Back
A name resolution issue with systemd-resolved we found in the wild

A name resolution issue with systemd-resolved we found in the wild

- - Articles

As you may know, you can connect Moss to your fresh Ubuntu 18.04 or 16.04 server – regardless the provider where such server is hosted. Moss also features native integrations with some cloud providers (Amazon, DigitalOcean, Google and Vultr as of this writing), but you can use Moss with any vps, cloud instance, or even physical server – not a common use case, but feasible anyway.

A few days ago a customer was having an issue when trying to connect an Ubuntu 18.04 instance (hosted on his provider of choice) to Moss. So I decided to create an account on such provider and investigate the problem. It turned out that the provider’s image had some “suboptimal” configurations and that the default solution for name resolution in Ubuntu 18.04 (bionic) has some related bugs. I think the problem is interesting enough to be shared, and it’ll also allow us to talk about systemd and, more specifically, systemd-resolved for name resolution.

systemd-resolved

systemd is a free software project that aims to provide user-space building blocks for a Linux system. Their most well-known component is an init system able to start multiple services in parallel, therefore reducing the boot time for your Linux box. If you have servers running Ubuntu (xenial or bionic), you’re already using systemd as your init – PID 1 – process.

Despite the former may sound great, systemd has been heavily criticized by part of the open source community for years. I won’t delve into those complaints, but I’ll note one of them: that systemd reinvents lots of sub-systems that didn’t need a fix, leading to subtle incompatibilities with existing infrastructure.

One of those sub-systems is name resolution, i.e. how your server translates domain names into network addresses. systemd comes with its own implementation: systemd-resolved. Ubuntu included systemd-resolved in version 16.10 and it’s now present in the current LTS version – 18.04.

systemd-resolved provides local applications with an interface to the DNS. In addition to implementing a resolver, it adds several capabilities like DNS caching and DNSSEC validation. systemd-resolved can be consumed by applications in three ways:

  • By means of its D-Bus API. D-Bus is a message bus for inter-process communication. It’s part of the freedesktop.org project, and systemd makes heavy use of it. So in general, desktop applications and systemd services are the most likely clients of this interface.
  • By means of its implementation of the glibc API – getaddrinfo(3) and related functions. Not all systemd-resolved capabilities are currently supported through this interface, and further configuration is required to make systemd-resolved handle name resolution in this case. If so configured, I’d say that most software running on your server would use this interface with systemd-resolved.
  • By means of the local DNS stub listener that systemd-resolved runs on IP address 127.0.0.53 on the loopback interface. If an application deals with DNS requests directly, this is the only way that systemd-resolved has to resolve the request.

Hmm, this starts to be complex… doesn’t it? Within a same server, how name resolution actually behaves may differ on a per-application and per-configuration basis. And things can become trickier.

Let’s quickly see how name resolution has been traditionally done in Linux systems.

/etc/resolv.conf and friends

If you’ve ever done some system configuration in Linux, you’ll know that the name resolver is configured in /etc/resolv.conf. It usually consists of a list of nameservers which are queried in order. That’s it. In the old days the sysadmin would set up this file and move on.

But then, more dynamic environments became “the new normal”. In particular, Linux desktops and cloud computing required dynamic networking environments, so a static configuration file for name resolution wasn’t a good fit in such cases. Therefore, applications like resolvconf were (and still are) used to dynamically update /etc/resolv.conf based on external information. resolvconf is not intended to be used by hand, but from other configuration software like ifup, ifdown, dhclient, or dnsmasq.

How does this relate to systemd-resolved? Well, we have more complexity here. systemd-resolved might either be the provider of /etc/resolv.conf or consume that file. It depends on the compatibility mode that has been determined by the system administrator. Basically, you either rely on 127.0.0.53 for name resolution, use systemd-resolved to allow applications to bypass systemd-resolved, or let other packages manage /etc/resolv.conf.

Ok, you must be confused at this moment… Let me explain the problem that originated this blog post and I’ll use it as an example to walk through these configs.

The name resolution issue

The problem that our customer was having on his cloud provider’s Ubuntu 18.04 server was this one:

[email protected]:~# gpg --recv-keys --keyserver hkp://keyserver.ubuntu.com:80 14AA40EC0831756756D7F66C4F4EA0AAE5267A6C
gpg: keyserver receive failed: Invalid argument

gpg relies on dirmngr (both part of the GNU Privacy Guard project) to handle certificates and revocation lists. When I looked for the appropriate error logs, I found a name resolution issue for hostname keyserver.ubuntu.com:

[email protected]:~# cat /var/log/syslog | grep dirmngr
Jun  5 09:13:44 ubuntu-1804-image dirmngr[3005]: resolving 'keyserver.ubuntu.com' failed: Invalid argument
Jun  5 09:13:44 ubuntu-1804-image dirmngr[3005]: can't connect to 'keyserver.ubuntu.com': host not found
[redacted]

Ok, so the host wasn’t being found. Let’s try to resolve it:

[email protected]:~# host keyserver.ubuntu.com
keyserver.ubuntu.com has address 91.189.89.49
keyserver.ubuntu.com has address 91.189.90.55

Hmm… it works – something strange is happening. What nameservers are being used?

[email protected]:~# cat /etc/resolv.conf 
[redacted]
nameserver 127.0.0.53
nameserver 1.0.0.1
nameserver 1.1.1.1

Since 127.0.0.53 is the first name server in /etc/resolv.conf, that should be the one handling DNS requests in first place. How is systemd-resolved configured?

[email protected]:~# cat /etc/systemd/resolved.conf 
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.
#
# Entries in this file show the compile time defaults.
# You can change settings by editing this file.
# Defaults can be restored by simply deleting this file.
#
# See resolved.conf(5) for details

[Resolve]
#DNS=
#FallbackDNS=
#Domains=
#LLMNR=no
#MulticastDNS=no
#DNSSEC=no
#Cache=yes
#DNSStubListener=yes

Apparently this is ok – as per the last line, the stub resolver (127.0.0.53) is enabled and should be answering name resolution queries. Let’s check if it’s actually running:

[email protected]:~# systemctl status systemd-resolved.service
● systemd-resolved.service - Network Name Resolution
   Loaded: loaded (/lib/systemd/system/systemd-resolved.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2018-06-05 08:04:10 UTC; 12min ago
   [redacted]
[email protected]:~# netstat -nlutp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      1753/systemd-resolv
udp    18432      0 127.0.0.53:53           0.0.0.0:*                           1753/systemd-resolv
[redacted]

Yes, it’s running and the stub is listening on the appropriate ports – udp/53 and tcp/53. But wait, we saw before that applications can also interface with systemd-resolved by means of D-Bus or glibc APIs. If dirmngr and host used different approaches, we might infer that one of them is exposing an issue but the other one isn’t.

In fact, dirmngr uses glibc’s getaddrinfo() but host is part of BIND and it deals with DNS requests directly. I checked this by disassembling the code with objdump (binutils package), but I could have reviewed the source code instead. Are the dirmngr‘s calls being handled by systemd-resolved directly? To answer this, we have to check whether the hosts: directive in /etc/nsswitch.conf contains the keyword “resolve”. However, we can see that’s not the case in the server under study.

[email protected]:~# cat /etc/nsswitch.conf 
# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd:         compat systemd
group:          compat systemd
shadow:         compat
gshadow:        files

hosts:          files dns
networks:       files

protocols:      db files
services:       db files
ethers:         db files
rpc:            db files

netgroup:       nis

So apparently we can assume that systemd-resolved handles dirmngr‘s requests as they reach 127.0.0.53. In case the latter times out, Cloudflare servers will be queried instead.

Then, why the gpg command fails to resolve the hostname? What’s really happening under the hoods? Network traffic will tell us. Let’s capture DNS requests and responses with tcpdump while we run gpg.

We can observe different phases in the former screenshot.

  1. The application looks for the IPv4 addresses (Type A records) of keyserver.ubuntu.com. The query reaches systemd-resolved’s stub listener and it issues three queries in parallel – one for Cloudflare and two for Google DNS servers. The first response is returned to the application.
  2. The application looks for the IPv6 addresses (Type AAAA records) of keyserver.ubuntu.com. Same behavior as before.
  3. The application looks for a Type 0 (Class 7168) record and systemd-resolved’s stub listener replies with a Format Error.
  4. Queries time out 5 seconds later and the process starts over.

The record in phase 3 turns out to be RRSIG – it holds digital signatures of resource records which are used during the DNSSEC authentication process. At this moment, systemd-resolved doesn’t support queries for these (and related) records under certain conditions. We can easily check this by forcing 127.0.0.53 to resolve an RRSIG query (it fails). If the same query is served by Cloudflare’s DNS nameserver, it succeeds:

[email protected]:~# host -t RRSIG keyserver.ubuntu.com 127.0.0.53
Using domain server:
Name: 127.0.0.53
Address: 127.0.0.53#53
Aliases: 

Host keyserver.ubuntu.com not found: 1(FORMERR)

[email protected]:~# host -t RRSIG keyserver.ubuntu.com 1.1.1.1
Using domain server:
Name: 1.1.1.1
Address: 1.1.1.1#53
Aliases: 

keyserver.ubuntu.com has no RRSIG record

Finally we have something that explains why gpg failed. However, it’s still not clear why all queries time out, since A and AAAA records were successfully answered. Let’s keep looking into that.

Why is systemd-resolved issuing 3 queries in parallel? Let’s check its status:

[email protected]:~# systemd-resolve --status
Global
         DNS Servers: 1.0.0.1
                      1.1.1.1
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 3 (eth1)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: 8.8.8.8
                      8.8.4.4

Link 2 (eth0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: 8.8.8.8
                      8.8.4.4

We have:

  • 1.0.0.1 and 1.1.1.1: Cloudflare’s name servers as the global DNS servers. These come from /etc/resolv.conf as we saw before.
  • 8.8.8.8 and 8.8.4.4: Google’s name servers for queries flowing through interface eth0. These come from external sources, in particular a DHCP server.
  • 8.8.8.8 and 8.8.4.4: Same as above but for interface eth1.

The three parallel queries match the expected behavior of systemd-resolved: one for the global name server and another one per network interface. Who’s setting up Cloudflare as the global name server? Let’s revisit /etc/resolv.conf:

[email protected]:~# ls -l /etc/resolv.conf 
lrwxrwxrwx 1 root root 29 May  7 12:36 /etc/resolv.conf -> ../run/resolvconf/resolv.conf

[email protected]:~# cat /etc/resolv.conf 
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
#     DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 127.0.0.53
nameserver 1.0.0.1
nameserver 1.1.1.1

What a strange config! Don’t you think? Note the conflicting settings:

  1. You’re using systemd-resolved and it takes per-interface DNS configurations from a DHCP server. Such DHCP server lists external DNS servers – from Google.
  2. But /etc/resolv.conf is being managed by resolvconf. This should mean that the sysadmin wants to use the last compatibility mode of systemd-resolved regarding /etc/resolv.conf . However, since 127.0.0.53 is listed as a name server, systemd-resolved takes control over name resolution.
  3. But resolvconf includes additional (but different) external DNS servers – from Cloudflare – for your global name resolution settings.

This configuration nightmare along with the RRSIG bug we discussed earlier make your system break in a very subtle, hard-to-debug, hard-to-understand way. In particular, when some requests are awaiting an answer from some of the upstream DNS servers, but systemd-resolved receives a query it doesn’t support (like the one searching for a RRSIG record), the name resolution process seems to fail entirely.

In my opinion, the settings that the cloud provider chose are flawed (from a maintainability viewpoint) and must be fixed. Even if they worked, they make little sense and add complexity to an already complex setup. The provider should choose a clear policy  – either use 127.0.0.53 as the only name server or get rid of systemd-resolved – and stick with it.

Conclusion

DNS is a critical component of the Internet, and it’s more complex as it might seem at first sight. For the sake of additional functionality, systemd-resolved adds more complexity to it. According to the author of systemd-resolved:

resolved is not supposed to be a DNS server, it’s supposed to be exactly good enough so that libc-like DNS clients can resolve their stuff

Therefore, it seems s a bit unfortunate that distros like Ubuntu Server (among others) and upstream providers include it as the default solution for name resolution without careful settings.

In this article I’ve tried to walk you through the pain of tracking down a real issue on a real provider. If you think I got something wrong, just drop me a message and I’ll be happy to update this post. Hey, and don’t forget to sign up below if you want us to send you an email as we publish more stuff 😀

Don’t miss a post! Subscribe to the Moss Blog