Posted on Leave a comment

systemd-resolved: introduction to split DNS

Fedora 33 switches the default DNS resolver to systemd-resolved. In simple terms, this means that systemd-resolved will run as a daemon. All programs wanting to translate domain names to network addresses will talk to it. This replaces the current default lookup mechanism where each program individually talks to remote servers and there is no shared cache.

If necessary, systemd-resolved will contact remote DNS servers. systemd-resolved is a “stub resolver”—it doesn’t resolve all names itself (by starting at the root of the DNS hierarchy and going down label by label), but forwards the queries to a remote server.

A single daemon handling name lookups provides significant benefits. The daemon caches answers, which speeds answers for frequently used names. The daemon remembers which servers are non-responsive, while previously each program would have to figure this out on its own after a timeout. Individual programs only talk to the daemon over a local transport and are more isolated from the network. The daemon supports fancy rules which specify which name servers should be used for which domain names—in fact, the rest of this article is about those rules.

Split DNS

Consider the scenario of a machine that is connected to two semi-trusted networks (wifi and ethernet), and also has a VPN connection to your employer. Each of those three connections has its own network interface in the kernel. And there are multiple name servers: one from a DHCP lease from the wifi hotspot, two specified by the VPN and controlled by your employer, plus some additional manually-configured name servers. Routing is the process of deciding which servers to ask for a given domain name. Do not mistake this with the process of deciding where to send network packets, which is called routing too.

The network interface is king in systemd-resolved. systemd-resolved first picks one or more interfaces which are appropriate for a given name, and then queries one of the name servers attached to that interface. This is known as “split DNS”.

There are two flavors of domains attached to a network interface: routing domains and search domains. They both specify that the given domain and any subdomains are appropriate for that interface. Search domains have the additional function that single-label names are suffixed with that search domain before being resolved. For example, a lookup for “server” is treated as a lookup for “server.example.com” if the search domain is “example.com.” In systemd-resolved config files, routing domains are prefixed with the tilde (~) character.

Specific example

Now consider a specific example: your VPN interface tun0 has a search domain private.company.com and a routing domain ~company.com. If you ask for mail.private.company.com, it is matched by both domains, so this name would be routed to tun0.

A request for www.company.com is matched by the second domain and would also go to tun0. If you ask for www, (in other words, if you specify a single-label name without any dots), the difference between routing and search domains comes into play. systemd-resolved attempts to combine the single-label name with the search domain and tries to resolve www.private.company.com on tun0.

If you have multiple interfaces with search domains, single-label names are suffixed with all search domains and resolved in parallel. For multi-label names, no suffixing is done; search and routing domains are are used to route the name to the appropriate interface. The longest match wins. When there are multiple matches of the same length on different interfaces, they are resolved in parallel.

A special case is when an interface has a routing domain ~. (a tilde for a routing domain and a dot for the root DNS label). Such an interface always matches any names, but with the shortest possible length. Any interface with a matching search or routing domain has higher priority, but the interface with ~. is used for all other names. Finally, if no routing or search domains matched, the name is routed to all interfaces that have at least one name server attached.

Lookup routing in systemd-resolved

Domain routing

This seems fairly complex, partially because of the historic names which are confusing. In actual practice it’s not as complicated as it seems.

To introspect a running system, use the resolvectl domain command. For example:

$ resolvectl domain
Global:
Link 4 (wlp4s0): ~.
Link 18 (hub0):
Link 26 (tun0): redhat.com

You can see that www would resolve as www.redhat.com. over tun0. Anything ending with redhat.com resolves over tun0. Everything else would resolve over wlp4s0 (the wireless interface). In particular, a multi-label name like www.foobar would resolve over wlp4s0, and most likely fail because there is no foobar top-level domain (yet).

Server routing

Now that you know which interface or interfaces should be queried, the server or servers to query are easy to determine. Each interface has one or more name servers configured. systemd-resolved will send queries to the first of those. If the server is offline and the request times out or if the server sends a syntactically-invalid answer (which shouldn’t happen with “normal” queries, but often becomes an issue when DNSSEC is enabled), systemd-resolved switches to the next server on the list. It will use that second server as long as it keeps responding. All servers are used in a round-robin rotation.

To introspect a running system, use the resolvectl dns command:

$ resolvectl dns
Global:
Link 4 (wlp4s0): 192.168.1.1 8.8.4.4 8.8.8.8
Link 18 (hub0):
Link 26 (tun0): 10.45.248.15 10.38.5.26

When combined with the previous listing, you know that for www.redhat.com, systemd-resolved will query 10.45.248.15, and—if it doesn’t respond—10.38.5.26. For www.google.com, systemd-resolved will query 192.168.1.1 or the two Google servers 8.8.4.4 and 8.8.8.8.

Differences from nss-dns

Before going further detail, you may ask how this differs from the previous default implementation (nss-dns). With nss-dns there is just one global list of up to three name servers and a global list of search domains (specified as nameserver and search in /etc/resolv.conf).

Each name to query is sent to the first name server. If it doesn’t respond, the same query is sent to the second name server, and so on. systemd-resolved implements split-DNS and remembers which servers are currently considered active.

For single-label names, the query is performed with each of the the search domains suffixed. This is the same with systemd-resolved. For multi-label names, a query for the unsuffixed name is performed first, and if that fails, a query for the name suffixed by each of the search domains in turn is performed. systemd-resolved doesn’t do that last step; it only suffixes single-label names.

A second difference is that with nss-dns, this module is loaded into each process. The process itself communicates with remote servers and implements the full DNS stack internally. With systemd-resolved, the nss-resolve module is loaded into the process, but it only forwards the query to systemd-resolved over a local transport (D-Bus) and doesn’t do any work itself. The systemd-resolved process is heavily sandboxed using systemd service features.

The third difference is that with systemd-resolved all state is dynamic and can be queried and updated using D-Bus calls. This allows very strong integration with other daemons or graphical interfaces.

Configuring systemd-resolved

So far, this article talked about servers and the routing of domains without explaining how to configure them. systemd-resolved has a configuration file (/etc/systemd/resolv.conf) where you specify name servers with DNS= and routing or search domains with Domains= (routing domains with ~, search domains without). This corresponds to the Global: lists in the two listings above.

In this article’s examples, both lists are empty. Most of the time configuration is attached to specific interfaces, and “global” configuration is not very useful. Interfaces come and go and it isn’t terribly smart to contact servers on an interface which is down. As soon as you create a VPN connection, you want to use the servers configured for that connection to resolve names, and as soon as the connection goes down, you want to stop.

How does then systemd-resolved acquire the configuration for each interface? This happens dynamically, with the network management service pushing this configuration over D-Bus into systemd-resolved. The default in Fedora is NetworkManager and it has very good integration with systemd-resolved. Alternatives like systemd’s own systemd-networkd implement similar functionality. But the interface is open and other programs can do the appropriate D-Bus calls.

Alternatively, resolvectl can be used for this (it is just a wrapper around the D-Bus API). Finally, resolvconf provides similar functionality in a form compatible with a tool in Debian with the same name.

Scenario: Local connection more trusted than VPN

The important thing is that in the common scenario, systemd-resolved follows the configuration specified by other tools, in particular NetworkManager. So to understand how systemd-resolved names, you need to see what NetworkManager tells it to do. Normally NM will tell systemd-resolved to use the name servers and search domains received in a DHCP lease on some interface. For example, look at the source of configuration for the two listings shown above:

There are two connections: “Parkinson” wifi and “Brno (BRQ)” VPN. In the first panel DNS:Automatic is enabled, which means that the DNS server received as part of the DHCP lease (192.168.1.1) is passed to systemd-resolved. Additionally. 8.8.4.4 and 8.8.8.8 are listed as alternative name servers. This configuration is useful if you want to resolve the names of other machines in the local network, which 192.168.1.1 provides. Unfortunately the hotspot DNS server occasionally gets stuck, and the other two servers provide backup when that happens.

The second panel is similar, but doesn’t provide any special configuration. NetworkManager combines routing domains for a given connection from DHCP, SLAAC RDNSS, and VPN, and finally manual configuration and forward this to systemd-resolved. This is the source of the search domain redhat.com in the listing above.

There is an important difference between the two interfaces though: in the second panel, “Use this connection only for resources on its network” is checked. This tells NetworkManager to tell systemd-resolved to only use this interface for names under the search domain received as part of the lease (Link 26 (tun0): redhat.com in the first listing above). In the first panel, this checkbox is unchecked, and NetworkManager tells systemd-resolved to use this interface for all other names (Link 4 (wlp4s0): ~.). This effectively means that the wireless connection is more trusted.

Scenario: VPN more trusted than local network

In a different scenario, a VPN would be more trusted than the local network and the domain routing configuration reversed. If a VPN without “Use this connection only for resources on its network” is active, NetworkManager tells systemd-resolved to attach the default routing domain to this interface. After unchecking the checkbox and restarting the VPN connection:

$ resolvectl domain
Global:
Link 4 (wlp4s0):
Link 18 (hub0):
Link 28 (tun0): ~. redhat.com
$ resolvectl dns
Global:
Link 4 (wlp4s0):
Link 18 (hub0):
Link 28 (tun0): 10.45.248.15 10.38.5.26

Now all domain names are routed to the VPN. The network management daemon controls systemd-resolved and the user controls the network management daemon.

Additional systemd-resolved functionality

As mentioned before, systemd-resolved provides a common name lookup mechanism for all programs running on the machine. Right now the effect is limited: shared resolver and cache and split DNS (the lookup routing logic described above). systemd-resolved provides additional resolution mechanisms beyond the traditional unicast DNS. These are the local resolution protocols MulticastDNS and LLMNR, and an additional remote transport DNS-over-TLS.

Fedora 33 does not enable MulticastDNS and DNS-over-TLS in systemd-resolved. MulticastDNS is implemented by nss-mdns4_minimal and Avahi. Future Fedora releases may enable these as the upstream project improves support.

Implementing this all in a single daemon which has runtime state allows smart behaviour: DNS-over-TLS may be enabled in opportunistic mode, with automatic fallback to classic DNS if the remote server does not support it. Without the daemon which can contain complex logic and runtime state this would be much harder. When enabled, those additional features will apply to all programs on the system.

There is more to systemd-resolved: in particular LLMNR and DNSSEC, which only received brief mention here. A future article will explore those subjects.

Posted on Leave a comment

Create a wifi hotspot with Raspberry Pi 3 and Fedora

If you’re already running Fedora on your Pi, you’re already most of the way to a wifi hotspot. A Raspberry Pi has a wifi interface that’s usually set up to join an existing wifi network. This interface can be reconfigured to provide a new wifi network. If a room has a good network cable and a bad wifi signal (a brick wall, foil-backed plasterboard, and even a window with a metal oxide coating are all obstacles), fix it with your Pi.

This article describes the procedure for setting up the hotspot. It was tested on third generation Pis – a Model B v1.2, and a Model B+ (the older 2 and the new 4 weren’t tested). These are the credit-card size Pis that have been around a few years.

This article also delves a little way into the network concepts behind the scenes. For instance, “hotspot” is the term that’s caught on in public places around the world, but it’s more accurate to use the term WLAN AP (Wireless Local Area Network Access Point).In fact, if you want to annoy your friendly neighborhood network administrator, call a hotspot a “wifi router”. The inaccuracy will make their eyes cross.

A few nmcli commands configure the Raspberry Pi as a wifi AP. The nmcli command-line tool controls the NetworkManager daemon. It’s not the only network configuration system available. More complex solutions are available for the adventurous. Check out the hostapd RPM package and the OpenWRT distro. Have a look at Internet connection sharing with NetworkManager for more ideas.

A dive into network administration

The hotspot is a routed AP (Access Point). It sits between two networks, the current wired network and its new wireless network, and takes care of the post-office-style forwarding of IP packets between them.

Routing and interfaces

The wireless interface on the Raspberry Pi is named wlan0 and the wired one is eth0. The new wireless network uses one range of IP addresses and the current wired network uses another. In this example, the current network range is 192.168.0.0/24 and the new network range is 10.42.0.0/24. If these numbers make no sense, that’s OK. You can carry on without getting to grips with IP subnets and netmasks. The Raspberry Pi’s two interfaces have IP addresses from these ranges.

Packets are sent to local computers or remote destinations based on their IP addresses. This is routing work, and it’s where the routed part of routed AP name comes from. If you’d like to build a more complex router with DHCP and DNS, pick up some tips from the article How to use Fedora Server to create a router / gateway.

It’s not a bridged AP

Netowrk bridging is another way of extending a network, but it’s not how this Pi is set up. This routed AP is not a bridged AP. To understand the difference between routing and bridging, you have to know a little about the networking layers of the OSI network model. A good place to start is the beginner’s guide to network troubleshooting in Linux. Here’s the short answer.

  • layer 3, network ← Yes, our routed AP is here.
  • layer 2, data link ← No, it’s not a bridged AP.
  • layer 1, physical ← Radio transmission is covered here.

A bridge works at a lower layer of the network stack – it uses ethernet MAC addresses to send data. If this was a bridged AP, it wouldn’t have two sets of IP addresses; the new wireless network and the current wired network would use the same IP subnet.

IP masquerading

You won’t find an IP address starting with 10. anywhere on the Internet. It’s a private address, not a public address. To get an IP packet routed out of the wifi network and back in again, packet addresses have to be changed. IP masquerading is a way of making this routing work. The masquerade name is used because the packets’ real addresses are hidden. the wired network doesn’t see any addresses from the wireless network.

IP masquerading is set up automatically by NetworkManager. NetworkManager adds nftables rules to handle IP masquerading.

The Pi’s network stack

A stack of network hardware and software makes wifi work.

  • Network hardware
  • Kernel space software
  • User space software

You can see the network hardware. The Raspberry Pi has two main hardware components – a tiny antenna and Broadcom wifi chip. MagPi magazine has some great photos.

Kernel software provides the plumbing. There’s no need to work on these directly – it’s all good to go in the Fedora distribution.

  • Broadcom driver modules talk to the hardware. List these with the command lsmod | grep brcm.
  • A TCP/IP stack handles protocols.
  • The netfilter framework filters packets.
  • A network system ties these all together.

User space software customizes the system. It’s full of utilities that either help the user, talk to the kernel, or connect other utilities together. For instance, the firewall-cmd tool talks to the firewalld service, firewalld talks to the nftables tool, and nftables talks to the netfilter framework in the kernel. The nmcli commands talk to NetworkManager. And NetworkManager talks to pretty much everything.

Create the AP

That’s enough theory — let’s get practical. Fire up your Raspberry Pi running Fedora and run these commands.

Install software

Nearly all the required software is included with the Fedora Minimal image. The only thing missing is the dnsmasq package. This handles the DHCP and IP address part of the new wifi network, automatically. Run this command using sudo:

$ sudo dnf install dnsmasq

Create a new NetworkManager connection

NetworkManager sets up one network connection automatically, Wired connection 1. Use the nmcli tool to tell NetworkManager how to add a wifi connection. NetworkManager saves these settings, and a bunch more, in a new config file.

The new configuration file is created in the directory /etc/sysconfig/network-scripts/. At first, it’s empty; the image has no configuration files for network interfaces. If you want to find out more about how NetworkManager uses the network-scripts directory, the gory details are in the nm-settings-ifcfg-rh man page.

[nick@raspi ~]$ ls /etc/sysconfig/network-scripts/
[nick@raspi ~]$

The first nmcli command, to create a network connection, looks like this. There’s more to do — the Pi won’t work as a hotspot after running this.

nmcli con add \ type wifi \ ifname wlan0 \ con-name 'raspi hotspot' \ autoconnect yes \ ssid 'raspi wifi'

The following commands complete several more steps:

  • Create a new connection.
  • List the connections.
  • Take another look at the network-scripts folder. NetworkManager added a config file.
  • List available APs to connect to.

This requires running several commands as root using sudo:

$ sudo nmcli con add type wifi ifname wlan0 con-name 'raspi hotspot' autoconnect yes ssid 'raspi wifi'
Connection 'raspi wifi' (13ea67a7-a8e6-480c-8a46-3171d9f96554) successfully added.
$ sudo nmcli connection show
NAME UUID TYPE DEVICE
Wired connection 1 59b7f1b5-04e1-3ad8-bde8-386a97e5195d ethernet eth0
raspi wifi 13ea67a7-a8e6-480c-8a46-3171d9f96554 wifi wlan0
$ ls /etc/sysconfig/network-scripts/
ifcfg-raspi_wifi
$ sudo nmcli device wifi list
IN-USE BSSID SSID MODE CHAN RATE SIGNAL BARS SECURITY 01:0B:03:04:C6:50 APrivateAP Infra 6 195 Mbit/s 52 ▂▄__ WPA2 02:B3:54:05:C8:51 SomePublicAP Infra 6 195 Mbit/s 52 ▂▄__ --

You can remove the new config and start again with this command:

$ sudo nmcli con delete 'raspi hotspot'

Change the connection mode

A NetworkManager connection has many configuration settings. You can see these with the command nmcli con show ‘raspi hotspot’. Some of these settings start with the label 802-11-wireless. This is to do with industry standards that make wifi work – the IEEE organization specified many protocols for wifi, named 802.11. This new wifi connection is in infrastructure mode, ready to connect to a wifi access point. The Pi isn’t supposed to connect to another AP; it’s supposed to be the AP that others connect to.

This command changes the mode from infrastructure to AP. It also sets a few other wireless properties. The bg value tells NetworkManager to follow two old IEEE standards – 802.11b and 802.11g. Basically it configures the radio to use the 2.4GHz frequency band, not the 5GHz band. ipv4.method shared means this connection will be shared with others.

  • Change the connection to a hotspot by changing the mode to ap.
sudo nmcli connection \ modify "raspi hotspot" \ 802-11-wireless.mode ap \ 802-11-wireless.band bg \ ipv4.method shared

The connection starts automatically. The dnsmasq application gives the wlan0 interface an IP address of 10.42.0.1. The manual commands to start and stop the hotspot are:

$ sudo nmcli con up "raspi hotspot"
$ sudo nmcli con down "raspi hotspot"

Connect a device

The next steps are to:

  • Watch the log.
  • Connect a smartphone.
  • When you’ve seen enough, type
    ^C

    ([control][c]) to stop watching the log.

$ journalctl --follow
-- Logs begin at Wed 2020-04-01 18:23:45 BST. --
...

Use a wifi-enabled device, like your phone. The phone can find the new raspi wifi network.

Messages about an associating client appear in the activity log:

Jun 10 18:08:05 raspi wpa_supplicant[662]: wlan0: AP-STA-CONNECTED 94:b0:1f:2e:d2:bd
Jun 10 18:08:05 raspi wpa_supplicant[662]: wlan0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0
Jun 10 18:08:05 raspi dnsmasq-dhcp[713]: DHCPREQUEST(wlan0) 10.42.0.125 94:b0:1f:2e:d2:bd
Jun 10 18:08:05 raspi dnsmasq-dhcp[713]: DHCPACK(wlan0) 10.42.0.125 94:b0:1f:2e:d2:bd nick

Examine the firewall

A new security zone named nm-shared has appeared. This is stopping some wifi access.

$ sudo firewall-cmd --get-active-zones
[sudo] password for nick:
nm-shared interfaces: wlan0
public interfaces: eth0

The new zone is set up to accept everything because the target is ACCEPT. Clients are able to use web, mail and SSH to get to the Internet.

$ sudo firewall-cmd --zone=nm-shared --list-all
nm-shared (active) target: ACCEPT icmp-block-inversion: no interfaces: wlan0 sources: services: dhcp dns ssh ports: protocols: icmp ipv6-icmp masquerade: no forward-ports: source-ports: icmp-blocks: rich rules: rule priority="32767" reject

This big list of config settings takes a little examination.

The first line, the innocent-until-proven-guilty option target: ACCEPT says all traffic is allowed through, unless a rule says otherwise. It’s the same as saying these types of traffic are all OK.

  • inbound packets – requests sent from wifi clients to the Raspberry Pi
  • forwarded packets – requests from wifi clients to the Internet
  • outbound packets – requests sent by the PI to wifi clients

However, there’s a hidden gotcha: requests from wifi clients (like your workstation) to the Raspberry Pi may be rejected. The final line — the mysterious rule in the rich rules section — refers to the routing policy database. The rule stops you from connecting from your workstation to your Pi with a command like this: ssh 10.42.0.1. This rule only affects traffic sent to to the Raspberry Pi, not traffic sent to the Internet, so browsing the web works fine.

If an inbound packet matches something in the services and protocols lists, it’s allowed through. NetworkManager automatically adds ICMP, DHCP and DNS (Internet infrastructure services and protocols). An SSH packet doesn’t match, gets as far as the post-processing stage, and is rejected — priority=”32767″ translates as “do this after all the processing is done.”

If you want to know what’s happening behind the scenes, that rich rule creates an nftables rule. The nftables rule looks like this.

$ sudo nft list chain inet firewalld filter_IN_nm-shared_post
table inet firewalld { chain filter_IN_nm-shared_post { reject }
}

Fix SSH login

Connect from your workstation to the Raspberry Pi using SSH.This won’t work because of the rich rule. A protocol that’s not on the list gets instantly rejected.

Check that SSH is blocked:

$ ssh 10.42.0.1
ssh: connect to host 10.42.0.1 port 22: Connection refused

Next, add SSH to the list of allowed services. If you don’t remember what services are defined, list them all with firewall-cmd ‐‐get-services. For SSH, use option ‐‐add-service ssh or ‐‐remove-service ssh. Don’t forget to make the change permanent.

$ sudo firewall-cmd --add-service ssh --permanent --zone=nm-shared
success

Now test with SSH again.

$ ssh 10.42.0.1
The authenticity of host '10.42.0.1 (10.42.0.1)' can't be established.
ECDSA key fingerprint is SHA256:dDdgJpDSMNKR5h0cnpiegyFGAwGD24Dgjg82/NUC3Bc.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added '10.42.0.1' (ECDSA) to the list of known hosts.
Last login: Tue Jun 9 18:58:36 2020 from 10.0.1.35
[email protected]'s password:

SSH access is no longer blocked.

Test as a headless computer

The raspberry pi runs fine as a headless computer. From here on, you can use SSH to work on your Pi.

  • Power off.
  • Remove keyboard and video monitor.
  • Power on.
  • Wait a couple minutes.
  • Connect from your workstation to the Raspberry Pi using SSH. Use either the wired interface or the wireless one; both work.

Increase security with WPA-PSK

The WPA-PSK (Wifi Protected Access with Pre-Shared Key) system is designed for home users and small offices. It is password protected. Use nmcli again to add WPA-PSK:

$ sudo nmcli con modify "raspi hotspot" wifi-sec.key-mgmt wpa-psk
$ sudo nmcli con modify "raspi hotspot" wifi-sec.psk "hotspot-password"

Troubleshooting

Here are a couple recommendations:

The bad news is, there are no troubleshooting tips here. There are so many things that can go wrong, there’s no way of covering them.

Troubleshooting a network stack is tricky. If one component goes wrong, it may all go wrong. And making changes like reloading firewall rules can upset services like NetworkManager and sshd. You know you’re in the weeds when you find yourself running nftables commands like nft list ruleset and firewalld commands like firewall-cmd ‐‐set-log-denied=all.

Play with your new platform

Add value to your new AP. Since you’re running a Pi, there are many hardware add-ons. Since it’s running Fedora, you have thousands of packages available. Try turning it into a mini-NAS, or adding battery back-up, or perhaps a music player.


Photo by Uriel SC on Unsplash.

Posted on Leave a comment

TCP window scaling, timestamps and SACK

The Linux TCP stack has a myriad of sysctl knobs that allow to change its behavior.  This includes the amount of memory that can be used for receive or transmit operations, the maximum number of sockets and optional features and protocol extensions.

There are  multiple articles that recommend to disable TCP extensions, such as timestamps or selective acknowledgments (SACK) for various “performance tuning” or “security” reasons.

This article provides background on what these extensions do, why they
are enabled by default, how they relate to one another and why it is normally a bad idea to turn them off.

TCP Window scaling

The data transmission rate that TCP can sustain is limited by several factors. Some of these are:

  • Round trip time (RTT).  This is the time it takes for a packet to get to the destination and a reply to come back. Lower is better.
  • lowest link speed of the network paths involved
  • frequency of packet loss
  • the speed at which new data can be made available for transmission
    For example, the CPU needs to be able to pass data to the network adapter fast enough. If the CPU needs to encrypt the data first, the adapter might have to wait for new data. In similar fashion disk storage can be a bottleneck if it can’t read the data fast enough.
  • The maximum possible size of the TCP receive window. The receive window determines how much data (in bytes) TCP can transmit before it has to wait for the receiver to report reception of that data. This is announced by the receiver. The receiver will constantly update this value as it reads and acknowledges reception of the incoming data. The receive windows current value is contained in the TCP header that is part of every segment sent by TCP. The sender is thus aware of the current receive window whenever it receives an acknowledgment from the peer. This means that the higher the round-trip time, the longer it takes for sender to get receive window updates.

TCP is limited to at most 64 kilobytes of unacknowledged (in-flight) data. This is not even close to what is needed to sustain a decent data rate in most networking scenarios. Let us look at some examples.

Theoretical data rate

With a round-trip-time of 100 milliseconds, TCP can transfer at most 640 kilobytes per second. With a 1 second delay, the maximum theoretical data rate drops down to only 64 kilobytes per second.

This is because of the receive window. Once 64kbyte of data have been sent the receive window is already full.  The sender must wait until the peer informs it that at least some of the data has been read by the application. 

The first segment sent reduces the TCP window by the size of that segment. It takes one round-trip before an update of the receive window value will become available. When updates arrive with a 1 second delay, this results in a 64 kilobyte limit even if the link has plenty of bandwidth available.

In order to fully utilize a fast network with several milliseconds of delay, a window size larger than what classic TCP supports is a must. The ’64 kilobyte limit’ is an artifact of the protocols specification: The TCP header reserves only 16bits for the receive window size. This allows receive windows of up to 64KByte. When the TCP protocol was originally designed, this size was not seen as a limit.

Unfortunately, its not possible to just change the TCP header to support a larger maximum window value. Doing so would mean all implementations of TCP would have to be updated simultaneously or they wouldn’t understand one another anymore. To solve this, the interpretation of the receive window value is changed instead.

The ‘window scaling option’ allows to do this while keeping compatibility to existing implementations.

TCP Options: Backwards-compatible protocol extensions

TCP supports optional extensions. This allows to enhance the protocol with new features without the need to update all implementations at once. When a TCP initiator connects to the peer, it also send a list of supported extensions. All extensions follow the same format: an unique option number followed by the length of the option and the option data itself.

The TCP responder checks all the option numbers contained in the connection request. If it does not understand an option number it skips
‘length’ bytes of data and checks the next option number. The responder omits those it did not understand from the reply. This allows both the sender and receiver to learn the common set of supported options.

With window scaling, the option data always consist of a single number.

The window scaling option

 
Window Scale option (WSopt): Kind: 3, Length: 3
    +---------+---------+---------+
    | Kind=3  |Length=3 |shift.cnt|
    +---------+---------+---------+
         1         1         1

The window scaling option tells the peer that the receive window value found in the TCP header should be scaled by the given number to get the real size.

For example, a TCP initiator that announces a window scaling factor of 7 tries to instruct the responder that any future packets that carry a receive window value of 512 really announce a window of 65536 byte. This is an increase by a factor of 128. This would allow a maximum TCP Window of 8 Megabytes.

A TCP responder that does not understand this option ignores it. The TCP packet sent in reply to the connection request (the syn-ack) then does not contain the window scale option. In this case both sides can only use a 64k window size. Fortunately, almost every TCP stack supports and enables this option by default, including Linux.

The responder includes its own desired scaling factor. Both peers can use a different number. Its also legitimate to announce a scaling factor of 0. This means the peer should treat the receive window value it receives verbatim, but it allows scaled values in the reply direction — the recipient can then use a larger receive window.

Unlike SACK or TCP timestamps, the window scaling option only appears in the first two packets of a TCP connection, it cannot be changed afterwards. It is also not possible to determine the scaling factor by looking at a packet capture of a connection that does not contain the initial connection three-way handshake.

The largest supported scaling factor is 14. This allows TCP window sizes
of up to one Gigabyte.

Window scaling downsides

It can cause data corruption in very special cases. Before you disable the option – it is impossible under normal circumstances. There is also a solution in place that prevents this. Unfortunately, some people disable this solution without realizing the relationship with window scaling. First, let’s have a look at the actual problem that needs to be addressed. Imagine the following sequence of events:

  1. The sender transmits segments: s_1, s_2, s_3, … s_n
  2.  The receiver sees: s_1, s_3, .. s_n and sends an acknowledgment for s_1.
  3.  The sender considers s_2 lost and sends it a second time. It also sends new data contained in segment s_n+1.
  4.  The receiver then sees: s_2, s_n+1, s_2: the packet s_2 is received twice.

This can happen for example when a sender triggers re-transmission too early. Such erroneous re-transmits are never a problem in normal cases, even with window scaling. The receiver will just discard the duplicate.

Old data to new data

The TCP sequence number can be at most 4 Gigabyte. If it becomes larger than this, the sequence wraps back to 0 and then increases again. This is not a problem in itself, but if this occur fast enough then the above scenario can create an ambiguity.

If a wrap-around occurs at the right moment, the sequence number s_2 (the re-transmitted packet) can already be larger than s_n+1. Thus, in the last step (4), the receiver may interpret this as: s_2, s_n+1, s_n+m, i.e. it could view the ‘old’ packet s_2 as containing new data.

Normally, this won’t happen because a ‘wrap around’ occurs only every couple of seconds or minutes even on high bandwidth links. The interval between the original and a unneeded re-transmit will be a lot smaller.

For example,with a transmit speed of 50 Megabytes per second, a
duplicate needs to arrive more than one minute late for this to become a problem. The sequence numbers do not wrap fast enough for small delays to induce this problem.

Once TCP approaches ‘Gigabyte per second’ throughput rates, the sequence numbers can wrap so fast that even a delay by only a few milliseconds can create duplicates that TCP cannot detect anymore. By solving the problem of the too small receive window, TCP can now be used for network speeds that were impossible before – and that creates a new, albeit rare problem. To safely use Gigabytes/s speed in environments with very low RTT receivers must be able to detect such old duplicates without relying on the sequence number alone.

TCP time stamps

A best-before date

In the most simple terms, TCP timestamps just add a time stamp to the packets to resolve the ambiguity caused by very fast sequence number wrap around. If a segment appears to contain new data, but its timestamp is older than the last in-window packet, then the sequence number has wrapped and the ”new” packet is actually an older duplicate. This resolves the ambiguity of re-transmits even for extreme corner cases.

But this extension allows for more than just detection of old packets. The other major feature made possible by TCP timestamps are more precise round-trip time measurements (RTTm).

A need for precise round-trip-time estimation

When both peers support timestamps,  every TCP segment carries two additional numbers: a timestamp value and a timestamp echo.

 
TCP Timestamp option (TSopt): Kind: 8, Length: 10
+-------+----+----------------+-----------------+
|Kind=8 | 10 |TS Value (TSval)|EchoReply (TSecr)|
+-------+----+----------------+-----------------+
    1      1         4                4

An accurate RTT estimate is crucial for TCP performance. TCP automatically re-sends data that was not acknowledged. Re-transmission is triggered by a timer: If it expires, TCP considers one or more packets that it has not yet received an acknowledgment for to be lost. They are then sent again.

But “has not been acknowledged” does not mean the segment was lost. It is also possible that the receiver did not send an acknowledgment so far or that the acknowledgment is still in flight. This creates a dilemma: TCP must wait long enough for such slight delays to not matter, but it can’t wait for too long either.

Low versus high network delay

In networks with a high delay, if the timer fires too fast, TCP frequently wastes time and bandwidth with unneeded re-sends.

In networks with a low delay however,  waiting for too long causes reduced throughput when a real packet loss occurs. Therefore, the timer should expire sooner in low-delay networks than in those with a high delay. The tcp retransmit timeout therefore cannot use a fixed constant value as a timeout. It needs to adapt the value based on the delay that it experiences in the network.

Round-trip time measurement

TCP picks a retransmit timeout that is based on the expected round-trip time (RTT). The RTT is not known in advance. RTT is estimated by measuring the delta between the time a segment is sent and the time TCP receives an acknowledgment for the data carried by that segment.

This is complicated by several factors.

  • For performance reasons, TCP does not generate a new acknowledgment for every packet it receives. It waits  for a very small amount of time: If more segments arrive, their reception can be acknowledged with a single ACK packet. This is called “cumulative ACK”.
  •  The round-trip-time is not constant. This is because of a myriad of factors. For example, a client might be a mobile phone switching to different base stations as its moved around. Its also possible that packet switching takes longer when link or CPU utilization increases.
  • a packet that had to be re-sent must be ignored during computation. This is because the sender cannot tell if the ACK for the re-transmitted segment is acknowledging the original transmission (that arrived after all) or the re-transmission.

This last point is significant: When TCP is busy recovering from a loss, it may only receives ACKs for re-transmitted segments. It then can’t measure (update) the RTT during this recovery phase. As a consequence it can’t adjust the re-transmission timeout, which then keeps growing exponentially. That’s a pretty specific case (it assumes that other mechanisms such as fast retransmit or SACK did not help). Nevertheless, with TCP timestamps, RTT evaluation is done even in this case.

If the extension is used, the peer reads the timestamp value from the TCP segments extension space and stores it locally. It then places this value in all the segments it sends back as the “timestamp echo”.

Therefore the option carries two timestamps: Its senders own timestamp and the most recent timestamp it received from the peer. The “echo timestamp” is used by the original sender to compute the RTT. Its the delta between its current timestamp clock and what was reflected in the “timestamp echo”.

Other timestamp uses

TCP timestamps even have other uses beyond PAWS and RTT measurements. For example it becomes possible to detect if a retransmission was unnecessary. If the acknowledgment carries an older timestamp echo, the acknowledgment was for the initial packet, not the re-transmitted one.

Another, more obscure use case for TCP timestamps is related to the TCP syn cookie feature.

TCP connection establishment on server side

When connection requests arrive faster than a server application can accept the new incoming connection, the connection backlog will eventually reach its limit. This can occur because of a mis-configuration of the system or a bug in the application. It also happens when one or more clients send connection requests without reacting to the ‘syn ack’ response. This fills the connection queue with incomplete connections. It takes several seconds for these entries to time out. This is called a “syn flood attack”.

TCP timestamps and TCP syn cookies

Some TCP stacks allow to accept new connections even if the queue is full. When this happens, the Linux kernel will print a prominent message to the system log:

Possible SYN flooding on port P. Sending Cookies. Check SNMP counters.

This mechanism bypasses the connection queue entirely. The information that is normally stored in the connection queue is encoded into the SYN/ACK responses TCP sequence number. When the ACK comes back, the queue entry can be rebuilt from the sequence number.

The sequence number only has limited space to store information. Connections established using the ‘TCP syn cookie’ mechanism can not support TCP options for this reason.

The TCP options that are common to both peers can be stored in the timestamp, however. The ACK packet reflects the value back in the timestamp echo field which allows to recover the agreed-upon TCP options as well. Else, cookie-connections are restricted by the standard 64 kbyte receive window.

Common myths – timestamps are bad for performance

Unfortunately some guides recommend disabling TCP timestamps to reduce the number of times the kernel needs to access the timestamp clock to get the current time. This is not correct. As explained before, RTT estimation is a necessary part of TCP. For this reason, the kernel always takes a microsecond-resolution time stamp when a packet is received/sent.

Linux re-uses the clock timestamp taken for the RTT estimation for the remainder of the packet processing step. This also avoids the extra clock access to add a timestamp to an outgoing TCP packet.

The entire timestamp option only requires 10 bytes of TCP option space in each packet, this is not a significant decrease in space available for packet payload.

common myths – timestamps are a security problem

Some security audit tools and (older) blog posts recommend to disable TCP
timestamps because they allegedly leak system uptime: This would then allow to estimate the patch level of the system/kernel. This was true in the past: The timestamp clock is based on a constantly increasing value that starts at a fixed value on each system boot. A timestamp value would give a estimate as to how long the machine has been running (uptime).

As of Linux 4.12 TCP timestamps do not reveal the uptime anymore. All timestamp values sent use a peer-specific offset. Timestamp values also wrap every 49 days.

In other words, connections from or to address “A” see a different timestamp than connections to the remote address “B”.

Run sysctl net.ipv4.tcp_timestamps=2 to disable the randomization offset. This makes analyzing packet traces recorded by tools like wireshark or tcpdump easier – packets sent from the host then all have the same clock base in their TCP option timestamp.  For normal operation the default setting should be left as-is.

Selective Acknowledgments

TCP has problems if several packets in the same window of data are lost. This is because TCP Acknowledgments are cumulative, but only for packets
that arrived in-sequence. Example:

  • Sender transmits segments s_1, s_2, s_3, … s_n
  • Sender receives ACK for s_2
  • This means that both s_1 and s_2 were received and the
    sender no longer needs to keep these segments around.
  • Should s_3 be re-transmitted? What about s_4? s_n?

The sender waits for a “retransmission timeout” or ‘duplicate ACKs’ for s_2 to arrive. If a retransmit timeout occurs or several duplicate ACKs for s_2 arrive, the sender transmits s_3 again.

If the sender receives an acknowledgment for s_n, s_3 was the only missing packet. This is the ideal case. Only the single lost packet was re-sent.

If the sender receives an acknowledged segment that is smaller than s_n, for example s_4, that means that more than one packet was lost. The
sender needs to re-transmit the next segment as well.

Re-transmit strategies

Its possible to just repeat the same sequence: re-send the next packet until the receiver indicates it has processed all packet up to s_n. The problem with this approach is that it requires one RTT until the sender knows which packet it has to re-send next. While such strategy avoids unnecessary re-transmissions, it can take several seconds and more until TCP has re-sent the entire window of data.

The alternative is to re-send several packets at once. This approach allows TCP to recover more quickly when several packets have been lost. In the above example TCP re-send s_3, s_4, s_5, .. while it can only be sure that s_3 has been lost.

From a latency point of view, neither strategy is optimal. The first strategy is fast if only a single packet has to be re-sent, but takes too long when multiple packets were lost.

The second one is fast even if multiple packet have to be re-sent, but at the cost of wasting bandwidth. In addition, such a TCP sender could have transmitted new data already while it was doing the unneeded re-transmissions.

With the available information TCP cannot know which packets were lost. This is where TCP Selective Acknowledgments (SACK) come in. Just like window scaling and timestamps, it is another optional, yet very useful TCP feature.

The SACK option

 
   TCP Sack-Permitted Option: Kind: 4, Length 2
   +---------+---------+
   | Kind=4  | Length=2|
   +---------+---------+

A sender that supports this extension includes the “Sack Permitted” option in the connection request. If both endpoints support the extension, then a peer that detects a packet is missing in the data stream can inform the sender about this.

 
   TCP SACK Option: Kind: 5, Length: Variable
                     +--------+--------+
                     | Kind=5 | Length |
   +--------+--------+--------+--------+
   |      Left Edge of 1st Block       |
   +--------+--------+--------+--------+
   |      Right Edge of 1st Block      |
   +--------+--------+--------+--------+
   |                                   |
   /            . . .                  /
   |                                   |
   +--------+--------+--------+--------+
   |      Left Edge of nth Block       |
   +--------+--------+--------+--------+
   |      Right Edge of nth Block      |
   +--------+--------+--------+--------+

A receiver that encounters segment_s2 followed by s_5…s_n, it will include a SACK block when it sends the acknowledgment for s_2:

 
                +--------+-------+
                | Kind=5 |   10  |
+--------+------+--------+-------+
| Left edge: s_5                 |
+--------+--------+-------+------+
| Right edge: s_n                |
+--------+-------+-------+-------+

This tells the sender that segments up to s_2 arrived in-sequence, but it also lets the sender know that the segments s_5 to s_n were also received. The sender can then re-transmit these two packets and proceed to send new data.

The mythical lossless network

In theory SACK provides no advantage if the connection cannot experience packet loss. Or the connection has such a low latency that even waiting one full RTT does not matter.

In practice lossless behavior is virtually impossible to ensure.
Even if the network and all its switches and routers have ample bandwidth and buffer space packets can still be lost:

  • The host operating system might be under memory pressure and drop
    packets. Remember that a host might be handling tens of thousands of packet streams simultaneously.
  • The CPU might not be able to drain incoming packets from the network interface fast enough. This causes packet drops in the network adapter itself.
  • If TCP timestamps are not available even a connection with a very small RTT can stall momentarily during loss recovery.

Use of SACK does not increase the size of TCP packets unless a connection experiences packet loss. Because of this, there is hardly a reason to disable this feature. Almost all TCP stacks support SACK – it is typically only absent on low-power IOT-alike devices that are not doing TCP bulk data transfers.

When a Linux system accepts a connection from such a device, TCP automatically disables SACK for the affected connection.

Summary

The three TCP extensions examined in this post are all related to TCP performance and should best be left to the default setting: enabled.

The TCP handshake ensures that only extensions that are understood by both parties are used, so there is never a need to disable an extension globally just because a peer might not support it.

Turning these extensions off results in severe performance penalties, especially in case of TCP Window Scaling and SACK. TCP timestamps can be disabled without an immediate disadvantage, however there is no compelling reason to do so anymore. Keeping them enabled also makes it possible to support TCP options even when SYN cookies come into effect.

Posted on Leave a comment

Automating Network Devices with Ansible

Ansible is a great automation tool for system and network engineers, with Ansible we can automate small network to a large scale enterprise network. I have been using Ansible to automate both Aruba, and Cisco switches from my Fedora powered laptops for a couple of years. This article covers the requirements and executing a couple of playbooks.

Configuring Ansible

If Ansible is not installed, it can be installed using the command below

$ sudo dnf -y install ansible

Once installed, create a folder in your home directory or a directory of your preference and copy the ansible configuration file. For this demonstration, I will be using the following.

$ mkdir -pv /home/$USER/network_automation
$ sudo cp -v /etc/ansible.cfg /home/$USER/network_automation
$ cd /home/$USER/network_automation
$ sudo chown $USER.$USER && chmod 0600 ansible.cfg

To prevent lengthy commands from failing, edit the ansible.cfg and append the following lines. We must add the persistent connection and set the desired time in seconds for the command_timeout as demonstrated below. A use case where this is useful is when you are performing backups of a network device that has a lengthy configuration.

$ vim ansible.cfg
[persistent_connection]
command_timeout = 300
connection_timeout = 30

Requirements

If SELinux is enabled, you will need to install SELinux binding, which is required when using the copy module.

# Install SELinux bindings
dnf -y install python3-libselinux python3-libsemanage

Creating the inventory

The inventory holds the names of the network assets, and grouping of the assets are in square brackets [], below is a  sample inventory.

[site_a]
Core_A ansible_host=192.168.122.200
Distro_A ansible_host=192.168.122.201
Distro_B ansible_host=192.168.122.202

Group vars can be used to address the common variables, for example, credentials, network operating system, and so on. Ansible document on inventory provides additional details.

Playbook

Playbooks are Ansible’s configuration, deployment, and orchestration language. They can describe a policy you want your remote systems to enforce, or a set of steps in a general IT process.
Ansible Playbook

Read Operations

Let us create a simple playbook to run a show command to read the configuration on a few switches.

 
  1 ---
  2 - name: Basic Playbook
  3   hosts: site_a
  4   connection: local
  5
  6   tasks:
  7   - name: Get Interface Brief
  8     ios_command:
  9       commands:
 10         - show ip interface brief | e una
 11     register: interfaces
 12
 13   - name: Print results
 14     debug:
 15       msg: "{{ interfaces.stdout[0] }}
Without Debug

With Debug

The above images show the differences without and with the debug module respectively.

Let’s break the playbook into three blocks, starting with lines 1 to 4.

  • The three dashes/hyphens starts the YAML document
  • The hosts defines the hosts or host groups, multiple groups are comma-separated
  • Connection defines the methodology to connect to the network devices. Another option is network_cli (recommended method) and will be used later in this article. See IOS Platform Options for more details.

Lines 6 to 11 starts the tasks, we will be using ios_command and ios_config. This play will execute the show command show ip interface brief | e una and save the output from the command into the interfaces variable, with the register key.

Lines 13 to 15, by default, when you execute a show command you will not see the output, though this is not used during automation. It is very useful for debugging; therefore, the debug module was used.

The below video shows the execution of the playbook. There are a couple of ways you can execute the playbook.

  • Passing arguments to the command line, for example, include -u <username> -k to prompt for the remote user credentials
 
ansible-playbook -i inventory show_demo.yaml -u admin -k
  • Include the credentials in the host or group vars
 
ansible-playbook -i inventory show_demo.yaml

Never store passwords in plain text. We recommend using SSH keys to authenticate SSH connections. Ansible supports ssh-agent to manage your SSH keys. If you must use passwords to authenticate SSH connections, we recommend encrypting them with
Using Vault in Playbooks

Passing arguments to the command line
Credentials in the inventory

If we want to save the output to a file, we will use the copy module as shown in the playbook below. In addition to using the copy module, we will include the backup_dir variable to specify the directory path.

 
---
- name: Get System Infomation
  hosts: site_a
  connection: network_cli
  gather_facts: no
 
  vars:
    backup_dir: /home/eramirez/dev/ansible/fedora_magazine
 
  tasks:
  - name: get system interfaces
    ios_command:
      commands:
        - show ip int br | e una
    register: interface
   
  - name: Save result to disk
    copy:
      content: "{{ interface.stdout[0] }}"
      dest: "{{ backup_dir }}/{{ inventory_hostname }}.txt"

To demonstrate the use of variables in the inventory, we will use plain text. This method Must not be used in production.

 
[site_a]
Core_A ansible_host=192.168.122.200
Distro_A ansible_host=192.168.122.201
Distro_B ansible_host=192.168.122.202
[all:vars]
ansible_connection=network_cli
ansible_network_os=ios
ansible_user=admin
ansible_password=fedora
ansible_become=yes
ansible_become_password=yes
ansible_become_method=enable

Write Operations

In the previous section, we saw that we could get information from the network devices; in this section, we will write (add/modify) the configuration on these network devices. To make changes to the network device, we will be using the ios config module.

Let us create a playbook to configure a couple of interfaces in all of the network devices in site_a. We will first take a backup of the current configuration of all devices in site_a. Lastly, we will save the configuration.

 
---
- name: Get System Infomation
  hosts: site_a
  connection: network_cli
  gather_facts: no
 
  vars:
    backup_dir: /home/eramirez/dev/ansible/fedora_magazine
 
  tasks:
  - name: Backup configs
    ios_config:
      backup: yes
      backup_options:
        filename: "{{ inventory_hostname }}_running_cfg.txt"
        dir_path: "{{ backup_dir }}"
   
  - name: get system interfaces
    ios_config:
      lines:
        - description Raspberry Pi
        - switchport mode access
        - switchport access vlan 100
        - spanning-tree portfast
        - logging event link-status
        - no shutdown
      parents: "{{ item }}"
    with_items:
      - interface FastEthernet1/12
      - interface FastEthernet1/13
     
  - name: Save switch configuration
    ios_config:
      save_when: modified

Before we execute the playbook, we will first validate the interface configuration. We will then run the playbook and confirm the changes as illustrated below.

Conclusion

This article is a basic introduction to whet your appetite that demonstrates how Ansible is used to manage network devices. Ansible is capable of automating a vast network, which includes MPLS routing and performing validation before executing the next task.

Posted on Leave a comment

Internet connection sharing with NetworkManager

NetworkManager is the network configuration daemon used on Fedora and many other distributions. It provides a consistent way to configure network interfaces and other network-related aspects on a Linux machine. Among many other features, it provides a Internet connection sharing functionality that can be very useful in different situations.

For example, suppose you are in a place without Wi-Fi and want to share your laptop’s mobile data connection with friends. Or maybe you have a laptop with broken Wi-Fi and want to connect it via Ethernet cable to another laptop; in this way the first laptop become able to reach the Internet and maybe download new Wi-Fi drivers.

In cases like these it is useful to share Internet connectivity with other devices. On smartphones this feature is called “Tethering” and allows sharing a cellular connection via Wi-Fi, Bluetooth or a USB cable.

This article shows how the connection sharing mode offered by NetworkManager can be set up easily; it addition, it explains how to configure some more advanced features for power users.

How connection sharing works

The basic idea behind connection sharing is that there is an upstream interface with Internet access and a downstream interface that needs connectivity. These interfaces can be of a different type—for example, Wi-Fi and Ethernet.

If the upstream interface is connected to a LAN, it is possible to configure our computer to act as a bridge; a bridge is the software version of an Ethernet switch. In this way, you “extend” the LAN to the downstream network. However this solution doesn’t always play well with all interface types; moreover, it works only if the upstream network uses private addresses.

A more general approach consists in assigning a private IPv4 subnet to the downstream network and turning on routing between the two interfaces. In this case, NAT (Network Address Translation) is also necessary. The purpose of NAT is to modify the source of packets coming from the downstream network so that they look as if they originate from your computer.

It would be inconvenient to configure manually all the devices in the downstream network. Therefore, you need a DHCP server to assign addresses automatically and configure hosts to route all traffic through your computer. In addition, in case the sharing happens through Wi-Fi, the wireless network adapter must be configured as an access point.

There are many tutorials out there explaining how to achieve this, with different degrees of difficulty. NetworkManager hides all this complexity and provides a shared mode that makes this configuration quick and convenient.

Configuring connection sharing

The configuration paradigm of NetworkManager is based on the concept of connection (or connection profile). A connection is a group of settings to apply on a network interface.

This article shows how to create and modify such connections using nmcli, the NetworkManager command line utility, and the GTK connection editor. If you prefer, other tools are available such as nmtui (a text-based user interface), GNOME control center or the KDE network applet.

A reasonable prerequisite to share Internet access is to have it available in the first place; this implies that there is already a NetworkManager connection active. If you are reading this, you probably already have a working Internet connection. If not, see this article for a more comprehensive introduction to NetworkManager.

The rest of this article assumes you already have a Wi-Fi connection profile configured and that connectivity must be shared over an Ethernet interface enp1s0.

To enable sharing, create a connection for interface enp1s0 and set the ipv4.method property to shared instead of the usual auto:

$ nmcli connection add type ethernet ifname enp1s0 ipv4.method shared con-name local

The shared IPv4 method does multiple things:

  • enables IP forwarding for the interface;
  • adds firewall rules and enables masquerading;
  • starts dnsmasq as a DHCP and DNS server.

NetworkManager connection profiles, unless configured otherwise, are activated automatically. The new connection you have added should be already active in the device status:

$ nmcli device
DEVICE         TYPE      STATE         CONNECTION
enp1s0         ethernet  connected     local
wlp4s0         wifi      connected     home-wifi

If that is not the case, activate the profile manually with nmcli connection up local.

Changing the shared IP range

Now look at how NetworkManager configured the downstream interface enp1s0:

$ ip -o addr show enp1s0
8: enp1s0 inet 10.42.0.1/24 brd 10.42.0.255 ...

10.42.0.1/24 is the default address set by NetworkManager for a device in shared mode. Addresses in this range are also distributed via DHCP to other computers. If the range conflicts with other private networks in your environment, change it by modifying the ipv4.addresses property:

$ nmcli connection modify local ipv4.addresses 192.168.42.1/24

Remember to activate again the connection profile after any change to apply the new values:

$ nmcli connection up local $ ip -o addr show enp1s0
8: enp1s0 inet 192.168.42.1/24 brd 192.168.42.255 ...

If you prefer using a graphical tool to edit connections, install the nm-connection-editor package. Launch the program and open the connection to edit; then select the Shared to other computers method in the IPv4 Settings tab. Finally, if you want to use a specific IP subnet, click Add and insert an address and a netmask.

Adding custom dnsmasq options

In case you want to further extend the dnsmasq configuration, you can add new configuration snippets in /etc/NetworkManager/dnsmasq-shared.d/. For example, the following configuration:

dhcp-option=option:ntp-server,192.168.42.1
dhcp-host=52:54:00:a4:65:c8,192.168.42.170

tells dnsmasq to advertise a NTP server via DHCP. In addition, it assigns a static IP to a client with a certain MAC.

There are many other useful options in the dnsmasq manual page. However, remember that some of them may conflict with the rest of the configuration; so please use custom options only if you know what you are doing.

Other useful tricks

If you want to set up sharing via Wi-Fi, you could create a connection in Access Point mode, manually configure the security, and then enable connection sharing. Actually, there is a quicker way, the hotspot mode:

$ nmcli device wifi hotspot [ifname $dev] [password $pw]

This does everything needed to create a functional access point with connection sharing. The interface and password options are optional; if they are not specified, nmcli chooses the first Wi-Fi device available and generates a random password. Use the ‘nmcli device wifi show-password‘ command to display information for the active hotspot; the output includes the password and a text-based QR code that you can scan with a phone:

What about IPv6?

Until now this article discussed sharing IPv4 connectivity. NetworkManager also supports sharing IPv6 connectivity through DHCP prefix delegation. Using prefix delegation, a computer can request additional IPv6 prefixes from the DHCP server. Those public routable addresses are assigned to local networks via Router Advertisements. Again, NetworkManager makes all this easier through the shared IPv6 mode:

$ nmcli connection modify local ipv6.method shared

Note that IPv6 sharing requires support from the Internet Service Provider, which should give out prefix delegations through DHCP. If the ISP doesn’t provides delegations, IPv6 sharing will not work; in such case NM will report in the journal that no prefixes are available:

policy: ipv6-pd: none of 0 prefixes of wlp1s0 can be shared on enp1s0

Also, note that the Wi-Fi hotspot command described above only enables IPv4 sharing; if you want to also use IPv6 sharing you must edit the connection manually.

Conclusion

Remember, the next time you need to share your Internet connection, NetworkManager will make it easy for you.

If you have suggestions on how to improve this feature or any other feedback, please reach out to the NM community using the mailing list, the issue tracker or joining the #nm IRC channel on freenode.

Posted on Leave a comment

Cloning a MAC address to bypass a captive portal

If you ever attach to a WiFi system outside your home or office, you often see a portal page. This page may ask you to accept terms of service or some other agreement to get access. But what happens when you can’t connect through this kind of portal? This article shows you how to use NetworkManager on Fedora to deal with some failure cases so you can still access the internet.

How captive portals work

Captive portals are web pages offered when a new device is connected to a network. When the user first accesses the Internet, the portal captures all web page requests and redirects them to a single portal page.

The page then asks the user to take some action, typically agreeing to a usage policy. Once the user agrees, they may authenticate to a RADIUS or other type of authentication system. In simple terms, the captive portal registers and authorizes a device based on the device’s MAC address and end user acceptance of terms. (The MAC address is a hardware-based value attached to any network interface, like a WiFi chip or card.)

Sometimes a device doesn’t load the captive portal to authenticate and authorize the device to use the location’s WiFi access. Examples of this situation include mobile devices and gaming consoles (Switch, Playstation, etc.). They usually won’t launch a captive portal page when connecting to the Internet. You may see this situation when connecting to hotel or public WiFi access points.

You can use NetworkManager on Fedora to resolve these issues, though. Fedora will let you temporarily clone the connecting device’s MAC address and authenticate to the captive portal on the device’s behalf. You’ll need the MAC address of the device you want to connect. Typically this is printed somewhere on the device and labeled. It’s a six-byte hexadecimal value, so it might look like 4A:1A:4C:B0:38:1F. You can also usually find it through the device’s built-in menus.

Cloning with NetworkManager

First, open nm-connection-editor, or open the WiFI settings via the Settings applet. You can then use NetworkManager to clone as follows:

  • For Ethernet – Select the connected Ethernet connection. Then select the Ethernet tab. Note or copy the current MAC address. Enter the MAC address of the console or other device in the Cloned MAC address field.
  • For WiFi – Select the WiFi profile name. Then select the WiFi tab. Note or copy the current MAC address. Enter the MAC address of the console or other device in the Cloned MAC address field.

Bringing up the desired device

Once the Fedora system connects with the Ethernet or WiFi profile, the cloned MAC address is used to request an IP address, and the captive portal loads. Enter the credentials needed and/or select the user agreement. The MAC address will then get authorized.

Now, disconnect the WiFi or Ethernet profile, and change the Fedora system’s MAC address back to its original value. Then boot up the console or other device. The device should now be able to access the Internet, because its network interface has been authorized via your Fedora system.

This isn’t all that NetworkManager can do, though. For instance, check out this article on randomizing your system’s hardware address for better privacy.

Posted on Leave a comment

Use sshuttle to build a poor man’s VPN

Nowadays, business networks often use a VPN (virtual private network) for secure communications with workers. However, the protocols used can sometimes make performance slow. If you can reach reach a host on the remote network with SSH, you could set up port forwarding. But this can be painful, especially if you need to work with many hosts on that network. Enter sshuttle — which lets you set up a quick and dirty VPN with just SSH access. Read on for more information on how to use it.

The sshuttle application was designed for exactly the kind of scenario described above. The only requirement on the remote side is that the host must have Python available. This is because sshuttle constructs and runs some Python source code to help transmit data.

Installing sshuttle

The sshuttle application is packaged in the official repositories, so it’s easy to install. Open a terminal and use the following command with sudo:

$ sudo dnf install sshuttle

Once installed, you may find the manual page interesting:

$ man sshuttle

Setting up the VPN

The simplest case is just to forward all traffic to the remote network. This isn’t necessarily a crazy idea, especially if you’re not on a trusted local network like your own home. Use the -r switch with the SSH username and the remote host name:

$ sshuttle -r username@remotehost 0.0.0.0/0

However, you may want to restrict the VPN to specific subnets rather than all network traffic. (A complete discussion of subnets is outside the scope of this article, but you can read more here on Wikipedia.) Let’s say your office internally uses the reserved Class A subnet 10.0.0.0 and the reserved Class B subnet 172.16.0.0. The command above becomes:

$ sshuttle -r username@remotehost 10.0.0.0/8 172.16.0.0/16

This works great for working with hosts on the remote network by IP address. But what if your office is a large network with lots of hosts? Names are probably much more convenient — maybe even required. Never fear, sshuttle can also forward DNS queries to the office with the –dns switch:

$ sshuttle --dns -r username@remotehost 10.0.0.0/8 172.16.0.0/16

To run sshuttle like a daemon, add the -D switch. This also will send log information to the systemd journal via its syslog compatibility.

Depending on the capabilities of your system and the remote system, you can use sshuttle for an IPv6 based VPN. You can also set up configuration files and integrate it with your system startup if desired. If you want to read even more about sshuttle and how it works, check out the official documentation. For a look at the code, head over to the GitHub page.


Photo by Kurt Cotoaga on Unsplash.