Posted on Leave a comment

Web of Trust, Part 1: Concept

Every day we rely on technologies who nobody can fully understand. Since well before the industrial revolution, complex and challenging tasks required an approach that broke out the different parts into smaller scale tasks. Each resulting in specialized knowledge used in some parts of our lives, leaving other parts to trust in skills that others had learned. This shared knowledge approach also applies to software. Even the most avid readers of this magazine, will likely not compile and validate every piece of code they run. This is simply because the world of computers is itself also too big for one person to grasp.

Still, even though it is nearly impossible to understand everything that happens within your PC when you are using it, that does not leave you blind and unprotected. FLOSS software shares trust, giving protection to all users, even if individual users can’t grasp all parts in the system. This multi-part article will discuss how this ‘Web of Trust’ works and how you can get involved.

But first we’ll have to take a step back and discuss the basic concepts, before we can delve into the details and the web. Also, a note before we start, security is not just about viruses and malware. Security also includes your privacy, your economic stability and your technological independence.

One-Way System

By their design, computers can only work and function in the most rudimentary ways of logic: True or false. And or Or. This (boolean logic) is not readily accessible to humans, therefore we must do something special. We write applications in a code that we can (reasonably) comprehend (human readable). Once completed, we turn this human readable code into a code that the computer can comprehend (machine code).

The step of conversion is called compilation and/or building, and it’s a one-way process. Compiled code (machine code) is not really understandable by humans, and it takes special tools to study in detail. You can understand small chunks, but on the whole, an entire application becomes a black box.

This subtle difference shifts power. Power, in this case being the influence of one person over another person. The person who has written the human-readable version of the application and then releases it as compiled code to use by others, knows all about what the code does, while the end user knows a very limited scope. When using (software) in compiled form, it is impossible to know for certain what an application is intended to do, unless the original human readable code can be viewed.

The Nature of Power

Spearheaded by Richard Stallman, this shift of power became a point of concern. This discussion started in the 1980s, for this was the time that computers left the world of academia and research, and entered the world of commerce and consumers. Suddenly, that power became a source of control and exploitation.

One way to combat this imbalance of power, was with the concept of FLOSS software. FLOSS Software is built on 4-Freedoms, which gives you a wide array of other ‘affiliated’ rights and guarantees. In essence, FLOSS software uses copyright-licensing as a form of moral contract, that forces software developers not to leverage the one-way power against their users. The principle way of doing this, is with the the GNU General Public Licenses, which Richard Stallman created and has since been promoting.

One of those guarantees, is that you can see the code that should be running on your device. When you get a device using FLOSS software, then the manufacturer should provide you the code that the device is using, as well as all instructions that you need to compile that code yourself. Then you can replace the code on the device with the version you can compile yourself. Even better, if you compare the version you have with the version on the device, you can see if the device manufacturer tried to cheat you or other customers.

This is where the web of Trust comes back into the picture. The Web of Trust implies that even if the vast majority of people can’t validate the workings of a device, that others can do so on their behalf. Journalists, security analysts and hobbyists, can do the work that others might be unable to do. And if they find something, they have the power to share their findings.

Security by Blind Trust

This is of course, if the application and all components underneath it, are FLOSS. Proprietary software, or even software which is merely Open Source, has compiled versions that nobody can recreate and validate. Thus, you can never truly know if that software is secure. It might have a backdoor, it might sell your personal data, or it might be pushing a closed ecosystem to create a vendor-lock. With closed-source software, your security is as good as the company making the software is trustworthy.

For companies and developers, this actually creates another snare. While you might still care about your users and their security, you’re a liability: If a criminal can get to your official builds or supply-chain, then there is no way for anybody to discover that afterwards. An increasing number of attacks do not target users directly, but instead try to get in, by exploiting the trust the companies/developers have carefully grown.

You should also not underestimate pressure from outside: Governments can ask you to ignore a vulnerability, or they might even demand cooperation. Investment firms or shareholders, may also insist that you create a vendor-lock for future use. The blind trust that you demand of your users, can be used against you.

Security by a Web of Trust

If you are a user, FLOSS software is good because others can warn you when they find suspicious elements. You can use any FLOSS device with minimal economic risk, and there are many FLOSS developers who care for your privacy. Even if the details are beyond you, there are rules in place to facilitate trust.

If you are a tinkerer, FLOSS is good because with a little extra work, you can check the promises of others. You can warn people when something goes wrong, and you can validate the warnings of others. You’re also able to check individual parts in a larger picture. The libraries used by FLOSS applications, are also open for review: It’s “Trust all the way down”.

For companies and developers, FLOSS is also a great reassurance that your trust can’t be easily subverted. If malicious actors wish to attack your users, then any irregularity can quickly be spotted. Last but not least, since you also stand to defend your customers economic well-being and privacy, you can use that as an important selling point to customers who care about their own security.

Fedora’s case

Fedora embraces the concept of FLOSS and it stands strong to defend it. There are comprehensive legal guidelines, and Fedora’s principles are directly referencing the 4-Freedoms: Freedom, Friends, Features, and First

Fedora's Foundation logo, with Freedom highlighted. Illustrative.

To this end, entire systems have been set up to facilitate this kind of security. Fedora works completely in the open, and any user can check the official servers. Koji is the name of the Fedora Buildsystem, and you can see every application and it’s build logs there. For added security, there is also Bohdi, which orchestrates the deployment of an application. Multiple people must approve it, before the application can become available.

This creates the Web of Trust on which you can rely. Every package in the repository goes through the same process, and at every point somebody can intervene. There are also escalation systems in place to report issues, so that issues can quickly be tackled when they occur. Individual contributors also know that they can be reviewed at every time, which itself is already enough of a precaution to dissuade mischievous thoughts.

You don’t have to trust Fedora (implicitly), you can get something better; trust in users like you.

Posted on Leave a comment

Use dnsmasq to provide DNS & DHCP services

Many tech enthusiasts find the ability to control their host name resolution important. Setting up servers and services usually requires some form of fixed address, and sometimes also requires special forms of resolution such as defining Kerberos or LDAP servers, mail servers, etc. All of this can be achieved with dnsmasq.

dnsmasq is a lightweight and simple program which enables issuing DHCP addresses on your network and registering the hostname & IP address in DNS. This configuration also allows external resolution, so your whole network will be able to speak to itself and find external sites too.

This article covers installing and configuring dnsmasq on either a virtual machine or small physical machine like a Raspberry Pi so it can provide these services in your home network or lab. If you have an existing setup and just need to adjust the settings for your local workstation, read the previous article which covers configuring the dnsmasq plugin in NetworkManager.

Install dnsmasq

First, install the dnsmasq package:

sudo dnf install dnsmasq

Next, enable and start the dnsmasq service:

sudo systemctl enable --now dnsmasq

Configure dnsmasq

First, make a backup copy of the dnsmasq.conf file:

sudo cp /etc/dnsmasq.conf /etc/dnsmasq.conf.orig

Next, edit the file and make changes to the following to reflect your network. In this example, mydomain.org is the domain name, 192.168.1.10 is the IP address of the dnsmasq server and 192.168.1.1 is the default gateway.

sudo vi /etc/dnsmasq.conf

Insert the following contents:

domain-needed
bogus-priv
no-resolv
server=8.8.8.8
server=8.8.4.4
local=/mydomain.org/
listen-address=::1,127.0.0.1,192.168.1.10
expand-hosts
domain=mydomain.org
dhcp-range=192.168.1.100,192.168.1.200,24h
dhcp-option=option:router,192.168.1.1
dhcp-authoritative
dhcp-leasefile=/var/lib/dnsmasq/dnsmasq.leases

Test the config to check for typos and syntax errors:

$ sudo dnsmasq --test
dnsmasq: syntax check OK.

Now edit the hosts file, which can contain both statically- and dynamically-allocated hosts. Static addresses should lie outside the DHCP range you specified earlier. Hosts using DHCP but which need a fixed address should be entered here with an address within the DHCP range.

sudo vi /etc/hosts

The first two lines should be there already. Add the remaining lines to configure the router, the dnsmasq server, and two additional servers.

127.0.0.1   localhost localhost.localdomain
::1         localhost localhost.localdomain
192.168.1.1    router
192.168.1.10   dnsmasq
192.168.1.20   server1
192.168.1.30   server2

Restart the dnsmasq service:

sudo systemctl restart dnsmasq

Next add the services to the firewall to allow the clients to connect:

sudo firewall-cmd --add-service={dns,dhcp}
sudo firewall-cmd --runtime-to-permanent

Test name resolution

First, install bind-utils to get the nslookup and dig packages. These allow you to perform both forward and reverse lookups. You could use ping if you’d rather not install extra packages. but these tools are worth installing for the additional troubleshooting functionality they can provide.

sudo dnf install bind-utils

Now test the resolution. First, test the forward (hostname to IP address) resolution:

$ nslookup server1
Server:       127.0.0.1
Address:    127.0.0.1#53
Name:    server1.mydomain.org
Address: 192.168.1.20

Next, test the reverse (IP address to hostname) resolution:

$ nslookup 192.168.1.20
20.1.168.192.in-addr.arpa    name = server1.mydomain.org.

Finally, test resolving hostnames outside of your network:

$ nslookup fedoramagazine.org
Server:       127.0.0.1
Address:    127.0.0.1#53
Non-authoritative answer:
Name:    fedoramagazine.org
Address: 35.196.109.67

Test DHCP leases

To test DHCP leases, you need to boot a machine which uses DHCP to obtain an IP address. Any Fedora variant will do that by default. Once you have booted the client machine, check that it has an address and that it corresponds to the lease file for dnsmasq.

From the machine running dnsmasq:

$ sudo cat /var/lib/dnsmasq/dnsmasq.leases
1598023942 52:54:00:8e:d5:db 192.168.1.100 server3 01:52:54:00:8e:d5:db
1598019169 52:54:00:9c:5a:bb 192.168.1.101 server4 01:52:54:00:9c:5a:bb

Extending functionality

You can assign hosts a fixed IP address via DHCP by adding it to your hosts file with the address you want (within your DHCP range). Do this by adding into the dnsmasq.conf file the following line, which assigns the IP listed to any host that has that name:

dhcp-host=myhost

Alternatively, you can specify a MAC address which should always be given a fixed IP address:

dhcp-host=11:22:33:44:55:66,192.168.1.123

You can specify a PXE boot server if you need to automate machine builds

tftp-root=/tftpboot
dhcp-boot=/tftpboot/pxelinux.0,boothost,192.168.1.240

This should point to the actual URL of your TFTP server.

If you need to specify SRV or TXT records, for example for LDAP, Kerberos or similar, you can add these:

srv-host=_ldap._tcp.mydomain.org,ldap-server.mydomain.org,389
srv-host=_kerberos._udp.mydomain.org,krb-server.mydomain.org,88
srv-host=_kerberos._tcp.mydomain.org,krb-server.mydomain.org,88
srv-host=_kerberos-master._udp.mydomain.org,krb-server.mydomain.org,88
srv-host=_kerberos-adm._tcp.mydomain.org,krb-server.mydomain.org,749
srv-host=_kpasswd._udp.mydomain.org,krb-server.mydomain.org,464
txt-record=_kerberos.mydomain.org,KRB-SERVER.MYDOMAIN.ORG

There are many other options in dnsmasq. The comments in the original config file describe most of them. For full details, read the man page, either locally or online.

Posted on Leave a comment

Installing and running Vagrant using qemu-kvm

Vagrant is a brilliant tool, used by DevOps professionals, coders, sysadmins and regular geeks to stand up repeatable infrastructure for development and testing. From their website:

Vagrant is a tool for building and managing virtual machine environments in a single workflow. With an easy-to-use workflow and focus on automation, Vagrant lowers development environment setup time, increases production parity, and makes the “works on my machine” excuse a relic of the past.

If you are already familiar with the basics of Vagrant, the documentation provides a better reference build for all available features and internals.

Vagrant provides easy to configure, reproducible, and portable work environments built on top of industry-standard technology and controlled by a single consistent workflow to help maximize the productivity and flexibility of you and your team.

https://www.vagrantup.com/intro

This guide will walk through the steps necessary to get Vagrant working on a Fedora-based machine.

I started with a minimal install of Fedora Server as this reduces the memory footprint of the host OS, but if you already have a working Fedora machine, either Server or Workstation, then this should still work.

Check the machine supports virtualisation:

$ sudo lscpu | grep Virtualization Virtualization:                  VT-x Virtualization type:             full

Install qemu-kvm:

sudo dnf install qemu-kvm libvirt libguestfs-tools virt-install rsync

Enable and start the libvirt daemon:

sudo systemctl enable --now libvirtd

Install Vagrant:

sudo dnf install vagrant

Install the Vagrant libvirtd plugin:

sudo vagrant plugin install vagrant-libvirt

Add a box

vagrant box add fedora/32-cloud-base --provider=libvirt

Create a minimal Vagrantfile to test

$ mkdir vagrant-test $ cd vagrant-test $ vi Vagrantfile

Vagrant.configure("2") do |config| config.vm.box = "fedora/32-cloud-base" end

Note the capitalisation of the file name and in the file itself.

Check the file:

vagrant status

Current machine states: default not created (libvirt) The Libvirt domain is not created. Run 'vagrant up' to create it.

Start the box:

vagrant up

Connect to your new machine:

vagrant ssh

That’s it – you now have Vagrant working on your Fedora machine.

To stop the machine, use vagrant halt. This simply halts the machine but leaves the VM and disk in place.
To shut it down and delete it use vagrant destroy. This will remove the whole machine and any changes you’ve made in it.

Next steps

You don’t need to download boxes before issuing the vagrant up command – you can specify the box and the provider in the Vagrantfile directly and Vagrant will download it if it’s not already there. Below is an example which also sets the amount memory and number of CPUs:

# -*- mode: ruby -*-
# vi: set ft=ruby : Vagrant.configure("2") do |config| config.vm.box = "fedora/32-cloud-base" config.vm.provider :libvirt do |libvirt| libvirt.cpus = 1 libvirt.memory = 1024 end
end

For more information on using Vagrant, creating your own machines and using different boxes, see the official documentation at https://www.vagrantup.com/docs

There is a huge repository of boxes ready to download and use, and the official location for these is Vagrant Cloud – https://app.vagrantup.com/boxes/search. Some are basic operating systems and some offer complete functionality such as databases, web servers etc.

Posted on Leave a comment

Incremental backups with Btrfs snapshots

Snapshots are an interesting feature of Btrfs. A snapshot is a copy of a subvolume. Taking a snapshot is immediate. However, taking a snapshot is not like performing a rsync or a cp, and a snapshot doesn’t occupy space as soon as it is created.

Editors note: From the BTRFS Wiki – A snapshot is simply a subvolume that shares its data (and metadata) with some other subvolume, using Btrfs’s COW capabilities.

Occupied space will increase alongside the data changes in the original subvolume or in the snapshot itself, if it is writeable. Added/modified files, and deleted files in the subvolume still reside in the snapshots. This is a convenient way to perform backups.

Using snapshots for backups

A snapshot resides on the same disk where the subvolume is located. You can browse it like a regular directory and recover a copy of a file as it was when the snapshot was performed. By the way, a snapshot on the same disk of the snapshotted subvolume is not an ideal backup strategy: if the hard disk broke, snapshots will be lost as well. An interesting feature of snapshots is the ability to send them to another location. The snapshot can be sent to an external hard drive or to a remote system via SSH (the destination filesystems need to be formatted as Btrfs as well). To do this, the commands btrfs send and btrfs receive are used.

Taking a snapshot

In order to use the send and the receive commands, it is important to create the snapshot as read-only, and snapshots are writeable by default.

The following command will take a snapshot of the /home subvolume. Note the -r flag for readonly.

sudo btrfs subvolume snapshot -r /home /.snapshots/home-day1

Instead of day1, the snapshot name can be the current date, like home-$(date +%Y%m%d). Snapshots look like regular subdirectories. You can place them wherever you like. The directory /.snapshots could be a good choice to keep them neat and to avoid confusion.

Editors note: Snapshots will not take recursive snapshots of themselves. If you create a snapshot of a subvolume, every subvolume or snapshot that the subvolume contains is mapped to an empty directory of the same name inside the snapshot.

Backup using btrfs send

In this example the destination Btrfs volume in the USB drive is mounted as /run/media/user/mydisk/bk . The command to send the snapshot to the destination is:

sudo btrfs send /.snapshots/home-day1 | sudo btrfs receive /run/media/user/mydisk/bk

This is called initial bootstrapping, and it corresponds to a full backup. This task will take some time, depending on the size of the /home directory. Obviously, subsequent incremental sends will take a shorter time.

Incremental backup

Another useful feature of snapshots is the ability to perform the send task in an incremental way. Let’s take another snapshot.

sudo btrfs subvolume snapshot -r /home /.snapshots/home-day2

In order to perform the send task incrementally, you need to specify the previous snapshot as a base and this snapshot has to exist in the source and in the destination. Please note the -p option.

sudo btrfs send -p /.snapshot/home-day1 /.snapshot/home-day2 | sudo btrfs receive /run/media/user/mydisk/bk

And again (the day after):

sudo btrfs subvolume snapshot -r /home /.snapshots/home-day3
sudo btrfs send -p /.snapshot/home-day2 /.snapshot/home-day3 | sudo btrfs receive /run/media/user/mydisk/bk

Cleanup

Once the operation is complete, you can keep the snapshot. But if you perform these operations on a daily basis, you could end up with a lot of them. This could lead to confusion and potentially a lot of used space on your disks. So it is a good advice to delete some snapshots if you think you don’t need them anymore.

Keep in mind that in order to perform an incremental send you need at least the last snapshot. This snapshot must be present in the source and in the destination.

sudo btrfs subvolume delete /.snapshot/home-day1
sudo btrfs subvolume delete /.snapshot/home-day2
sudo btrfs subvolume delete /run/media/user/mydisk/bk/home-day1
sudo btrfs subvolume delete /run/media/user/mydisk/bk/home-day2

Note: the day 3 snapshot was preserved in the source and in the destination. In this way, tomorrow (day 4), you can perform a new incremental btrfs send.

As some final advice, if the USB drive has a bunch of space, you could consider maintaining multiple snapshots in the destination, while in the source disk you would keep only the last one.

Posted on Leave a comment

Btrfs Coming to Fedora 33

by Chris Murphy and Langdon White


User data is the most important thing on a computer. Whether it’s source code for the next big release, family pictures, a music library, or anything else, you want it to be safe. Changing the default file system is not a change to make casually. The Fedora Project is changing the default file system for desktop variants (Fedora Workstation, Fedora KDE, etc), for the first time since Fedora 11. Btrfs will replace ext4 as the default filesystem in Fedora 33.

What does this mean for me?

Btrfs is a stable and mature file system with modern features: data integrity, optimizations for SSDs, compression, cheap writable snapshots, multiple device support, and more.

The switch to Btrfs will use a single-partition disk layout, and Btrfs’ built-in volume management. The previous default layout placed constraints on disk usage that can be a difficult adjustment for novice users. Btrfs solves this problem by avoiding it.

As a techie, you may have heard of bit rot, and memory bit flips. Data can be corrupted by a multitude of physical factors, even cosmic rays from the sun! Before an SSD fails outright, often it will return either zeros or garbage, instead of your data. Btrfs safeguards your data with checksums, and performs verification on every read. Corrupt data is never given to your programs, and it won’t replicate into your backups to be discovered another day (or year).

Btrfs uses a “copy-on-write” model: your data and the file system itself are never overwritten. This enhances crash-safeness. When copying a file, Btrfs does not write new data until you actually change the old data, saving space.

In fact, users will save more space when using Btrfs’ transparent compression. Compressing data reduces total writes, saves space, and extends flash drive life. In many cases, it can also improve performance. Compression can be enabled on an entire file system, or per subvolume, directory, and even per file. You will be able to opt-in to using compression in Fedora 33. And it’s one of the features we’re looking forward to taking advantage of by default in future Fedora releases.

Trusted

Facebook uses Btrfs on millions of machines in production. They compare its stability to ext4 and XFS (another file system available in Fedora). In fact, they use Btrfs to “improve” the quality of the consumer storage hardware that they use in production. Btrfs detects problems before the hardware fails.

(open)SUSE have been using Btrfs for many years now, including SUSE Linux Enterprise Server (SLES). You can’t imagine a company that provides support to customers shipping software that they don’t completely trust.

What’s next?

The Change is code complete, and has been testable in Rawhide as the default file system since early July. Btrfs has been explicitly supported in Fedora since 2012. This is expected to be a transparent change for most users, however it is still significant. Fedora will ensure we deliver the dependable and reliable experience Fedora users have come to expect.

Special thanks to: Ben Cotton, Michael Catanzaro, and the Fedora Workstation Working Group for contributing to this article.

Posted on Leave a comment

Tune up your sound with PulseEffects: Speakers

Audio components for your computer don’t always produce the quality of sound you want. For instance your laptop speakers may be a bit “tinny” sounding, or a set of speakers for your desktop may be too boomy for your room. Or if you use a desk or headset microphone, you may find that recordings you make are not as high quality as you’d like. Enter PulseEffects!

PulseAudio and Gstreamer

The PulseAudio sound server comes with Fedora Workstation by default. It’s highly flexible and easily modified. PulseAudio can deal with many different inputs and outputs. For instance, it lets you switch between different inputs known as sources (such as microphones, or sound files) or outputs known as sinks (such as speakers or headphones).

By default, PulseAudio manages sound as streams, digitally sampled at a specific rate and bit depth with a defined number of channels — two for most stereo streams. It handles different sample rates for you, so you don’t have to know the details of the stream. PulseAudio simply deals with moving sound from one point to another.

The Gstreamer multimedia framework, on the other hand, provides myriad ways to modify audio and video data on their way through a pipeline. Gstreamer comes with plugins that allow it to attach to PulseAudio. This means that you can use Gstreamer to make a pipeline between your inputs and outputs to change audio streams.

PulseEffects manages this process with a nice, graphical front end. It lets you select and order different effects for your sound.

Installing PulseEffects

To install PulseEffects, use the Software tool and type pulseeffects to find the package. Fedora carries this software in its official repositories. So if you need to, you can switch the source from the Flatpak version to the Fedora one. Then click Install.

If you’re using a command line, you can use the sudo command with dnf to do the same thing:

$ sudo dnf install pulseeffects

Start playing some audio. This can come from your Videos or Rhythmbox media player, a website such as YouTube, a music streaming app such as Spotify, or something else. The best source to use is a full-fidelity digital audio source, like a CD or a FLAC file. (MP3 and online digital streams cut out some frequency information to reduce data size.) Ideally, it should be music that you are used to listening to in many places, and know well, like an album or playlist. Put it on repeat so you can use it while you tune your sound.

Then launch the program using either the Software tool’s Launch control, or your desktop’s application launcher. On Fedora Workstation, go to the Activities hotspot, use the Show Applications control to locate PulseEffects in the list, and click to launch. You should see your application in the list of active sound streams, with color bars that show an average frequency response:

PulseEffects initial screen with one sound stream from Videos

Notice all the sound modules available on the left. None are running by default when you first start PulseEffects. Any enabled modules are applied to your sound in order from top to bottom. You can use the up/down controls next to the module names to alter the order.

What does “better” mean?

Before we get started, realize that what constitutes “better” will usually be different based on many factors:

  • Specific hardware like the model of speakers or microphone
  • The environment the sound device is in (your room)
  • What your ears prefer

There is no magic cure for bad sound that works universally everywhere. So the examples you’ll see are based on some common problems. But you will need to use your ears to determine what’s best for your hardware, in the place you’re using it.

Making desktop speakers sound better

Often desktop speaker sets consist of a subwoofer and small satellite speakers. These tend to be both excessively “boomy,” meaning too much very low frequency sound, and “honky” or “boxy,” meaning too much of some middle (or “mid”) frequencies. To fix frequencies that are over- or under-represented, we can use an equalizer.

By default, all the effects are off in PulseEffects. Since you want to modify a sound output — your speakers — make sure the “speaker” icon at the upper left is selected. Locate the Equalizer control in PulseEffects and select it. Select the toggle switch at the top of the equalizer controls to turn it on.

The default equalizer appears as a 30-band graphic EQ. Older readers might be familiar with seeing physical equipment like this. Each band alters not just that specific frequency, but a fairly narrow band of frequencies around it. Think of it like a “dip” or “bump” in the frequency graph, depending on whether you lower or raise the slider.

PulseEffects default EQ, with a roll-off of extreme low frequencies and some reduction of unpleasant “boxiness” around 450Hz

If you’re not sure how to alter frequencies to get better sound, click the “tools” icon under the on/off toggle. Under Presets you can select different EQ settings to find something closest to what you like. Then you can modify those settings as you like. If things get out of control, under Settings use the Flat response control to zero out all the EQ.

Using the “tools” icon, you can also choose a different number of bands to simplify your choices. Using the “gear” icon above each band, you can choose different types of EQ, as well as the width and Q. Additional filter types like a low/high pass or low/high shelf are also available. Feel free to play with the EQ to see how it works, but be careful not to increase EQ levels too high if your speakers are above a moderate volume, because you can damage speakers that way.

Tips from a mix engineer

These guidelines may help you find the optimal sound for your situation.

  • It’s almost always better to reduce a problem frequency than to boost other things. If you boost too much, your music can start to distort.
  • To fix excessive boominess, apply a high pass filter somewhere between 30 and 50 Hz. You may also want to try a bell EQ reduction somewhere between 40 and 100 Hz.
  • If you want to fix a boxy sound (reminds you of a cardboard box), try a bell EQ to reduce some frequencies between 300 and 500 Hz.
  • To fix a honky or nasal sound, try reducing some frequencies between 650 and 900 Hz.
  • If guitar/keyboard solos or vocals seem a bit muffled, try a gentle boost centered somewhere between 1 and 2 kHz to make them a little more present.
  • If your speakers sound overly tinny, apply a high shelf reduction starting somewhere between 4 and 8 kHz — start at a high frequency and dial back to where it’s helpful. To fix a dull sound, apply a high shelf boost using the same approach.

Remember that a little EQ goes a long way. Try keeping your bell boosts or cuts between +4 and -4 on the sliders. The goal is not to make the music sound extreme, but to make slight corrections. Otherwise your ears will get tired more quickly, or in extreme cases you may even get headaches.

Watch the Input and Output meters at the bottom of every module. If you see a lot of green on one or both, the sound module is overloading at that stage. You’ll often find MP3 files, especially of modern music, have this issue. You may also see “warning” icons flashing over the check mark on enabled modules.

One way to cure this is to use the Limiter module at the beginning of the chain, and simply turn the input gain on the chain down about -3dB, leaving the limit at 0dB. This simply lowers the overall signal level without any attenuation. Then you can run other modules without worrying quite as much about distortion or overload in later stages.

Making laptop speakers sound better

While the above guidelines might be good for bigger speakers, laptops have the additional burden of being very small. Typically they lack bass response, because more and/or larger speakers, and more powerful magnets, are needed to produce those frequencies well.

However, you can correct this using the Bass Enhancer module in PulseEffects. You may want to move this module downward in the stack after your EQ for best results. Rather than turning the amount up excessively, try a modest change of +3 or +4dB, and then move the Scope frequency around until you find where you start to notice good results. Don’t be tempted to amplify too much because again, if it’s too high you could start to damage your laptop speakers over time.

Storing your work

First, set PulseEffects to run whenever you login. Use the “hamburger” tool at the top right to open up the General settings. Set Start Service at Login to enabled, and also enable the option to Process All Outputs. This does not mean all devices will get the same settings. Instead, it means that PulseEffects will run a chain for any sound output device you have connected. You can apply different chains to different devices.

Next, select the Presets button, and in the text box, type a name for your preset. One recommendation is to use the name of the device for which you’ve created a chain. Then click the “+” icon to add the preset. If you make changes, you can either use the “save” icon to save the changes to the selected preset, or click Apply to throw them away and re-apply the saved preset.

Finally, you can click the “cycle” icon if you want the preset to be applied every time the currently used sound output is detected. This is almost always a good idea. If you want to set up different presets for other outputs, first connect the output. Then make a new preset as described above, and select that to be auto-applied.

One final note: When you close the PulseEffects application, your active chain of effects does not stop. It will stay running unless you reset or stop the service. PulseEffects will consume a few percent of CPU time (depending on processor speed). On all but the oldest systems the load should not be noticeable. However, if you are sensitive to power use such as on a laptop, you may want to stop the service using this command:

$ pulseeffects -q

Conclusion

Remember that every environment and person’s hearing is different, so beware of the overly dogmatic. Finally, you can’t make terrible speakers into great ones. But you can usually make them sound not so terrible — and if you have decent speakers, you usually can make them sound quite good!

The PulseEffects author also has both a LiberaPay donation site and a Patreon account, so if you find the software useful, you might want to consider contributing.

In the next installment, you’ll learn how to set up better sound on a desktop or headset microphone, to improve your teleconference meetings or make better audio or video spoken content. Until then, enjoy your new sound possibilities.


Photo by Paul Esch-Laurent on Unsplash.

Posted on Leave a comment

TCP window scaling, timestamps and SACK

The Linux TCP stack has a myriad of sysctl knobs that allow to change its behavior.  This includes the amount of memory that can be used for receive or transmit operations, the maximum number of sockets and optional features and protocol extensions.

There are  multiple articles that recommend to disable TCP extensions, such as timestamps or selective acknowledgments (SACK) for various “performance tuning” or “security” reasons.

This article provides background on what these extensions do, why they
are enabled by default, how they relate to one another and why it is normally a bad idea to turn them off.

TCP Window scaling

The data transmission rate that TCP can sustain is limited by several factors. Some of these are:

  • Round trip time (RTT).  This is the time it takes for a packet to get to the destination and a reply to come back. Lower is better.
  • lowest link speed of the network paths involved
  • frequency of packet loss
  • the speed at which new data can be made available for transmission
    For example, the CPU needs to be able to pass data to the network adapter fast enough. If the CPU needs to encrypt the data first, the adapter might have to wait for new data. In similar fashion disk storage can be a bottleneck if it can’t read the data fast enough.
  • The maximum possible size of the TCP receive window. The receive window determines how much data (in bytes) TCP can transmit before it has to wait for the receiver to report reception of that data. This is announced by the receiver. The receiver will constantly update this value as it reads and acknowledges reception of the incoming data. The receive windows current value is contained in the TCP header that is part of every segment sent by TCP. The sender is thus aware of the current receive window whenever it receives an acknowledgment from the peer. This means that the higher the round-trip time, the longer it takes for sender to get receive window updates.

TCP is limited to at most 64 kilobytes of unacknowledged (in-flight) data. This is not even close to what is needed to sustain a decent data rate in most networking scenarios. Let us look at some examples.

Theoretical data rate

With a round-trip-time of 100 milliseconds, TCP can transfer at most 640 kilobytes per second. With a 1 second delay, the maximum theoretical data rate drops down to only 64 kilobytes per second.

This is because of the receive window. Once 64kbyte of data have been sent the receive window is already full.  The sender must wait until the peer informs it that at least some of the data has been read by the application. 

The first segment sent reduces the TCP window by the size of that segment. It takes one round-trip before an update of the receive window value will become available. When updates arrive with a 1 second delay, this results in a 64 kilobyte limit even if the link has plenty of bandwidth available.

In order to fully utilize a fast network with several milliseconds of delay, a window size larger than what classic TCP supports is a must. The ’64 kilobyte limit’ is an artifact of the protocols specification: The TCP header reserves only 16bits for the receive window size. This allows receive windows of up to 64KByte. When the TCP protocol was originally designed, this size was not seen as a limit.

Unfortunately, its not possible to just change the TCP header to support a larger maximum window value. Doing so would mean all implementations of TCP would have to be updated simultaneously or they wouldn’t understand one another anymore. To solve this, the interpretation of the receive window value is changed instead.

The ‘window scaling option’ allows to do this while keeping compatibility to existing implementations.

TCP Options: Backwards-compatible protocol extensions

TCP supports optional extensions. This allows to enhance the protocol with new features without the need to update all implementations at once. When a TCP initiator connects to the peer, it also send a list of supported extensions. All extensions follow the same format: an unique option number followed by the length of the option and the option data itself.

The TCP responder checks all the option numbers contained in the connection request. If it does not understand an option number it skips
‘length’ bytes of data and checks the next option number. The responder omits those it did not understand from the reply. This allows both the sender and receiver to learn the common set of supported options.

With window scaling, the option data always consist of a single number.

The window scaling option

 
Window Scale option (WSopt): Kind: 3, Length: 3
    +---------+---------+---------+
    | Kind=3  |Length=3 |shift.cnt|
    +---------+---------+---------+
         1         1         1

The window scaling option tells the peer that the receive window value found in the TCP header should be scaled by the given number to get the real size.

For example, a TCP initiator that announces a window scaling factor of 7 tries to instruct the responder that any future packets that carry a receive window value of 512 really announce a window of 65536 byte. This is an increase by a factor of 128. This would allow a maximum TCP Window of 8 Megabytes.

A TCP responder that does not understand this option ignores it. The TCP packet sent in reply to the connection request (the syn-ack) then does not contain the window scale option. In this case both sides can only use a 64k window size. Fortunately, almost every TCP stack supports and enables this option by default, including Linux.

The responder includes its own desired scaling factor. Both peers can use a different number. Its also legitimate to announce a scaling factor of 0. This means the peer should treat the receive window value it receives verbatim, but it allows scaled values in the reply direction — the recipient can then use a larger receive window.

Unlike SACK or TCP timestamps, the window scaling option only appears in the first two packets of a TCP connection, it cannot be changed afterwards. It is also not possible to determine the scaling factor by looking at a packet capture of a connection that does not contain the initial connection three-way handshake.

The largest supported scaling factor is 14. This allows TCP window sizes
of up to one Gigabyte.

Window scaling downsides

It can cause data corruption in very special cases. Before you disable the option – it is impossible under normal circumstances. There is also a solution in place that prevents this. Unfortunately, some people disable this solution without realizing the relationship with window scaling. First, let’s have a look at the actual problem that needs to be addressed. Imagine the following sequence of events:

  1. The sender transmits segments: s_1, s_2, s_3, … s_n
  2.  The receiver sees: s_1, s_3, .. s_n and sends an acknowledgment for s_1.
  3.  The sender considers s_2 lost and sends it a second time. It also sends new data contained in segment s_n+1.
  4.  The receiver then sees: s_2, s_n+1, s_2: the packet s_2 is received twice.

This can happen for example when a sender triggers re-transmission too early. Such erroneous re-transmits are never a problem in normal cases, even with window scaling. The receiver will just discard the duplicate.

Old data to new data

The TCP sequence number can be at most 4 Gigabyte. If it becomes larger than this, the sequence wraps back to 0 and then increases again. This is not a problem in itself, but if this occur fast enough then the above scenario can create an ambiguity.

If a wrap-around occurs at the right moment, the sequence number s_2 (the re-transmitted packet) can already be larger than s_n+1. Thus, in the last step (4), the receiver may interpret this as: s_2, s_n+1, s_n+m, i.e. it could view the ‘old’ packet s_2 as containing new data.

Normally, this won’t happen because a ‘wrap around’ occurs only every couple of seconds or minutes even on high bandwidth links. The interval between the original and a unneeded re-transmit will be a lot smaller.

For example,with a transmit speed of 50 Megabytes per second, a
duplicate needs to arrive more than one minute late for this to become a problem. The sequence numbers do not wrap fast enough for small delays to induce this problem.

Once TCP approaches ‘Gigabyte per second’ throughput rates, the sequence numbers can wrap so fast that even a delay by only a few milliseconds can create duplicates that TCP cannot detect anymore. By solving the problem of the too small receive window, TCP can now be used for network speeds that were impossible before – and that creates a new, albeit rare problem. To safely use Gigabytes/s speed in environments with very low RTT receivers must be able to detect such old duplicates without relying on the sequence number alone.

TCP time stamps

A best-before date

In the most simple terms, TCP timestamps just add a time stamp to the packets to resolve the ambiguity caused by very fast sequence number wrap around. If a segment appears to contain new data, but its timestamp is older than the last in-window packet, then the sequence number has wrapped and the ”new” packet is actually an older duplicate. This resolves the ambiguity of re-transmits even for extreme corner cases.

But this extension allows for more than just detection of old packets. The other major feature made possible by TCP timestamps are more precise round-trip time measurements (RTTm).

A need for precise round-trip-time estimation

When both peers support timestamps,  every TCP segment carries two additional numbers: a timestamp value and a timestamp echo.

 
TCP Timestamp option (TSopt): Kind: 8, Length: 10
+-------+----+----------------+-----------------+
|Kind=8 | 10 |TS Value (TSval)|EchoReply (TSecr)|
+-------+----+----------------+-----------------+
    1      1         4                4

An accurate RTT estimate is crucial for TCP performance. TCP automatically re-sends data that was not acknowledged. Re-transmission is triggered by a timer: If it expires, TCP considers one or more packets that it has not yet received an acknowledgment for to be lost. They are then sent again.

But “has not been acknowledged” does not mean the segment was lost. It is also possible that the receiver did not send an acknowledgment so far or that the acknowledgment is still in flight. This creates a dilemma: TCP must wait long enough for such slight delays to not matter, but it can’t wait for too long either.

Low versus high network delay

In networks with a high delay, if the timer fires too fast, TCP frequently wastes time and bandwidth with unneeded re-sends.

In networks with a low delay however,  waiting for too long causes reduced throughput when a real packet loss occurs. Therefore, the timer should expire sooner in low-delay networks than in those with a high delay. The tcp retransmit timeout therefore cannot use a fixed constant value as a timeout. It needs to adapt the value based on the delay that it experiences in the network.

Round-trip time measurement

TCP picks a retransmit timeout that is based on the expected round-trip time (RTT). The RTT is not known in advance. RTT is estimated by measuring the delta between the time a segment is sent and the time TCP receives an acknowledgment for the data carried by that segment.

This is complicated by several factors.

  • For performance reasons, TCP does not generate a new acknowledgment for every packet it receives. It waits  for a very small amount of time: If more segments arrive, their reception can be acknowledged with a single ACK packet. This is called “cumulative ACK”.
  •  The round-trip-time is not constant. This is because of a myriad of factors. For example, a client might be a mobile phone switching to different base stations as its moved around. Its also possible that packet switching takes longer when link or CPU utilization increases.
  • a packet that had to be re-sent must be ignored during computation. This is because the sender cannot tell if the ACK for the re-transmitted segment is acknowledging the original transmission (that arrived after all) or the re-transmission.

This last point is significant: When TCP is busy recovering from a loss, it may only receives ACKs for re-transmitted segments. It then can’t measure (update) the RTT during this recovery phase. As a consequence it can’t adjust the re-transmission timeout, which then keeps growing exponentially. That’s a pretty specific case (it assumes that other mechanisms such as fast retransmit or SACK did not help). Nevertheless, with TCP timestamps, RTT evaluation is done even in this case.

If the extension is used, the peer reads the timestamp value from the TCP segments extension space and stores it locally. It then places this value in all the segments it sends back as the “timestamp echo”.

Therefore the option carries two timestamps: Its senders own timestamp and the most recent timestamp it received from the peer. The “echo timestamp” is used by the original sender to compute the RTT. Its the delta between its current timestamp clock and what was reflected in the “timestamp echo”.

Other timestamp uses

TCP timestamps even have other uses beyond PAWS and RTT measurements. For example it becomes possible to detect if a retransmission was unnecessary. If the acknowledgment carries an older timestamp echo, the acknowledgment was for the initial packet, not the re-transmitted one.

Another, more obscure use case for TCP timestamps is related to the TCP syn cookie feature.

TCP connection establishment on server side

When connection requests arrive faster than a server application can accept the new incoming connection, the connection backlog will eventually reach its limit. This can occur because of a mis-configuration of the system or a bug in the application. It also happens when one or more clients send connection requests without reacting to the ‘syn ack’ response. This fills the connection queue with incomplete connections. It takes several seconds for these entries to time out. This is called a “syn flood attack”.

TCP timestamps and TCP syn cookies

Some TCP stacks allow to accept new connections even if the queue is full. When this happens, the Linux kernel will print a prominent message to the system log:

Possible SYN flooding on port P. Sending Cookies. Check SNMP counters.

This mechanism bypasses the connection queue entirely. The information that is normally stored in the connection queue is encoded into the SYN/ACK responses TCP sequence number. When the ACK comes back, the queue entry can be rebuilt from the sequence number.

The sequence number only has limited space to store information. Connections established using the ‘TCP syn cookie’ mechanism can not support TCP options for this reason.

The TCP options that are common to both peers can be stored in the timestamp, however. The ACK packet reflects the value back in the timestamp echo field which allows to recover the agreed-upon TCP options as well. Else, cookie-connections are restricted by the standard 64 kbyte receive window.

Common myths – timestamps are bad for performance

Unfortunately some guides recommend disabling TCP timestamps to reduce the number of times the kernel needs to access the timestamp clock to get the current time. This is not correct. As explained before, RTT estimation is a necessary part of TCP. For this reason, the kernel always takes a microsecond-resolution time stamp when a packet is received/sent.

Linux re-uses the clock timestamp taken for the RTT estimation for the remainder of the packet processing step. This also avoids the extra clock access to add a timestamp to an outgoing TCP packet.

The entire timestamp option only requires 10 bytes of TCP option space in each packet, this is not a significant decrease in space available for packet payload.

common myths – timestamps are a security problem

Some security audit tools and (older) blog posts recommend to disable TCP
timestamps because they allegedly leak system uptime: This would then allow to estimate the patch level of the system/kernel. This was true in the past: The timestamp clock is based on a constantly increasing value that starts at a fixed value on each system boot. A timestamp value would give a estimate as to how long the machine has been running (uptime).

As of Linux 4.12 TCP timestamps do not reveal the uptime anymore. All timestamp values sent use a peer-specific offset. Timestamp values also wrap every 49 days.

In other words, connections from or to address “A” see a different timestamp than connections to the remote address “B”.

Run sysctl net.ipv4.tcp_timestamps=2 to disable the randomization offset. This makes analyzing packet traces recorded by tools like wireshark or tcpdump easier – packets sent from the host then all have the same clock base in their TCP option timestamp.  For normal operation the default setting should be left as-is.

Selective Acknowledgments

TCP has problems if several packets in the same window of data are lost. This is because TCP Acknowledgments are cumulative, but only for packets
that arrived in-sequence. Example:

  • Sender transmits segments s_1, s_2, s_3, … s_n
  • Sender receives ACK for s_2
  • This means that both s_1 and s_2 were received and the
    sender no longer needs to keep these segments around.
  • Should s_3 be re-transmitted? What about s_4? s_n?

The sender waits for a “retransmission timeout” or ‘duplicate ACKs’ for s_2 to arrive. If a retransmit timeout occurs or several duplicate ACKs for s_2 arrive, the sender transmits s_3 again.

If the sender receives an acknowledgment for s_n, s_3 was the only missing packet. This is the ideal case. Only the single lost packet was re-sent.

If the sender receives an acknowledged segment that is smaller than s_n, for example s_4, that means that more than one packet was lost. The
sender needs to re-transmit the next segment as well.

Re-transmit strategies

Its possible to just repeat the same sequence: re-send the next packet until the receiver indicates it has processed all packet up to s_n. The problem with this approach is that it requires one RTT until the sender knows which packet it has to re-send next. While such strategy avoids unnecessary re-transmissions, it can take several seconds and more until TCP has re-sent the entire window of data.

The alternative is to re-send several packets at once. This approach allows TCP to recover more quickly when several packets have been lost. In the above example TCP re-send s_3, s_4, s_5, .. while it can only be sure that s_3 has been lost.

From a latency point of view, neither strategy is optimal. The first strategy is fast if only a single packet has to be re-sent, but takes too long when multiple packets were lost.

The second one is fast even if multiple packet have to be re-sent, but at the cost of wasting bandwidth. In addition, such a TCP sender could have transmitted new data already while it was doing the unneeded re-transmissions.

With the available information TCP cannot know which packets were lost. This is where TCP Selective Acknowledgments (SACK) come in. Just like window scaling and timestamps, it is another optional, yet very useful TCP feature.

The SACK option

 
   TCP Sack-Permitted Option: Kind: 4, Length 2
   +---------+---------+
   | Kind=4  | Length=2|
   +---------+---------+

A sender that supports this extension includes the “Sack Permitted” option in the connection request. If both endpoints support the extension, then a peer that detects a packet is missing in the data stream can inform the sender about this.

 
   TCP SACK Option: Kind: 5, Length: Variable
                     +--------+--------+
                     | Kind=5 | Length |
   +--------+--------+--------+--------+
   |      Left Edge of 1st Block       |
   +--------+--------+--------+--------+
   |      Right Edge of 1st Block      |
   +--------+--------+--------+--------+
   |                                   |
   /            . . .                  /
   |                                   |
   +--------+--------+--------+--------+
   |      Left Edge of nth Block       |
   +--------+--------+--------+--------+
   |      Right Edge of nth Block      |
   +--------+--------+--------+--------+

A receiver that encounters segment_s2 followed by s_5…s_n, it will include a SACK block when it sends the acknowledgment for s_2:

 
                +--------+-------+
                | Kind=5 |   10  |
+--------+------+--------+-------+
| Left edge: s_5                 |
+--------+--------+-------+------+
| Right edge: s_n                |
+--------+-------+-------+-------+

This tells the sender that segments up to s_2 arrived in-sequence, but it also lets the sender know that the segments s_5 to s_n were also received. The sender can then re-transmit these two packets and proceed to send new data.

The mythical lossless network

In theory SACK provides no advantage if the connection cannot experience packet loss. Or the connection has such a low latency that even waiting one full RTT does not matter.

In practice lossless behavior is virtually impossible to ensure.
Even if the network and all its switches and routers have ample bandwidth and buffer space packets can still be lost:

  • The host operating system might be under memory pressure and drop
    packets. Remember that a host might be handling tens of thousands of packet streams simultaneously.
  • The CPU might not be able to drain incoming packets from the network interface fast enough. This causes packet drops in the network adapter itself.
  • If TCP timestamps are not available even a connection with a very small RTT can stall momentarily during loss recovery.

Use of SACK does not increase the size of TCP packets unless a connection experiences packet loss. Because of this, there is hardly a reason to disable this feature. Almost all TCP stacks support SACK – it is typically only absent on low-power IOT-alike devices that are not doing TCP bulk data transfers.

When a Linux system accepts a connection from such a device, TCP automatically disables SACK for the affected connection.

Summary

The three TCP extensions examined in this post are all related to TCP performance and should best be left to the default setting: enabled.

The TCP handshake ensures that only extensions that are understood by both parties are used, so there is never a need to disable an extension globally just because a peer might not support it.

Turning these extensions off results in severe performance penalties, especially in case of TCP Window Scaling and SACK. TCP timestamps can be disabled without an immediate disadvantage, however there is no compelling reason to do so anymore. Keeping them enabled also makes it possible to support TCP options even when SYN cookies come into effect.

Posted on Leave a comment

Backup and restore Toolboxes

Toolboxes started life often described as disposable containers – and that is still one of their major uses: install stuff, then try it out in the relative safety of a container, and lastly, cleanly dispose of it. Minimal risk, fuss and without pesky residual libraries and applications hanging around on the host long after you have finished.

So — why would you backup a Toolbox? Sometimes, they have more permanent uses, contain complex and lengthy installs, or are being used for critical applications. For example, Toolboxes can be used as a development environment, containing hardware associated drivers and applications. Or they could be used for an application you want to run in a container for which there is no Flatpak, or one that has requirements a Flatpak doesn’t satisfy. While they can be handy to use on Fedora Workstation, toolbox containers are often essential for Silverblue users since they offer an easy solution to installing applications that can’t successfully be installed by rpm-ostree. Or for applications that may not have a Flatpak version readily available. In the above situations a busted Toolbox can be a major headache. But if a backup exists, you can quickly restore a Toolbox or move it to another workstation.

The backup process uses Podman to create an image of an existing toolbox container, and save that image to an archive file. To restore the toolbox container, load the image from the archive file and then create a Toolbox from that image. The new toolbox container will be an identical copy of your backed up toolbox container.

It is important to note this process does not backup data, just what you have installed in the toolbox container. This includes packages installed from repositories or from a local rpm file using dnf. If you need to backup data, Podman’s commit command that will be used to capture an image of the toolbox container, has an option to include volumes attached to the container.

Creating a backup

To backup a toolbox container you will need it’s name and container ID which can be gotten by using toolbox list. For this example I am going to backup my golang development toolbox container, imaginatively named go.

$ toolbox list CONTAINER ID CONTAINER NAME CREATED STATUS IMAGE NAME
00ff783a102f go 5 weeks ago exited registry.fedoraproject.org/f32/fedora-toolbox:32

If the container’s status shows as running , you should stop it using podman container stop container_name. Although the commit command has a -p for pause option, make sure that the Toolbox is not running, which helps it initialize correctly when restored from backup.

$ podman container stop go

To create an image of the toolbox container use

podman container commit -p container_ID backup-image-name

Depending on the complexity of the Toolbox, this can take a little while.

 $ podman container commit -p 00ff783a102f go-backup

Now to confirm the image has been created type…

$ toolbox list

You should get output similar to what is below…

IMAGE ID IMAGE NAME CREATED
cfcb13046db7 localhost/go-backup:latest About a minute ago CONTAINER ID CONTAINER NAME CREATED STATUS IMAGE NAME
00ff783a102f go 5 weeks ago exited registry.fedoraproject.org/f32/fedora-toolbox:32

Now to save the backup image to a tar archive file using podman save -o backup-filename.tar backup-image-name.

$ podman save -o go.tar go-backup

Confirm the archive file, our toolbox container backup, was created.

$ ls go.tar 

Do some tidying up, remove the backup image and, if needed, remove the original Toolbox.

$ podman rmi go-backup $ toolbox rm go

Restore a backup

To create an image from the backup file that was made above, you do it with the command podman load -i backup_filename.

$ podman load -i go.tar

Then you can confirm the image was created with…

$ toolbox list IMAGE ID IMAGE NAME CREATED
cfcb13046db7 localhost/go-backup:latest 17 minutes ago

Now create a toolbox container from the restored image, with toolbox create –container container_name ––image image_name, specifying the full repository and version tag as the image name.

$ toolbox create --container go --image localhost/go-backup:latest

Confirm that the toolbox was created.

$ toolbox list IMAGE ID IMAGE NAME CREATED
cfcb13046db7 localhost/go-backup:latest 20 minutes ago CONTAINER ID CONTAINER NAME CREATED STATUS IMAGE NAME
34cef6b7e28d go 21 seconds ago configured localhost/go-backup:latest

Finally, you can test that the restored Toolbox works…

$ toolbox enter --container go

If you can enter the newly created toolbox container, you will see the toolbox prompt and will have successfully backed up and restored your Pet toolbox container.

Posted on Leave a comment

LaTeX typesetting, Part 3: formatting

This series covers basic formatting in LaTeX. Part 1 introduced lists. Part 2 covered tables. In part 3, you will learn about another great feature of LaTeX: the flexibility of granular document formatting. This article covers customizing the page layout, table of contents, title sections, and page style.

Page dimension

When you first wrote your LaTeX document you may have noticed that the default margin is slightly bigger than you may imagine. The margins have to do with the type of paper you specified, for example, a4, letter, and the document class: article, book, report, and so on. To modify the page margins there are a few options, one of the simplest options is using the fullpage package.

This package sets the body of the page such that the page is almost full.

Fullpage package documentation

The illustration below demonstrates the LaTeX default body compared to using the fullpage package.

Another option is to use the geometry package. Before you explore how the geometry package can manipulate margins, first look at the page dimensions as depicted below.

  1. one inch + \hoffset
  2. one inch + \voffset
  3. \oddsidemargin = 31pt
  4. \topmargin = 20pt
  5. \headheight = 12pt
  6. \headsep = 25pt
  7. \textheight = 592pt
  8. \textwidth = 390pt
  9. \marginparsep = 35pt
  10. \marginparwidth = 35pt
  11. \footskip = 30pt

To set the margin to 1 (one) inch using the geometry package use the following example

\usepackage{geometry}
\geometry{a4paper, margin=1in}

In addition to the above example, the geometry command can modify the paper size, and orientation. To change the size of the paper, use the example below:

\usepackage[a4paper, total={7in, 8in}]{geometry}

To change the page orientation, you need to add landscape to the geometry options as shown below:

\usepackage{geometery}
\geometry{a4paper, landscape, margin=1.5in
Landscape Orientation

Table of contents

By default, a LaTeX table of contents is titled “Contents”. There are times when you prefer to relabel the text to be “Table of Content”, change the vertical spacing between the ToC and your first section of chapter, or simply change the color of the text.

To change the text you add the following lines to your preamble, substitute english with your desired language :

\usepackage[english]{babel}
\addto\captionsenglish{
\renewcommand{\contentsname}
{\bfseries{Table of Contents}}}

To manipulate the virtual spacing between ToC and the list of figures, sections, and chapters, use the tocloft package. The two options used in this article are cftbeforesecskip and cftaftertoctitleskip.

The tocloft package provides means of controlling the typographic design of the ToC, List of Figures and List of Tables.

Tocloft package doucmentation

\usepackage{tocloft}
\setlength\ctfbeforesecskip{2pt}
\setlength\cftaftertoctitleskip{30pt}

cftbeforesecskip is the spacing between the sections in the ToC, while
cftaftertoctitleskip is the space between text “Table of Contents” and the first section in the ToC. The below image shows the differences between the default and the modified ToC.

Default ToC
Customized ToC

Borders

When using the package hyperref in your document, LaTeX section lists in the ToC and references including \url have a border, as shown in the images below.

To remove these borders, include the following in the preamble, In the previous section, “Table of Contents,” you will see that there are not any borders in the ToC.

\usepackage{hyperref}
\hypersetup{ pdfborder = {0 0 0}}

Title section

To modify the title section font, style, and/or color, use the package titlesec. In this example, you will change the font size, font style, and font color of the section, subsection, and subsubsection. First, add the following to the preamble.

\usepackage{titlesec}
\titleformat*{\section}{\Huge\bfseries\color{darkblue}}
\titleformat*{\subsection}{\huge\bfseries\color{darkblue}}
\titleformat*{\subsubsection}{\Large\bfseries\color{darkblue}}

Taking a closer look at the code, \titleformat*{\section} specifies the depth of section to use. The above example, uses up to the third depth. The {\Huge\bfseries\color{darkblue}} portion specifies the size of the font, font style and, font color

Page style

To customize the page headers and footers one of the packages, use fancyhdr. This example uses this package to modify the page style, header, and footer. The code below provides a brief description of what each option does.

\pagestyle{fancy} %for header to be on each page
\fancyhead[L]{} %keep left header blank
\fancyhead[C]{} %keep centre header blank
\fancyhead[R]{\leftmark} %add the section/chapter to the header right
\fancyfoot[L]{Static Content} %add static test to the left footer
\fancyfoot[C]{} %keep centre footer blank
\fancyfoot[R]{\thepage} %add the page number to the right footer
\setlength\voffset{-0.25in} %space between page border and header (1in + space)
\setlength\headheight{12pt} %height of the actual header.
\setlength\headsep{25pt} %separation between header and text.
\renewcommand{\headrulewidth}{2pt} % add header horizontal line
\renewcommand{\footrulewidth}{1pt} % add footer horizontal line

The results of this change are shown below:

Header
Footer

Tips

Centralize the preamble

If write many TeX documents, you can create a .tex file with all your preamble based on your document categories and reference this file. For example, I use a structure.tex as shown below.

$ cat article_structure.tex
\usepackage[english]{babel}
\addto\captionsenglish{
\renewcommand{\contentsname}
{\bfseries{\color{darkblue}Table of Contents}}
} % Relable the contents
%\usepackage[margin=0.5in]{geometry} % specifies the margin of the document
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{graphicx} % allows you to add graphics to the document
\usepackage{hyperref} % permits redirection of URL from a PDF document
\usepackage{fullpage} % formate the content to utilise the full page
%\usepackage{a4wide}
\usepackage[export]{adjustbox} % to force image position
%\usepackage[section]{placeins} % to have multiple images in a figure
\usepackage{tabularx} % for wrapping text in a table
%\usepackage{rotating}
\usepackage{multirow}
\usepackage{subcaption} % to have multiple images in a figure
%\usepackage{smartdiagram} % initialise smart diagrams
\usepackage{enumitem} % to manage the spacing between lists and enumeration
\usepackage{fancyhdr} %, graphicx} %for header to be on each page
\pagestyle{fancy} %for header to be on each page
%\fancyhf{}
\fancyhead[L]{}
\fancyhead[C]{}
\fancyhead[R]{\leftmark}
\fancyfoot[L]{Static Content} %\includegraphics[width=0.02\textwidth]{virgin_voyages.png}}
\fancyfoot[C]{} % clear center
\fancyfoot[R]{\thepage}
\setlength\voffset{-0.25in} %Space between page border and header (1in + space)
\setlength\headheight{12pt} %Height of the actual header.
\setlength\headsep{25pt} %Separation between header and text.
\renewcommand{\headrulewidth}{2pt} % adds horizontal line
\renewcommand{\footrulewidth}{1pt} % add horizontal line (footer)
%\renewcommand{\oddsidemargin}{2pt} % adjuct the margin spacing
%\renewcommand{\pagenumbering}{roman} % change the numbering style
%\renewcommand{\hoffset}{20pt}
%\usepackage{color}
\usepackage[table]{xcolor}
\hypersetup{ pdfborder = {0 0 0}} % removes the red boarder from the table of content
%\usepackage{wasysym} %add checkbox
%\newcommand\insq[1]{%
% \Square\ #1\quad%
%} % specify the command to add checkbox
%\usepackage{xcolor}
%\usepackage{colortbl}
%\definecolor{Gray}{gray}{0.9} % create new colour
%\definecolor{LightCyan}{rgb}{0.88,1,1} % create new colour
%\usepackage[first=0,last=9]{lcg}
%\newcommand{\ra}{\rand0.\arabic{rand}}
%\newcolumntype{g}{>{\columncolor{LightCyan}}c} % create new column type g
%\usesmartdiagramlibrary{additions}
%\setcounter{figure}{0}
\setcounter{secnumdepth}{0} % sections are level 1
\usepackage{csquotes} % the proper was of using double quotes
%\usepackage{draftwatermark} % Enable watermark
%\SetWatermarkText{DRAFT} % Specify watermark text
%\SetWatermarkScale{5} % Toggle watermark size
\usepackage{listings} % add code blocks
\usepackage{titlesec} % Manipulate section/subsection
\titleformat{\section}{\Huge\bfseries\color{darkblue}} % update sections to bold with the colour blue \titleformat{\subsection}{\huge\bfseries\color{darkblue}} % update subsections to bold with the colour blue
\titleformat*{\subsubsection}{\Large\bfseries\color{darkblue}} % update subsubsections to bold with the colour blue
\usepackage[toc]{appendix} % Include appendix in TOC
\usepackage{xcolor}
\usepackage{tocloft} % For manipulating Table of Content virtical spacing
%\setlength\cftparskip{-2pt}
\setlength\cftbeforesecskip{2pt} %spacing between the sections
\setlength\cftaftertoctitleskip{30pt} % space between the first section and the text ``Table of Contents''
\definecolor{navyblue}{rgb}{0.0,0.0,0.5}
\definecolor{zaffre}{rgb}{0.0, 0.08, 0.66}
\definecolor{white}{rgb}{1.0, 1.0, 1.0}
\definecolor{darkblue}{rgb}{0.0, 0.2, 0.6}
\definecolor{darkgray}{rgb}{0.66, 0.66, 0.66}
\definecolor{lightgray}{rgb}{0.83, 0.83, 0.83}
%\pagenumbering{roman}

In your articles, refer to the structure.tex file as shown in the example below:

\documentclass[a4paper,11pt]{article}
\input{/path_to_structure.tex}}
\begin{document}
…...
\end{document}

Add watermarks

To enable watermarks in your LaTeX document, use the draftwatermark package. The below code snippet and image demonstrates the how to add a watermark to your document. By default the watermark color is grey which can be modified to your desired color.

\usepackage{draftwatermark} \SetWatermarkText{\color{red}Classified} %add watermark text \SetWatermarkScale{4} %specify the size of the text

Conclusion

In this series you saw some of the basic, but rich features that LaTeX provides for customizing your document to cater to your needs or the audience the document will be presented to. With LaTeX, there are many packages available to customize the page layout, style, and more.

Posted on Leave a comment

Spam Classification with ML-Pack

Introduction

ML-Pack is a small footprint C++ machine learning library that can be easily integrated into other programs. It is an actively developed open source project and released under a BSD-3 license. Machine learning has gained popularity due to the large amount of electronic data that can be collected. Some other popular machine learning frameworks include TensorFlow, MxNet, PyTorch, Chainer and Paddle Paddle, however these are designed for more complex workflows than ML-Pack. On Fedora, ML-Pack is packaged by its lead developer Ryan Curtin. In addition to a command line interface, ML-Pack has bindings for Python and Julia. Here, we will focus on the command line interface since this may be useful for system administrators to integrate into their workflows.

Installation

You can install ML-Pack on the Fedora command line using

$ sudo dnf -y install mlpack mlpack-bin

You can also install the documentation, development headers and Python bindings by using …

$ sudo dnf -y install mlpack-doc \
mlpack-devel mlpack-python3

though they will not be used in this introduction.

Example

As an example, we will train a machine learning model to classify spam SMS messages. To keep this article brief, linux commands will not be fully explained, but you can find out more about them by using the man command, for example for the command first command used below, wget

$ man wget

will give you information that wget will download files from the web and options you can use for it.

Get a dataset

We will use an example spam dataset in Indonesian provided by Yudi Wibisono

 
$ wget https://drive.google.com/file/d/1-stKadfTgJLtYsHWqXhGO3nTjKVFxm_Q/view
$ unzip dataset_sms_spam_bhs_indonesia_v1.zip

Pre-process dataset

We will try to classify a message as spam or ham by the number of occurrences of a word in a message. We first change the file line endings, remove line 243 which is missing a label and then remove the header from the dataset. Then, we split our data into two files, labels and messages. Since the labels are at the end of the message, the message is reversed and then the label removed and placed in one file. The message is then removed and placed in another file.

$ tr 'r' 'n' < dataset_sms_spam_v1.csv > dataset.txt
$ sed '243d' dataset.txt > dataset1.csv
$ sed '1d' dataset1.csv > dataset.csv
$ rev dataset.csv | cut -c1 | rev > labels.txt
$ rev dataset.csv | cut -c2- | rev > messages.txt
$ rm dataset.csv
$ rm dataset1.csv
$ rm dataset.txt

Machine learning works on numeric data, so we will use labels of 1 for ham and 0 for spam. The dataset contains three labels, 0, normal sms (ham), 1, fraud (spam), and 2 promotion (spam). We will label all spam as 1, so promotions and fraud will be labelled as 1.

$ tr '2' '1' < labels.txt > labels.csv
$ rm labels.txt

The next step is to convert all text in the messages to lower case and for simplicity remove punctuation and any symbols that are not spaces, line endings or in the range a-z (one would need expand this range of symbols for production use)

$ tr '[:upper:]' '[:lower:]' < \
messages.txt > messagesLower.txt
$ tr -Cd 'abcdefghijklmnopqrstuvwxyz n' < \ messagesLower.txt > messagesLetters.txt
$ rm messagesLower.txt

We now obtain a sorted list of unique words used (this step may take a few minutes, so use nice to give it a low priority while you continue with other tasks on your computer).

$ nice -20 xargs -n1 < messagesLetters.txt > temp.txt
$ sort temp.txt > temp2.txt
$ uniq temp2.txt > words.txt
$ rm temp.txt
$ rm temp2.txt

We then create a matrix, where for each message, the frequency of word occurrences is counted (more on this on Wikipedia, here and here). This requires a few lines of code, so the full script, which should be saved as ‘makematrix.sh’ is below

#!/bin/bash
declare -a words=()
declare -a letterstartind=()
declare -a letterstart=()
letter=" "
i=0
lettercount=0
while IFS= read -r line; do labels[$((i))]=$line let "i++"
done < labels.csv
i=0
while IFS= read -r line; do words[$((i))]=$line firstletter="$( echo $line | head -c 1 )" if [ "$firstletter" != "$letter" ] then letterstartind[$((lettercount))]=$((i)) letterstart[$((lettercount))]=$firstletter letter=$firstletter let "lettercount++" fi let "i++"
done < words.txt
letterstartind[$((lettercount))]=$((i))
echo "Created list of letters" touch wordfrequency.txt
rm wordfrequency.txt
touch wordfrequency.txt
messagecount=0
messagenum=0
messages="$( wc -l messages.txt )"
i=0
while IFS= read -r line; do let "messagenum++" declare -a wordcount=() declare -a wordarray=() read -r -a wordarray <<> wordfrequency.txt echo "Processed message ""$messagenum" let "i++"
done < messagesLetters.txt
# Create csv file
tr ' ' ',' data.csv

Since Bash is an interpreted language, this simple implementation can take upto 30 minutes to complete. If using the above Bash script on your primary workstation, run it as a task with low priority so that you can continue with other work while you wait:

$ nice -20 bash makematrix.sh

Once the script has finished running, split the data into testing (30%) and training (70%) sets:

$ mlpack_preprocess_split \ --input_file data.csv \ --input_labels_file labels.csv \ --training_file train.data.csv \ --training_labels_file train.labels.csv \ --test_file test.data.csv \ --test_labels_file test.labels.csv \ --test_ratio 0.3 \ --verbose

Train a model

Now train a Logistic regression model:

$ mlpack_logistic_regression \
--training_file train.data.csv \
--labels_file train.labels.csv --lambda 0.1 \
--output_model_file lr_model.bin

Test the model

Finally we test our model by producing predictions,

$ mlpack_logistic_regression \
--input_model_file lr_model.bin \ --test_file test.data.csv \
--output_file lr_predictions.csv

and comparing the predictions with the exact results,

$ export incorrect=$(diff -U 0 lr_predictions.csv \
test.labels.csv | grep '^@@' | wc -l)
$ export tests=$(wc -l < lr_predictions.csv)
$ echo "scale=2; 100 * ( 1 - $((incorrect)) \
/ $((tests)))" | bc

This gives approximately 90% validation rate, similar to that obtained here.

The dataset is composed of approximately 50% spam messages, so the validation rates are quite good without doing much parameter tuning. In typical cases, datasets are unbalanced with many more entries in some categories than in others. In these cases a good validation rate can be obtained by mispredicting the class with a few entries. Thus to better evaluate these models, one can compare the number of misclassifications of spam, and the number of misclassifications of ham. Of particular importance in applications is the number of false positive spam results as these are typically not transmitted. The script below produces a confusion matrix which gives a better indication of misclassification. Save it as ‘confusion.sh’

#!/bin/bash
declare -a labels
declare -a lr
i=0
while IFS= read -r line; do labels[i]=$line let "i++"
done < test.labels.csv
i=0
while IFS= read -r line; do lr[i]=$line let "i++"
done < lr_predictions.csv
TruePositiveLR=0
FalsePositiveLR=0
TrueZerpLR=0
FalseZeroLR=0
Positive=0
Zero=0
for i in "${!labels[@]}"; do if [ "${labels[$i]}" == "1" ] then let "Positive++" if [ "${lr[$i]}" == "1" ] then let "TruePositiveLR++" else let "FalseZeroLR++" fi fi if [ "${labels[$i]}" == "0" ] then let "Zero++" if [ "${lr[$i]}" == "0" ] then let "TrueZeroLR++" else let "FalsePositiveLR++" fi fi done
echo "Logistic Regression"
echo "Total spam" $Positive
echo "Total ham" $Zero
echo "Confusion matrix"
echo " Predicted class"
echo " Ham | Spam "
echo " ---------------"
echo " Actual| Ham | " $TrueZeroLR "|" $FalseZeroLR
echo " class | Spam | " $FalsePositiveLR " |" $TruePositiveLR
echo ""

then run the script

$ bash confusion.sh

You should get output similar to

Logistic Regression
Total spam 183
Total ham 159
Confusion matrix

    Predicted class
    Ham Spam
Actual class Ham 128 26
Spam 31 157

which indicates a reasonable level of classification. Other methods you can try in ML-Pack for this problem include Naive Bayes, random forest, decision tree, AdaBoost and perceptron.

To improve the error rating, you can try other pre-processing methods on the initial data set. Neural networks can give upto 99.95% validation rates, see for example here, here and here. However, using these techniques with ML-Pack cannot be done on the command line interface at present and is best covered in another post.

For more on ML-Pack, please see the documentation.