# Temperature Logging: Redux

Previously as I was experimenting with logging the temperature using a Raspberry Pi (to monitor the temperatures experienced by fermenting cider), I noted that the Pi was something of a terrible hack, and it should be possible to do more efficiently with some slightly less common hardware.

I decided that improved version would be interesting to build for use at home, since it’s both kind of fun to collect data like that, and actually knowing the temperature in the house is useful at times. The end result of this project is that I can generate graphs like the one below of conditions around the house:

## Software requirements

My primary requirement for home monitoring of this sort is that it not depend on a proprietary hub (especially not one that depends on an external service that might go away without warning), and I’d also like something that can be integrated with my existing (but minimal) home automation setup that’s based around Home Assistant running on my home server.

Given my main software is open source it should be possible to integrate an arbitrary solution with it, with varying amount of reverse engineering and implementation necessary. Because reverse-engineering services like that is not my idea of fun, it’s much preferable to find something that’s already supported and take advantage of others’ work. While I don’t mind debugging, I don’t want to build an integration from scratch if I don’t need to.

## Hardware selection

As observed last time, the “hub” model for connecting “internet of things” devices to a network seems to be the best choice from a security standpoint- the only externally-visible network device is the hub, which can apply arbitrary security policies to communications between devices and to public networks (in the simplest case, forbidding all communications with public networks). Indeed, recent scholarly work (PDF) suggests systems that work on this model but apply more sophisticated policies to communications passing through the hub.

With that in mind, I decided a Zigbee network for my sensors would be appropriate- the sensors themselves have no ability to talk to public networks because they don’t even run an Internet Protocol stack, and it’s already a fairly common standard for communication. Plus, I was able to get several of the previously-mentioned Xiaomi temperature, humidity and barometric pressure sensors for about $10 each; a quite reasonable cost, given they’re battery powered with very long life and good wireless range. Home assistant already has some support for Zigbee devices; most relevant here seems to be its implementation of the Zigbee Home Automation application standard. Though the documentation isn’t very clear, it supports (or, should support) any radio that communicates with a host processor over a UART interface and speaks either the XBee or EZSP serial protocol. Since the documentation for Home Assistant specifically notes that the Elelabs Zigbee USB adapter is compatible, I bought one of those. Its documentation includes a description of how to configure Home Assistant with it and specifically mentions Xiaomi Aqara devices (which includes the sensors I had selected), so I was confident that radio would meet my needs, though unsure of exactly what protocol was actually used to communicate with the radio over USB at the time I ordered it. ## Experimenting Once I received the Zigbee radio-on-a-usb-stick, I immediately tried to manually drive it using whatever libraries I could use to set up a network and get one of my sensors connected to it. This ended up not working, but I did learn a lot about how the radio adapter is meant to work. For working with it in Python, the Elelabs documentation points to bellows, a library providing EZSP protocol support for the zigpy Zigbee stack. It also includes a command-line interface exposing some basic commands, perfect for the sort of experimentation I wanted to do. Getting connected was easy; I plugged the USB stick into my Linux workstation and it appeared right away as a PL2303 USB-to-serial converter. Between this and noting that bellows implements the EZSP protocol, I inferred that the Elelabs stick is a Silicon Labs EM35x microcontroller running the EmberZNet stack in a network coordinator mode, with a PL2303 exposing a UART over USB so the host can communicate with the microcontroller (and the rest of the network) by speaking EZSP. Having worked that out and made sense of it, I printed out a label for the stick that says what it is (“Elelabs Zigbee USB adapter”) and how to communicate with it (EZSP at 57600 baud) since the stick is completely unmarked otherwise and being able to tell what it does just by looking at it is very helpful. Trying to use the bellows CLI, the status output seemed okay and the NCP was running. In order to connect one of my sensors, I then needed to figure out how to make the sensor join the network after using bellows permit to let new devices join the network. The sensors each came with a little instruction booklet, but it was all in Chinese. With the help of Google Translate, I was able to take photos of it and find the important bit- holding the button on the sensor for about 5 seconds until the LED blinks three times will reset it, at which point it will attempt to join an open network. On trying to run bellows permit prior to resetting a sensor to get it on the network, I encountered an annoying bug- it didn’t seem to do anything, and Python emitted a warning: RuntimeWarning: coroutine 'permit' was never awaited. I dug into that a little more and found the libraries make heavy use of PEP 492 coroutines, and the warning was fairly clear that a function was declared async when it shouldn’t have been (or its coroutine wasn’t then given to an event loop) so the function actually implementing permit never ran. I eventually tracked down the problem, patched it locally and filed a bug which has since been fixed. Having fixed that bug, I continued to try to get a sensor on my toy network but was ultimately (apparently) unsuccessful. I could permit joins and reset the sensor and see debug output indicating something was happening on the network, but never any conclusive messages saying a new device had joined and rather a lot of messages along the lines of “unrecognized message.” I couldn’t tell if it was working or not, so moved on to hooking up Home Assistant. ### Setting up Home Assistant Getting set up with Home Assistant was mostly just a matter of following the guide provided with the USB stick, but using my own knowledge of how to set up the software (not using hassio). Configuring the zha component and pointing it at the right USB path is pretty easy. I did discover that specifying the database_path for the zha component alone is not enough to make it work; if the file doesn’t already exist setup just fails. Simply creating an empty file at the configured path is enough- apparently that file is an sqlite database that zigpy uses to track known devices. Still following the Elelabs document, I spent a bit of time invoking zha.permit and trying to get a sensor online to no apparent success. After a little more searching, I found discussion on the Home Assistant forums and in particular one user suggesting that these particular sensors are somewhat finicky when joining a network. They suggested (and my findings agree) that holding the button on the sensor to reset it, then tapping the button approximately every second for a little while (another 5-10 seconds) will keep it awake long enough to successfully join the network. The keep-awake tapping approach did eventually work, though I also found that Home Assistant sometimes didn’t show a new sensor (or parts of a new sensor, like it might show the temperature but not humidity or pressure) until I restarted it. This might be a bug or a misconfiguration on my part, but it’s minor enough not to worry about. At this point I’ve verified that my software and hardware can all work, so it’s time to set up the permanent configuration. ## Permanent configuration As mentioned above, I run Home Assistant on my Linux home server. Since I was already experimenting on a Linux system, that configuration should be trivial to transfer over, but for one additional desire I had: I want more freedom in where I place the Zigbee radio, in particular not just plugged directly into a free USB port on the server. Putting it in a reasonably central location with other radios (say, near the WiFi router) would be nice. A simple solution might be a USB extension cable, but I didn’t have any of those handy and strewing more wires about the place feels inelegant. My Internet router (a TP-Link Archer C7 running OpenWrt) does have an available USB port though, so I suspected it would be possible to connect the Zigbee radio to the router and make it appear as a serial port on the server. This turned out to be true! ### Serial over network To find the solution for running a serial port over the network, I first searched for existing protocols; it turns out there’s a standard one that’s somewhat commonly used in fancy networking equipment, specified by RFC 2217. RFC 2217 specifies a set of extensions to Telnet allowing serial port configuration (bit rate, data bits, parity, etc) and flow control over Telnet. Having identified a protocol that does what I want, it’s then a matter of finding software that works as a client (assuming I’ll be able to find or write a suitable server). Suitable clients are somewhat tricky however, since from an applicaton perspective UART use on Linux involves making specialized ioctls to the device to configure it, then reading and writing bytes as usual. Making an RFC2217 network serial device appear like a local device would seem to involve writing a kernel driver that exports a new class of RFC2217 device nodes supporting the relevant ioctls- none exists.1 An alternate approach (not using RFC 2217) might be USB/IP, which is supported in mainline Linux and allows a server to bind USB devices physically connected to it to a virtual USB controller that can then be remotely attached to a different physical machine over a network. This seems like a more complex and potentially fragile solution though, so I put that aside after learning of it. Since Linux doesn’t have any kernel-level support for remote serial ports, I needed to search for support at the application level. It turns out bellows uses pyserial to communicate with radios, and pySerial is a quite featureful library- while most users will only ever provide device names like COM1 or /dev/ttyUSB0, it supports a range of more exotic URLs specifying connections, including RFC 2217. So given a suitable server running on a remote machine, I should be able to configure Home Assistant to use a URL like rfc2217://zigbee.local:25 to reach the Zigbee radio. ### Serial server The next step in setting up the Zigbee radio plugged into the router is finding an application that can expose a PL2303 over the network with the RFC 2217 protocol. That turned out to be a short search, where I quickly discovered ser2net which does the job and is already packaged for OpenWRT. Installing it on the router was trivial, though I also needed to be sure the kernel module(s) required to expose the USB-serial port were available: # opkg install kmod-usb-serial-pl2303 ser2net Having installed ser2net, I still had to figure out how to configure it. While the documentation describes its configuration format, I know from experience that configuring servers on OpenWRT is usually done differently (as something of a concession to targeting embedded systems without much storage). I quickly found that the package had installed a sample configuration file at /etc/config/ser2net: config ser2net global option enabled 1 config controlport option enabled 0 option host localhost option port 2000 config default option speed 115200 option databits 8 option parity 'none' option stopbits 1 option rtscts false option local false option remctl true config proxy option enabled 0 option port 5000 option protocol telnet option timeout 0 option device '/dev/ttyAPP0' option baudrate 115200 option databits 8 option parity 'none' option stopbits 1 # option led_tx 'tx' # option led_rx 'rx' option rtscts false option local false option xonxoff false Unfortunately, this configuration doesn’t include any comments so the reader is force to guess the meaning of each option. They mostly correspond to words that appear in the ser2net manual, but I didn’t trust guesses so went digging in the OpenWRT packages source code and found the script responsible for converting /etc/config/ser2net into an actual configuration file when starting ser2net. My initial guess at the configuration I wanted looked something like this: config proxy option enabled 1 option port 5000 option protocol telnet option timeout 0 option device '/dev/ttyUSB0' option baudrate 57600 option remctl true The protocol is specified as telnet because RFC 2217 is a layer on top of telnet (my first guess was that I actually wanted raw until actually reading the RFC and seeing it was a set of telnet extensions), and the device is the device name that I found the Zigbee stick appeared as when plugged into the router.2 Unfortunately, this configuration didn’t work and pyserial gave gack a somewhat perplexing error message: serial.serialutil.SerialException: Remote does not seem to support RFC2217 or BINARY mode [we-BINARY:False(INACTIVE), we-RFC2217:False(REQUESTED)]. Without much visibility into what the serial driver was trying to do, I opted to examine the network traffic with Wireshark. I first attempted to use the text-mode interface (tshark -d tcp.port==5000,telnet -f 'port 5000'), but quickly gave up and switched to the GUI instead. I captured the traffic passing between the server and router, but there was almost nothing! The client (pyserial) was sending some Telnet negotiation messages (DO ECHO, WILL suppress go ahead and COM port control), then nothing happened for a few seconds and the connection closed. Since restarting Home Assistant for every one of these serial tests was quite cumbersome, at this point I checked if pyserial includes any programs suitable for testing connectivity. It happily does, provided in my distribution’s package as miniterm.py. Running miniterm.py rfc2217://c7:5000 failed in the same way, so I had a quicker debugging tool. At this point the problem seems like it’s at the server side, so I stopped the ser2net server on the router and started one in the foreground, with a custom configuration specified on the command line: $ /etc/init.d/ser2net stop
$ser2net -n -d -C '5000:telnet:0:/dev/ttyUSB0:57600 remctl' ser2net[14914]: Unable to create network socket(s) on line 0 While ser2net didn’t outright fail, it did print a concerning error message. Does it work if I change the port it’s listening on? $ ser2net -n -d -C '1234:telnet:0:/dev/ttyUSB0:57600 remctl'

And then running miniterm.py succeeds, leaving me with a terminal I could type into (but didn’t, since I don’t know how to speak EZSP with my keyboard).

$miniterm.py rfc2217://c7:1234 57600 --- Miniterm on rfc2217://c7:1234 57600,8,N,1 --- --- Quit: Ctrl+] | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H --- --- exit --- I discovered after a little digging (netstat -lnp) that miniupnpd was already listening on port 5000 of the router, so changing the port fixes the confusing problem. A different sample port in the ser2net configuration would have prevented such an issue, as would ser2net giving up when it fails to bind to a requested port instead of printing a message and pretending nothing happened. But at least I didn’t have to patch anything to make it work. With ser2net listening on port 2525 instead, Home Assistant can connect to it (hooray!). But it immediately throws a different error: NotImplementedError: write_timeout is currently not supported. I’ve found another bug in a rarely-exercised corner of this software stack, have I? Well, kind of. Finding that error message in the pyserial source, something is trying to set the write timeout to zero and it’s simply not implemented in pyserial for RFC2217 connections. This is ultimately because Home Assistant (as alluded to earlier with bellows and zigpy) is all coroutine-based so it uses pyserial-asyncio to adapt the blocking APIs provided by pyserial to something that works nicely with coroutines running on an event loop. When pyserial-asyncio tries to set non-blocking mode by making the timeout zero, we find it’s not supported. def _reconfigure_port(self): """Set communication parameters on opened port.""" if self._socket is None: raise SerialException("Can only operate on open ports") # if self._timeout != 0 and self._interCharTimeout is not None: # XXX if self._write_timeout is not None: raise NotImplementedError('write_timeout is currently not supported') # XXX While I could probably implement non-blocking support for RFC 2217 in pyserial, that seemed rather difficult and not my idea of fun. So instead I looked for a workaround- if RFC 2217 won’t work, does pyserial support a protocol that will? The answer is of course yes: I can use socket:// for a raw socket connection to the ser2net server. This sacrifices the ability to change UART parameters (format, baud rate, etc) on the fly, but since the USB stick doesn’t support changing parameters on the fly anyway (as far as I can tell), this is no problem. ## Final configuration The ser2net configuration that I’m now using looks like this: config proxy option enabled 1 option port 2525 option protocol raw option timeout 0 option device '/dev/ttyUSB0' option baudrate 57600 option remctl 0 And the relevant stanza in Home Assistant configuration: (The baud rate needs to be specified, but pyserial ignores it for socket:// connections.) zha: usb_path: 'socket://c7:2525' database_path: /srv/homeassistant/.homeassistant/zigbee.db baudrate: 57600 After ensuring the zigbee.db file exists and restarting Home Assistant to reload the configuration, I was able to pair all three sensors by following the procedure defined above: call the permit service in Home Assistant, then reset the sensor by holding the button until its LED blinks three times, then tap the button every second or so for a bit. I did observe some strange behavior on pairing the sensors that made me think they weren’t pairing correctly, like error messages in the log (ERROR (MainThread) [homeassistant.components.sensor] Setup of platform zha is taking longer than 60 seconds. Startup will proceed without waiting any longer.) and some parts of each sensor not appearing (the temperature might be shown, but not humidity or pressure). Restarting Home Assistant after pairing the sensors made everything appear as expected though, so there may be a bug somewhere in there but I can’t be bothered to debug it since there was a very easy workaround. ## Complaining about async I/O It’s rather interesting to me that the major bugs I encountered in trying to set up this system in a slightly unusual configuration were related to asynchronous I/O running in event loops- this is an issue that’s become something of my pet problem, such that I will argue to just about anybody who will listen that asynchronous I/O is usually unnecessary and more difficult to program. That I discovered two separate bugs in the tools that make this work relating to running asynchronous I/O in event loops seems to support that conclusion. If Home Assistant simply spawned threads for components I believe it would simplify individual parts (perhaps at the cost of some slightly more complex low-level communication primitives) and make the system easier to debug. Instead, it runs all of its dependencies in a way they are not well-exercised in, presumably in search of “maximum performance” that seems entirely irrelevant when considering the program’s main function is acting as a hub for a variety of external devices. I have (slowly) been working on distilling all these complaints into a series of essays on the topic, but for now this is a fine opportunity to wave a finger at something that I think is strictly worse because it’s evented. ## Conclusion I’m pretty happy with the sensors and software configuration I have now- the sensors are tiny and unobtrusive, while the software does a fine job of logging data and presenting live readings for my edification. I’d like to also configure a “real” database like InfluxDB to store my sensor readings over arbitrarily long time periods (since Home Assistant doesn’t remember data forever, reasonably so), which shouldn’t be too difficult (it’s supported as a module) but is somewhat unrelated to setting up Zigbee sensors in the first place. Until then, I’m pretty happy with these results despite the fact that I think the developers have made a terrible choice with evented I/O. 1. I did find somebody asking for input on the implementation of exactly that, but it looks like nothing ever came of it. A reply suggesting an application at the master end of a pty (pseudoterminal) suggests an interesting alternate option, but it doesn’t appear to be possible to receive parameter change requests from a pty (though flow control is exposed when running in “packet mode”). [return] 2. I was concerned at the outset that the router might be completely unable to see the Zigbee stick, since apparently the Archer C7 doesn’t include a USB 1.1 OHCI or UHCI controller, so it’s incapable of communicating at all with low-speed devices like keyboards! I’ve heard (but not verified myself) that connecting a USB 2.0 hub will allow the router to communicate with low-speed devices downstream of the hub as a workaround. [return] # Building a terrible 'IoT' temperature logger I had approximately the following exchange with a co-worker a few days ago: Them: “Hey, do you have a spare Raspberry Pi lying around?” Me: [thinks] “..yes, actually.” T: “Do you want to build a temperature logger with Prometheus and a DS18B20+? M: “Uh, okay?” It later turned out that that co-worker had been enlisted by yet another individual to provide a temperature logger for their project of brewing cider, to monitor the temperature during fermentation. Since I had all the hardware at hand (to wit, a Raspberry Pi 2 that I wasn’t using for anything and temperature sensors provided by the above co-worker), I threw something together. It also turned out that the deadline was quite short (brewing began just two days after this initial exchange), but I made it work in time. ## Interfacing the thermometer As noted above, the core of this temperature logger is a DS18B20 temperature sensor. Per the manufacturer: The DS18B20 digital thermometer provides 9-bit to 12-bit Celsius temperature measurements … communicates over a 1-Wire bus that by definition requires only one data line (and ground) for communication with a central microprocessor. … Each DS18B20 has a unique 64-bit serial code, which allows multiple DS18B20s to function on the same 1-Wire bus. Thus, it is simple to use one microprocessor to control many DS18B20s distributed over a large area. Indeed, this is a very easy device to interface with. But even given the svelte hardware needs (power, data and ground signals), writing some code that speaks 1-Wire is not necessarily something I’m interested in. Fortunately, these sensors are very commonly used with the Raspberry Pi, as illustrated by an Adafruit tutorial published in 2013. The Linux kernel provided for the Pi in its default Raspbian (Debian-derived) distribution supports bit-banging 1-Wire over its GPIOs by default, requiring only a device tree overlay to activate it. This is as simple as adding a line to /boot/config.txt to make the machine’s boot loader instruct the kernel to apply a change to the hardware configuration at boot time: dtoverlay=w1-gpio With that configuration, one simply needs to wire the sensor up. The w1-gpio device tree configuration by default uses GPIO 4 on the Pi as the data line, then power and grounds need to be connected and a pull-up resistor added to the data line (since 1-Wire is an open-drain bus). The w1-therm kernel module already understands how to interface with these sensors- meaning I don’t need to write any code to talk to the temperature sensor: Linux can do it all for me! For instance, reading the temperature out in an interactive shell to test, after booting with the 1-Wire overlay enabled: $ modprobe w1-gpio w1-therm
$cd /sys/bus/w1/devices$ ls
28-000004b926f1  w1_bus_master1

## Temperature exporter

Having connected the thermometer to the Pi and set up Prometheus, we now need to glue them together such that Prometheus can read the temperature. The usual way is for Prometheus to make HTTP requests to its known data sources, where the response is formatted such that Prometheus can make sense of the metrics. There is some support for having metrics sources push their values to Prometheus through a bridge (that basically just remembers the values it’s given until they’re scraped), but that seems inelegant given it would require running another program (the bridge) and goes against the how Prometheus is designed to work.

I’ve published the source for the metrics exporter I ended up writing, and will give it a quick description in the remnants of this section.

The easiest solution to providing a service over HTTP is using the http.server module, so that’s what I chose to use. When the program starts up it scans for temperature sensors and stores them. This has a downside of never returning data if a sensor is accidentally disconnected at startup, but detection is fairly slow and only doing it at startup makes it clearer if sensors are accidentally disconnected during operation, since reading them will fail at that point.

#!/usr/bin/env python3

import socketserver
from http.server import HTTPServer, BaseHTTPRequestHandler
from w1thermsensor import W1ThermSensor

SENSORS = W1ThermSensor.get_available_sensors()

The request handler has a method that builds the whole response at once, which is just plain text based on a simple template.

class Exporter(BaseHTTPRequestHandler):
METRIC_HEADER = ('# HELP w1therm_temperature Temperature in Kelvin of the sensor.\n'
'# TYPE w1therm_temperature gauge\n')

def build_exposition(self, sensor_states):
for sensor, temperature in sensor_states.items():
out += 'w1therm_temperature{{id="{}"}} {}\n'.format(sensor, temperature)
return out

do_GET is called by BaseHTTPRequestHandler for all HTTP GET requests to the server. Since this server doesn’t really care what you want (it only exports one thing- metrics), it completely ignores the request and sends back metrics.

    def do_GET(self):
response = self.build_exposition(self.get_sensor_states())
response = response.encode('utf-8')

# We're careful to send a content-length, so keepalive is allowed.
self.protocol_version = 'HTTP/1.1'
self.close_connection = False

self.send_response(200)
self.wfile.write(response)

The http.server API is somewhat cumbersome in that it doesn’t try to handle setting Content-Length on responses to allow clients to keep connections open between requests, but at least in this case it’s very easy to set the Content-Length on the response and correctly implement HTTP 1.1. The Content-Type used here is the one specified by the Prometheus documentation for exposition formats.

The rest of the program is just glue, for the most part. The console_entry_point function is the entry point for the w1therm_prometheus_exporter script specified in setup.py. The network address and port to listen on are taken from the command line, then an HTTP server is started and allowed to run forever.

### As a server

As a Python program with a few non-standard dependencies, installation of this server is not particularly easy. While I could sudo pip install everything and call it sufficient, that’s liable to break unexpectedly if other parts of the system are automatically updated- in particular the Python interpreter itself (though Debian as a matter of policy doesn’t update Python to a different release except as a major update, so it shouldn’t happen without warning). What I’d really like is the ability to build a single standalone program that contains everything in a convenient single-file package, and that’s exactly what PyInstaller can do.

A little bit of wrestling with pyinstaller configuration later (included as the .spec file in the repository), I had successfully built a pretty heavy (5MB) executable containing everything the server needs to run. I placed a copy in /usr/local/bin, for easy accessibility in running it.

I then wrote a simple systemd unit for the temperature server to make it start automatically, installed as /etc/systemd/system/w1therm-prometheus-exporter.service:

[Unit]
Description=Exports 1-wire temperature sensor readings to Prometheus
Documentation=https://bitbucket.org/tari/w1therm-prometheus

[Service]
ExecStart=/usr/local/bin/w1therm-prometheus-exporter localhost 9000
Restart=always

StandardOutput=journal
StandardError=journal

# Standalone binary doesn't need any access beyond its own binary image and
# a tmpfs to unpack itself in.
DynamicUser=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Enable the service, and it will start automatically when the system boots:

systemctl enable w1therm-prometheus-exporter.service

This unit includes rather more protection than is probably very useful, given the machine is single-purpose, but it seems like good practice to isolate the server from the rest of the system as much as possible.

• DynamicUser will make it run as a system user with ID semi-randomly assigned each time it starts so it doesn’t look like anything else on the system for purposes of resource (file) ownership.
• ProtectSystem makes it impossible to write to most of the filesystem, protecting against accidental or malicious changes to system files.
• ProtectHome makes it impossible to read any user’s home directory, preventing information leak from other users.
• PrivateTmp give the server its own private /tmp directory, so it can’t interfere with temporary files created by other things, nor can its be interfered with- preventing possible races which could be exploited.

## Pi connectivity

Having built the HTTP server, I needed a way to get data from it to Prometheus. As discussed earlier, the Raspberry Pi with the sensor is on a WiFi network that doesn’t permit any incoming connections, so how can Prometheus scrape metrics if it can’t connect to the Pi?

One option is to push metrics to Prometheus, using the push gateway. However, I don’t like that option because the push gateway is intended mostly for jobs that run unpredictably, in particular where they can exit without warning. This isn’t true of my sensor server. PushProx provides a rather better solution, wherein clients connect to a proxy which forwards fetches from Prometheus to the relevant client, though I think my ultimate solution is just as effective and simpler.

What I ended up doing is using autossh to open an SSH tunnel at the Prometheus server which connects to the Raspberry Pi’s metrics server. Autossh is responsible for keeping the connection alive, managed by systemd. Code is going to be much more instructive here than a long-form description, so here’s the unit file:

[Unit]
Description=SSH reverse tunnel from %I for Prometheus
After=network-online.target
Wants=network-online.target

[Service]
User=autossh
ExecStart=/usr/bin/autossh -N -p 22 -l autossh -R 9000:localhost:9000 -i /home/autossh/id_rsa %i
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Installed as /etc/systemd/system/autossh-tunnel@.service, this unit file tells systemd that we want to start autossh when the network is online and try to ensure it always stays online. I’ve increased RestartSec from the default 100 milliseconds because I found that even with the dependency on network-online.target, ssh could fail when the system was booting up with DNS lookup failures, then systemd would give up. Increasing the restart time means it takes much longer for systemd to give up, and in the meantime the network actually comes up.

The autossh process itself runs as a system user I created just to run the tunnels (useradd --system -m autossh), and opens a reverse tunnel from port 9000 on the remote host to the same port on the Pi. Authentication is with an SSH key I created on the Pi and added to the Prometheus machine in Google Cloud, so it can log in to the server without any human intervention. Teaching systemd that this should run automatically is a simple enable command away1:

systemctl enable autossh-tunnel@pitemp.example.com

Then it’s just a matter of configuring Prometheus to scrape the sensor exporter. The entire Prometheus config looks like this:

global:
scrape_interval:     15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s

scrape_configs:
- job_name: 'w1therm'
static_configs:
- targets: ['localhost:9000']

That’s pretty self-explanatory; Prometheus will fetch metrics from port 9000 on the same machine (which is actually an SSH tunnel to the Raspberry Pi), and do so every 15 seconds. When the Pi gets the request for metrics, it reads the temperature sensors and returns their values.

## Data retention

I included InfluxDB in the setup to get arbitrary retention of temperature data- Prometheus is designed primarily for real-time monitoring of computer systems, to alert human operators when things appear to be going wrong. Consequently, in the default configuration Prometheus only retains captured data for a few weeks, and doesn’t provide a convenient way to export data for archival or analysis. While the default retention is probably sufficient for this project’s needs, I wanted better control over how long that data was kept and the ability to save it as long as I liked. So while Prometheus doesn’t offer that control itself, it does support reading and writing data to and from various other databases, including InfluxDB (which I chose only because a package for it is available in Debian without any additional work).

Unfortunately, the version of Prometheus available in Debian right now is fairly old- 1.5.2, where the latest release is 2.2. More problematic, while Prometheus now supports a generic remote read/write API, this was added in version 2.0 and is not yet available in the Debian package. Combined with the lack of documentation (as far as I could find) for the old remote write feature, I was a little bit stuck.

Things ended up working out nicely though- I happened to see flags relating to InluxDB in the Prometheus web UI, which mostly have no default values:

• storage.remote.influxdb-url
• storage.remote.influxdb.database = prometheus
• storage.remote.influxdb.retention-policy
• storage.remote.influxdb.username

These can be specified to Prometheus by editing /etc/defaults/prometheus, which is part of the Debian package for providing the command line arguments to the server without requiring users to directly edit the file that tells the system how to run Prometheus. I ended up with these options there:

ARGS="--storage.local.retention=720h \
--storage.remote.influxdb-url=http://localhost:8086/ \
--storage.remote.influxdb.retention-policy=autogen"

The first option just makes Prometheus keep its data longer than the default, whereas the others tell it how to write data to InfluxDB. I determined where InfluxDB listens for connections by looking at its configuration file /etc/influxdb/influxdb.conf and making a few guesses: a comment in the http section there noted that “these (HTTP endpoints) are the primary mechanism for getting data into and out of InfluxDB” and included the settings bind-address=":8086" and auth-enabled=false, so I guessed (correctly) that telling Prometheus to find InfluxDB at http://localhost:8086/ should be sufficient.

Or, it was almost enough: setting the influxdb-url and restarting Prometheus, it was logging warnings periodically about getting errors back from InfluxDB. Given the influxdb.database settings defaults to prometheus, I (correctly) assumed I needed to create a database. A little browsing of the Influx documentation and a few guesses later, I had done that:

$apt-get install influxdb-client$ influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 1.0.2
InfluxDB shell version: 1.0.2
> CREATE DATABASE prometheus;

Examining the Prometheus logs again, now it was failing and complaining that the specified retention policy didn’t exist. Noting that the Influx documentation for the CREATE DATABASE command mentioned that the autogen retention policy will be used if no other is specified, setting the retention-policy flag to autogen and restarting Prometheus made data start appearing, which I verified by waiting a little while and making a query (guessing a little bit about how I would query a particular metric):

> USE prometheus;
> SELECT * FROM w1therm_temperature LIMIT 10;
name: w1therm_temperature
-------------------------
time                    id              instance        job     value
1532423583303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423598303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423613303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423628303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423643303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423658303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423673303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423688303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423703303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423718303000000     000004b926f1    localhost:9000  w1therm 297.9

## Results

A sample graph of the temperature over two days:

The fermentation temperature is quite stable, with daily variation of less than one degree in either direction from the baseline.

## Refinements

I later improved the temperature server to handle SIGHUP as a trigger to scan for sensors again, which is a slight improvement over restarting it, but not very important because the server is already so simple (and fast to restart).

On reflection, using Prometheus and scraping temperatures is a very strange way to go about solving the problem of logging the temperature (though it has the advantage of using only tools I was already familiar with so it was easy to do quickly). Pushing temperature measurements from the Pi via MQTT would be a much more sensible solution, since that’s a protocol designed specifically for small sensors to report their states. Indeed, there is no shortage of published projects that do exactly that more efficiently than my Raspberry Pi, most of them using ESP8266 microcontrollers which are much lower-power and can still connect to Wi-Fi networks.

Getting sensor readings through an MQTT broker and storing them to be able to graph them is not quite as trivial as scraping them with Prometheus, but I suspect there does exist a software package that does most of the work already. If not, I expect a quick and dirty one could be implemented with relative ease.

On the other hand, running a device like that which is internet-connected but is unlikely to ever receive anything remotely looking like a security update seems ill-advised if it’s meant to run for anything but a short amount of time. In that case having the sensor be part of a Zigbee network instead, which does not permit direct internet connectivity and thus avoids the fraught terrain of needing to protect both the device itself from attack and the data transmitted by the device from unauthorized use (eavesdropping) by taking ownership of that problem away from the sensor.

It remains possible to forward messages out to an MQTT broker on the greater internet using some kind of bridge (indeed, this is the system used by many consumer “smart device” platforms, like Philips’ Hue though I don’t think they use MQTT), where individual devices connect only to the Zigbee network, and a more capable bridge is responsible for internet connectivity. The problem of keeping the bridge secure remains, but is appreciably simpler than needing to maintain the security of each individual device in what may be a heterogeneous network.

It’s even possible to get inexpensive off-the-shelf temperature and humidity sensors that connect to Zigbee networks like some sold by Xiaomi, offering much better finish than a prototype-quality one I might be able to build myself, very good battery life, and still capable of operating in a heterogenous Zigbee network with arbitrary other devices (though you wouldn’t know it from the manufacturer’s documentation, since they want consumers to commit to their “platform” exclusively)!

So while my solution is okay in that it works fine with hardware I already had on hand, a much more robust solution is readily available with off-the-shelf hardware and only a little bit of software to glue it together. If I needed to do this again and wanted a solution that doesn’t require my expertise to maintain it, I’d reach for those instead.

1. Hostname changed to an obviously fake one for anonymization purposes. [return]

# sax-ng

Over on Cemetech, we’ve long had an embedded chat widget called “SAX” (“Simultaneous Asynchronous eXchange”). It behaves kind of like a traditional shoutbox, in that registered users can use the SAX widget to chat in near-real-time. There is also a bot that relays messages between the on-site widget and an IRC channel, which we call “saxjax”.

The implementation of this, however, was somewhat lacking in efficiency. It was first implemented around mid-2006, and saw essentially no updates until just recently. The following is a good example of how dated the implementation was:

// code for Mozilla, etc
if (window.XMLHttpRequest) {
xmlhttp=new XMLHttpRequest()
xmlhttp.open("GET",url,true)
xmlhttp.send(null)
} else if (window.ActiveXObject) {
// code for IE
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP")
if (xmlhttp) {
xmlhttp.open("GET",url,true)
xmlhttp.send()
}
}


The presence of ActiveXObject here implies it was written at a time when a large fraction of users would have been using Internet Explorer 5 or 6 (the first version of Internet Explorer released which supported the standard form of XMLHttpRequest was version 7).

Around a year ago (that’s how long this post has been a draft for!), I took it upon myself to design and implement a more modern replacement for SAX. This post discusses that process and describes the design of the replacement, which I have called “sax-ng.”

## Legacy SAX

The original SAX implementation, as alluded to above, is based on AJAX polling. On the server, a set of approximately the 30 most recent messages were stored in a MySQL database and a few PHP scripts managed retrieving and modifying messages in the database. This design was a logical choice when initially built, since the web site was running on a shared web host (supporting little more than PHP and MySQL) at the time.

Eventually this design became a problem, as essentially every page containing SAX that is open at any given time regularly polls for new messages. Each poll calls into PHP on the server, which opens a database connection to perform one query. Practically, this means a very large number of database connections being opened at a fairly regular pace. In mid-2012 the connection count reached levels where the shared hosting provider were displeased with it, and requested that we either pay for a more expensive hosting plan or reduce resource usage.

In response, we temporarily disabled SAX, then migrated the web site to a dedicated server provided by OVH, who had opened a new North American datacenter in July. We moved to the dedicated server in August of 2012. This infrastructure change kept the system running, and opened the door to a more sophisticated solution since we gained the ability to run proper server applications.

Meanwhile, the limitations of saxjax (the IRC relay bot) slowly became more evident over time. The implementation was rather ad-hoc, in Python. It used two threads to implement relay, with a dismaying amount of shared state used to relay messages between the two threads. It tended to stop working correctly in case of an error in either thread, be it due to a transient error response from polling the web server for new messages, or an encoding-related exception thrown from the IRC client (since Python 2.x uses bytestrings for most tasks unless specifically told not to, and many string operations (particularly outputting the string to somewhere) can break without warning when used with data that is not 8-bit clean (that is, basically anything that isn’t ASCII).

Practically, this meant that the bot would frequently end up in a state where it would only relay messages one way, or relay none at all. I put some time into making it more robust to these kinds of failures early in 2015, such that some of the time it would manage to catch these errors and outright restart (rather than try to recover from an inconsistent state). Doing so involved some pretty ugly hacks though, which prompted a return to some longtime thoughts on how SAX could be redesigned for greater efficiently and robustness.

## sax-ng

For a long time prior to beginning this work, I frequently (semi-jokingly) suggested XMPP (Jabber) as a solution to the problems with SAX. At a high level this seems reasonable: XMPP is a chat protocol with a number of different implementations available, and is relatively easy to set up as a private chat service.

On the other hand, the feature set of SAX imposes a few requirements which are not inherently available for any given chat service:

1. An HTTP gateway, so clients can run inside a web browser.
2. Group chat, not just one-to-one conversation capability.
3. External authentication (logging in to the web site should permit connection to chat as well).
4. Retrieval of chat history (so a freshly-loaded page can have some amount of chat history shown).

As it turns out, ejabberd enables all of these, with relatively little customization. mod_http_bind provides an HTTP gateway as specified in XEP-0206, mod_muc implements multi-user chat as specified in XEP-0045 which also includes capabilities to send chat history to clients when they connect, and authentication can be handled by an external program which speaks a simple binary protocol and is invoked by ejabberd.

Main implementation of the new XMPP-based system was done in about a week, perhaps 50 hours of concerted work total (though I may be underestimating). I had about a month of “downtime” at the beginning of this past summer, the last week of which was devoted to building sax-ng.

### ejabberd

The first phase involved setting up an instance of ejabberd to support the rest of the system. I opted to run it inside Docker, ideally to make the XMPP server more self-contained and avoid much custom configuration on the server. Conveniently, somebody had already built a Docker configuration for ejabberd with a wealth of configuration switches, so it was relatively easy to set up.

Implementing authentication against the web site was also easy, referring to the protocol description in the ejabberd developers guide. Since this hooks into the website’s authentication system (a highly modified version of phpBB), this script simply connects to the mysql server and runs queries against the database.

Actual authentication is performed with phpBB SIDs (Session ID), rather than a user’s password. It was built this way because the SID and username are stored in a cookie, which is available to a client running in a web browser. This is probably also somewhat more secure than storing a password in the web browser, since the SID is changed regularly so data exposure via some vector cannot compromise a user’s web site password.

Error handling in the authentication script is mostly nonexistent. The Erlang approach to such problems is mostly “restart the component if it fails”, so in case of a problem (of which the only real possibility is a database connection error) ejabberd will restart the authentication script and attempt to carry on. In practice this has proven to be perfectly reliable.

In XMPP MUC (Multi-User Chat), users are free to choose any nickname they wish. For our application, there is really only one room and we wish to enforce that the nickname used in XMPP is the same as a user’s username on the web site. There ends up being no good way in ejabberd to require that a user take a given nickname, but we can ensure that it is impossible to impersonate other users by registering all site usernames as nicknames in XMPP. Registered nicknames may only be used by the user to which they are registered, so the only implementation question is in how to automatically register nicknames.

I ended up writing a small patch to mod_muc_admin, providing an ejabberdctl subcommand to register a nickname. This patch is included in its entirety below.

diff --git a/src/mod_muc_admin.erl b/src/mod_muc_admin.erl
index 9c69628..3666ba0 100644
@@ -15,6 +15,7 @@
start/2, stop/1, % gen_mod API
muc_online_rooms/1,
muc_unregister_nick/1,
+    muc_register_nick/3,
create_room/3, destroy_room/3,
create_rooms_file/1, destroy_rooms_file/1,
rooms_unused_list/2, rooms_unused_destroy/2,
@@ -38,6 +39,9 @@

%% Copied from mod_muc/mod_muc.erl
-record(muc_online_room, {name_host, pid}).
+-record(muc_registered,
+        {us_host = {\{<<"">>, <<"">>}, <<"">>} :: {\{binary(), binary()}, binary()} | '$1', + nick = <<"">> :: binary()}). %%---------------------------- %% gen_mod @@ -73,6 +77,11 @@ commands() -> module = ?MODULE, function = muc_unregister_nick, args = [{nick, binary}], result = {res, rescode}}, + #ejabberd_commands{name = muc_register_nick, tags = [muc], + desc = "Register the nick in the MUC service to the JID", + module = ?MODULE, function = muc_register_nick, + args = [{nick, binary}, {jid, binary}, {domain, binary}], + result = {res, rescode}}, #ejabberd_commands{name = create_room, tags = [muc_room], desc = "Create a MUC room name@service in host", @@ -193,6 +202,16 @@ muc_unregister_nick(Nick) -> error end. +muc_register_nick(Nick, JID, Domain) -> + {jid, UID, Host, _,_,_,_} = jlib:string_to_jid(JID), + F = fun (MHost, MNick) -> + mnesia:write(#muc_registered{us_host=MHost, + nick=MNick}) + end, + case mnesia:transaction(F, [{\{UID, Host}, Domain}, Nick]) of + {atomic, ok} -> ok; + {aborted, _Error} -> error + end. %%---------------------------- %% Ad-hoc commands  It took me a while to work out how exactly to best implement this feature, but considering I had never worked in Erlang before it was reasonably easy. I do suspect some familiarity with Haskell and Rust provided background to more easily understand certain aspects of the language, though. The requirement that I duplicate the muc_registered record (since apparently Erlang provides no way to import records from another file) rubs me the wrong way, though. In practice, then, a custom script traverses the web site database, invoking ejabberdctl to register the nickname for every existing user at server startup and then periodically or on demand when the server is running. ### Web interface The web interface into XMPP was implemented with Strophe.js, communicating with ejabberd via HTTP-bind with the standard support in both the client library and server. The old SAX design served a small amount of chat history with every page load so it was immediately visible without performing any additional requests after page load, but since the web server never receives chat data (it all goes into XMPP directly), this is no longer possible. The MUC specification allows a server to send chat history to clients when they join a room, but that still requires several HTTP round-trips (taking up to several seconds) to even begin receiving old lines. I ended up storing a cache of messages in the browser, which is used to populate the set of displayed messages on initial page load. Whenever a message is received and displayed, its text, sender and a timestamp are added to the local cache. On page load, messages from this cache which are less than one hour old are displayed. The tricky part with this approach is avoiding duplication of lines when messages sent as part of room history already exist, but checking the triple of sender, text and timestamp seems to handle these cases quite reliably. ### webridge The second major feature of SAX is to announce activity on the web site’s bulletin board, such as when people create or reply to threads. Since the entire system was previously managed by code tightly integrated with the bulletin board, a complete replacement of the relevant code was required. In the backend, SAX functionality was implemented entirely in one PHP function, so replacing the implementation was relatively easy. The function’s signature was something like saxSay($type, $who,$what, $where), where type is a magic number indicating what kind of message it is, such as the creation of a new thread, a post in a thread or a message from a user. The interpretation of the other parameters depends on the message type, and tends to be somewhat inconsistent. The majority of that function was a maze of comparisons against the message type, emitting a string which was eventually pushed into the chat system. Rather than attempt to make sense of that code, I decided to replace it with a switch statement over symbolic values (whereas the old code just used numbers with no indication of purpose), feeding simple invocations of sprintf. Finding the purpose of each of the message types was most challenging among that, but it wasn’t terribly difficult as I ended up searching the entire web site source code for references to saxSay and determined the meaning of the types from the caller’s context. To actually feed messages from PHP into XMPP, I wrote a simple relay bot which reads messages from a UNIX datagram socket and repeats them into a MUC room. A UNIX datagram socket was selected because there need not be any framing information in messages coming in (just read a datagram and copy its payload), and this relay should not be accessible to anything running outside the same machine (hence a UNIX socket). The bot is implemented in Python with Twisted, utilizing Twisted’s provided protocol support for XMPP. It is run as a service under twistd, with configuration provided via environment variables because I didn’t want to write anything to handle reading a more “proper” configuration file. When the PHP code calls saxSay, that function connects to a socket with path determined from web site configuration and writes the message into that socket. The relay bot (“webridge”) receives these messages and writes them into MUC. ### saxjax-ng Since keeping a web page open for chatting is not particularly convenient, we also operate a bridge between the SAX chat and an IRC channel called saxjax. The original version of this relay bot was of questionable quality at best: the Python implementation ran two threads, each providing one-way communication though a list. No concurrency primitives, little sanity. Prior to creation of sax-ng I had put some amount of effort in improving the reliability of that system, since an error in either thread would halt all processing of messages in the direction corresponding to the thread in which the error occurred. Given there was essentially no error handling anywhere in the program, this sort of thing happened with dismaying frequency. saxjax-ng is very similar in design to webridge, in that it’s Twisted-based and uses the Twisted XMPP library. On the IRC side, it uses Twisted’s IRC library (shocking!). Both ends of this end up being very robust when combined with the components that provide automatic reconnection and a little bit of custom logic for rotating through a list of IRC servers. Twisted guarantees singlethreaded operation (that’s the whole point; it’s an async event loop), so relaying a message between the two connections is simply a matter of repeating it on the other connection. ## Contact with users This system has been perfectly reliable since deployment, after a few changes. Most notably, the http-bind interface for ejabberd was initially exposed on port 5280 (the default for http-bind). Users behind certain restrictive firewalls can’t connect to that port, so we quickly reconfigured our web server to reverse-proxy to http-bind and solve that problem. Doing so also means the XMPP server doesn’t need its own copy of the server’s SSL certificate. There are still some pieces of the web site that emit messages containing HTML entities in accordance with the old system. The new system.. doesn’t emit HTML entities because that should be the responsibility of something doing HTML presentation (Strong Opinion) and I haven’t bothered trying to find the things that are still emitting HTML-like strings. The reconnect logic on the web client tends to act like it’s received multiples of every message that arrives after it’s tried to reconnect to XMPP, such as when a user puts their computer to sleep and later resumes; the web client tries to detect the lost connection and reopen it, and I think some event handlers are getting duplicated at that point. Haven’t bothered working on a fix for that either. # Conclusion ejabberd is a solid piece of software and not hard to customize. Twisted is a good library for building reliable network programs in Python, but has enough depth that some of its features lack useful documentation so finding what you need and figuring out how to use it can be difficult. This writeup has been languishing for too long so I’m done writing now. # Web history archival and WARC management I’ve been a sort of ‘rogue archivist’ along the lines of the Archive Team for some time, but generally lack the combination of motivation and free time to directly take part in their activities. That said, I do sometimes go on bursts of archival since these things do concern me; it’s just a question of when I’ll get manic enough to be useful and latch onto an archival task as the one to do. An earlier public example is when I mirrored ticalc.org. The historical record contains plenty of instances where people maintained copies of their communications or other documentation which has proven useful to study, and in the digital world the same is likely to be true. With the ability to cheaply store large amounts of data, it is also relatively easy to generate collections in the hope of their future utility. Something I first played with back in 2014 was extracting lists of web pages to archive from web browser history. From a public perspective this may not be particularly interesting, but if maintained over a period of time this data could be interesting as a snapshot of a typical-in-some-fashion individual’s daily life, or for purposes I can’t foresee. Today I’m going to write a little about how I collect this data and reduce the space requirements. The products of this work that are source code can be found on Bitbucket. ## Collecting History I use Firefox as my everyday web browser, which combined with Firefox Sync provides ready access to a reasonably complete record of my web browsing activity. The first step is extracting the actual browser history, which is a relatively straightforward process since Firefox maintains all of this data in SQLite databases. I use cookies.sqlite and places.sqlite from my Firefox profile. Extracting history from places.sqlite is as simple as running a query that emits timestamps and corresponding URLs. For example: sqlite3 places.sqlite \ "SELECT visit_date, url FROM moz_places, moz_historyvisits \ WHERE moz_places.id = moz_historyvisits.place_id \ AND visit_date >$LASTRUN \
ORDER BY visit_date"

This will print the timestamp and URL for every page in history newer than LASTRUN (which can easily be omitted to get everything), with the fields separated by pipes (|). The timestamp (visit_date) is a UNIX timestamp expressed in microseconds.

While there’s some utility in just grabbing web pages, the real advantage I’ve found in using data directly from a web browser is that it can gain a personal touch, with access to private data granted in many cases by cookies. This does imply that the data should not be shared, but as with personal letters in history this formerly-private information may become useful in the future at a point when the privacy of that data is no longer a concern for those involved.

Again using sqlite and the cookies.sqlite file we got from Firefox, it’s relatively easy to extract a cookies.txt file that can be read by many tools:

sqlite3 -separator ' ' cookies.sqlite << EOF
.mode tabs
SELECT host,
CASE substr(host,1,1)='.' WHEN 0 THEN 'FALSE' ELSE 'TRUE' END,
path,
CASE isSecure WHEN 0 THEN 'FALSE' ELSE 'TRUE' END,
expiry,
name,
value
EOF

The output of that sqlite invocation can be redirected directly into a cookies.txt file without any further work.

With the list of URLs and cookies, it’s again not difficult to capture a WARC containing every web page listed. I’ve used wget, largely out of convenience. Taking advantage of a UNIX shell, I usually do the following, piping the URL list into wget:

cut -d '|' -f 2- urls.txt | \
wget --warc-file=date --warc-cdx --warc-max-size=1G \
-e robots=off -U "Inconspicuous Browser" \
--timeout 30 --tries 2 --page-requisites \
--delete-after -i -

This will download every URL given to it with the cookies extracted earlier, and will also download external resources (like images) when they are referenced in downloaded pages. The process will be logged to a WARC file named with the time the process was started, limiting to approximately 1-gigabyte chunks.

This takes a while, and the best benefits are to be had from running this at fairly short intervals which will tend to provide more unexpired cookies and catch changes over short periods of time, thus presenting a more accurate view of what the browser’s user is actually doing.

## Deduplication

On completion, I’m presented with a directory containing some number of compressed WARC files. That’s a reasonable place to leave it, but this weekend after doing an archival run that yielded about 90 gigabytes of data I decided to look into making it smaller, especially considering I know my archive runs end up grabbing many copies of the same resources on web sites which I visit frequently (for example, icons on DuckDuckGo).

The easy approach would be to use a compression scheme which tends to work better than gzip (the typical compression scheme for WARCs). However, doing so would destroy a useful property in that the files do not need to be completely decompressed for viewing. These are built such that with an index showing where a particular record exists in the archive, a user does not need to decompress the entire file up to that point (as would be the case with most compression schemes)- it is possible to seek to that point in the compressed file and decompress just the desired record.

I had hope that the professionals in this field had already considered ways to make their archives smaller, and that ended up being true but the documentation is very sparse: the only truly useful material was a recent presentation by Youssef Eldakar from the Bibliotheca Alexandrina cursorily describing tools to deduplicate entries in WARC files using revisit records which point to a previous date-URL combination that has the same contents1.

I don’t see any strong reason to keep my archives split into 1-gigabyte pieces and it’s slightly easier to perform deduplication on a single large archive, so I used megawarc to join the a number of smaller archives into one big one.

It was easy enough to find the published code for the tools described in the presentation, so all I had to do was figure out how to run them.. right?

## The Process

The logical procedure for deduplication is as follows:

1. Run warcsum to compute hashes of every record of interest in the specified archive(s), writing them to a file.
2. Run warccollres to examine the records and their hashes, determining which ones are actually the same and which are just hash collisions.
3. Run wardrefs to rewrite the archives with references when duplicates are found.

I had a hard time actually getting that to happen, though.

### warcsum

Running warcsum was relatively easy; it happily chewed on my test archive for a while and eventually spat out a long list of files. I later discovered that it wasn’t processing the whole archive, though- it stopped after about two gigabytes of data. I eventually found that the program (written in C) was using int as a type to represent file offsets, so the apparent offset in a file becomes negative after reading two gigabytes of data which causes the program to end, thinking it’s done everything. I patched the relevant bits to use 64-bit types (like off_t) where working with file offsets, and eventually got it to emit 1.7 million records rather than the few tens of thousands I was getting before.

While investigating the premature termination, I found (using warcat) that wget sometimes writes record length fields that are one byte longer than the actual record. I spent a while trying to investigate that and repair the length fields in hopes of fixing warcsum’s premature termination, but it ended up being unnecessary. In practice this off-by-one doesn’t seem to be harmful, but I do find it somewhat concerning.

I also discovered that warcsum assumes wrapping arithmetic for determining how large some buffers should be, which is undefined behavior in C and could cause Bad Things to happen. I fixed the instance where I saw it, but that didn’t seem to be causing any issues on my dataset.

### warccollres

Moving on to warccollres, I found that it assumes a lot of infrastructure which I lack. Given the name of a WARC file, it expects to have access to a MySQL server which can indicate a URL where records from the WARC can be downloaded- a reasonable assumption if you’re a professional working within an organization like the Bibliotheca Alexandrina or Internet Archive, but excessive for my purposes and difficult to set up.

I ended up rewriting all of warccollres in Python, using a self-contained database and assuming direct access to the files. There’s nothing particularly novel in there (see warccollres.py in the repository). WARC records are read from the archive and compared where they have the same hash to determine actual equality, and duplicates are marked as such.

I originally imported everything into a sqlite database and did all the work in there (not importing file contents though– that would be very inefficient), but this was rather slow because sqlite tends to be slow on workloads that involve more than a little bit of writing to the database. With some changes I made it use a “real” database (MariaDB) which helped. After tuning some parameters on the database server to allow it to use much larger amounts of memory (innodb_buffer_pool_size..) and creating some indexes on the imported data, everything moved along at a nice clip.

As the process went on, it seemed to slow down- early on everything was I/O-bound and status messages were scrolling by too fast for me to see, but after a few hundred thousand records had been processed I could see a significant slowdown. Looking at resource usage, the database was the limiting factor.

It turned out that though I had created indexes in the database on the rows that get queried frequently, it was still performing a full table scan to satisfy the requirement that records be processed in the order which they appear in the WARC file. (I determined this by manually running some queries and having mariadb ANALYZE them for information on how it processed the query.) After creating a composite index of the copy_number and warc_offset columns (which I wasn’t even aware was possible until I read the grammar for CREATE INDEX carefully, and had to experiment to discover that the order in which they are specified matters), the process again became I/O-bound. Where the first 1.2 million records or so were processed in about 16 hours, the last 500 thousand were completed in only about an hour after I created that index.

### warcrefs

Compared to the earlier parts, warcrefs is a quite docile tool, perhaps in part because it’s implemented in Java. I made a few changes to the file describing how Maven should build it so I could get a jar file containing the program and all its required libraries which would be easy to run. With the file-offset issues in warcsum fresh in my mind, I proactively checked for similar issues in warcrefs and found it used int for file offsets throughout (which in Java is always a 32-bit value). I changed the relevant parts to use long instead, avoiding further problems with large files.

As I write this warccollres is still running on a large amount of data, so I can’t truly evaluate the capabilities of warcrefs. I did test it on a small archive which had some duplication and it was successful (verified by manual inspection2).

### warcrefs revisited

I’m writing this section after the above-mentioned run of warccollres finished and I got to run warcrefs over about 30 gigabytes of data. It turned out a few additional changes were required.

1. I forgot to recompile the jar after changing its use of file offsets to use longs, at which point I found the error reporting was awful in that the program only printed the error message and nothing else. It bailed out on reaching a file offset not representable as an int, but I couldn’t tell that until I made it print a proper stack trace.
2. Portions of revisit records were processed as strings but have lengths in bytes. Where multibyte characters are used this yields a wrong size. Fortunately, the WARC library used to write output checks these so I just had to fix it to use byte lengths everywhere.
3. Reading records to deduplicate reopened the input file for every record and never closed them, causing the program to eventually reach the system open file limit and fail. I had to make it close those.

## Results

I got surprisingly good savings out of deduplication on my initial large dataset. Turns out web browser history has a lot more duplication than a typical archive: about 50% on my data, where Eldakar cited a number closer to 15% for general archives.

$ls -lh total 47G -rw-r--r-- 1 tari users 14G Jan 18 15:07 mega_dedup.warc.gz -rw-r--r-- 1 tari users 33G Jan 17 10:45 mega.warc.gz -rw-r--r-- 1 tari users 275M Jan 17 11:39 mega.warcsum -rw-r--r-- 1 tari users 415M Jan 18 13:50 warccollres.txt  The input file was 33 gigabytes, reduced to only 14 after deduplication. I’ve manually checked that all the records appear to be there, so that appears to be true deduplication only. There are 1709118 response records in the archive (that’s the number of lines in the warcsum file), with only 210467 unique responses3, making an average of about 8 copies per response. Perhaps predictably, this implies that the duplicated records tend to be small since the overall savings was much less than 8 times. ## Improvements At this point deduplication is not a very automated process, since there are three different programs involved and a database must be set up. This would be relatively easy to script, but it hasn’t yet seen enough use for me to be confident in its ability to run unattended. There are some inefficiencies, especially in warccollres.py which decompresses records in their entirety into memory (where it could stream them or back them with real files to reduce memory requirements for large records). It also requires that there be only one WARC file under consideration, which was a concession to simplicity of implementation. In the downloading process, I found that it will sometimes get hung up on streams, particularly streaming audio like Hutton Orbital Radio where the actual stream URL appears in browser history. The result of that kind of thing is downloading a “file” of unbounded size at a rather low speed (since it’s delivered only as fast as the audio will be played back). wpull is a useful tool to replace wget with (that is also mostly compatible, for convenience) which can help address these issues. It supports custom scripts to control its operation in a more fine-grained way, which would probably permit detection of streams so they don’t get downloaded. Also attractive is wpull’s support for running Javascript in downloaded pages, which allows it to capture data that is not served “baked in” to a web page as is often the case on modern web sites, especially “social” ones. ## Concluding I ended up spending the majority of a weekend hammering out most of this code, from about 11:00 on Saturday through about 18:00 on Sunday with only about an hour total for food-breaks and a too-much-yet-not-enough 6-hour pause to sleep. I might not call it pleasant, but it’s a good feeling to build something like this successfully and before losing interest in it for an indeterminate amount of time. I have long-term plans regarding software to automate archiving tasks like this one, and that was where my work here started early on Saturday. I’d hope that future manic chunks of time like this one will lead to further progress on that concept, but personal history says this kind of incredibly-productive block of time occurs at most a few times a year, and the target of my concentration is unpredictable4. Call it a goal to work toward, maybe: the ability to work on archiving as an occupation, rather than a sadly neglected hobby. In any case, if you missed it, the collection of code I put together for deduplication is available on Bitbucket. The history-gathering portions I use are basically exactly as described in the relevant sections, leaving a lot of room for future improvement. Thanks for reading if you’ve come this far, and I hope you find my work useful! 1. I’m not entirely comfortable with that approach, since there is no particular guarantee that any record exists with the specified “coordinates” (time of retrieval and network location) in web-space. However, this approach does maintain sanity even if a WARC is split into its individual records which is another important consideration. [return] 2. WARC files are mostly plain text with possibly-binary network traffic in between, so it’s relatively easy to browse them with tools like zless and verify everything looks correct. It’s quite convenient, really. [return] 3. SELECT count(id) FROM warcsums WHERE copy_num = 1 [return] 4. In fact, the last time I did something like this I (re)wrote a large amount of chat infrastructure which I still have yet to finish writing up for this blog. [return] # GStreamer's playbin, threads and queueing I’ve been working on a project that uses GStreamer to play back audio files in an automatically-determined order. My implementation uses a playbin, which is nice and easy to use. I had some issues getting it to continue playback on reaching the end of a file, though. According to the documentation for the about-to-finish signal, This signal is emitted when the current uri is about to finish. You can set the uri and suburi to make sure that playback continues. This signal is emitted from the context of a GStreamer streaming thread. Because I wanted to avoid blocking a streaming thread under the theory that doing so might interrupt playback (the logic in determining what to play next hits external resources so may take some time), my program simply forwarded that message out to be handled in the application’s main thread by posting a message to the pipeline’s bus. Now, this approach appeared to work, except it didn’t start playing the next URI, and the pipeline never changed state- it was simply wedged. Turns out that you must assign to the uri property from the same thread, otherwise it doesn’t do anything. Fortunately, it turns out that blocking that streaming thread while waiting for data isn’t an issue (determined by experiment by simply blocking the thread for a while before setting the uri). # Chainloading Truecrypt I recently purchased a new laptop computer (a Lenovo Thinkpad T520), and wanted to configure it to dual-boot between Windows and Linux. Since this machine is to be used “on the go”, I also wanted to have full encryption of any operating systems on the device. My choices of tools for this are Truecrypt on the Windows side, and dm_crypt with LUKS on Linux. Mainly due to rather troublesome design on the Windows side of this setup, it was not as easy as I might have hoped. I did eventually get it working, however. ## Admonishment Truecrypt was [https://www.grc.com/misc/truecrypt/truecrypt.htm]("Discontinued") in 2014, but still works okay. VeraCrypt is substantially a drop-in replacement if you’re looking for a piece of software that is still actively maintained. As of this update (early 2017) the only non-commercial option for an encrypted Windows system booted from UEFI is Windows’ native BitLocker (with which dual-booting is possible but it won’t be possible to read the encrypted Windows partition from Linux), but if you’re booting via legacy BIOS these instructions should still work for TrueCrypt or VeraCrypt. # Windows Installing Windows on the machine was easy enough, following the usual installation procedure. I created a new partition to install Windows to filling half of the disk, and let it do its thing. Downloading and installing Truecrypt is similarly easy. From there, I simply chose the relevant menu entry to turn on system encryption. The first snag appeared when the system encryption wizard refused to continue until I had burned an optical disk containing the recovery information (in case the volume headers were to get corrupted). I opted to copy the iso file to another location, with the ability to boot it via grub4dos if necessary in the future (or merely burn a disc as necessary). The solution to this was to re-invoke the volume creation wizard with the noisocheck option: C:\\Program Files\\TrueCrypt>TrueCrypt Format.exe /noisocheck  One reboot followed, and I was able to let TrueCrypt go through and encrypt the system. It was then time to set up Linux. # Linux Basic setup of my Linux system was straightforward. Arch (my distribution of choice) offers good support for LUKS encryption of the full system, so most of the installation went smoothly. On reaching the bootloader installation phase, I let it install and configure syslinux (my loader of choice simply because it is easier to configure than GRUB), but did not install it to the MBR. With the installation complete, I had to do some work to manually back up the MBR installed by Truecrypt, then install a non-default MBR for Syslinux. First up was backing up the Truecrypt MBR to a file: # dd if=/dev/sda of=/mnt/boot/tc.bs count=1  That copies the first sector of the disk (512 bytes, containing the MBR and partition table) to a file (tc.bs) on my new /boot partition. Before installing a Syslinux MBR, I wanted to ensure that chainloading the MBR from a file would work. To that end, I used the installer to chainload to my new installation, and used that to attempt loading Windows. The following incantation (entered manually from the syslinux prompt) eventually worked: .com32 chain.c32 hd0 1 file=/tc.bs  Pulling that line apart, I use the chainloader to boot the file tc.bs in the base of my /boot partition, and load the first partition on my first hard drive (that is, where Windows is installed). This worked, so I booted once more into the installer to install the Syslinux MBR: # dd if=/usr/lib/syslinux/mbr.bin of=/dev/sda bs=1 count=440 conv=notrunc  This copies 440 bytes from the given file to my hard drive, where 440 bytes is the size of the MBR. The input file is already that size so the count parameter should not be necessary, but one cannot be too careful when doing such modification to the MBR. Rebooting that, sadly, did not work. It turns out that the Syslinux MBR merely scans the current hard drive for partitions that are marked bootable, and boots the first one. The Truecrypt MBR does the same thing, which is troublesome– in order for Truecrypt to work the Windows partition must be marked bootable, but Syslinux is unable to find its configuration when this is the case. Enter albmbr.bin. Syslinux ships several different MBRs, and the alternate does not scan for bootable partitions. Instead, the last byte of the MBR is set to a value indicating which partition to boot from. Following the example from the Syslinux wiki (linked above), then, I booted once more from my installer and copied the altmbr into position: # printf '\x5' | cat /usr/lib/syslinux/altmbr.bin - | dd bs=1 count=440 conv=notrunc of=/dev/sda  This shell pipeline echoes a single byte of value 5, appends it to the contents of altmbr.bin, and writes the resulting 440 bytes to the MBR on sda. The 5 comes from the partition Syslinux was installed on, in this case the first logical partition on the disk (/dev/sda5). With that, I was able to boot Syslinux properly and it was a simple matter to modify the configuration to boot either Windows or Linux on demand. Selected parts of my syslinux.cfg file follow: UI menu.c32 LABEL arch MENU LABEL Arch Linux LINUX /vmlinuz-linux APPEND root=/dev/mapper/Homura-root cryptdevice=/dev/sda6:HomuHomu ro INITRD /initramfs-linux.img LABEL windows MENU LABEL Windows 7 COM32 chain.c32 APPEND hd0 1 file=/tc.bs  # Further resources For all things Syslinux, the documentation wiki offers documentation sufficient for most purposes, although it can be somewhat difficult to navigate. A message from the Syslinux mailing list gave me the key to making Syslinux work from the MBR. The Truecrypt documentation offered some interesting information, but was surprisingly useless in the quest for a successful chainload (indeed, the volume creation wizard very clearly states that using a non-truecrypt MBR is not supported). # High-availability /home revisited About a month ago, I wrote about my experiments in ways to keep my home directory consistently available. I ended up concluding that DRBD is a neat solution for true high-availability systems, but it’s not really worth the trouble for what I want to do, which is keeping my home directory available and in-sync across several systems. Considering the problem more, I determined that I really value a simple setup. Specifically, I want something that uses very common software, and is resistant to network failures. My local network going down is an extremely rare occurence, but it’s possible that my primary workstation will become a portable machine at some point in the future- if that happens, anything that depends on a constant network connection becomes hard to work with. If an always-online option is out of the question, I can also consider solutions which can handle concurrent modification (which DRBD can do, but requires using OCFS, making that solution a no-go). ## Rsync rsync is many users’ first choice for moving files between computers, and for good reason: it’s efficient and easy to use. The downside in this case is that rsync tends to be destructive, because the source of a copy operation is taken to be the canonical version, any modifications made in the destination will be wiped out. I already have regular cron jobs running incremental backups of my entire /home so the risk of rsync permanently destroying valuable data is low. However, being forced to recover from backup in case of accidental deletions is a hassle, and increases the danger of actual data loss. In that light, a dumb rsync from the NAS at boot-time and back to it at shutdown could make sense, but carries undesirable risk. It would be possible to instruct rsync to never delete files, but the convenience factor is reduced, since any file deletions would have to be done manually after boot-up. What else is there? ## Unison I eventually decided to just use Unison, another well-known file synchronization utility. Unison is able to handle non-conflicting changes between destinations as well as intelligently detect which end of a transfer has been modified. Put simply, it solves the problems of rsync, although there are still situations where it requires manual intervention. Those are handled with reasonable grace, however, with prompting for which copy to take, or the ability to preserve both and manually resolve the conflict. Knowing Unison can do what I want and with acceptable amounts of automation (mostly only requiring intervention on conflicting changes), it became a simple matter of configuration. Observing that all the important files in my home directory which are not already covered by some other synchronization scheme (such as configuration files managed with Mercurial) are only in a few subdirectories, I quickly arrived at the following profile: root = /home/tari root = /media/Caring/sync/tari path = incoming path = pictures path = projects path = wallpapers  Fairly obvious function here, the two sync roots are /home/tari (my home directory) and /media/Caring/sync/tari (the NAS is mounted via NFS at /media/Caring), and only the four listed directories will be synchronized. An easy and robust solution. I have yet to configure the system for automatic synchronization, but I’ll probably end up simply installing a few scripts to run unison at boot and when shutting down, observing that other copies of the data are unlikely to change while my workstation is active. Some additional hooks may be desired, but I don’t expect configuration to be difficult. If it ends up being more complex, I’ll just have to post another update on how I did it. Update Jan. 30: I ended up adding a line to my rc.local and rc.shutdown scripts that invokes unison: su tari -c "unison -auto home"  Note that the Unison profile above is stored as ~/.unison/home.prf, so this handles syncing everything I listed above. # Experiments with a high-availability /home I was recently experimenting with ways to configure my computing setup for high availability of my personal data, which is stored in a Btrfs-formatted partition on my SSD. When my workstation is booted into Windows, however, I want to be able to access my data with minimal effort. Since there’s no way to access a Btrfs volume natively from within Windows, I had to find another approach. It seemed like automatically syncing files out to my NAS was the best solution, since that’s always available and independent of most other things I would be doing at any time. # Candidates The obvious first option for syncing files to the NAS is the ever-common rsync. It’s great at periodic file transfers, but real-time syncing of modifications is rather beyond the ken of rsync. lsync provides a reasonable way to keep things reasonably in-sync, but it’s far from an elegant solution. Were I so motivated, it would be reasonable to devise a similar rsync wrapper using inotify (or similar mechanisms) to only handle modified files and possibly even postpone syncing changes until some change threshold is exceeded. With existing software, however, rsync is a rather suboptimal solution. From a cursory scan, cluster filesystems such as ceph or lustre seem like good options for tackling this problem. The main disadvantage of the cluster filesystem approach, however, is rather high complexity. Most cluster filesystem implementations require a few layers of software, generally both a metadata server and storage server. In large deployments that software stack makes sense, but it’s needless complexity for me. In addition, ensuring that data is correctly duplicated across both systems at any given time may be a challenge. I didn’t end up trying this route so ensuring data duplication may be easier than it seems, but a cluster filesystem ultimately seemed like needless complexity for what I wanted to do. While researching cluster filesystems, I discovered xtreemfs, which has a number of unique features, such as good support for wide-area storage networks, and is capable of operating securely even over the internet. Downsides of xtreemfs are mostly related to the technology it’s built on, since the filesystem itself is implemented with Linux’s FUSE (Filesystem in USErspace) layer and is implemented in Java. Both those properties make it rather clunky to work with and configure, so I ended up looking for another solution after a little time spent attempting to build and configure xtreemfs. The solution I ultimately settled upon was DRBD, which is a block-level replication tool. Unlike the other approaches, DRBD sits at the block level (rather than the filesystem level), so any desired filesystem can be run on top of it. This was a major advantage to me, because Btrfs provides a few features that I find important (checksums for data, and copy-on-write snapshotting). Handling block-level syncing is necessarily somewhat more network-intensive than running at the file level, but since I was targeting use over a gigabit LAN, network usage was a peripheral concern. # Implementation From the perspective of normal operation, a DRBD volume looks like RAID 1 running over a network. One host is marked as the primary, and any changes to the volume on that host are propagated to the secondary host. If the primary goes offline for whatever reason, the secondary system can be promoted to the new primary, and the resource stays available. In the situation of my designs for use of DRBD, my workstation machine would be the primary in order to achieve normal I/O performance while still replicating changes to the NAS. Upon taking the workstation down for whatever reason (usually booting it into another OS), all changes should be on the NAS, which remains active as a lone secondary. DRBD doesn’t allow secondary volumes to be used at all (mainly since that would introduce additional concerns to ensure data integrity), so in order to mount the secondary and make it accessible (such as via a Samba share) the first step is to mark the volume as primary. I was initially cautious about how bringing the original primary back online would affect synchronization, but it turned out to handle such a situation gracefully. When the initial primary (workstation) comes back online following promotion of the secondary (NAS), the former primary is demoted back to secondary status, which also ensures that any changes while the workstation was offline are correctly mirrored back. While the two stores are resyncing, it is possible to mark the workstation as primary once more and continue normal operation while the NAS’ modifications sync back. Given that both my NAS and workstation machines run Arch Linux, setup of DRBD for this scheme was fairly simple. First order of business was to create a volume to base DRBD on. The actual DRBD driver is part of mainline Linux since version 2.6.33, so having the requisite kernel module loaded was easy. The userspace utilities are available in the AUR, so it was easy to get those configured and installed. Finally, I created a resource configuration file as follows: resource home { device /dev/drbd0; meta-disk internal; protocol A; startup { become-primary-on Nakamura; } on Nakamura { disk /dev/Nakamura/home; address ipv4 192.168.1.10:7789; } on Nero { disk /dev/loop0; address ipv4 192.168.1.8:7789; } }  The device option specifies what name the DRBD block device should be created with, and meta-disk internal specifies that the DRBD metadata (which contains such things as the dirty bitmap for syncing modified blocks) should be stored within the backing device, rather than in some external file. The protocol line specifies asynchronous operation (don’t wait for a response from the secondary before returning saying a write is complete), which helps performance but makes the system less robust in the case of a sudden failure. Since my use case is less concerned with robustness and more with simple availability and maintaining performance as much as possible, I opted for the asynchronous protocol. The startup block specifies that Nakamura (the workstation) should be promoted to primary when it comes online. The two on blocks specify the two hosts of the cluster. Nakamura’s volume is backed by a Linux logical volume (in the volume group ‘Nakamura’), while Nero’s is hosted on a loop device. I chose to use a loop device on Nero simply because the machine has a large amount of storage (6TB in RAID5), but no unallocated space, so I had to use a loop device. In using a loop device I ended up ignoring a warning in the DRBD manual about running it over loop block devices causing deadlocks– this ended up being a poor choice, as described later. It was a fairly simple matter of bringing the volumes online once I had written the configuration. Load the relevant kernel module, and use the userland utilities to set up the backing device. Finally, bring the volume up. Repeat this series of steps again on the other host. # modprobe drbd # drbdadm create-md home # drbdadm up home With the module loaded and a volume online, status information is visible in /proc/drbd, looking something like the following (shamelessly taken from the DRBD manual): $ cat /proc/drbd
version: 8.3.0 (api:88/proto:86-89)
GIT-hash: 9ba8b93e24d842f0dd3fb1f9b90e8348ddb95829 build by buildsystem@linbit, 2008-12-18 16:02:26
0: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r---
ns:0 nr:8 dw:8 dr:0 al:0 bm:2 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

The first few lines provide version information, and the two lines beginning with ‘0:’ describe the state of a DRBD volume. Of the rest of the information, we can see that both hosts are online and communicating (Connected), both are currently marked as secondaries (Secondary/Secondary), and both have the latest version of all data (UpToDate/UpToDate). The last step in creating the volume is to mark one host as primary. Since this is a newly-created volume, marking one host as primary requires invalidation of the other, prompting resynchronization of the entire device. I execute drbdadm primary –force home on Nakamura to mark that host as having the canonical version of the data, and the devices begin to synchronize.

Once everything is set, it becomes possible to use the DRBD block device (/dev/drbd0 in my configuration) like any other block device- create filesystems, mount it, or write random data to it. With a little work to invoke the DRBD initscripts at boot time, I was able to get everything working as expected. There were a few small issues with the setup, though:

• Nero (the NAS) required manual intervention to be promoted to the primary role. This could be improved by adding some sort of hooks on access to promote it to primary and mount the volume. This could probably be implemented with autofs for a truly transparent function, or even a simple web page hosted by the NAS which prompts promotion when it is visited.
• Deadlocks! I mentioned earlier that I chose to ignore the warning in the manual about deadlocks when running DRBD on top of loop devices, and I did start seeing some on Nero. All I/O on the volume hosting the loop device on Nero would stall, and the only way out was by rebooting the machine.

# Conclusion

DRBD works for keeping data in sync between two machines in a transparent fashion, at the cost of a few more software requirements and a slight performance hit. The kernelspace tools are in mainline Linux so should be available in any reasonably recent kernel, but availability of the userspace utilities is questionable. Fortunately, building them for oneself is fairly easy. Provided the drbd module is loaded, it is not necessary to use the userspace utilities to bring the volume online- the backing block device can be mounted without DRBD, but the secondary device will need to be manually invalidated upon reconnect. That’s useful for ensuring that it’s difficult for data to be rendered inaccessible, since the userspace utilities are not strictly needed to get at the data.

I ultimately didn’t continue running this scheme for long, mainly due to the deadlock issues I had on the NAS, which could have been resolved with some time spent reorganizing the storage on that host. I decided that wasn’t worth the effort, however. To achieve a similar effect, I ended up configuring a virtual machine on my Windows installation that has direct access to the disks which have Linux-hosted data, so I can boot the physical Linux installation in a virtual machine. By modifying the initscripts a little, I configured it to start Samba at boot time when running virtualized in order to give access to the data. The virtualized solution is a bit more of a hack than DRBD and is somewhat less robust (in case of unexpected shutdown, this makes two operating systems coming down hard), but I think the relative simplicity and absence of a network tether are a reasonable compromise.

Were I to go back to a DRBD-backed solution at some time, I might want to look into using DRBD in dual-primary mode. In most applications only a single primary can be used since most filesystems are designed without the locking required to allow multiple drivers to operate on them at the same time (this is why NFS and similar network filesystems require lock managers). Using a shared-disk filesystem such as OCFS (or OCFS2), DRBD is capable of having both hosts in primary mode, so the filesystem can be mounted and modified on both hosts at once. Using dual primaries would simplify the promotion scheme (each host must simply be promoted to primary when it comes online), but would also require care to avoid split-brain situations (in which communications are lost but both hosts are still online and processing I/O requests, so they desync and require manual intervention to resolve conflicts). I didn’t try OCFS2 at all during this experiment mainly because I didn’t want to stop using btrfs as my primary filesystem.

To conclude, DRBD works for what I wanted to do, but deadlocks while running it on a loop device kept me from using it for long. The virtual machine-based version of this scheme performs well enough for my needs, despite being rather clunky to work with. I will keep DRBD in mind for similar uses in the future, though, and may revisit the issue at a later date when my network layout changes.

Update 26.1.2012: I’ve revisited this concept in a simpler (and less automatic) fashion.

# How not to distribute software

I recently acquired a TI eZ430-Chronos watch/development platform. It’s a pretty fancy piece of kit just running the stock firmware, but I got it with hacking in mind, so of course that’s what I set out to do. Little did I know that TI’s packaging of some of the related tools is a good lesson in what not to do when packaging software for users of any system that isn’t Windows..

The first thing to do when working with a new platform is usually to try out the sample applications, and indeed in this case I did exactly that. TI helpfully provide a distribution of the PC-side software for communicating with the Chronos that runs on Linux, but things cannot be that easy. What follows is a loose transcript of my session to get slac388a unpacked so I could look at the provided code.

$unzip slac388a.zip$ ls
Chronos-Setup
$chmod +x Chronos-Setup$ ./Chronos-Setup
$ Oh, it did nothing. Maybe it segfaulted silently because it’s poorly written? $ dmesg | tail
[snip]
[2591.111811] [drm] force priority to high
[2591.111811] [drm] force priority to high
$file Chronos-Setup Chronos-Setup: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), statically linked, stripped$ gdb Chronos-Setup
GNU gdb (GDB) 7.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
(no debugging symbols found)...done.
(gdb) r
Starting program: /home/tari/workspace/chronos-tests/Chronos-Setup
[Inferior 1 (process 9214) exited with code 0177]

Great. It runs and exits with code 127. How useful.

I moved the program over to a 32-bit system, and of course it worked fine, although that revealed a stunningly brain-dead design decision. The following image says everything.

To recap, this was a Windows-style self-extracting installer packed in a zip archive upon initial download, designed to run on a 32-bit Linux system, which failed silently when run on a 64-bit system. I am simply stunned by the bad design.

Bonus tidbit: it unpacked an uninstaller in the directory of source code and compiled demo applications, as if whoever packaged it decided the users (remember, this is an embedded development demo board so it’s logical to assume the users are fairly tech-savvy) were too clueless to delete a single directory when the contents were no longer wanted. I think the only possible reaction is a hearty :facepalm:.

# Pointless Linux Hacks

I nearly always find it interesting to muck about in someone else’s code, often to add simple features or to make it do something silly, and the Linux kernel is no exception to that. What follows is my own first adventure into patching Linux to do my evil bidding.

Aside from mucking about in code for fun, digging through public source code such as that provided by Linux can be very useful when developing something new.

## A short story

I was doing nothing of particular importance yesterday afternoon when I was booting up my previously mentioned netbook. The machine usually runs on a straight framebuffer powered by KMS on i915 hardware, and my kernel is configured to show the famous Tux logo while booting.

Readers familiar with the logo behaviour might already see where I’m going with this, but the kernel typically displays one copy of the logo for each processor in the system (so a uniprocessor machine shows one tux, a quad-core shows four, etc..). As a bit of a joke, then, suggested a friend, why not patch my kernel to make it look like a much more powerful machine than it really is? Of course, that’s exactly what I did, and here’s the patch for Linux 2.6.38.

--- drivers/video/fbmem.c.orig	2011-04-14 07:26:34.865849376 -0400
+++ drivers/video/fbmem.c	2011-04-13 13:06:28.706011678 -0400
@@ -635,7 +635,7 @@
int y;

y = fb_show_logo_line(info, rotate, fb_logo.logo, 0,
-			      num_online_cpus());
+			      4 * num_online_cpus());
y = fb_show_extra_logos(info, y, rotate);

return y;


Quite simply, my netbook now pretends to have an eight-core processor (the Atom with SMT reports two logical cores) as far as the visual indications go while booting up.

## Source-diving

Thus we come to source-diving, a term I’ve borrowed from the community of Nethack players to describe the process of searching for the location of a particular piece of code in some larger project.

Diving in someone else’s source is frequently useful, although I don’t have any specific examples of it in my own work at the moment. For an outside example, have a look at musca, which is a tiling window manager for X which was written from scratch but used ratpoison and dwm (two other X window managers) as models:

Musca’s code is actually written from scratch, but a lot of useful stuff was gleaned from reading the source code of those two excellent projects.

A personal recommendation for anyone seeking to go source-diving: become good friends with grep. In the case of my patch above, the process went something like this:

• grep -R LOGO_LINUX linux-2.6.38/ to find all references to LOGO_LINUX in the source tree.
• Examine the related files, find drivers/video/fbmem.c, which contains the logo display code.
• Find the part which controls the number of logos to display by searching that file for ‘cpu’, assuming (correctly) that it must call some outside function to get the number of CPUs active in the system.
• Patch line 638 (for great justice).

Next up in my source-diving adventures will be finding the code which controls what happens when the user presses control+alt+delete, in anticipation of sometime rewriting fb-hitler into a standalone kernel rather than a program running on top of Linux..