# Fomu: a beginner's guide

FPGAs are pretty cool pieces of hardware for tinkering with, and have become remarkably easy to approach as a hobbyist in recent years. Boards like the TinyFPGA BX don’t require any special hardware to use and can provide a simple platform for modestly-scoped projects or just for learning.

While historically the software tools for programming FPGAs are proprietary and provided by the hardware manufacturer, Symbiflow (enabled and probably inspired by earlier work like Project IceStorm) provides completely free and open-source tooling and documentation for programming some FPGAs, significantly lowering the cost of entry (most vendors provide some free version of their design software but limited to lower-end devices; a license for the non-free version of the software is well into the realm of “if you have to ask, you can’t afford it”) and appearing to yield better results in many cases.1

As somebody who finds it fun to learn new things and experiment with new kinds of creations, FPGAs are quite interesting to me- they’re quite complex devices that enable very powerful creations, with excellent depth for mastery. While I did some course lab work with Altera FPGAs in university (and a little bit of chip design/layout later), I’d call those mostly canned tasks with easily-understood requirements and problem-solving approaches; it was sufficient to familiarize myself with the systems, but not enough to be particularly useful.

The announcement of Fomu caught my interest because I was aware of the earlier Tomu but wasn’t sufficiently interested to try to acquire any hardware. With Fomu however, I’m rather more interested because it enables interesting capabilities for playing with hardware- others have already demonstrated small RISC-V CPUs running in that FPGA (despite its modest logic capacity), for instance.

Even more conveniently for being able to play with Fomu, I’ve been in contact with Mithro who is approximately half of the team behind Fomu and gotten access to a stockpile of “hacker edition” boards that have been hand-assembled but not programmed at all. With slightly early access to hardware, I’ve been able to do some exploration and re-familiarize myself with the world of digital logic design and figure out the hardware.

## Hardware

In summary, Fomu is a small (9.4 by 13 by 0.6 millimeters) circuit board with a Lattice ICE40UP5K-UWG30 FPGA, a 16-megabit SPI Flash for configuration (and other data) storage, a single RGB LED for blinkiness and a 48 MHz MEMS oscillator to provide a clock.

The whole thing is built so it can fit inside a standard USB port. Production boards are meant to ship with a USB bootloader that allows new configurations to be uploaded to the board only via that USB connection, but hacker boards are provided completely unprogrammed (and untested).

### Schematic

Before we can make the hardware do something, we’ll need to understand how everything is put together:

Unfortunately, this schematic leaves some things to be desired. While it does allow us to see what parts are actually on the board and generally how they’re connected, it fails to clearly mark the external connections- power and data lines on the USB connector, test points and utility I/O pads.

The USB connections are easy to figure out, however; it’s a standard pinout so we can easily identify which physical pads on the board correspond to VUSB (5V supply), ground and the two data lines (USBP, USBN). Rather trickier to work out is the function of each of the test points on the board, though there is a provided template for laser-cutting a programming jig which provides some hints:

This jig is meant to be built up by stacking four layers of material, engraving a small pocket in the bottom to hold the board to program and inserting pogo pins in the small holes to contact with the test points on the board. This template helps us in that it has labels for the test points, though! All of the test points are clearly identified, except it’s unclear what voltage is expected on the power supply.

By inspecting the board myself, I eventually determined that the test point for supplying power (marked VCC on the programming jig template) is downstream of the 3.3V regulator (not connected to the USB power supply pad) so it expects 3.3 Volts for programming.

### Convenient pinout diagram

By way of improving the schematic, here’s that same photo of the board with the signal names from the schematic pointed out on each of the pads, and the individual chips pointed out.

And the same in tabular form for easy searching:

Silkscreen Schematic Description
V +3V3 3.3V rail
S CS SPI chip select (active low)
C SCK SPI clock
I MISO SPI MISO
O MOSI SPI MOSI
R CRESET_B FPGA reset (active low)
G GND Ground
1 PIN1 User I/O 1
2 PIN2 User I/O 2
3 PIN3 User I/O 3
4 PIN4 User I/O 4

## Bootstrapping

My first task in attempting to bootstrap a board and load some configuration on it was building a programming jig. Given there was already a template for a laser-cut acrylic one and I have access to a benchtop laser cutter, this was easy:

It’s a little bit ugly because the pogo pins I had ready access too are too small to nicely fit in the laser-cut holes so I had to carefully glue them in place.

Actually programming a board can be done with the fomu-flash utility running on a Raspberry Pi. I conveniently had a Raspberry Pi 2 to hand, so a little wiring to the Pi’s GPIO header had a jig that should work. Unfortunately, it didn’t- all I got out when trying to make it identify the on-board flash chip was 1s:

fomu-flash -i Manufacturer ID: unknown (ff) Memory model: unknown (ff) Memory size: unknown (ff) Device ID: ff Serial number: ff ff ff ff Status 1: ff Status 2: ff Status 3: ff I gave up on that hardware after spending a while experimenting with it, and decided to design a custom programming jig that might be a little easier to ensure pin alignment is good. This is a little bit tricky because the minimum pitch of the test points is just 1.8 mm, which is not large enough for the 0.1-inch (2.54 mm) DuPont connectors commonly used for prototyping and desirable in this case because they’re very easy to connect to the Raspberry Pi’s GPIO header. ### A better jig Fortunately, I also have access to rather sophisticated prototyping tools and had some nice parts handy from other projects. In particular, a Form 2 stereolithographic 3D printer and some good pogo pins, Preci-dip 90155-AS. I computer-modeled a jig to be 3d-printed that should be both compact and robust, pictured below (see the end of this post for downloadable resources): ### Features The design takes great advantage of the flexibility of 3d printing for fabrication: it is easy to install the pogo pins off-vertical by making the (press-fit) holes at an arbitrary angle; this would be very difficult with conventional fabrication, but it allows the spacing of the pins at the top of the jig to be large enough that 2.54mm connectors can be used, despite the pad spacing on the board being only 1.8mm. On the bottom side, there are several narrow features that act as a shelf to support the board (which is 0.6mm thick) so its outside surface is flush with the bottom surface of the jig. A semicircular boss on one side mates with the cutout on the PCB to key the jig so it is obvious when the board is correctly oriented in the jig. A small cutout on one edge allows a tool to be inserted to pull the board out if needed, because the fit is close enough that it might stick sometimes (or a tool could be pushed through from the top). As a manufacturability consideration, the top surface has a slant between opposite corners. This improves the print quality on a Form 2- because dimensions on that side are not critical the part is designed to be printed with that side “down” (actually up, once in the printer) and supports attached to it on that end. By allowing the printer to gradually build up a slope rather than immediately build a plane, it can better produce the intended shape- an earlier version of the design with a flat top had a very rough finish because large and thin layers of material tend to warp until enough material is built up to be self-supporting. The choice of pogo pins in particular is key, since they’re made with a small shoulder and retaining barbs that allow them to be easily press-fit into a connector shell: The one downside of these pins is the short tail, intended for mounting to a circuit board. While the aforementioned DuPont connectors can be mated to the tail, they are not very secure and come off at the slightest force. A revised design choosing parts for their function and not just immediate availability might prefer to use a part like 90101-AS, which is intended for wire termination rather than board mounting- then wires can be securely attached to the pin rather than tenuously placed on it. My workaround that didn’t involve buying more parts was carefully gluing the wires in place, which seems to work okay. ## Programming Having built a jig that I could be confident would work correctly, we now return to the problem of actually programming the board. Connecting the new jig to my Raspberry Pi in the same way I did the first one, it failed in the same way- reading all 1s. At this point I was rather stumped, with a few possible explanations for the problems: 1. Both jigs are unreliable 2. I’m wiring the jigs up incorrectly 3. Software on my Pi is configured incorrectly 4. All of the Fomus I tried were faulty To discount the first two possibilities, I was able to borrow Mithro’s professionally-built jig that already had a Raspberry Pi 3 connected to it. I didn’t have any credentials to log in to that Pi and use it interactively however, so I was limited to checking its wiring and carefully ensuring I connected my Pi to the jig in the same way, then try programming again. This also failed. Having tried that I had to assume my Pi was somehow misconfigured, since it seemed increasingly unlikely that I was doing anything wrong and it seemed implausible that all of my boards were faulty. I eventually took the SD card out of the other jig’s Pi and inspected the software it would run by connecting it to another computer. This amounted to the same fomu-flash program I was using, so I inspected the system configuration in /boot/config.txt and found a variety of non-default options that seemed plausibly useful. Ultimately, I found some magic words: dtparam=spi=on This option makes the kernel on the Pi expose a hardware-assisted SPI peripheral, which seems like an obvious missing option until you realize that fomu-flash actually bit-bangs SPI because the hardware support is insufficient for this application. In any case, I did find that turning that option on makes everything work correctly:  fomu-flash -i
Memory model: AT25SF161 (86)
Memory size: 16 Mbit (01)
Device ID: 14
Serial number: ff ff ff ff
Status 1: 02
Status 2: 00
Status 3: ff

I reported the bug and made a note of this in the documentation so hopefully nobody else has to deal with that problem in the future, even if the root cause is mystifying.

## Success!

With the ability to talk to the configuration flash, it’s then possible to write an actual bitstream. To avoid needing to write one myself, it’s easy to take the LED blinker example from the fomu-tests repository:

fomu-tests/blink$make FOMU_REV=hacker ... Info: constrained 'rgb0' to bel 'X4/Y31/io0' Info: constrained 'rgb1' to bel 'X5/Y31/io0' Info: constrained 'rgb2' to bel 'X6/Y31/io0' Info: constrained 'clki' to bel 'X6/Y0/io1' Warning: unmatched constraint 'spi_mosi' (on line 5) Warning: unmatched constraint 'spi_miso' (on line 6) Warning: unmatched constraint 'spi_clk' (on line 7) Warning: unmatched constraint 'spi_cs' (on line 8) Info: constrained 'user_1' to bel 'X12/Y0/io1' Info: constrained 'user_2' to bel 'X5/Y0/io0' Info: constrained 'user_3' to bel 'X9/Y0/io1' Info: constrained 'user_4' to bel 'X19/Y0/io1' Warning: unmatched constraint 'usb_dn' (on line 13) Warning: unmatched constraint 'usb_dp' (on line 14) ... Info: Device utilisation: Info: ICESTORM_LC: 33/ 5280 0% Info: ICESTORM_RAM: 0/ 30 0% Info: SB_IO: 5/ 96 5% Info: SB_GB: 1/ 8 12% Info: ICESTORM_PLL: 0/ 1 0% Info: SB_WARMBOOT: 0/ 1 0% Info: ICESTORM_DSP: 0/ 8 0% Info: ICESTORM_HFOSC: 0/ 1 0% Info: ICESTORM_LFOSC: 0/ 1 0% Info: SB_I2C: 0/ 2 0% Info: SB_SPI: 0/ 2 0% Info: IO_I3C: 0/ 2 0% Info: SB_LEDDA_IP: 0/ 1 0% Info: SB_RGBA_DRV: 1/ 1 100% Info: ICESTORM_SPRAM: 0/ 4 0% ... Built 'blink' for Fomu hacker$ fomu-flash -w blink.bin
Erasing @ 018000 / 01969a  Done
Programming @ 01959a / 01969a  Done
$fomu-flash -v blink.bin Reading @ 01969a / 01969a Done Programming that to the board yields a blinking LED as expected, so I’ve achieved success in the basic form of this project by getting the FPGA to do something. Further exploration will involve writing gateware with Migen rather than straight Verilog (because I find Verilog to be rather tedious to write) and trying to build a system around a RISC-V CPU (because that sounds interesting). ## Resources If you want to make your own copy of the programming jig or just explore it, you’ve got several options: • View the model at OnShape. This will allow you to view and make changes to the parametric model, which is what you’ll need to make most useful changes to it. • View the STL online. A quick and dirty way to get an interactive view of the model. • Download the STL. If you just want to try to 3D print your own, this is all you need. It may also be useful if you want to make changes using a 3d modeling program (rather than a CAD program). • Download a Solidworks part file. This was just exported from OnShape, but you might prefer this if you want to use SolidWorks to edit the model. All of the official documentation for Fomu is available on Github. For basic information (such as what I referred to when writing up this project), that’s a great starting point. I designed the programming jig in OnShape which is a pretty good and very convenient CAD tool. 1. FPGA vendors don’t publish all the information required to build configuration bitstreams for their hardware, possibly because they wish to support their side business in selling design tool licenses- this despite the fact that (anecdotally, since I can’t recall where I saw it) many FPGA developers say that vendor tooling is one of the biggest annoyances in their work. The open-source tools require a fair amount of painstaking reverse-engineering of chips to create! [return] # Building a terrible 'IoT' temperature logger I had approximately the following exchange with a co-worker a few days ago: Them: “Hey, do you have a spare Raspberry Pi lying around?” Me: [thinks] “..yes, actually.” T: “Do you want to build a temperature logger with Prometheus and a DS18B20+? M: “Uh, okay?” It later turned out that that co-worker had been enlisted by yet another individual to provide a temperature logger for their project of brewing cider, to monitor the temperature during fermentation. Since I had all the hardware at hand (to wit, a Raspberry Pi 2 that I wasn’t using for anything and temperature sensors provided by the above co-worker), I threw something together. It also turned out that the deadline was quite short (brewing began just two days after this initial exchange), but I made it work in time. ## Interfacing the thermometer As noted above, the core of this temperature logger is a DS18B20 temperature sensor. Per the manufacturer: The DS18B20 digital thermometer provides 9-bit to 12-bit Celsius temperature measurements … communicates over a 1-Wire bus that by definition requires only one data line (and ground) for communication with a central microprocessor. … Each DS18B20 has a unique 64-bit serial code, which allows multiple DS18B20s to function on the same 1-Wire bus. Thus, it is simple to use one microprocessor to control many DS18B20s distributed over a large area. Indeed, this is a very easy device to interface with. But even given the svelte hardware needs (power, data and ground signals), writing some code that speaks 1-Wire is not necessarily something I’m interested in. Fortunately, these sensors are very commonly used with the Raspberry Pi, as illustrated by an Adafruit tutorial published in 2013. The Linux kernel provided for the Pi in its default Raspbian (Debian-derived) distribution supports bit-banging 1-Wire over its GPIOs by default, requiring only a device tree overlay to activate it. This is as simple as adding a line to /boot/config.txt to make the machine’s boot loader instruct the kernel to apply a change to the hardware configuration at boot time: dtoverlay=w1-gpio With that configuration, one simply needs to wire the sensor up. The w1-gpio device tree configuration by default uses GPIO 4 on the Pi as the data line, then power and grounds need to be connected and a pull-up resistor added to the data line (since 1-Wire is an open-drain bus). The w1-therm kernel module already understands how to interface with these sensors- meaning I don’t need to write any code to talk to the temperature sensor: Linux can do it all for me! For instance, reading the temperature out in an interactive shell to test, after booting with the 1-Wire overlay enabled: $ modprobe w1-gpio w1-therm
$cd /sys/bus/w1/devices$ ls
28-000004b926f1  w1_bus_master1

## Temperature exporter

Having connected the thermometer to the Pi and set up Prometheus, we now need to glue them together such that Prometheus can read the temperature. The usual way is for Prometheus to make HTTP requests to its known data sources, where the response is formatted such that Prometheus can make sense of the metrics. There is some support for having metrics sources push their values to Prometheus through a bridge (that basically just remembers the values it’s given until they’re scraped), but that seems inelegant given it would require running another program (the bridge) and goes against the how Prometheus is designed to work.

I’ve published the source for the metrics exporter I ended up writing, and will give it a quick description in the remnants of this section.

The easiest solution to providing a service over HTTP is using the http.server module, so that’s what I chose to use. When the program starts up it scans for temperature sensors and stores them. This has a downside of never returning data if a sensor is accidentally disconnected at startup, but detection is fairly slow and only doing it at startup makes it clearer if sensors are accidentally disconnected during operation, since reading them will fail at that point.

#!/usr/bin/env python3

import socketserver
from http.server import HTTPServer, BaseHTTPRequestHandler
from w1thermsensor import W1ThermSensor

SENSORS = W1ThermSensor.get_available_sensors()

The request handler has a method that builds the whole response at once, which is just plain text based on a simple template.

class Exporter(BaseHTTPRequestHandler):
METRIC_HEADER = ('# HELP w1therm_temperature Temperature in Kelvin of the sensor.\n'
'# TYPE w1therm_temperature gauge\n')

def build_exposition(self, sensor_states):
for sensor, temperature in sensor_states.items():
out += 'w1therm_temperature{{id="{}"}} {}\n'.format(sensor, temperature)
return out

do_GET is called by BaseHTTPRequestHandler for all HTTP GET requests to the server. Since this server doesn’t really care what you want (it only exports one thing- metrics), it completely ignores the request and sends back metrics.

    def do_GET(self):
response = self.build_exposition(self.get_sensor_states())
response = response.encode('utf-8')

# We're careful to send a content-length, so keepalive is allowed.
self.protocol_version = 'HTTP/1.1'
self.close_connection = False

self.send_response(200)
self.wfile.write(response)

The http.server API is somewhat cumbersome in that it doesn’t try to handle setting Content-Length on responses to allow clients to keep connections open between requests, but at least in this case it’s very easy to set the Content-Length on the response and correctly implement HTTP 1.1. The Content-Type used here is the one specified by the Prometheus documentation for exposition formats.

The rest of the program is just glue, for the most part. The console_entry_point function is the entry point for the w1therm_prometheus_exporter script specified in setup.py. The network address and port to listen on are taken from the command line, then an HTTP server is started and allowed to run forever.

### As a server

As a Python program with a few non-standard dependencies, installation of this server is not particularly easy. While I could sudo pip install everything and call it sufficient, that’s liable to break unexpectedly if other parts of the system are automatically updated- in particular the Python interpreter itself (though Debian as a matter of policy doesn’t update Python to a different release except as a major update, so it shouldn’t happen without warning). What I’d really like is the ability to build a single standalone program that contains everything in a convenient single-file package, and that’s exactly what PyInstaller can do.

A little bit of wrestling with pyinstaller configuration later (included as the .spec file in the repository), I had successfully built a pretty heavy (5MB) executable containing everything the server needs to run. I placed a copy in /usr/local/bin, for easy accessibility in running it.

I then wrote a simple systemd unit for the temperature server to make it start automatically, installed as /etc/systemd/system/w1therm-prometheus-exporter.service:

[Unit]
Description=Exports 1-wire temperature sensor readings to Prometheus
Documentation=https://bitbucket.org/tari/w1therm-prometheus

[Service]
ExecStart=/usr/local/bin/w1therm-prometheus-exporter localhost 9000
Restart=always

StandardOutput=journal
StandardError=journal

# Standalone binary doesn't need any access beyond its own binary image and
# a tmpfs to unpack itself in.
DynamicUser=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Enable the service, and it will start automatically when the system boots:

systemctl enable w1therm-prometheus-exporter.service

This unit includes rather more protection than is probably very useful, given the machine is single-purpose, but it seems like good practice to isolate the server from the rest of the system as much as possible.

• DynamicUser will make it run as a system user with ID semi-randomly assigned each time it starts so it doesn’t look like anything else on the system for purposes of resource (file) ownership.
• ProtectSystem makes it impossible to write to most of the filesystem, protecting against accidental or malicious changes to system files.
• ProtectHome makes it impossible to read any user’s home directory, preventing information leak from other users.
• PrivateTmp give the server its own private /tmp directory, so it can’t interfere with temporary files created by other things, nor can its be interfered with- preventing possible races which could be exploited.

## Pi connectivity

Having built the HTTP server, I needed a way to get data from it to Prometheus. As discussed earlier, the Raspberry Pi with the sensor is on a WiFi network that doesn’t permit any incoming connections, so how can Prometheus scrape metrics if it can’t connect to the Pi?

One option is to push metrics to Prometheus, using the push gateway. However, I don’t like that option because the push gateway is intended mostly for jobs that run unpredictably, in particular where they can exit without warning. This isn’t true of my sensor server. PushProx provides a rather better solution, wherein clients connect to a proxy which forwards fetches from Prometheus to the relevant client, though I think my ultimate solution is just as effective and simpler.

What I ended up doing is using autossh to open an SSH tunnel at the Prometheus server which connects to the Raspberry Pi’s metrics server. Autossh is responsible for keeping the connection alive, managed by systemd. Code is going to be much more instructive here than a long-form description, so here’s the unit file:

[Unit]
Description=SSH reverse tunnel from %I for Prometheus
After=network-online.target
Wants=network-online.target

[Service]
User=autossh
ExecStart=/usr/bin/autossh -N -p 22 -l autossh -R 9000:localhost:9000 -i /home/autossh/id_rsa %i
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Installed as /etc/systemd/system/autossh-tunnel@.service, this unit file tells systemd that we want to start autossh when the network is online and try to ensure it always stays online. I’ve increased RestartSec from the default 100 milliseconds because I found that even with the dependency on network-online.target, ssh could fail when the system was booting up with DNS lookup failures, then systemd would give up. Increasing the restart time means it takes much longer for systemd to give up, and in the meantime the network actually comes up.

The autossh process itself runs as a system user I created just to run the tunnels (useradd --system -m autossh), and opens a reverse tunnel from port 9000 on the remote host to the same port on the Pi. Authentication is with an SSH key I created on the Pi and added to the Prometheus machine in Google Cloud, so it can log in to the server without any human intervention. Teaching systemd that this should run automatically is a simple enable command away1:

systemctl enable autossh-tunnel@pitemp.example.com

Then it’s just a matter of configuring Prometheus to scrape the sensor exporter. The entire Prometheus config looks like this:

global:
scrape_interval:     15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s

scrape_configs:
- job_name: 'w1therm'
static_configs:
- targets: ['localhost:9000']

That’s pretty self-explanatory; Prometheus will fetch metrics from port 9000 on the same machine (which is actually an SSH tunnel to the Raspberry Pi), and do so every 15 seconds. When the Pi gets the request for metrics, it reads the temperature sensors and returns their values.

## Data retention

I included InfluxDB in the setup to get arbitrary retention of temperature data- Prometheus is designed primarily for real-time monitoring of computer systems, to alert human operators when things appear to be going wrong. Consequently, in the default configuration Prometheus only retains captured data for a few weeks, and doesn’t provide a convenient way to export data for archival or analysis. While the default retention is probably sufficient for this project’s needs, I wanted better control over how long that data was kept and the ability to save it as long as I liked. So while Prometheus doesn’t offer that control itself, it does support reading and writing data to and from various other databases, including InfluxDB (which I chose only because a package for it is available in Debian without any additional work).

Unfortunately, the version of Prometheus available in Debian right now is fairly old- 1.5.2, where the latest release is 2.2. More problematic, while Prometheus now supports a generic remote read/write API, this was added in version 2.0 and is not yet available in the Debian package. Combined with the lack of documentation (as far as I could find) for the old remote write feature, I was a little bit stuck.

Things ended up working out nicely though- I happened to see flags relating to InluxDB in the Prometheus web UI, which mostly have no default values:

• storage.remote.influxdb-url
• storage.remote.influxdb.database = prometheus
• storage.remote.influxdb.retention-policy
• storage.remote.influxdb.username

These can be specified to Prometheus by editing /etc/defaults/prometheus, which is part of the Debian package for providing the command line arguments to the server without requiring users to directly edit the file that tells the system how to run Prometheus. I ended up with these options there:

ARGS="--storage.local.retention=720h \
--storage.remote.influxdb-url=http://localhost:8086/ \
--storage.remote.influxdb.retention-policy=autogen"

The first option just makes Prometheus keep its data longer than the default, whereas the others tell it how to write data to InfluxDB. I determined where InfluxDB listens for connections by looking at its configuration file /etc/influxdb/influxdb.conf and making a few guesses: a comment in the http section there noted that “these (HTTP endpoints) are the primary mechanism for getting data into and out of InfluxDB” and included the settings bind-address=":8086" and auth-enabled=false, so I guessed (correctly) that telling Prometheus to find InfluxDB at http://localhost:8086/ should be sufficient.

Or, it was almost enough: setting the influxdb-url and restarting Prometheus, it was logging warnings periodically about getting errors back from InfluxDB. Given the influxdb.database settings defaults to prometheus, I (correctly) assumed I needed to create a database. A little browsing of the Influx documentation and a few guesses later, I had done that:

$apt-get install influxdb-client$ influx
Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
Connected to http://localhost:8086 version 1.0.2
InfluxDB shell version: 1.0.2
> CREATE DATABASE prometheus;

Examining the Prometheus logs again, now it was failing and complaining that the specified retention policy didn’t exist. Noting that the Influx documentation for the CREATE DATABASE command mentioned that the autogen retention policy will be used if no other is specified, setting the retention-policy flag to autogen and restarting Prometheus made data start appearing, which I verified by waiting a little while and making a query (guessing a little bit about how I would query a particular metric):

> USE prometheus;
> SELECT * FROM w1therm_temperature LIMIT 10;
name: w1therm_temperature
-------------------------
time                    id              instance        job     value
1532423583303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423598303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423613303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423628303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423643303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423658303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423673303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423688303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423703303000000     000004b926f1    localhost:9000  w1therm 297.9
1532423718303000000     000004b926f1    localhost:9000  w1therm 297.9

## Results

A sample graph of the temperature over two days:

The fermentation temperature is quite stable, with daily variation of less than one degree in either direction from the baseline.

## Refinements

I later improved the temperature server to handle SIGHUP as a trigger to scan for sensors again, which is a slight improvement over restarting it, but not very important because the server is already so simple (and fast to restart).

On reflection, using Prometheus and scraping temperatures is a very strange way to go about solving the problem of logging the temperature (though it has the advantage of using only tools I was already familiar with so it was easy to do quickly). Pushing temperature measurements from the Pi via MQTT would be a much more sensible solution, since that’s a protocol designed specifically for small sensors to report their states. Indeed, there is no shortage of published projects that do exactly that more efficiently than my Raspberry Pi, most of them using ESP8266 microcontrollers which are much lower-power and can still connect to Wi-Fi networks.

Getting sensor readings through an MQTT broker and storing them to be able to graph them is not quite as trivial as scraping them with Prometheus, but I suspect there does exist a software package that does most of the work already. If not, I expect a quick and dirty one could be implemented with relative ease.

On the other hand, running a device like that which is internet-connected but is unlikely to ever receive anything remotely looking like a security update seems ill-advised if it’s meant to run for anything but a short amount of time. In that case having the sensor be part of a Zigbee network instead, which does not permit direct internet connectivity and thus avoids the fraught terrain of needing to protect both the device itself from attack and the data transmitted by the device from unauthorized use (eavesdropping) by taking ownership of that problem away from the sensor.

It remains possible to forward messages out to an MQTT broker on the greater internet using some kind of bridge (indeed, this is the system used by many consumer “smart device” platforms, like Philips’ Hue though I don’t think they use MQTT), where individual devices connect only to the Zigbee network, and a more capable bridge is responsible for internet connectivity. The problem of keeping the bridge secure remains, but is appreciably simpler than needing to maintain the security of each individual device in what may be a heterogeneous network.

It’s even possible to get inexpensive off-the-shelf temperature and humidity sensors that connect to Zigbee networks like some sold by Xiaomi, offering much better finish than a prototype-quality one I might be able to build myself, very good battery life, and still capable of operating in a heterogenous Zigbee network with arbitrary other devices (though you wouldn’t know it from the manufacturer’s documentation, since they want consumers to commit to their “platform” exclusively)!

So while my solution is okay in that it works fine with hardware I already had on hand, a much more robust solution is readily available with off-the-shelf hardware and only a little bit of software to glue it together. If I needed to do this again and wanted a solution that doesn’t require my expertise to maintain it, I’d reach for those instead.

1. Hostname changed to an obviously fake one for anonymization purposes. [return]

# Considering my backup systems

With the recent news that Crashplan were doing away with their “Home” offering, I had reason to reconsider my choice of online backup backup provider. Since I haven’t written anything here lately and the results of my exploration (plus description of everything else I do to ensure data longevity) might be of interest to others looking to set up backup systems for their own data, a version of my notes from that process follows.

## The status quo

I run a Linux-based home server for all of my long-term storage, currently 15 terabytes of raw storage with btrfs RAID on top. The choice of btrfs and RAID allows me some degree of robustness against local disk failures and accidental damage to data.

If a disk fails I can replace it without losing data, and using btrfs’ RAID support it’s possible to use heterogenous disks, meaning when I need more capacity it’s possible to remove one disk (putting the volume into a degraded state) and add a new (larger) one and rebalance onto the new disk.

btrfs’ ability to take copy-on-write snapshots of subvolumes at any time makes it reasonable to take regular snapshots of everything, providing a first line of defense against accidental damage to data. I use Snapper to automatically create rolling snapshots of each of the major subvolumes:

• Synchronized files (mounted to other machines over the network) have 8 hourly, 7 daily, 4 weekly and 3 monthly snapshots available at any time.
• Staging items (for sorting into other locations) have a snapshot for each of the last two hours only, because those items change frequently and are of low value until considered further.
• Everything else keeps one snapshot from the last hour and each of the last 3 days.

This configuration strikes a balance according to my needs for accident recovery and storage demands plus performance. The frequently-changed items (synchronized with other machines and containing active projects) have a lot of snapshots because most individual files are small but may change frequently, so a large number of snapshots will tend to have modest storage needs. In addition, the chances of accidental data destruction are highest there. The other subvolumes are either more static or lower-value, so I feel little need to keep many snapshots of them.

I use Crashplan to back up the entire system to their “cloud”1 service for $5 per month. The rate at which I add data to the system is usually lower than the rate at which it can be uploaded back to Crashplan as a backup, so in most cases new data is backed up remotely within hours of being created. Finally, I have a large USB-connected external hard drive as a local offline backup. Also formatted with btrfs like the server (but with the entire disk encrypted), I can use btrfs send to send incremental backups to this external disk, even without the ability to send information from the external disk back. In practice, this means I can store the external disk somewhere else completely (possibly without an Internet connection) and occasionally shuttle diffs to it to update to a more recent version. I always unplug this disk from power and its host computer when not being updated, so it should only be vulnerable to physical damage and not accidental modification of its contents. ### Synchronization and remotes For synchronizing current projects between my home server (which I treat as the canonical repository for everything), the tools vary according to the constraints of the remote system. I mount volumes over NFS or SMB from systems that rarely or never leave my network. For portable devices (laptop computers), Syncthing (running on the server and portable device) makes bidirectional synchronization easy without requiring that both machines always be on the same network. I keep very little data on portable devices that is not synchronized back to the server, but because it is (or, was) easy to set up, I used Crashplan’s peer-to-peer backup feature to back up my portable computers to the server. Because the Crashplan application is rather heavyweight (it’s implemented in Java!) and it refuses to include peer-to-peer backups in uploads to their storage service (reasonably so; I can’t really complain about that policy), my remote servers back up to my home server with Borg. I also have several Android devices that aren’t always on my home network- these aren’t covered very well by backups, unfortunately. I use FolderSync to automatically upload things like photos to my server which covers the extent of most data I create on those devices, but it seems difficult to make a backup of an Android device that includes things like preferences and per-app data without rooting the device (which I don’t wish to do for various reasons). ### Summarizing the status quo • btrfs snapshots offer quick access to recent versions of files. • btrfs RAID provides resilience against single-disk failures and easy growth of total storage in my server. • Remote systems synchronize or back up most of their state to the server. • Everything on the server is continuously backed up to Crashplan’s remote servers. • A local offline backup can be easily moved and is rarely even connected to a computer so it should be robust against even catastrophic failures. ## Evaluating alternatives Now that we know how things were, we can consider alternative approaches to solve the problem of Crashplan’s$5-per-month service no longer being available. The primary factors for me are cost and storage capacity. Because most of my data changes rarely but none of it is strictly immutable, I want a system that makes it possible to do incremental backups. This will of course also depend on software support, but it means that I will tend to prefer services with straightforward pricing because it is difficult to estimate how many operations (read or write) are necessary to complete an incremental backup.

Some services like Dropbox or Google Drive as commonly-known examples might be appropriate for some users, but I won’t consider them. As consumer-oriented services positioned for the use case of “make these files available whenever I have Internet access,” they’re optimized for applications very different from the needs of my backups and tend to be significantly more expensive at the volumes I need.

So, the contenders:

• Crashplan for Small Business: just like Crashplan Home (which was going away), but costs $10/mo for unlimited storage and doesn’t support peer-to-peer backup. Can migrate existing Crashplan Home backup archives to Small Business as long as they are smaller than 5 terabytes. • Backblaze:$50 per year for unlimited storage, but their client only runs on Mac and Windows.
• Google Cloud Storage: four flavors available, where the interesting ones for backups are Nearline and Coldline. Low cost per gigabyte stored, but costs are incurred for each operation and transfer of data out.
• Backblaze B2: very low cost per gigabyte, but incurs costs for download.
• Online.net C14: very low cost per gigabyte, no cost for operations or data transfer in the “intensive” flavor.
• AWS Glacier: lowest cost for storage, but very high latency and cost for data retrieval.

The pricing is difficult to consume in this form, so I’ll make some estimates with an 8 terabyte backup archive. This somewhat exceeds my current needs, so should be a useful if not strictly accurate guide. The following table summarizes expected monthly costs for storage, addition of new data and the hypothetical cost of recovering everything from a backup stored with that service.

Service Storage cost Recovery cost Notes
Crashplan $10 0 "Unlimited" storage, flat fee. Backblaze$4.17 0 "Unlimited" storage, flat fee.
GCS Nearline $80 ~$80 Negligible but nonzero cost per operation. Download $0.08 to$0.23 per gigabyte depending on total monthly volume and destination.
GCS Coldline $56 ~$80 Higher but still negligible cost per operation. All items must be stored for at least 90 days (kind of).
B2 $40$80 Flat fee for storage and transfer per-gigabyte.
C14 €40 0 "Intensive" flavor. Other flavors incur per-operation costs.
Glacier $32$740 Per-gigabyte retrieval fees plus Internet egress. Reads may take up to 12 hours for data to become available. Negligible cost per operation. Minimum storage 90 days (like Coldline).

Note that for Google Cloud and AWS I’ve used the pricing quoted for the cheapest regions; Iowa on GCP and US East on AWS.

### Analysis

Backblaze is easily the most attractive option, but the availability restriction for their client (which is required to use the service) to Windows and Mac makes it difficult to use. It may be possible to run a Windows virtual machine on my Linux server to make it work, but that sounds like a lot of work for something that may not be reliable. Backblaze is out.

AWS Glacier is inexpensive for storage, but extremely expensive and slow when retrieving data. The pricing structure is complex enough that I’m not comfortable depending on this rough estimate for the costs, since actual costs for incremental backups would depend strongly on the details of how they were implemented (since the service incurs charges for reads and writes). The extremely high latency on bulk retrievals (up to 12 hours) and higher cost for lower-latency reads makes it questionable that it’s even reasonable to do incremental backups on Glacier. Not Glacier.

C14 is attractively priced, but because they are not widely known I expect backup packages will not (yet?) support it as a destination for data. Unfortunately, that means C14 won’t do.

Google Cloud is fairly reasonably-priced, but Coldline’s storage pricing is confusing in the same ways that Glacier is. Either flavor is better pricing-wise than Glacier simply because the recovery cost is so much lower, but there are still better choices than GCS.

B2’s pricing for storage is competitive and download rates are reasonable (unlike Glacier!). It’s worth considering, but Crashplan still wins in cost. Plus I’m already familiar with software for doing incremental backups on their service (their client!) and wouldn’t need to re-upload everything to a new service.

## Fallout

I conclude that the removal of Crashplan’s “Home” service effectively means a doubling of the monthly cost to me, but little else. There are a few extra things to consider, however.

First, my backup archive at Crashplan was larger than 5 terabytes so could not be migrated to their “Business” version. I worked around that by removing some data from my backup set and waiting a while for those changes to translate to “data is actually gone from the server including old versions,” then migrating to the new service and adding the removed data back to the backup set. This means I probably lost a few old versions of the items I removed and re-added, but I don’t expect to ever need any of them.

Second and more concerning in general is the newfound inability to do peer-to-peer backups from portable (and otherwise) computers to my own server. For Linux machines that are always Internet-connected Borg continues to do the job, but I needed a new package that works on Windows. I’ve eventually chosen Duplicati, which can connect to my server the same way Borg does (over SSH/SFTP) and will in general work over arbitrarily-restricted internet connections in the same way that Crashplan did.

## Concluding

I’m still using Crashplan, but converting to their more-expensive service was not quite trivial. It’s still much more inexpensive to back up to their service compared to others, which means they still have some significant freedom to raise the cost until I consider some other way to back up my data remotely.

As something of a corollary, it’s pretty clear that my high storage use on Crashplan is subsidized by other customers who store much less on the service; this is just something they must recognize when deciding how to price the service!

# Reflecting on Breath of the Wild

I’ve been enjoying The Legend of Zelda: Breath of the Wild recently, and reflected some on what makes it interesting to me from a non-gameplay perspective. This document is a version of those musings organized for publication, though perhaps less well organized than my usual writings- I am by no means a skilled critic, but spending longer in composing this would likely just delay its completion to little benefit.

Note that at the time of this writing I have not yet completed the game, but there are still some minor spoilers for the early portions of the game and general premise.

I haven’t really played any Zelda games before Breath of the Wild. Once (long ago) I played a little bit of A Link to the Past but didn’t find it interesting (and was bad at it). Similarly, some time later I tried Ocarina of Time and failed to find anything compelling about it. Claiming that those games are simply not fun would be disingenuous, given both of them appear in multiple lists of “greatest video games ever” compiled by various parties. The correct question to answer here is then which of the differences between those earlier games and Breath of the Wild make the latter interesting to me, but the former not.

Having had extremely limited exposure to earlier Zelda games, I am in a poor position to comment on gameplay differences. In principle I appreciate the open design of Breath of the Wild that allows the player to go anywhere and do just about anything following the completion of a short tutorial section, but it seems to me that I have not been taking much advantage of that freedom. My general impression of the older Zelda style is that there is usually only one way to progress and that may be a frustrating factor that is not highly ranked in my mind however, whereas Breath of the Wild has given me as a player a sufficiently precise goal and the freedom to go after it at any time, but also hints as to things I may do to make success in that quest more probable.

It’s possible that merely offering more freedom to the player and applying contemporary popular game design principles (and resources) to the general Zelda formula is enough to capture the interest of the Present Me. Certainly when I tried the other above-described games they would have been relatively old so I may have been accustomed to games simply built with different style.

On the other hand, I have opinions about how Zelda has previously approached storytelling and how it does in Breath of the Wild which make for rather more interesting thoughts, so I will turn to those considerations instead.

My general impression of the typical Zelda plot is “oh no stop the evil man from doing the evil thing.” As far as it goes, that’s fine but I find it uninteresting. Breath of the Wild takes a somewhat different approach to the general concept that I find much more compelling.

Ganon, the usual Big Bad, is depicted as an elemental evil in Breath of the Wild, rather than some kind of dark wizard. If a mere dark wizard could be the source of an existential threat of the sort that requires a Great Hero (the player) to protect the world, it seems that there would be more frequent existential threats. Given the presentation suggesting that the player is a unique and noteworthy hero (for instance, collecting artifacts that are implied to be unique and offer powerful abilities to a chosen wielder), putting them in a world where the threat demanding heroism is as mundane as a dark wizard of some kind is rather unconvincing.

The ability of the player character to defeat this elemental evil is not (at least from the start) taken as self-evident, in comparison to how I see other Zelda games, which from the beginning seem to take it as given that the player character is destined to become a great hero and save the world.

Breath of the Wild starts with the player as the only sentient being in a wild landscape with traces of former glory, quickly introducing a guide sort of character who acts as a gently-guided tutorial before giving fuzzy backstory about how the player character is essentialy one in a long line of heroes. The details of this are revealed over hours of gameplay, largely leaving the player alone in a vast world scattered with suggestions of a long history.

I previously found the official canon for Zelda to be rather silly, because every new game is basically the same structurally (Link the hero must do hero things to assist/rescue Zelda the princess-heroine from the Big Bad). This has been justified as that they are these recurring characters over very long periods of time, but it struck me mostly as a convenient excuse for recycling characters that are well-liked.

## A world with history

In Breath of the Wild the history of the world is used to good effect, including the recurring nature of the major characters. Lore-keeping NPCs allude to the distant past and what are effectively past incarnations of the main characters, while side characters are more defined by their roles so it is plausible to reuse them.

The world itself acts as a testament to the long history of the land, with ruined structures dotting the landscape (often large stone constructions that are impressive even as ruins) and even older structures (shrines) that serve as a kind of waypoints in what might otherwise be aimless wandering for the player. While the shrines are largely ignored by NPCs, they are not invisible- NPCs seem aware of their presence, but because they are not useful go largely ignored. If there is a weak point in this bit of world-building, the lack of interest or elaboration around shrines seems to be it- but this seems fair when these shrines are meant to be orders of magnitude older than any other relevant features of the landscape.

The art style might earn some credit for informing the feel of the world, but I don’t feel qualified to comment on that. I’ll leave comments on the art of the game to commentators better informed about art, leaving my thoughts at how it seems to be a good use of the limited computing power available to the game’s developers, recognizing that the Wii U and Switch are underpowered systems by the standards of other contemporary gaming platforms.

## Characterization

I noted earlier that Breath of the Wild does not seem to take it as given that the player’s character is destined to become a great hero and save the world, but an alternate interpretation might be that the player already has hero credentials. It is shown early on that Link was some kind of royally appointed champion, and it is then implied that those selfsame credentials derive to some extent from Link being an embodiment of a spirit that protects the world through time immemorial.

Despite these heroic credentials, the player at the beginning of the game is effectively powerless, having been idle (unconscious) for a century, following a disastrous defeat at the hands of an encroaching Ganon.

Much of the characterization is done through flashbacks, which I find pretty effective. Putting this kind of backstory in-line might have been distracting and taken away from the feel of a desolate and ruined world, which makes for some of the most intriguing parts of the setting.

It is a bit unusual among games (those with this kind of budget available at least) that this one largely does without voice acting- most characters (the player included) “speak” in what amounts to grunts, except in flashbacks and with a few major characters who are fully voiced. This works well I find; they are not particularly wordy, and suggest some kind of connection between Link and the world’s past, making the player’s task something more resembling a quest to regain the glory of the land, rather than just preventing a cataclysm (which has already happened!).

I appreciate the gravitas of the voiced instances in general as appropriate to the events surrounding a kingdom under mortal threat. Despite that, there are welcome comedic and otherwise lighthearted instances that seem appropriate when considering that Link and Zelda are supposed to be relatively young. In particular, Zelda as a character does not come across to me as a nearly-omniscient plot device (“go here and do this, you must!”), but instead as a typical person thrust into a position of power as royalty doing what they feel is right in a difficult situation.

Other characters help the game illustrate that in the past Link is coming from, there were organized attempts to resist the impending doom embodied in Ganon. Compare this to my impression of other Zeldas, where if anybody else is aware of the existential hazard that must be defeated by the player they foolishly choose to put their trust in the abilities of the player who begins with no credentials to speak of.

The most guidance any person gives to the player regarding how best to achieve their ultimate goal of defeating the Great Evil hanging over the land amounts to general information, stating what happened in the past and suggesting what might be helpful. Not only does this support the feeling that nobody knows a magical secret that will defeat Ganon, but it also meshes nicely with the feeling that the player is responsible for restoring the former glory of the land (and themselves), leaving it up to the player what level of achievement is appropriate.

If desired, a sufficiently skilled player can go directly from the game’s introduction to defeating the Big Bad, in some fashion asserting that there is nothing that need be reclaimed.

## Concluding

With a largely ruined and wild world in the grip of an elemental evil, Breath of the Wild suggests that there is a long and glorious forgotten history to be rediscovered and reclaimed. Whereas the feeling I get from the narrative setup of other Zelda games is “become a hero and save the land!”, this one is closer to “reclaim the forgotten glory of the land by rediscovering your own heroic nature.” Less a story of self-actualization, the player is fighting the universe itself to build a better world for themselves and every other character in it- but they are granted great freedom to decide what must be done and how.

Recognizing my preferences in fiction, it is hardly surprising that I find Breath of the Wild’s approach to storytelling compelling. For instance, I quite enjoy the works of Alastair Reynolds, which often have themes relating to inhumanly long timescales and the inability of humans to comprehend all relevant aspects. Taken slightly differently, I greatly appreciate Lovecraftian horror with its bleak outlook and Breath of the Wild hits some similar notes (albeit in a much more positive fashion).

So, yeah. I like Breath of the Wild and really appreciate how it diverges from other Zelda games in presentation (though I lack the perspective to objectively judge how much it actually differs!). Even if you, dear reader, have found other Zelda iterations uninteresting, this one might be worth looking at.

# An illustrated guide to LLVM

At the most recent Rust Sydney meetup (yesterday, “celebrating” Rust’s second birthday) I gave a talk intended to provide an introduction to using LLVM to build compilers, using Rust as the implementing language. The presentations were not recorded which might have been neat, but I’m publishing the slides and notes here for anybody who might find it interesting or useful. It is however not as illustrated as the title may seem to suggest.

It’s embedded below, or you can view standalone in your browser or as a PDF, available with or without presenter notes. Navigate with the arrow keys on your keyboard or by swiping. Press ? to show additional keys for controls; in particular, s will open a presenter view that includes the plentiful notes I’ve included.

In case of curiousity, I built the presentation with reveal.js and its preparation consumed a lot more time than I initially expected (though that’s not the fault of reveal).

# Quick and dirty web image optimization

Given a large pile of images that nominally live on a web server, I want to make them smaller and more friendly to serve to clients. This is hardly novel: for example, Google offer detailed advice on reducing the size of images for the web. I have mostly JPEG and PNG images, so, jpegtran and optipng are the tools of choice for bulk lossless compression.

To locate and compress images, I’ll use GNU find and parallel to invoke those tools. For JPEGs I take a simple approach, preserving comment tags and creating a progressive JPEG (which can be displayed at a reduced resolution before the entire image has been downloaded).

find -iname '*.jpg' -printf '%P,%s\n' \
-exec jpegtran -o -progressive -copy comments -outfile {} {} \; \
> delta.csv


Where things get a little more interesting is when I output the name and size of each located file (-printf ...) and store those in a file (> delta.csv).1 This is so I can collect more information about the changes that were made.

## Check the JPEGs

Loading up a Jupyter notebook for quick computations driven with Python, I load the file names and original sizes, adding the current size of each listed file to my dataset.

import os
deltas = []

with open('delta.csv', 'r') as f:
for line in f:
name, _, size = line.partition(',')
size = int(size)
newsize = os.stat(name).st_size
deltas.append((name, size, newsize))

From there, it’s quick work to do some list comprehensions and generate some overall statistics:

original_sizes = [orig for (_, orig, _) in deltas]
final_sizes = [new for (_, _, new) in deltas]
shrinkage = [orig - new for (_, orig, new) in deltas]

pct_total_change = 100 * (sum(original_sizes) -
sum(final_sizes)) / sum(original_sizes)

pct_change = [shrinkage /
orig for (shrinkage, orig) in zip(shrinkage, original_sizes)]
avg_pct_change = 100 * sum(pct_change) / len(pct_change)

print('Total size reduction:', sum(shrinkage),
'bytes ({}%)'.format(round(pct_total_change, 2)))
avg = sum(shrinkage) / len(shrinkage)
print('Average reduction per file:', avg,
'bytes ({}%)'.format(round(avg_pct_change, 2)))

This is by no means good code, but that’s why this short write-up is “quick and dirty”. Over the JPEGs alone, I rewrote 2162 files, saving 8820474 bytes overall; about 8.4 MiB, 8.02% of the original size of all of the files- a respectable savings for exactly zero loss in quality.

## PNGs

I processed the PNGs in a similar fashion, having optipng emit interlaced PNGs which can be displayed at reduced resolution without complete data and try more combinations of compression parameters than it usually would, trading CPU time for possible gains. There was no similar tradeoff with the JPEGs, since apparently jpegtran’s optimization of encoding is entirely deterministic whereas optipng relies on experimental compression of image data with varying parameters to find a minimum size.

Since I observed that many of the PNGs were already optimally-sized and interlacing made them significantly larger (by more than 10% in many cases), I also considered only files that were more than 256 KiB in size. To speed up processing overall, I used parallel to run multiple instances of optipng at once (since it’s a single-threaded program) to better utilize the multicore processors at my disposal.

find -iname '*.png' -size +256k -printf '%P,%s\n' \
-exec parallel optipng -i 1 -o 9 ::: {} + \
> delta.csv


Running the same analysis over this output, I found that interlacing had a significant negative effect on image size. There were 68 inputs larger than 256 KiB, and the size of all of the files increased by 2316732 bytes; 2.2 MiB, nearly 10% of the original size.

A few files had significant size reductions (about 300 KiB in the best case), but the largest increase in size had similar magnitude. Given the overall size increase, the distribution of changes must be skewed towards net increases.

### Try again

Assuming most of the original images were not interlaced (most programs that write PNG don’t interlace unless specifically told to) and recognizing that interlaced PNGs tend to compress worse, I ran this again but without interlacing (-i 0) and selecting all files regardless of size.

The results of this second run were much better: over 3102 files, save 12719421 bytes (12.1 MiB), 15.9% of the original combined size. One file saw a whopping 98% reduction in size, from 234 KB to only 2914 bytes- inspecting that one myself, the original was inefficiently coded 32 bits per pixel (8-bit RGBA), and it was reduced to two bits per pixel. I expect a number of other files had similar but less dramatic transformations. Some were not shrunk at all, but optipng is smart enough to skip rewriting those so it will never make a file larger- the first run was an exception because I asked it to make interlaced images.

## That’s all

I saved about 20 MiB for around 5000 files- not bad. A copy of the notebook I used to do measurements is available (check out nbviewer for an online viewer), useful if you want to do something similar with your own images. I would not recommend doing it to ones that are not meant purely for web viewing, since the optimization process may strip metadata that is useful to preserve.

1. Assumption: none of the file names contain commas, since I’m calling this a CSV (comma-separated values) file. It’s true in this instance, but may not be in others.

[return]

# sax-ng

Over on Cemetech, we’ve long had an embedded chat widget called “SAX” (“Simultaneous Asynchronous eXchange”). It behaves kind of like a traditional shoutbox, in that registered users can use the SAX widget to chat in near-real-time. There is also a bot that relays messages between the on-site widget and an IRC channel, which we call “saxjax”.

The implementation of this, however, was somewhat lacking in efficiency. It was first implemented around mid-2006, and saw essentially no updates until just recently. The following is a good example of how dated the implementation was:

// code for Mozilla, etc
if (window.XMLHttpRequest) {
xmlhttp=new XMLHttpRequest()
xmlhttp.open("GET",url,true)
xmlhttp.send(null)
} else if (window.ActiveXObject) {
// code for IE
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP")
if (xmlhttp) {
xmlhttp.open("GET",url,true)
xmlhttp.send()
}
}


The presence of ActiveXObject here implies it was written at a time when a large fraction of users would have been using Internet Explorer 5 or 6 (the first version of Internet Explorer released which supported the standard form of XMLHttpRequest was version 7).

Around a year ago (that’s how long this post has been a draft for!), I took it upon myself to design and implement a more modern replacement for SAX. This post discusses that process and describes the design of the replacement, which I have called “sax-ng.”

## Legacy SAX

The original SAX implementation, as alluded to above, is based on AJAX polling. On the server, a set of approximately the 30 most recent messages were stored in a MySQL database and a few PHP scripts managed retrieving and modifying messages in the database. This design was a logical choice when initially built, since the web site was running on a shared web host (supporting little more than PHP and MySQL) at the time.

Eventually this design became a problem, as essentially every page containing SAX that is open at any given time regularly polls for new messages. Each poll calls into PHP on the server, which opens a database connection to perform one query. Practically, this means a very large number of database connections being opened at a fairly regular pace. In mid-2012 the connection count reached levels where the shared hosting provider were displeased with it, and requested that we either pay for a more expensive hosting plan or reduce resource usage.

In response, we temporarily disabled SAX, then migrated the web site to a dedicated server provided by OVH, who had opened a new North American datacenter in July. We moved to the dedicated server in August of 2012. This infrastructure change kept the system running, and opened the door to a more sophisticated solution since we gained the ability to run proper server applications.

Meanwhile, the limitations of saxjax (the IRC relay bot) slowly became more evident over time. The implementation was rather ad-hoc, in Python. It used two threads to implement relay, with a dismaying amount of shared state used to relay messages between the two threads. It tended to stop working correctly in case of an error in either thread, be it due to a transient error response from polling the web server for new messages, or an encoding-related exception thrown from the IRC client (since Python 2.x uses bytestrings for most tasks unless specifically told not to, and many string operations (particularly outputting the string to somewhere) can break without warning when used with data that is not 8-bit clean (that is, basically anything that isn’t ASCII).

Practically, this meant that the bot would frequently end up in a state where it would only relay messages one way, or relay none at all. I put some time into making it more robust to these kinds of failures early in 2015, such that some of the time it would manage to catch these errors and outright restart (rather than try to recover from an inconsistent state). Doing so involved some pretty ugly hacks though, which prompted a return to some longtime thoughts on how SAX could be redesigned for greater efficiently and robustness.

## sax-ng

For a long time prior to beginning this work, I frequently (semi-jokingly) suggested XMPP (Jabber) as a solution to the problems with SAX. At a high level this seems reasonable: XMPP is a chat protocol with a number of different implementations available, and is relatively easy to set up as a private chat service.

On the other hand, the feature set of SAX imposes a few requirements which are not inherently available for any given chat service:

1. An HTTP gateway, so clients can run inside a web browser.
2. Group chat, not just one-to-one conversation capability.
3. External authentication (logging in to the web site should permit connection to chat as well).
4. Retrieval of chat history (so a freshly-loaded page can have some amount of chat history shown).

As it turns out, ejabberd enables all of these, with relatively little customization. mod_http_bind provides an HTTP gateway as specified in XEP-0206, mod_muc implements multi-user chat as specified in XEP-0045 which also includes capabilities to send chat history to clients when they connect, and authentication can be handled by an external program which speaks a simple binary protocol and is invoked by ejabberd.

Main implementation of the new XMPP-based system was done in about a week, perhaps 50 hours of concerted work total (though I may be underestimating). I had about a month of “downtime” at the beginning of this past summer, the last week of which was devoted to building sax-ng.

### ejabberd

The first phase involved setting up an instance of ejabberd to support the rest of the system. I opted to run it inside Docker, ideally to make the XMPP server more self-contained and avoid much custom configuration on the server. Conveniently, somebody had already built a Docker configuration for ejabberd with a wealth of configuration switches, so it was relatively easy to set up.

Implementing authentication against the web site was also easy, referring to the protocol description in the ejabberd developers guide. Since this hooks into the website’s authentication system (a highly modified version of phpBB), this script simply connects to the mysql server and runs queries against the database.

Actual authentication is performed with phpBB SIDs (Session ID), rather than a user’s password. It was built this way because the SID and username are stored in a cookie, which is available to a client running in a web browser. This is probably also somewhat more secure than storing a password in the web browser, since the SID is changed regularly so data exposure via some vector cannot compromise a user’s web site password.

Error handling in the authentication script is mostly nonexistent. The Erlang approach to such problems is mostly “restart the component if it fails”, so in case of a problem (of which the only real possibility is a database connection error) ejabberd will restart the authentication script and attempt to carry on. In practice this has proven to be perfectly reliable.

In XMPP MUC (Multi-User Chat), users are free to choose any nickname they wish. For our application, there is really only one room and we wish to enforce that the nickname used in XMPP is the same as a user’s username on the web site. There ends up being no good way in ejabberd to require that a user take a given nickname, but we can ensure that it is impossible to impersonate other users by registering all site usernames as nicknames in XMPP. Registered nicknames may only be used by the user to which they are registered, so the only implementation question is in how to automatically register nicknames.

I ended up writing a small patch to mod_muc_admin, providing an ejabberdctl subcommand to register a nickname. This patch is included in its entirety below.

diff --git a/src/mod_muc_admin.erl b/src/mod_muc_admin.erl
index 9c69628..3666ba0 100644
@@ -15,6 +15,7 @@
start/2, stop/1, % gen_mod API
muc_online_rooms/1,
muc_unregister_nick/1,
+    muc_register_nick/3,
create_room/3, destroy_room/3,
create_rooms_file/1, destroy_rooms_file/1,
rooms_unused_list/2, rooms_unused_destroy/2,
@@ -38,6 +39,9 @@

%% Copied from mod_muc/mod_muc.erl
-record(muc_online_room, {name_host, pid}).
+-record(muc_registered,
+        {us_host = {\{<<"">>, <<"">>}, <<"">>} :: {\{binary(), binary()}, binary()} | '$1', + nick = <<"">> :: binary()}). %%---------------------------- %% gen_mod @@ -73,6 +77,11 @@ commands() -> module = ?MODULE, function = muc_unregister_nick, args = [{nick, binary}], result = {res, rescode}}, + #ejabberd_commands{name = muc_register_nick, tags = [muc], + desc = "Register the nick in the MUC service to the JID", + module = ?MODULE, function = muc_register_nick, + args = [{nick, binary}, {jid, binary}, {domain, binary}], + result = {res, rescode}}, #ejabberd_commands{name = create_room, tags = [muc_room], desc = "Create a MUC room name@service in host", @@ -193,6 +202,16 @@ muc_unregister_nick(Nick) -> error end. +muc_register_nick(Nick, JID, Domain) -> + {jid, UID, Host, _,_,_,_} = jlib:string_to_jid(JID), + F = fun (MHost, MNick) -> + mnesia:write(#muc_registered{us_host=MHost, + nick=MNick}) + end, + case mnesia:transaction(F, [{\{UID, Host}, Domain}, Nick]) of + {atomic, ok} -> ok; + {aborted, _Error} -> error + end. %%---------------------------- %% Ad-hoc commands  It took me a while to work out how exactly to best implement this feature, but considering I had never worked in Erlang before it was reasonably easy. I do suspect some familiarity with Haskell and Rust provided background to more easily understand certain aspects of the language, though. The requirement that I duplicate the muc_registered record (since apparently Erlang provides no way to import records from another file) rubs me the wrong way, though. In practice, then, a custom script traverses the web site database, invoking ejabberdctl to register the nickname for every existing user at server startup and then periodically or on demand when the server is running. ### Web interface The web interface into XMPP was implemented with Strophe.js, communicating with ejabberd via HTTP-bind with the standard support in both the client library and server. The old SAX design served a small amount of chat history with every page load so it was immediately visible without performing any additional requests after page load, but since the web server never receives chat data (it all goes into XMPP directly), this is no longer possible. The MUC specification allows a server to send chat history to clients when they join a room, but that still requires several HTTP round-trips (taking up to several seconds) to even begin receiving old lines. I ended up storing a cache of messages in the browser, which is used to populate the set of displayed messages on initial page load. Whenever a message is received and displayed, its text, sender and a timestamp are added to the local cache. On page load, messages from this cache which are less than one hour old are displayed. The tricky part with this approach is avoiding duplication of lines when messages sent as part of room history already exist, but checking the triple of sender, text and timestamp seems to handle these cases quite reliably. ### webridge The second major feature of SAX is to announce activity on the web site’s bulletin board, such as when people create or reply to threads. Since the entire system was previously managed by code tightly integrated with the bulletin board, a complete replacement of the relevant code was required. In the backend, SAX functionality was implemented entirely in one PHP function, so replacing the implementation was relatively easy. The function’s signature was something like saxSay($type, $who,$what, $where), where type is a magic number indicating what kind of message it is, such as the creation of a new thread, a post in a thread or a message from a user. The interpretation of the other parameters depends on the message type, and tends to be somewhat inconsistent. The majority of that function was a maze of comparisons against the message type, emitting a string which was eventually pushed into the chat system. Rather than attempt to make sense of that code, I decided to replace it with a switch statement over symbolic values (whereas the old code just used numbers with no indication of purpose), feeding simple invocations of sprintf. Finding the purpose of each of the message types was most challenging among that, but it wasn’t terribly difficult as I ended up searching the entire web site source code for references to saxSay and determined the meaning of the types from the caller’s context. To actually feed messages from PHP into XMPP, I wrote a simple relay bot which reads messages from a UNIX datagram socket and repeats them into a MUC room. A UNIX datagram socket was selected because there need not be any framing information in messages coming in (just read a datagram and copy its payload), and this relay should not be accessible to anything running outside the same machine (hence a UNIX socket). The bot is implemented in Python with Twisted, utilizing Twisted’s provided protocol support for XMPP. It is run as a service under twistd, with configuration provided via environment variables because I didn’t want to write anything to handle reading a more “proper” configuration file. When the PHP code calls saxSay, that function connects to a socket with path determined from web site configuration and writes the message into that socket. The relay bot (“webridge”) receives these messages and writes them into MUC. ### saxjax-ng Since keeping a web page open for chatting is not particularly convenient, we also operate a bridge between the SAX chat and an IRC channel called saxjax. The original version of this relay bot was of questionable quality at best: the Python implementation ran two threads, each providing one-way communication though a list. No concurrency primitives, little sanity. Prior to creation of sax-ng I had put some amount of effort in improving the reliability of that system, since an error in either thread would halt all processing of messages in the direction corresponding to the thread in which the error occurred. Given there was essentially no error handling anywhere in the program, this sort of thing happened with dismaying frequency. saxjax-ng is very similar in design to webridge, in that it’s Twisted-based and uses the Twisted XMPP library. On the IRC side, it uses Twisted’s IRC library (shocking!). Both ends of this end up being very robust when combined with the components that provide automatic reconnection and a little bit of custom logic for rotating through a list of IRC servers. Twisted guarantees singlethreaded operation (that’s the whole point; it’s an async event loop), so relaying a message between the two connections is simply a matter of repeating it on the other connection. ## Contact with users This system has been perfectly reliable since deployment, after a few changes. Most notably, the http-bind interface for ejabberd was initially exposed on port 5280 (the default for http-bind). Users behind certain restrictive firewalls can’t connect to that port, so we quickly reconfigured our web server to reverse-proxy to http-bind and solve that problem. Doing so also means the XMPP server doesn’t need its own copy of the server’s SSL certificate. There are still some pieces of the web site that emit messages containing HTML entities in accordance with the old system. The new system.. doesn’t emit HTML entities because that should be the responsibility of something doing HTML presentation (Strong Opinion) and I haven’t bothered trying to find the things that are still emitting HTML-like strings. The reconnect logic on the web client tends to act like it’s received multiples of every message that arrives after it’s tried to reconnect to XMPP, such as when a user puts their computer to sleep and later resumes; the web client tries to detect the lost connection and reopen it, and I think some event handlers are getting duplicated at that point. Haven’t bothered working on a fix for that either. # Conclusion ejabberd is a solid piece of software and not hard to customize. Twisted is a good library for building reliable network programs in Python, but has enough depth that some of its features lack useful documentation so finding what you need and figuring out how to use it can be difficult. This writeup has been languishing for too long so I’m done writing now. # Web history archival and WARC management I’ve been a sort of ‘rogue archivist’ along the lines of the Archive Team for some time, but generally lack the combination of motivation and free time to directly take part in their activities. That said, I do sometimes go on bursts of archival since these things do concern me; it’s just a question of when I’ll get manic enough to be useful and latch onto an archival task as the one to do. An earlier public example is when I mirrored ticalc.org. The historical record contains plenty of instances where people maintained copies of their communications or other documentation which has proven useful to study, and in the digital world the same is likely to be true. With the ability to cheaply store large amounts of data, it is also relatively easy to generate collections in the hope of their future utility. Something I first played with back in 2014 was extracting lists of web pages to archive from web browser history. From a public perspective this may not be particularly interesting, but if maintained over a period of time this data could be interesting as a snapshot of a typical-in-some-fashion individual’s daily life, or for purposes I can’t foresee. Today I’m going to write a little about how I collect this data and reduce the space requirements. The products of this work that are source code can be found on Bitbucket. ## Collecting History I use Firefox as my everyday web browser, which combined with Firefox Sync provides ready access to a reasonably complete record of my web browsing activity. The first step is extracting the actual browser history, which is a relatively straightforward process since Firefox maintains all of this data in SQLite databases. I use cookies.sqlite and places.sqlite from my Firefox profile. Extracting history from places.sqlite is as simple as running a query that emits timestamps and corresponding URLs. For example: sqlite3 places.sqlite \ "SELECT visit_date, url FROM moz_places, moz_historyvisits \ WHERE moz_places.id = moz_historyvisits.place_id \ AND visit_date >$LASTRUN \
ORDER BY visit_date"

This will print the timestamp and URL for every page in history newer than LASTRUN (which can easily be omitted to get everything), with the fields separated by pipes (|). The timestamp (visit_date) is a UNIX timestamp expressed in microseconds.

While there’s some utility in just grabbing web pages, the real advantage I’ve found in using data directly from a web browser is that it can gain a personal touch, with access to private data granted in many cases by cookies. This does imply that the data should not be shared, but as with personal letters in history this formerly-private information may become useful in the future at a point when the privacy of that data is no longer a concern for those involved.

Again using sqlite and the cookies.sqlite file we got from Firefox, it’s relatively easy to extract a cookies.txt file that can be read by many tools:

sqlite3 -separator ' ' cookies.sqlite << EOF
.mode tabs
SELECT host,
CASE substr(host,1,1)='.' WHEN 0 THEN 'FALSE' ELSE 'TRUE' END,
path,
CASE isSecure WHEN 0 THEN 'FALSE' ELSE 'TRUE' END,
expiry,
name,
value
EOF

The output of that sqlite invocation can be redirected directly into a cookies.txt file without any further work.

With the list of URLs and cookies, it’s again not difficult to capture a WARC containing every web page listed. I’ve used wget, largely out of convenience. Taking advantage of a UNIX shell, I usually do the following, piping the URL list into wget:

cut -d '|' -f 2- urls.txt | \
wget --warc-file=date --warc-cdx --warc-max-size=1G \
-e robots=off -U "Inconspicuous Browser" \
--timeout 30 --tries 2 --page-requisites \
--delete-after -i -

This will download every URL given to it with the cookies extracted earlier, and will also download external resources (like images) when they are referenced in downloaded pages. The process will be logged to a WARC file named with the time the process was started, limiting to approximately 1-gigabyte chunks.

This takes a while, and the best benefits are to be had from running this at fairly short intervals which will tend to provide more unexpired cookies and catch changes over short periods of time, thus presenting a more accurate view of what the browser’s user is actually doing.

## Deduplication

On completion, I’m presented with a directory containing some number of compressed WARC files. That’s a reasonable place to leave it, but this weekend after doing an archival run that yielded about 90 gigabytes of data I decided to look into making it smaller, especially considering I know my archive runs end up grabbing many copies of the same resources on web sites which I visit frequently (for example, icons on DuckDuckGo).

The easy approach would be to use a compression scheme which tends to work better than gzip (the typical compression scheme for WARCs). However, doing so would destroy a useful property in that the files do not need to be completely decompressed for viewing. These are built such that with an index showing where a particular record exists in the archive, a user does not need to decompress the entire file up to that point (as would be the case with most compression schemes)- it is possible to seek to that point in the compressed file and decompress just the desired record.

I had hope that the professionals in this field had already considered ways to make their archives smaller, and that ended up being true but the documentation is very sparse: the only truly useful material was a recent presentation by Youssef Eldakar from the Bibliotheca Alexandrina cursorily describing tools to deduplicate entries in WARC files using revisit records which point to a previous date-URL combination that has the same contents1.

I don’t see any strong reason to keep my archives split into 1-gigabyte pieces and it’s slightly easier to perform deduplication on a single large archive, so I used megawarc to join the a number of smaller archives into one big one.

It was easy enough to find the published code for the tools described in the presentation, so all I had to do was figure out how to run them.. right?

## The Process

The logical procedure for deduplication is as follows:

1. Run warcsum to compute hashes of every record of interest in the specified archive(s), writing them to a file.
2. Run warccollres to examine the records and their hashes, determining which ones are actually the same and which are just hash collisions.
3. Run wardrefs to rewrite the archives with references when duplicates are found.

I had a hard time actually getting that to happen, though.

### warcsum

Running warcsum was relatively easy; it happily chewed on my test archive for a while and eventually spat out a long list of files. I later discovered that it wasn’t processing the whole archive, though- it stopped after about two gigabytes of data. I eventually found that the program (written in C) was using int as a type to represent file offsets, so the apparent offset in a file becomes negative after reading two gigabytes of data which causes the program to end, thinking it’s done everything. I patched the relevant bits to use 64-bit types (like off_t) where working with file offsets, and eventually got it to emit 1.7 million records rather than the few tens of thousands I was getting before.

While investigating the premature termination, I found (using warcat) that wget sometimes writes record length fields that are one byte longer than the actual record. I spent a while trying to investigate that and repair the length fields in hopes of fixing warcsum’s premature termination, but it ended up being unnecessary. In practice this off-by-one doesn’t seem to be harmful, but I do find it somewhat concerning.

I also discovered that warcsum assumes wrapping arithmetic for determining how large some buffers should be, which is undefined behavior in C and could cause Bad Things to happen. I fixed the instance where I saw it, but that didn’t seem to be causing any issues on my dataset.

### warccollres

Moving on to warccollres, I found that it assumes a lot of infrastructure which I lack. Given the name of a WARC file, it expects to have access to a MySQL server which can indicate a URL where records from the WARC can be downloaded- a reasonable assumption if you’re a professional working within an organization like the Bibliotheca Alexandrina or Internet Archive, but excessive for my purposes and difficult to set up.

I ended up rewriting all of warccollres in Python, using a self-contained database and assuming direct access to the files. There’s nothing particularly novel in there (see warccollres.py in the repository). WARC records are read from the archive and compared where they have the same hash to determine actual equality, and duplicates are marked as such.

I originally imported everything into a sqlite database and did all the work in there (not importing file contents though– that would be very inefficient), but this was rather slow because sqlite tends to be slow on workloads that involve more than a little bit of writing to the database. With some changes I made it use a “real” database (MariaDB) which helped. After tuning some parameters on the database server to allow it to use much larger amounts of memory (innodb_buffer_pool_size..) and creating some indexes on the imported data, everything moved along at a nice clip.

As the process went on, it seemed to slow down- early on everything was I/O-bound and status messages were scrolling by too fast for me to see, but after a few hundred thousand records had been processed I could see a significant slowdown. Looking at resource usage, the database was the limiting factor.

It turned out that though I had created indexes in the database on the rows that get queried frequently, it was still performing a full table scan to satisfy the requirement that records be processed in the order which they appear in the WARC file. (I determined this by manually running some queries and having mariadb ANALYZE them for information on how it processed the query.) After creating a composite index of the copy_number and warc_offset columns (which I wasn’t even aware was possible until I read the grammar for CREATE INDEX carefully, and had to experiment to discover that the order in which they are specified matters), the process again became I/O-bound. Where the first 1.2 million records or so were processed in about 16 hours, the last 500 thousand were completed in only about an hour after I created that index.

### warcrefs

Compared to the earlier parts, warcrefs is a quite docile tool, perhaps in part because it’s implemented in Java. I made a few changes to the file describing how Maven should build it so I could get a jar file containing the program and all its required libraries which would be easy to run. With the file-offset issues in warcsum fresh in my mind, I proactively checked for similar issues in warcrefs and found it used int for file offsets throughout (which in Java is always a 32-bit value). I changed the relevant parts to use long instead, avoiding further problems with large files.

As I write this warccollres is still running on a large amount of data, so I can’t truly evaluate the capabilities of warcrefs. I did test it on a small archive which had some duplication and it was successful (verified by manual inspection2).

### warcrefs revisited

I’m writing this section after the above-mentioned run of warccollres finished and I got to run warcrefs over about 30 gigabytes of data. It turned out a few additional changes were required.

1. I forgot to recompile the jar after changing its use of file offsets to use longs, at which point I found the error reporting was awful in that the program only printed the error message and nothing else. It bailed out on reaching a file offset not representable as an int, but I couldn’t tell that until I made it print a proper stack trace.
2. Portions of revisit records were processed as strings but have lengths in bytes. Where multibyte characters are used this yields a wrong size. Fortunately, the WARC library used to write output checks these so I just had to fix it to use byte lengths everywhere.
3. Reading records to deduplicate reopened the input file for every record and never closed them, causing the program to eventually reach the system open file limit and fail. I had to make it close those.

## Results

I got surprisingly good savings out of deduplication on my initial large dataset. Turns out web browser history has a lot more duplication than a typical archive: about 50% on my data, where Eldakar cited a number closer to 15% for general archives.

$ls -lh total 47G -rw-r--r-- 1 tari users 14G Jan 18 15:07 mega_dedup.warc.gz -rw-r--r-- 1 tari users 33G Jan 17 10:45 mega.warc.gz -rw-r--r-- 1 tari users 275M Jan 17 11:39 mega.warcsum -rw-r--r-- 1 tari users 415M Jan 18 13:50 warccollres.txt  The input file was 33 gigabytes, reduced to only 14 after deduplication. I’ve manually checked that all the records appear to be there, so that appears to be true deduplication only. There are 1709118 response records in the archive (that’s the number of lines in the warcsum file), with only 210467 unique responses3, making an average of about 8 copies per response. Perhaps predictably, this implies that the duplicated records tend to be small since the overall savings was much less than 8 times. ## Improvements At this point deduplication is not a very automated process, since there are three different programs involved and a database must be set up. This would be relatively easy to script, but it hasn’t yet seen enough use for me to be confident in its ability to run unattended. There are some inefficiencies, especially in warccollres.py which decompresses records in their entirety into memory (where it could stream them or back them with real files to reduce memory requirements for large records). It also requires that there be only one WARC file under consideration, which was a concession to simplicity of implementation. In the downloading process, I found that it will sometimes get hung up on streams, particularly streaming audio like Hutton Orbital Radio where the actual stream URL appears in browser history. The result of that kind of thing is downloading a “file” of unbounded size at a rather low speed (since it’s delivered only as fast as the audio will be played back). wpull is a useful tool to replace wget with (that is also mostly compatible, for convenience) which can help address these issues. It supports custom scripts to control its operation in a more fine-grained way, which would probably permit detection of streams so they don’t get downloaded. Also attractive is wpull’s support for running Javascript in downloaded pages, which allows it to capture data that is not served “baked in” to a web page as is often the case on modern web sites, especially “social” ones. ## Concluding I ended up spending the majority of a weekend hammering out most of this code, from about 11:00 on Saturday through about 18:00 on Sunday with only about an hour total for food-breaks and a too-much-yet-not-enough 6-hour pause to sleep. I might not call it pleasant, but it’s a good feeling to build something like this successfully and before losing interest in it for an indeterminate amount of time. I have long-term plans regarding software to automate archiving tasks like this one, and that was where my work here started early on Saturday. I’d hope that future manic chunks of time like this one will lead to further progress on that concept, but personal history says this kind of incredibly-productive block of time occurs at most a few times a year, and the target of my concentration is unpredictable4. Call it a goal to work toward, maybe: the ability to work on archiving as an occupation, rather than a sadly neglected hobby. In any case, if you missed it, the collection of code I put together for deduplication is available on Bitbucket. The history-gathering portions I use are basically exactly as described in the relevant sections, leaving a lot of room for future improvement. Thanks for reading if you’ve come this far, and I hope you find my work useful! 1. I’m not entirely comfortable with that approach, since there is no particular guarantee that any record exists with the specified “coordinates” (time of retrieval and network location) in web-space. However, this approach does maintain sanity even if a WARC is split into its individual records which is another important consideration. [return] 2. WARC files are mostly plain text with possibly-binary network traffic in between, so it’s relatively easy to browse them with tools like zless and verify everything looks correct. It’s quite convenient, really. [return] 3. SELECT count(id) FROM warcsums WHERE copy_num = 1 [return] 4. In fact, the last time I did something like this I (re)wrote a large amount of chat infrastructure which I still have yet to finish writing up for this blog. [return] # HodorCSE Localization of software, while not trivial, is not a particularly novel problem. Where it gets more interesting is in resource-constrained systems, where your ability to display strings is limited by display resolution and memory limitations may make it difficult to include multiple localized copies of any given string in a single binary. All of this is then on top of the usual (admittedly slight in well-designed systems) difficulty in selecting a language at runtime and maintaining reasonably readable code. This all comes to mind following discussion of providing translations of Doors CSE, a piece of software for the TI-84+ Color Silver Edition1 that falls squarely into the “embedded software” category. The simple approach (and the one taken in previous versions of Doors CS) to localizing it is just replacing the hard-coded strings and rebuilding. As something of a joke, it was proposed to make additional “joke” translations, for languages such as Klingon or pirate. I proposed a Hodor translation, along the lines of the Hodor UI patch2 for Android. After making that suggestion, I decided to exercise my skills a bit and actually make one. ## Hodor (Implementation)3 Since I don’t have access to the source code of Doors CSE, I had to modify the binary to rewrite the strings. Referring the to file format guide, we are aware that TI-8x applications are mostly Intel hex, with a short header. Additionally, I know that these applications are cryptographically signed which implies I will need to resign the application when I have made my changes. ### Dumping contents I installed the IntelHex module in a Python virtualenv to process the file into a format easier to modify, though I ended up not needing much capability from there. I simply used a hex editor to remove the header from the 8ck file (the first 0x4D bytes). Simply trying to convert the 8ck payload to binary without further processing doesn’t work in this case, because Doors CSE is a multipage application. On these calculators Flash applications are split into 16-kilobyte pages which get swapped into the memory bank at 0x4000. Thus the logical address of the beginning of each page is 0x4000, and programs that are not aware of the special delimiters used in the TI format (to delimit pages) handle this poorly. The raw hex file (after removing the 8ck header) looks like this: :020000020000FC :20400000800F00007B578012010F8021088031018048446F6F727343534580908081020382 :2040200022090002008070C39D40C39A6DC3236FC30E70C3106EC3CA7DC3FD7DC3677EC370 :20404000A97EC3FF7EC35D40C35D40C33D78C34E78C36A78C37778C35D40C3A851C9C940F3 :2040600001634001067001C36D00CA7D00BC6E00024900097A00E17200487500985800BDF8 [snip] :020000020001FB :204000003A9987B7CA1940FE01CA3440FE02CAFB40FE03CA0541C3027221415D7E23666FAD :20402000EF7D4721B98411AE84010900EDB0EFAA4AC302723A9B87B7CA4940FE01CA4340B9 :20404000C30272CD4F40C30272CDB540C30272EF67452100002275FE3EA03273FECD63405B  Lines 1 and 7 here are the TI-specific page markers, indicating the beginning of pages 0 and 1, respectively. The lines following each of those contain 32 (20 hex) bytes of data starting at address 0x40000 (4000). I extracted the data from each page out to its own file with a text editor, minus the page delimiter. From there, I was able to use the hex2bin.py script provided with the IntelHex module to create two binary files, one for each page. ### Modifying strings With two binary files, I was ready to modify some strings. The calculator’s character set mostly coincides with ASCII, so I used the strings program packaged with GNU binutils to examine the strings in the image. $ strings page00.bin
HDoorsCSE
##6M#60>
oJ:T
Uo&
dQ:T
[snip]
xImprove BASIC editor
Display clock
Enable lowercase
Always launch Doors CSE
Launch Doors CSE with
PRGM]


With some knowledge of the strings in there, it was reasonably short work to find them with a hex editor (in this case I used HxD) and replace them with variants on the string “Hodor”.

I also found that page 1 of the application contains no meaningful strings, so I ended up only needing to examine page 0. Some of the reported strings require care in modification, because they refer to system-invariant strings. For example, “OFFSCRPT” appears in there, which I know from experience is the magic name which may be given to an AppVar to make the calculator execute its contents when turned off. Thus I did not modify that string, in addition to a few others (names of authors, URLs, etc).

### Repacking

I ran bin2hex.py to convert the modified page 0 binary back into hex, and pasted the contents of that file back into the whole-app hex file (replacing the original contents of page 0). From there, I had to re-sign the binary.4 WikiTI points out how easy that process is, so I installed rabbitsign and went on my merry way:

$rabbitsign -g -r -o HodorCSE.8ck HodorCSE.hex  ### Testing I loaded the app up in an emulator to give it a quick test, and was met by complete nonsense, as intended. I’m providing the final modified 8ck here, for the amusement of my readers. I don’t suggest that anybody use it seriously, not for the least reason that I didn’t test it at all thoroughly to be sure I didn’t inadvertently break something. ## Extending the concept It’s relatively easy to extend this concept to the calculator’s OS as well (and in fact similar string replacements have been done before) with the OS signing keys in hand. I lack the inclination to do so, but surely somebody else would be able to do something fun with it using the process I outlined here. 1. That name sounds stupider every time I write it out. Henceforth, it’s just “the CSE.” [return] 2. The programmer of that one took is surprisingly far, such that all of the code that feasibly can be is also Hodor-filled. [return] 3. Hodor hodor hodor hodor. Hodor hodor hodor. [return] 4. This signature doesn’t identify the author, as you might assume. Once upon a time TI provided the ability for application authors to pay some amount of money to get a signing key associated with them personally, but that system never saw wide use. Nowadays everybody signs their applications with the public “freeware” keys, just because the calculator requires that all apps be signed and the public keys must be stored on the calculator (of which the freeware keys are preinstalled on all of them). [return] # "A Sufficiently Smart Compiler" On a bit of a lark today, I decided to see if I could get Spasm running in a web browser via Emscripten. I was successful, but found that something seemed to be optimizing out most of main() such that I had to hack in my own main function that performed the same critical functions and (for the sake of simplicity) hard-coded the relevant command-line options. Looking into the problem a bit further, I observed that not all of main() was being removed; there was one critical line left in. The beginning of the function in source and the generated code were as follows. C++ source: int main (int argc, char **argv) { int curr_arg = 1; bool case_sensitive = false; bool is_storage_initialized = false; use_colors = true; extern WORD user_attributes; user_attributes = save_console_attributes (); atexit (restore_console_attributes_at_exit); //if there aren't enough args, show info if (argc < 2) {  Generated Javascript (asm.js): function _main($argc, $argv) {$argc = $argc | 0;$argv = \$argv | 0;
HEAP8[4296] = 1;
__Z23save_console_attributesv() | 0;
return 0;
}


Spasm is known to work in general, but I found it unlikely that LLVM’s optimizer would be optimizing this code wrong as well. Building with optimizations turned off generated correct code, so it was definitely the optimizer breaking this and not some silly bug in Emscripten. Looking a little deeper into the save_console_attributes function, we see the following code:

WORD save_console_attributes () {
#ifdef WIN32
CONSOLE_SCREEN_BUFFER_INFO csbiScreenBufferInfo;
GetConsoleScreenBufferInfo (GetStdHandle (STD_OUTPUT_HANDLE), &csbiScreenBufferInfo);
return csbiScreenBufferInfo.wAttributes;
#endif
}

<!-- more -->

Since I'm not building for a Windows target (Emscripten's runtime environment
resembles a Unix-like system), this was preprocessed down to an empty function
(returning void), but it's declared with a non-void return. Smells like
[undefined behavior](http://blog.regehr.org/archives/213)! Let's make this
function return 0:

c++
WORD save_console_attributes () {
#ifdef WIN32
CONSOLE_SCREEN_BUFFER_INFO csbiScreenBufferInfo;
GetConsoleScreenBufferInfo (GetStdHandle (STD_OUTPUT_HANDLE), &csbiScreenBufferInfo);
return csbiScreenBufferInfo.wAttributes;
#else
return 0;
#endif
}


With that single change, I now get useful code in main. Evidently LLVM’s optimizer was smart enough to recognize the call to that function invoked UB and optimized out the rest of main.

## Concluding

This issue illustrates nicely the dangers of a sufficiently smart compiler, where updates to your compiler might break otherwise-working code because it’s subtly broken. This is particularly of concern in C, where the compilers tend to go to extreme measures to optimize the generated code and there are a lot of ways to inadvertently invoke undefined behavior.

Static analyzers are a big help in finding these issues. Looking more closely at the compiler output from building Spasm, it emitted a warning regarding this function, as well as several potential buffer overflows of the following form:

    char s[64];
strncat(s, "/", sizeof(s));


This looks correct, but is subtly broken because the length parameter taken by strncat should be the maximum allowed length of the string, excluding the null terminator. The third parameter should be sizeof(s) - 1` in this case, otherwise the string’s null terminator might be written out of bounds.

## Appendix

The code for my work on this is up on Bitbucket and might be of interest to some readers. I fear that by working on this project I’ve inadvertently committed to becoming the future maintainer of Spasm, which I find to contain a significant amount of poor-quality code. Perhaps I’ll have to write a replacement for Spasm in Rust, which I’ve been quite pleased with as a potential replacement for C, without the numerous pitfalls and rather more modern in its capabilities.