Building Performance, Availablity and Reliability

January 11, 2015

For the past nine months, I’ve been working at APTrust, building an online digital repository for universities to back up important data. The main point of the system is to enable universities to recover data in the event of a local or regional disaster.

The two big projects I worked on prior to APTrust both had a focus on performance and availability. My current project, by definition, must focus on availability and reliability.

RoomKey: Performance and Availability

Roomkey.com was a hotel metasearch and booking site, similar to Booking.com. While there, I did a lot of load and stress testing of the rates-checking engine I helped to build. When a user searched for a hotel on RoomKey, the rates engine connected to a number of partner systems (Hilton, Marriott, etc.) to check rates and availability for hotel rooms for the selected city and dates.

This presented two challenges. The first was that the rates engine needed to manage a large number of concurrent HTTPS connections to our partners. Some of those partners required us to use client SSL certificates. When we started, there was no good Clojure library that supported both asynchronous client HTTP connections and client SSL certificates.

So I added support for x509 client certificates to the Clojure/Apache HTTP library. At the time, however, the underlying Java Apache HTTP library was moving to a new version and proved too unstable for production. So I added x509 support to Clojure’s http.async.client and we moved forward.

The second challenge was that the partner systems all used SOAP, so the rates engine wound up parsing a lot of XML, and some of this XML was quite bloated. At times, the rates engine might be parsing over a million XML documents an hour. No DOM or SAX parser could keep up with that load. Oracle’s Java Profiler showed that XML parsing was by far the biggest user of CPU and memory.

I wound up writing a Clojure wrapper around the Java VTD XML library, which offers extraordinary performance in exchange for a different way of thinking about how to navigate an XML document. VTD provided an enormous performance boost.

Since RoomKey was a consumer site, availability was very important. I worked with a team from Cognizant to do extensive load and stress testing. The idea was to figure out when and what to scale on AWS’s ElasticBeanstalk. We got the basics hammered out, and apparently did a decent job, since we there were no scaling problems in the weeks after launch, even as the big hotel chains directed a huge amount of traffic to our site.

RoomKey eventually hired a couple of full-time sysadmin/ops people, which was a big relief. If you have a complex system or any meaningful traffic, you need at least one dedicated ops person. And though I’m competent at setting up and managing systems and even enjoy some of that work, I was a developer at RoomKey, and that’s where I wanted to focus my time. The ops staff did an excellent job keeping everything running smoothly as the system grew in scope and complexity.

EmergenSee: Performance and Availability

I worked on several projects after RoomKey, but the biggest one was EmergenSee: a mobile personal security app that runs on Android and iOS. When you turn the app on, it starts streaming audio, video and location data to central servers running on AWS. The servers notify participating law enforcement agencies that monitor your current location, and it also notifies your friends and contacts. Law enforcement (and your friends) can then see and hear whatever your phone’s video camera and mic are recording. They can see your location on a map, where you’re going and how fast you are moving. They can also chat with you through the app.

A number of universities used this app. If a student walking across campus at night felt threatened, she could turn the app on before anything bad happened, and campus police instantly had her pinpointed on a map and could see and hear what was going on around her. The application’s silent mode allowed domestic abuse victims to stream what was happening at home and to communicate via text-based chat with people who could help them. Anyone could use it to show local law enforcement where a crime or an emergency was occurring. The video and audio provided evidence that could be helpful in court.

I built the back end of this system on AWS, using Python, Django, Postgres, EC2, S3 and Elastic Beanstalk. This system had to be highly available and responsive. An incident like a fire, a bombing, or a riot might lead to many people streaming video into the system at once. Every active device would be sending location data every few seconds, and there may by chat data going back and forth as well.

The observer stations–usually police or private security firms–typically have a number of computers monitoring activity around the clock in a certain geographical location, or among a population of subscribers.

Unlike many data-driven web applications, EmergenSee could not rely very heavily on caching, because data was often streaming in by the second, and the monitors had to see up-to-the-second data. (The app could have benefitted from technology like Meteor.js, but Meteor did not exist at the time.) All data had to be preserved as well, since it may someday become evidence in a trial.

As at RoomKey, I had done a lot of load testing with JMeter, which could simulate large numbers of app users (who stream a lot of audio, video and location data) and law enforcement users (who who are constantly polling for new incidents). The point of load testing was to find the bottlenecks and resource limits, and to know what and when to scale up for increased load.

I had expected network I/O to be a bottleneck during times of heavy app-user load, but that was not the case. The bottleneck was CPU from the observer stations, which polled frequently for new data. And much of the CPU burden came from Django turning SQL results into Python objects.1

We were able to reduce CPU load by optimizing queries for the most common polling requests, and returning Python dictionaries instead of Django models for those requests. Using custom SQL and dictionary results inside an MVC framework seems like an anti-pattern–and it is. But MVC frameworks were built to solve general problems, and when you start to get into the details of performance, you often find you are dealing with very specific problems. So you break out of the framework from time to time, for the same reasons people sometimes write Python modules in C or mix assembly language into their C code. It’s not ideal, but we live in the real world, where the problems are sometimes messier than our neatly conceived systems.

It’s essential to comment any code that implements non-standard practices or un-obvious solutions, because whoever maintains the code after you is going to wonder why you did what you did. Without comments, a conscientious programmer will want to bring your non-standard code in line with the rest of the code base, inadvertently harming performance. A careless programmer might look at your odd code and think, “Well, if the original programmer was willing do deviate from the standard practice, I guess I’m free to hack in whatever solution is easiest.” Your comments should explain to both these people why you did what you did.

We were able to tune EmergenSee so that it scaled well under load, thanks to JMeter and Stackdriver. We also tuned the Postgres database to perform well, using the indexing principles described in Relational Database Index Design and the Optimizers.

[Update: January, 2023] Though it was a very useful app with a noble goal, EmergenSee had a truly tragic ending. If no similar app exists today, it should.

APTrust: Reliability and Availability

[Note: The paragraphs below describe the first generation of the APTrust system. You’ll find an overview of the current 2023 architecture in Architecting for the Cloud.]

Because APTrust’s purpose is to ensure that data is preserved, reliability is the most important feature of the system. Universities upload intellectual objects to a holding area, then a set of ingest services move the files to long-term storage, record metadata about the files in a Rails application, and provide additional features like file retrieval, deletion, and periodic checksums to ensure long-term data integrity. (Because we keep multiple copies of each file, we can replace corrupt files with copies that are known to be good.)

An intellectual object, by the way, is a collection of files and metadata. For example, a collection of 100 photos that document US shipbuilding during World War II, along with XML files describing metadata about each photo (photographer, location, date, subject, etc.) may constitute an intellectual object. A single intellectual object may contain one file of only a few kilobytes or thousands of files taking up several terabytes. Each APTrust partner university may have thousands or tens of thousands of objects in the archive.

The system uses quite a bit of network I/O (for transferring files), storage space, and CPU (for calculating MD5 and SHA-256 checksums on hundred of gigabytes of files at a time). But unlike RoomKey and EmergenSee, it does not have to be particularly fast. It’s perfectly acceptable for items to sit in a staging area for several hours awaiting ingest.

The system must be reliable and accountable. It must keep a full record of everything that happens to all of the data ever submitted for ingest. APTrust uses a system called Fedora (not to be confused with Fedora Linux) to record metadata and events for items that have been successfully ingested.

However, there’s a lot that can go wrong during the ingest process before Fedora ever knows anything about the intellectual object’s existence. Network failures can prevent ingest services from accessing uploaded files. Incorrectly packaged objects may fail validation. Resource limitations in Fedora can prevent the recording of essential metadata. Network problems can prevent items from being copied to long-term storage. Local storage limitations can cause failures in either the ingest services or the Fedora Rails application.

I had to write the ingest services to anticipate these problems and to recover from them. You can anticipate some things, such as a lack of disk space, by running some checks before you start processing. (For example, don’t start downloading a huge file if you can’t find a local volume with enough free space to store it.) Other things, such as network problems, you cannot anticipate. But you still have to be able to recover from them, and you have to be able to do it in a smart way.

For example, if you’ve downloaded a 200GB tar file and then something fails, you want to start processing again without having to download the file again, because downloading that much can be slow and expensive. Similarly, if half of the 20,000 files in an object were successfully copied to long-term storage before the network went down, you want to copy only the remaining 10,000 when the network comes back up. If the system was not able to ingest an object for any reason, you want to know the point at which the process failed and the full state of the item when it failed.

We built APTrust’s ingest services using Go, which is very good at managing multiple concurrent processes that may be blocking on network or disk I/O. Go uses Hoare’s pattern of Communicating Sequential Processes (CSP), which is very well suited to managing concurrent and asynchronous processes.

The CSP pattern is similar to what you see in an automobile factory. The car goes down one line where the engine is put into the frame. On the next line, the seats and interior items are installed. On the next line, the body panels are installed. The next line is for painting.

CSP makes concurrent and asynchronous processes very easy to conceptualize. It also enables you to ensure that your data is in a consistent and valid state as it moves from one process to the next. When used in combination with a simple task queue (we use Bitly’s NSQ), items moving through the system can fail at any step of the way and then be picked up again and processed from the last completed step.

As items make their way through the APTrust ingest process, they are accompanied by a JSON-serializable Go structure that records information about the object’s current state. NSQ includes a number of topics for different stages of the ingest process. The Go services each perform a step of the ingest process, (download, validation, storage, metadata recording, replication storage, etc.), then pass the item’s ID into the next NSQ topic, while storing its JSON state data in a disk-backed key-value store called Bolt.

If the system goes down at any time, it can resume where it left off when it comes back up. If an item cannot be ingested, the JSON representation of its state goes into a trouble queue. In the beta version of the system, the trouble queue revealed a number of problems with network communications and with the Go services sending data to the Rails application that Rails did not expect. When these bugs were fixed, we could reqeueue items in trouble queue, and processing would resume where it left off, thanks to the JSON state information preserved in the Bolt database.

In building a system whose primary focus is reliability, I was surprised at how much code was dedicated to anticipating and handling the failures of external systems. I haven’t done a formal count, but I believe half, or more than half, of the code in the system is related to anticipating or recovering from failure conditions.

It’s been an interesting challenge, and a good learning experience. Learning to use Go to build a system that reliably processes many terabytes of data has been the best part of the process. While Clojure solves problems of parallelism, concurrency and asynchronous programming with immutability 2, refs, atoms, and threading, Go addresses many of the same problems with simple channels and lightweight go-routines. 3 I think those features, coupled with a good standard library and the fact that Go’s syntax and data structures are familiar to developers accustomed to C-like languages, has gone a long way toward driving Go’s widespread adoption.

Clojure is about as elegant as a language can get, and it does seem to attract very high-calibre developers. I don’t know if it will get the market share that Go will, simply because learning to think in Lisp is very challenging for most procedural and OO programmers. (It was for me, but when I finally became proficient, it was very satisfying.) But Clojure’s learning curve and the elegant way it makes you break down problems can be a plus, since you pretty much know you’ll be working with smart people if you’re working on a Clojure project.

Notes

1 . I did some load and stress testing on Rails while I was at Hotelicopter, which was the precursor to RoomKey. Both Django and Rails showed high CPU usage when the system was running queries whose results had to be converted into a large number of model objects. I expected the Django application to eventually blow out the system’s memory and CPU and stop responding entirely, the way the Rails app had. But that never happened. Memory plateaued under Django, and response time slowed, but the system never stopped responding.

Rails and Django solve similar problems in similar ways, but I strongly prefer Django for its explicitness and generally good programming practices. Rails employs some abhorrent programming practices, such as concatenating strings at runtime, evaluating them, and then making the evaluated result a permanent method or attribute of a class. Debugging that kind of code takes tremendous effort, and in Rails, you will be debugging that kind of code a lot.

In practice, a Django developer spends most of his time designing and writing good code, while a Rails developer spends huge amounts of time debugging code whose implicit assumptions conflict with his explicit instructions.

2. For lists and collections, Clojure uses Software Transactional Memory to implement Multiversion Concurrency Control, which provides the appearance of immutability. Clojure does this well, allowing the programmer to proceed as if his data structures are immutable.

3. Go also includes mutexes for the rare times you need them, but you won’t need them nearly as often as you do in Java, C#, and other languages that rely on threading and shared state.