Architecting for the Cloud
When I joined APTrust in 2014, my mission was to create a digital preservation repository to be shared by a number of universities throughout the US. Universities would upload their most valuable digital materials for preservation, and our service would store them in S3 and Glacier, on the east coast and on the west coast.
On top of that, we provided a number of additional services, including regular fixity checks, deletion workflows requiring admin approval, PREMIS event logging (PREMIS is a standard from the US Library of Congress), automated restoration spot tests, and more.
At the time, few university libraries had experience using AWS, and most didn’t fully trust it. The universities, and libraries in particular, had been burned by proprietary abandonware, so they hesitated to fully commit to this new “cloud” idea, since it was the product of a giant corporation.
One essential component of my mandate was to build a portable system, so that if AWS didn’t pan out, we could move all of the compute resources onto any other Unix/Linux system, anywhere.
This meant I had to build on a traditional server-based model, where the primary assumption is that all resources (CPU, RAM, disk) are local.
The Server-Based Model
During the data ingest process, the system needs a certain amount of RAM and CPU to process incoming data. Running SHA checksums on terabytes of data was one of the most CPU-intensive aspects of ingest. The system also needed local storage to temporarily house files for checksumming and metadata extraction. Finally, it needed local storage to hold interim processing data, which consisted mainly of checksums and file metadata.
Ingests could fail if the checksums we calculated did not match the checksums in depositor-supplied manifests, if files were missing, or if the deposit payload included files not mentioned in the manifests. They could also fail if some essential metadata was invalid or missing.
When ingest succeeded, and all files were properly stored in S3 and Glacier, and the system would record all of the checksums, metadata and PREMIS events in an external database, a permanent registry that depositors could access through both a Web UI and a REST API.
The upside of this model was that it was easy to deploy. The suite of micro-services that handled the ingest process was written in Go, so each service compiled to a single binary with no external dependencies. We used Ansible to push all of the binaries onto an EC2 server, and supervisord to keep them running.
The system worked, and we met the depositors' requirement of a fully portable application suite that could be moved to any vanilla Linux server on a day’s notice.
Limitations of the Server-Based Model
This model works well for applications that have a stable and well-known workload. Our workload, however, was unpredictable and could vary hugely, up to 10,000x. To scale, we would have to shut down the current EC2 instance, then use Ansible to re-deploy the entire stack to a larger or smaller instance.
Because the prior instance may have left interim processing data on EBS volumes, we had to ensure that the new instance mounted all of the volumes that had been attached to the old instance.
We also had hard limits on CPU and RAM and network I/O dictated by the size of the chosen EC2 instance and ultimately by the largest instances AWS could offer.
Local storage limits were an even bigger problem. During normal operations, 500GB of local storage was fine, but sometimes we needed 10TB or more. Amazon’s Elastic File System didn’t work for us when it first launched, because its baseline throughput was just too slow, and we didn’t keep data on disk long enough to accumulate I/O credits.
In addition, after years in production, our once-cautious depositors were gaining faith in us, sending more and more data.
Now that the system had proved itself and was struggling to keep up with load, it was time to ditch the old server-centric architecture and redesign for the cloud.
Re-Architecting for the Cloud
The re-architecture had several goals:
- The system had to scale horizontally. This meant that our seventeen microservices could not store any interim files or interim metadata on local EBS volumes. All data had to be stored where any worker could reach it.
- The system had to be fault-tolerant. If any worker anywhere failed, another worker had to be able to pick up where the failed worker left off, preferably without duplicating work.
- The system must scale automatically. No more manual interventions from admins. We must run a cheap, skeleton operation during the slow hours of the day and scale up automatically to handle the occasional ingest floods.
Horizontal Scaling
We chose to put our service workers in Docker containers and run them on Amazon’s ECS. We built out a staging system and tested it under various high-stress workloads to discover the trigger points at which each container should scale up or down. Some workers had to scale on CPU usage, others on memory, and others on network I/O.
We moved the interim metadata that all workers needed to access from a local, disk-based key-value store to Redis, so that any worker could access it on the local private subnet. Redis was an excellent solution, because it’s fast and highly available, and the data it stores tends to have a short lifespan (in most cases, a matter of minutes).
The riskiest part of the rearchitecture was replacing the big EBS volumes. During the ingest process, our system has to unpack tar files and then read data from each of the unpacked files multiple times (for example, to perform checksums, extract metadata, check file integrity, etc.).
Some of these tar files are several terabytes in size. Some contain hundreds of thousands of files. Could we unpack them all to an S3 staging bucket and have subsequent workers reliably access them?
We wouldn’t know until we tried.
As it turns out, S3 is well suited to a “write-once, read-multiple” workload. As long as your microservices are running in the same region as your S3 bucket (and preferably in the same availability zone), they will have fast, reliable access to the data.
The Go programming language provides good support for stream processing, allowing us to run format identification and checksum calculations without ever having to write files to local disk. We simply stream the bits across the network and through our functions, which run very quickly when no disk writes are involved.
Better still, Go’s standard library includes a multi-writer, which allows us to run a single stream through multiple checksum algorithms (md5, sha-1, sha-256 and sha-512) in a single pass.
In our old sever-centric infrastructure, the EBS volumes used to store temporary files were our largest expense. They were a wasteful expense as well, since we generally had to keep large volumes on hand for ingest floods, and the volumes were mostly empty.
Using S3 to stage temporary files gave us better performance, infinite scalability, and massively reduced costs, since staging files have a lifecycle of only minutes or hours. Unlike EBS, where you pay for the volume whether it’s full or not, in S3 you pay only for what you use.
Fault Tolerance
All of our workers keep a full record of the work they’ve completed in Redis. Records are stored in JSON format, which all workers can read. The JSON records tend to be about 1 kilobyte in size. When we ingest a tar file, the system will create one record for the tar file itself, and one for each file inside the tarball. Because some tarballs contain hundreds of thousands of files, this can amount to a lot of data.
Workers read and update these records up to several times per second. A lightly-resourced Redis instance has no trouble handling this load.
Workers do occasionally fail at their tasks, the most common reasons being:
- Transient network errors or temporarily unavailable external services.
- ECS killing a worker during the scale-down process. (We do handle SIGTERM and SIGKILL, but occasional long-running processes still have problems).
- An underlying library hitting nil-pointer exceptions or out of memory exceptions. (These are rare.)
In each of these cases, the worker re-queues the task, and a new worker will pick it up, usually within a minute.
Network and external service errors are easy to detect and handle. If a worker runs into too many of “Connection reset by peer” or “Service Unavailable” errors, it knows it’s banging its head against a wall, and requeues the task for someone else to do later.
A worker being killed requires special handling. When it receives a SIGINT or SIGKILL, it flushes outstanding metadata to Redis and requeues whatever tasks it’s working on. ECS gives workers about 90 seconds for cleanup after a SIGINT, but in practice, our cleanup usually completes in 1-2 seconds.
Automatic Scaling
As mentioned above, ECS automatically handles scale-up and scale-down, provided you supply the right metrics. ECS monitors workers and it needs to know which resource limits to look out for when deciding to add or remove capacity.
Before moving our new system into production, we extensively tested our two most common problematic workloads to find the trigger points for each worker.
One problematic workload involved ingests that had to process a small number of very large files, for example, ten files of 800GB each. This workload tended to tie up workers for long periods, blocking other ingests.
The other problematic workload involved tarballs containing hundreds of thousands of files. Processing each file involves some overhead, including checksum calculation, format identification, fetching and updating interim processing data, etc. That amounts to a lot of network chatter, and a substantial number of internal operations that affect CPU and memory usage.
As expected, the trigger points differed for each worker. Some were CPU-bound, some memory-bound, and others bound by network I/O.
The Cloud-Based Architecture in Production
Once we figured out the scaling triggers, we pushed the system from staging into demo, where our depositors could have at it. They didn’t stress it much, but their test ingests did expose a number of edge cases that we needed to fix before going into production.
In production, the system has behaved well. During the quieter hours of the day, we run a single instance of each worker in the smallest available containers, for minimal cost. During ingest floods, we scale up without human intervention, and then the system returns to its quiet state on its own.
The overall cost of running these workers in ECS is less than what we were paying to keep a single giant EBS volume on hand in the prior system. That’s a substantial win that pays off every month. And our tech team, now freed of having to monitor and intervene in production systems, can turn its attention to more valuable work.
Reflection
Though it didn’t take advantage of all the cloud had to offer, the original architecture wasn’t wrong. It met a hard stakeholder requirement of reducing risk by being vendor-agnostic and portable.
It ran successfully for years, allowing us to work out bugs and edge cases, and to understand usage patterns. That, in turn, allowed us to clearly define the functional requirements and the cost and performance goals of the new architecture. We could not have developed the new architecture successfully without first defining those requirements and goals.
In fact, poorly defined or wrongly defined requirements and goals are among the most common reasons projects fail. Running a good-enough, slightly overpriced system for a while ultimately allowed us to define what a really good, well-priced system would look like.
For those of you tasked with porting legacy server-based systems to the cloud, I hope this article gives you food for thought. As you make the move, keep in mind that your main task is to change the local-first resource assumption to a network-first resource assumption. Design your systems so that all resources are network-accessible, so that all existing and future workers can access them.
From there, see what services your cloud provider offers. Determine whether those are a good match for your application’s use case, and then assess the costs of those services. That is often the hardest part, especially given the non-obvious usage costs of providers like AWS. The best way to uncover these hidden costs is to create a proof-of-concept system that simulates the workloads you’ll be running in production.
When you make the move to an auto-scaling, self-healing system, you’ll relieve your developers and admins of a lot of angst, and free up time and mindspace for more productive and satisfying work.