This is Part 9 of the Fundamentals of DevOps and Software Delivery series.
In Part 8, you learned how to protect your data in transit and at rest. In
this blog post, you’ll learn about other aspects of data, including storage, querying,
and replication. What data am I referring to? Just about all software relies on data: social networking apps need
profile, connection, and messaging data; shopping apps need inventory and purchase data; fitness apps need
workout and activity data.
Data is one of your most valuable, longest-lived assets. In all likelihood, your data will outlive your shiny
web framework, your orchestration tool, your service mesh, your CI/CD pipeline, most employees at your company, and
perhaps even the company itself, starting a second life as part of an acquisition. Data is important, and this
blog post will show you how to manage it properly, covering the following use cases:
Local storage
Hard drives
Primary data store
Relational databases
Caching
Key-value stores and content distribution networks (CDNs)
File storage
File servers and object stores
Semistructured data and search
Document stores
Analytics
Columnar databases
Asynchronous processing
Queues and streams
Scalability and availability
Replication and partitioning
Backup and recovery
Snapshots, continuous backups, and replication
As you go through these use cases, this blog post will walk you through hands-on examples,
including deploying a PostgreSQL database, automating schema migrations, configuring backups and replication, serving
files from S3, and using CloudFront as a CDN. Let’s jump right in by learning about hard drives.
Local Storage: Hard Drives
The most basic form of data storage is to write to your local hard drive. The following are the most common types of
hard drives used today:
Physical hard drives on prem
If you use physical servers in an on-prem data center, you typically use hard drives that are physically
attached to those servers. A deep dive on hard drive technology is beyond the scope of this blog post series. All
I’ll say for now is that you’ll want to look into different types of hard drives (e.g., magnetic, SSD), hard drive
interfaces (e.g., SATA, NVMe), and techniques for improving reliability and performance, such as redundant array of independent disks (RAID).
Network-attached hard drives in the cloud
If you use the cloud, you typically attach hard drives over the network. Examples of these types of hard drives include Amazon Elastic Block Store (EBS) and
Google Persistent Disk (full list).
Network-attached drives are mounted in the local filesystem, providing a filesystem path you can read from and write
to that looks and behaves exactly like a local, physically attached hard drive. The advantage of network-attached
drives is that you can use software (e.g., OpenTofu, Pulumi) to detach and reattach them (e.g., as part of a
deployment); the drawback is higher latency.
Shared hard drives in the cloud and on prem
For some use cases, such as file serving (which you’ll read about in Section 9.4), sharing a single network-attached hard drive among multiple servers can be advantageous, so they can all read from and
write to the same disk. Several popular protocols are used for sharing hard drives over the network: Network File System (NFS), Common Internet File System (CIFS), and Server Message Block (SMB). Some cloud providers offer
managed services that use these protocols under the hood, such as Amazon Elastic File System (EFS) and
Google Cloud Filestore (full list).
Volumes in container orchestration tools
By default, the filesystem of a container is ephemeral, so any data you write to it will be lost
when that container is replaced. If you need to persist data for the long term, you need to
configure your orchestration tool to create a persistent volume and mount it at a specific path within the
container. The software within that container can then write to that path just as if it’s a normal local hard drive,
and the data in that persistent volume will be retained even if the container is redeployed or replaced. Under the
hood, the orchestration tool may handle the persistent volume differently in different deployment environments. For
example, if you’re using Kubernetes in AWS (EKS), you might get an EBS volume; in Google Cloud (GKE), you might
get a Google Persistent Disk; and on your local computer (Docker Desktop), you might get a folder on your local hard
drive.
Running Data Stores in Containers
Containers are designed to be easy to distribute, scale, and throw away (hence the default of ephemeral disks), which
is great for stateless apps and local development, but not for data stores in production. Not all data stores, data
tools, and data vendors support running in containers, and not all orchestration tools support persistent volumes
(and those that do often have immature implementations). I prefer to run data stores in production by using managed
services, such as Amazon’s Relational Database Service (as you’ll see in Section 9.2.4).
I’d run a data store in a container only if my company was all-in on Kubernetes, which has the most mature persistent
volume implementation, and we had significant operational experience with it.
Just because you have a local hard drive doesn’t mean you should always use it. Years ago, as a summer intern at a
financial services company, I was tasked with writing a load-generator app that could test how the company’s financial
software handled various traffic patterns. This app needed to record the responses it got from the financial software,
and as I knew nothing about data storage at the time, I decided to write that data to a file on the local hard drive,
using a custom file format I made up. This quickly led to problems:
Querying the data
Once I started running tests with my load-generator app, my coworkers would ask me questions about the results. What
percentage of the requests were successful? How long did the requests take, on average? What response codes did I
get? To answer each of these questions, I had to write more and more code to extract insights from my custom file
format.
Evolving the data format
I’d occasionally have to update the file format used by the load-generator app, only to later realize that I could no
longer read files written in the old format.
Handling concurrency
To be able to generate sufficient load, I realized I’d have to run the load-generator app on multiple computers. My
code couldn’t handle this at all, as it knew only how to write data on one computer and couldn’t handle concurrency.
Eventually, the summer came to a close, and I ran out of time before I could fix all these issues. I suspect the
company quietly discarded my load-generator app after that. The problems I ran into—querying the data, evolving the
data format, handling concurrency—are something you have to deal with anytime you store data. As you’ll see shortly,
solving these problems takes a long time (decades), so whenever you need to store data, instead of using a custom file
format on the local hard drive, you should store it in a dedicated, mature data store.
You’ll see examples of data stores later in this blog post, such as relational databases,
document stores, and key-value stores. For now, the main takeaway is that these dedicated data stores should be
the only stateful systems in your architecture—that is, the only systems that use their local hard drives to store
data for the long term (persistent data). All your apps should be stateless, using their local hard
drives only to store ephemeral data that it’s OK to lose. Keeping apps stateless makes them easier to deploy, maintain, and
scale, and ensures that your data is stored in systems designed for data storage (with built-in solutions for
querying, data formats, concurrency, and so on).
Key Takeaway 1
Keep your applications stateless. Store all your data in dedicated data stores.
Let’s now turn our attention to some of these dedicated data stores, starting with the primary data store for most
companies, the relational database.
Primary Data Store: Relational Databases
Relational databases have been the dominant data storage solution for decades—and for good reason. They are flexible;
do a great job of maintaining data integrity and consistency; can be configured for remarkable scalability and
availability; offer a strong security model; have a huge ecosystem of tools, vendors, and developers;
store data efficiently (temporally and spatially); and are the most mature data storage technology available.
The last point, the maturity of relational databases, is worth focusing on. Consider the initial release dates of some
of the most popular relational databases
(full list):
Oracle (1979), Microsoft SQL Server (1989), MySQL (1995), PostgreSQL (1996, though it evolved from a codebase developed in the
1970s), and SQLite (2000). These databases have been in development for 25–50 years, and they are still in active
development today.
Data storage is not a technology you can develop quickly. As Joel Spolsky wrote,
good software takes at least
a decade to develop; with databases, it may be closer to two decades. That’s how long it takes to build a
piece of software that can be trusted with one of your company’s most valuable assets, your data, so that
you can be confident it won’t lose the data, won’t corrupt it, and won’t leak it.
One of the key takeaways from Part 8 is that you should not roll your own cryptography
unless you have extensive training and experience in that discipline; the same is true of data stores. The only time it
makes sense to create your own is when your use case falls outside the bounds of all existing data stores,
which is a rare occurrence that typically happens only at massive scale (the scale of a Google, Meta,
or X). And even then, do it only if you have at least a decade to spare.
Key Takeaway 2
Don’t roll your own data stores; always use mature, battle-tested, proven, off-the-shelf solutions.
Relational databases not only are mature solutions, but as you’ll see shortly, they also provide a set of tools that make
them reliable and flexible enough to handle a remarkably wide variety of use cases, from being embedded directly within
your application (SQLite can run in-process or even in a browser) all the way up to clusters of thousands of servers
that store petabytes of data. By comparison, just about all the other data storage technologies you’ll learn about in
this blog post are much younger than relational databases, and are designed for only a narrow set of
use cases. This is why most companies use relational databases as their primary data stores—the source of truth for
their data.
In this section, we’ll take a brief look at how relational databases handle the following data storage
concepts:
Reading and writing data
ACID transactions
Schemas and constraints
Later in this blog post, you’ll be able to compare how other data stores handle these same concepts.
Let’s start with reading and writing data.
Reading and Writing Data
A relational database stores data in tables, which represent a collection of related items. Each item is
stored in a row, and each row in a table has the same columns. For example, if you were working on a website for
a bank and needed to store data about its customers, you might have a customers table with each row representing
one customer as a tuple of id, name, date_of_birth, and balance, as shown in Table 16.
Table 16. The customers table
id
name
date_of_birth
balance
1
Brian Kim
1948-09-23
1500
2
Karen Johnson
1989-11-18
4853
3
Wade Feinstein
1965-02-29
2150
Relational databases require you to define a schema to describe the structure of each table before you can write any
data to that table. You’ll see how to define the schema for the customers table in Table 16 a little later
in this blog post. For now, let’s imagine the schema already exists and focus on how to read and write
data. To interact with a relational database, you use Structured Query Language (SQL).
Watch Out for Snakes: SQL Has Many Dialects
In theory, SQL is a language standardized by the American National Standards Institute (ANSI) and ISO, and is the same across all relational databases. In
practice, every relational database has its own dialect of SQL that is slightly different. In this blog post series,
I’m focusing on SQL concepts that apply to all relational databases, but I had to test my code somewhere, so
these examples use the PostgreSQL dialect.
The SQL to write data is an INSERT INTO statement, followed by the name of the table, the columns to insert, and the
values to put into those columns. Example 144 shows how to insert the three rows from
Table 16 into the customers table.
How do you know whether these INSERT statements worked? One way is to try reading the data back out. To read data with a
relational database, you use the same language, SQL, to formulate queries. The SQL syntax for queries is a SELECT
statement, followed by the columns you wish to select, or the wildcard * for all columns, then FROM, followed by
the name of the table to query. Example 145 shows how to retrieve all the data from the
customers table.
As you’d expect, this query returns the three rows inserted in Example 144. You can filter the
results by adding a WHERE clause with conditions to match. Example 146 shows a SQL query that
selects customers born after 1950, which should return just two of the three rows.
SQL is ubiquitous in the world of software, so it’s worth taking the time to learn it, as it will help you build
applications, do performance tuning, perform data analysis, and more. That said, going into all the details of SQL is
beyond the scope of this blog post series; see Part
9 recommended reading if you’re interested. All I’ll say for now is that SQL and the
relational model are exceptionally flexible, allowing you to query your data in countless ways; for example, you can
use WHERE to filter data; ORDER BY to sort data; GROUP BY to group data; JOIN to query data from multiple
tables; COUNT, SUM, AVG, and a variety of other aggregate functions to perform calculations on your data;
indices to make queries faster; and more. If I had used a relational database for that load-generator app when I was
a summer intern, I could’ve replaced thousands of lines of custom query code with a dozen lines of SQL.
The flexibility and expressiveness of SQL is one of the many reasons most companies use relational databases as their primary data stores. Another major reason is ACID transactions, as discussed next.
ACID Transactions
A transaction is a set of coherent operations that should be performed as a unit. In relational databases,
transactions must meet the following four properties:
Atomicity
Either all the operations in the transaction happen, or none of them do. Partial successes or partial failures are
not allowed.
Consistency
The operations always leave the data in a state that is valid according to all the rules and constraints you’ve
defined in the database.
Isolation
Even though transactions may be happening concurrently, the result should be the same as if the transactions
had happened sequentially.
Durability
Once a transaction has completed, it is recorded to persistent storage (typically, to a hard drive) so that even in
the case of a system failure, that transaction isn’t lost.
These four properties taken together form the acronym ACID, and it’s one of the defining properties of just about
all relational databases. For example, going back to the bank example with the customers table, imagine that the bank
charged a $100 annual fee for each customer. When the fee was due, you could use a SQL UPDATE statement to deduct
$100 from every customer, as shown in Example 147.
A relational database will apply this change to all customers in a single ACID transaction. Either the
transaction will complete successfully, and all customers will end up with $100 less, or no customers will be
affected at all. This may seem obvious, but many of the data stores you’ll see later in this blog post
do not support ACID transactions, so it would be possible for those data stores to crash partway through this
transaction and end up with some customers having $100 less and some unaffected.
Relational databases also support transactions across multiple statements. The canonical example is
transferring money, such as moving $100 from the customer with ID 1 (Brian Kim) to the customer with ID 2 (Karen
Johnson), as shown in Example 148.
All the statements between START TRANSACTION and COMMIT will execute as a single ACID transaction, ensuring that
one account has the balance decreased by $100, and the other increased by $100, or neither account will be affected
at all. If you were using one of the data stores from later in this blog post that don’t support ACID
transactions, you could end up in an in-between state that is inconsistent. For example, the first statement completes,
subtracting $100, but then the data store crashes before the second statement runs, and as a result, the $100 simply
vanishes into thin air. With a relational database, this sort of thing is not possible, regardless of crashes or
concurrency. This is a major reason relational databases are a great choice as your company’s source of truth. Another major reason for using relational data stores is the support for schemas
and constraints, as discussed next.
Schemas and Constraints
Relational databases require you to define a schema for each table before you can read and write data to that table.
To define a schema, you again use SQL, this time with a CREATE TABLE statement, followed by the name of the
table and a list of the columns. Example 149 shows the SQL to create the customers table in
Table 16.
This code creates a table called customers with columns called id, name, date_of_birth, and
balance. Note that the schema also includes integrity constraints to enforce business rules, such as
the following:
Domain constraints
These limit the kind of data you can store in the table. For example, each column has a
type, such as INT, VARCHAR, and DATE, so the database will prevent you from inserting data of the
wrong type. Also, the id column specifies SERIAL, which is a pseudo type (an alias) that gives you a convenient
way to capture three domain constraints: first, it sets the type of the id column to INT; second, it adds a
NOT NULL constraint, so the database will not allow you to insert a row that is missing a value for
this column; third, it sets the default value for this column to an automatically incrementing sequence, which
generates a monotonically increasing ID that is guaranteed to be unique for each new row. This is why the id column
ends up with IDs 1, 2, and 3 in Example 145.
Key constraints
The primary key is a column or columns that can be used to uniquely identify each row in the table. Example 149 makes id the primary key, so the database will ensure that every row has a unique value for this column.
Foreign-key constraints
These allow one table to reference another table. For example, since bank customers could have
more than one account, each with its own balance, instead of having a single balance column in the customers
table, you could create a separate table called accounts, as shown in Example 150.
The ID of the customer who owns this account. The REFERENCES keyword labels this column as a foreign key into
the id column of the customers table. This will prevent you from accidentally inserting a row into
the accounts table that has an invalid customer ID.
Foreign-key constraints are another defining characteristic of relational databases—another major reason they are
a good source of truth. These constraints allow you to express and enforce relationships among tables (this is what the
"relational" in "relational database" refers to), which is essential to maintaining the referential integrity of your
data.
Key Takeaway 3
Use relational databases as your primary data store (the source of truth), as they are secure, reliable, and mature, and they
support schemas, integrity constraints, foreign-key constraints, joins, ACID transactions, and a flexible query
language (SQL).
In addition to using CREATE TABLE to define the schema for new tables, you can use ALTER TABLE to modify the schema
for existing tables (e.g., to add a new column). Carefully defining and modifying a schema is what allows you to evolve
your data storage over time without running into backward-compatibility issues, as I did with my load-generator app.
Initially, you might manage schemas manually, connecting directly to the database and executing CREATE TABLE and
ALTER TABLE commands by hand. However, as is often the case with manual work, this becomes error prone and tedious.
Over time, the number of CREATE TABLE and ALTER TABLE commands piles up, and as you add more and more environments
where the database schema must be set up (e.g., dev, stage, prod), you’ll need a more systematic way to manage your
database schemas. The solution, as you saw in Part 2, is to manage your schemas as code.
In particular, schema migration tools can help, such as Redgate Flyway, Liquibase, and Knex.js (full list). These
tools allow you to define your initial schemas and all the subsequent modifications as code, typically in an ordered
series ofmigration files that you check into version control. For example, Flyway uses standard SQL in .sql files
(e.g., v1_create_customers.sql, v2_create_accounts.sql, and v3_update_customers.sql), whereas Knex.js uses a
JavaScript DSL in .jsfiles (e.g., 20240825_create_customers.js,20240827_create_accounts.js, and
20240905_update_customers.js). You apply these migration files by using the schema migration tool, which keeps
track of which of your migration files have already been applied and which haven’t, so no matter what state your
database is in, or how many times you run the migration tool, you can be confident your database will end up with the
desired schema.
As you make changes to your app, new versions of the app code will rely on new versions of your database schema. To
ensure that these versions are automatically deployed to each environment, you will need to integrate the schema migration
tool into your CI/CD pipeline (something you learned about in Part 5). One approach is to run the
schema migrations as part of your app’s boot code, just before the app starts listening for requests. The main
advantage of this option is that it works not only in shared environments (e.g., dev, stage, prod), but also in every
developer’s local environment, which is not only convenient, but also ensures your schema migrations are constantly
being tested. The main disadvantage is that migrations sometimes take a long time, and if an app takes too long to boot,
some orchestration tools will think there’s a problem and try to redeploy the app before the migration can finish.
Also, if you are running serverless apps, which already struggle with cold starts, you shouldn’t add anything to the
boot code that makes it worse. In these cases, you’re better off running migrations as a separate step in your
deployment pipeline, just before you deploy the app.
Now that you’ve seen the concepts behind relational databases, let’s see those concepts in action with a real-world
example.
Example: PostgreSQL, Lambda, and Schema Migrations
In this section, you’ll go through an example of deploying PostgreSQL, a popular open
source relational database, using Amazon’s Relational Database Service (RDS), a
fully managed service that provides a secure, reliable, and scalable way to run several types of relational
databases, including PostgreSQL, MySQL, Microsoft SQL Server, and Oracle Database. You’ll then manage the schema for this
database by using Knex.js and deploy a Lambda function to run a Node.js app that connects to the PostgreSQL database over
TLS and runs queries.
Here are the steps to set this up:
Create an OpenTofu module.
Create schema migrations.
Create the Lambda function.
Let’s start with creating the OpenTofu module.
Create an OpenTofu module
Example Code
As a reminder, you can find all the code examples in the blog post series’s repo in GitHub.
Head into the folder you’ve been using for this blog post series’s examples and create a new subfolder for this
blog post, and within it, a new OpenTofu root module called lambda-rds:
You can deploy PostgreSQL on RDS by using a reusable module called rds-postgres, which is
in the blog post series’s sample code repo in the ch9/tofu/modules/rds-postgres folder.
To use this module, create a file called main.tf in the lambda-rds module, with the initial contents shown in
Example 151.
variable"username"{description="Username for master DB user."type=string}variable"password"{description="Password for master DB user."type=stringsensitive=true}
These input variables allow you to pass in the username and password via environment variables, so you don’t have to
put these secrets directly into your code (as you learned in Part 8, do not store secrets
as plaintext!). Next, update main.tf with the code shown in Example 153 to deploy a Lambda function.
This code uses the same lambda module you’ve seen multiple times throughout this
blog post series to deploy a serverless Node.js app:
1
The source code for the function will be in the src folder. You’ll see what this code looks like shortly.
2
Use environment variables to pass the Lambda function all the details about the database, including the database
name, hostname, port, username, and password.
3
Create a Lambda function URL that will trigger the Lambda function in response to HTTP(S) requests.
Finally, add output variables for the Lambda function URL, as well as the database name, host, and port, to an
outputs.tf file, as shown in Example 154.
output"function_url"{description="The URL of the Lambda function"value=module.app.function_url}output"db_name"{description="The name of the database"value=module.rds_postgres.db_name}output"db_host"{description="The hostname of the database"value=module.rds_postgres.hostname}output"db_port"{description="The port of the database"value=module.rds_postgres.port}
Now that the OpenTofu code is defined, let’s move on to the schema migrations.
Create schema migrations
To create the schema migrations, create a src folder within the lambda-rds module:
console
$ mkdir-psrc
$ cdsrc
Next, create a package.json file with the contents shown in Example 155.
These commands install the following dependencies:
knex
This is the Knex.js library. The first npm install command installs it so it’s available to your Lambda function, and the second npm install command installs it with the --global flag so it’s available as a CLI tool in your terminal.
pg
This is the node-postgres library that Knex.js will use to talk to PostgreSQL.
You’re now ready to configure how Knex.js will connect to PostgreSQL. Knex.js will talk to PostgreSQL over the network,
and to protect this communication, PostgresSQL encrypts connections by using TLS (which you learned about in
Part 8). To validate the database’s TLS certificate, you need to do the following two steps:
Download the certificates for the CA that signed PostgreSQL’s TLS certificate:
Since you’re using RDS to run PostgreSQL, AWS is the CA. Download its certificate for the us-east-2 region from
the AWS website,
in PEM format. Save it under the filename rds-us-east-2-ca-cert.pem in the src folder.
Configure your app to trust the CA certificate:
Configure Knex.js to use the CA certificate by creating a file called knexfile.js, with the contents shown in
Example 156.
Use the PostgreSQL library (node-postgres) to talk to the database.
2
Read the CA certificate you just downloaded from the AWS website.
3
This JSON object configures the connection to use the database name, host, port, username, and password from the
environment variables you passed to the Lambda function in the OpenTofu code, and to validate the TLS certificate by
using the CA cert you read in 2. Note that this code also allows you to pass in the database username and
password, using environment variables of the form TF_VAR_xxx; you’ll see how this is used shortly.
4
Enable debug logging so you can see the queries Knex.js is running.
Next, create your first schema migrations as follows:
console
$ knexmigrate:makecreate_customers_tables
This creates a migrations folder, and within it, a file called <TIMESTAMP>_create_customers_table.js,
where TIMESTAMP indicates the time that you ran the knex migrate:make command. Replace the contents of this
file with Example 157.
With Knex.js, you manage your schemas in sequential .js files as follows:
1
The up function is where you define how to update the database schema.
2
Create the customers table with the same schema you first saw in Example 149, except
instead of using raw SQL (e.g., CREATE TABLE), you use a fluent JavaScript API (e.g., createTable()).
3
Populate the database with some initial data, adding the same three customers to the customers
table that you initially saw in Example 144, again using a fluent JavaScript API instead of raw
SQL.
4
The down function is where you define how to undo the schema changes in the up function. This gives you a way
to roll back changes in case of bugs, outages, or as part of testing. The code here deletes the customer table.
Now that you’ve defined your schema migrations, let’s fill in the Lambda function.
Create the Lambda function
Let’s create a Lambda function that can connect to the PostgreSQL database over TLS, perform some queries, and return
the results as JSON. Create app.js, which is the entry point for this function, with the contents shown in
Example 158.
Load the database connection configuration from knexfile.js.
2
Create a Knex.js client, using the configuration from 1.
3
Use the Knex.js client to perform the exact database query you saw in Example 146, which
fetches all customers born after 1950.
4
Return the results of the query as JSON.
You are now ready to deploy. First, set the TF_VAR_username and TF_VAR_password environment variables to the
username and password for the database master user:
Now that the PostgreSQL database is deployed, you can use the Knex CLI to apply schema migrations. Normally, you’d
integrate this step into your CI/CD pipeline, but for this example, you can apply the schema migrations
from your own computer. First, you need to expose the database name, host, and port that you just saw in the output
variables via the environment variables knexfile.js is expecting (you’ve already exposed the username and password as
environment variables):
If the migrations apply successfully, your database should be ready to use. To test it out, copy the URL in the
function_url output variable and open it up to see if the database query in that Lambda function returns the
customers born after 1950:
If you see a JSON response, congrats, you’ve successfully applied schema migrations to a PostgreSQL database, and you
have a web app running in AWS that’s able to talk to a PostgreSQL database over TLS!
Get Your Hands Dirty
Here are a few exercises you can try at home to go deeper:
The rds-postgres module deploys the database into the public subnets of the default VPC. Update the code to instead
deploy into the private subnets of a custom VPC (see Part 7). You’ll have to
update the way the Lambda function connects to the database (e.g., see Amazon RDS Proxy).
The Lambda function is using the master user to query the database. Switch to a more limited database user.
You may wish to run tofu destroy now to clean up your infrastructure so you don’t accumulate any charges.
Alternatively, you may want to wait until later in this blog post, when you update this example code
to enable backups and replicas for the database. Either way, make sure to commit your latest code to Git.
Now that you’ve had a look at the primary data store use case, let’s turn our attention to the next use case:
caching.
Caching: Key-Value Stores and CDNs
A cache is a way to store a subset of your data so you can serve that data with lower latency. The cache
achieves lower latency than the primary data store by storing the data in memory rather than on disk (refer back to
Table 9), and/or by storing the data in a format optimized for rapid retrieval (e.g., a
hash table) rather than flexible query mechanics (e.g., relational tables). A cache can also reduce latency in the
primary data store by offloading a portion of the primary data store’s workload.
The simplest version of a cache is an in-memory hash table directly in your application code. Example 159 shows
a simplified example of such a cache.
The cache is a hash table (aka map or object) that the app stores in memory.
2
Check whether the data you want is already in the cache. If so, return it immediately, without making an expensive query.
3
If the data isn’t in the cache, perform the expensive query, store the result in the cache (so future lookups
are fast), and return the result. This is known as the cache-aside strategy.
I labeled this approach "simplified" for the following reasons:
Memory usage
As written, the cache will grow indefinitely, so if you have enough unique keys, your app may run out of memory.
Real-world caching mechanisms typically need a way to configure a maximum cache size and a policy for evicting data
when that size is exceeded (e.g., evict the oldest or least frequently used entries).
Concurrency
Depending on the programming language, you may have to use synchronization primitives (e.g., locking) to
handle concurrent queries that update the cache.
Cold starts
If the cache is only in memory, then every single time you redeploy the app, it will start with an empty cache, which
may cause performance issues.
Cache invalidation
The code handles read operations but not write operations. Whenever you write to
the primary data store, you need to update the cache too. Otherwise, future queries will return stale data.
The first and second issues are reasonably easy to resolve with better code. The third and fourth issues are more
challenging. Cache invalidation in particular is one of those problems that’s much harder than it seems.[34] If you have, say, 20 replicas of your app, all with code similar to this example, then every time you write to your primary data store, you need to find a way to
(a) detect the change has happened and (b) invalidate or update 20 caches.
This is why, except for simple cases, the typical way most companies handle caching is by deploying a centralized
data store dedicated to caching. This way, you avoid cold starts, and you have only a single place to update
when you do cache invalidation. For example, you might do write-through caching: whenever you write to your
primary data store, you also update the cache. The two most common types of data stores that you use for centralized
caching are key-value stores and CDNs.
Key-Value Stores
Key-value stores are optimized for a single use case: fast lookup by a known identifier (key). They are
effectively a hash table distributed across multiple servers. The idea is to deploy the key-value store between
your app servers and your primary data store, as shown in Figure 82, so requests that are in the cache (a
cache hit) are returned quickly, without having to talk to the primary data store, and only requests that
aren’t in the cache (a cache miss) go to the primary data store (after which they are added to the cache
to speed up future requests).
Figure 82. Key-value stores act as a cache between app servers and primary data stores
Some of the major players in the key-value store space include Redis/Valkey[35] and Memcached (full list). The API for
most key-value stores primarily consists of just two types of functions, one to insert a key-value pair and one to look
up a value by key. For example, with Redis, you use SET to insert a key-value pair and GET to look up a key:
console
$ SETkeyvalue
OK$ GETkey
value
Key-value stores do not require you to define a schema ahead of time (sometimes referred to as schemaless, but this
is a misnomer, as you’ll learn in Section 9.5.3), so you can store any kind of value you want.
Typically, the values are either simple scalars (e.g., strings, integers) or blobs containing arbitrary data
that is opaque to the key-value store. Since the data store is aware of only keys and basic types of
values, functionality is typically limited. You shouldn’t
expect support for flexible queries, joins, foreign-key constraints, ACID transactions, or many of the other powerful
features of a relational database.
Key Takeaway 4
Use key-value stores to cache data, speeding up queries and reducing load on your primary data store.
You can deploy key-value stores yourself, or you can use managed services, such as Redis Cloud and Amazon ElastiCache
(full list). Once you
have a key-value store deployed, many libraries can automatically use them for cache-aside and write-through caching
without you having to implement those strategies manually. For example, you can automatically cache WordPress content by using the Redis and
Memcached plugins; and you can automatically cache any content you access via the Java Database Connectivity (JDBC) APIs by using the Redis Smart Cache plugin.
Let’s now look at the second type of data store commonly used for caching: CDNs.
CDNs
A content delivery network (CDN) consists of servers that are distributed all over the world, called points of presence (PoPs), that cache data from your origin servers (your app servers), and serve that data from a PoP
that is as close to the user as possible. Whereas a key-value store goes between your app servers and
your database, a CDN goes between your users and your app servers, as shown in Figure 83.
Figure 83. CDNs act as a cache between your users and your app servers
When a user makes a request, it first goes to the PoP closest to that user, and if the content is already
cached, the user gets a response immediately. If the content isn’t already cached, the PoP forwards the request to your
origin servers, caches the response (to make future requests fast), and then returns it to the user. Some of the major
players in the CDN space include CloudFlare, Akamai, Fastly, and Amazon Cloudfront
(full list). CDNs offer several
advantages:
Reduce latency
CDN servers are distributed all over the world; Akamai, for example, has more than 4,000 PoPs
in over 130 countries. This allows you to serve content from locations that are closer to your users,
which can significantly reduce latency (refer back to Table 9), without your
company having to invest the time and resources to deploy and maintain app servers all over the world.
Reduce load
Once the CDN has cached a response for a given key, it no longer needs to send a request to the underlying app
server for that key—at least, not until the data in the cache has expired or been invalidated. If you have a good
cache hit ratio (the percentage of requests that are a cache hit), this can significantly reduce the load on the
underlying app servers.
Improve security
Many CDNs provide additional layers of security, such as a web application firewall (WAF), which can
inspect and filter HTTP traffic to prevent certain types of attacks (e.g., SQL injection, XSS), and distributed denial-of-service (DDoS) protection, which shields you from malicious attempts to overwhelm your servers with
artificial traffic generated from servers around the world.
Other benefits
CDNs have gradually been offering more and more features that let you take advantage of their massively
distributed network of PoPs. Here are just a few examples: with edge computing, the CDN allows you to run
small bits of code on the PoPs, as close to your users (as close to the "edge") as possible; with compression, the
CDN automatically uses algorithms such as gzip or Brotli to minimize bandwidth usage; and with localization, knowing
which local PoP was used allows you to choose the language in which to serve content.
CDNs are most valuable for content that is the same for all of your users and doesn’t change often. For
example, news publications can usually offload a huge portion of their traffic to CDNs, as once an article is published,
every user sees the same content, and that content isn’t updated too often. On the other hand, social networks and
collaborative software can’t leverage CDNs as much, as every user sees different content, and the content changes often.
Key Takeaway 5
Use CDNs to cache static content, reducing latency for your users and reducing load on your servers.
Virtually all companies can benefit from a CDN when serving completely static content, such as
images, videos, binaries, JavaScript, and CSS. Instead of having your app servers waste CPU and memory on serving static content, you can offload most of this work to a CDN. In fact, many companies choose not to have their app
servers involved in static content at all, not even as an origin server for a CDN, and instead offload all static
content to dedicated file servers and object stores, as described in the next section.
File Storage: File Servers and Object Stores
One type of data most companies have to deal with comes in the form of static files. Some of these are files created
by your company’s developers, such as the JavaScript, CSS, and images you use on a website. Others are files
created by your customers, such as the photos and videos users might upload to a social media app.
You could store static files in a typical database (e.g., as a blob), which has the advantage of keeping all your data in a single system that already has security controls, data backups, monitoring, and so on. However, using a database for static content also has many disadvantages:
Slower database
Storing files in a database bloats the size of the database, making everything slower. Databases are already
a common bottleneck to scalability and availability (as you’ll learn in Section 9.8);
storing files in them only makes that worse.
Slower and more expensive replicas and backups
Replicating and backing up a larger database is more expensive and slower.
Increased latency
Serving files from your database to a web browser requires you to proxy each file through an app server, which
increases latency.
CPU, memory, and bandwidth overhead
Proxying files in a database through an app server increases bandwidth, CPU, and memory usage, both on the app
server and the database.
Instead of storing static files in a database, you typically store and serve them from dedicated file servers or object
stores.
File Servers
A file server is designed to store and serve static content, such as images, videos, binaries,
JavaScript, and CSS, so that your app servers can focus entirely on serving dynamic content (content that is
different for each user and request). Requests first go to a CDN, which returns a response immediately if it is already
cached, and if not, the CDN uses your app servers and file servers as origin servers for dynamic and static content,
respectively, as shown in Figure 84.
Figure 84. App servers serve dynamic content, file servers serve static content, and the CDN caches what it can of each
Most web-server and load-balancer software can easily be configured to serve files, including all the ones you saw in
Part 3 (e.g., Apache, nginx, HAProxy). The hard part is handling the following:
Storage
You need to provide sufficient hard drive capacity to store the files.
Metadata
You typically need to store metadata related to the files, such as owner, upload date, and tags. You
could store the metadata on the filesystem next to the files themselves, but the more common approach is to store
it in a separate data store (e.g., a relational database) that makes it easier to query the metadata.
Security
You need to control who can create, read, update, and delete files. You may also need to
encrypt data at rest and in transit, as you learned in Part 8.
Scalability and availability
You could host all the files on a single server, but as you know from Part 3, a single server is
a single point of failure. To support a lot of traffic, and to be resilient to outages, you typically need to use
multiple servers.
Solving these problems for a small number of files can be straightforward, but if you get to the scale of a Snapchat, whose users upload more than
4
billion pictures per day, these become considerable challenges that require lots of custom tooling, huge numbers of
servers and hard drives, RAID, NFS, and so on. One way to make these challenges easier is to offload much of this work
to an object store.
Object Stores
An object store (sometimes called a blob store) is a system designed to store opaque objects, or blobs, often in the
form of files with associated metadata. Typically, these are cloud services, so you can think of object stores as
a file server as a service. Some of the major players in this space are Amazon Simple Storage Service (S3), Google
Cloud Storage (GCS), and Azure Blog Storage
(full list). Object stores
provide out-of-the-box solutions to the challenges you saw with file servers in the preceding section:
Storage
Object stores provide nearly unlimited disk space, usually for low prices. Amazon S3, for example, is around $0.02 per
GB per month, with a generous free tier.
Metadata
Most object stores allow you to associate metadata with each file you upload. For instance, S3 allows you to configure
both system-defined metadata (e.g., standard HTTP headers such as entity tag and content type, as you’ll see in Section 9.4.3) and user-defined metadata (arbitrary key-value pairs).
Security
Most object stores offer access controls and encryption. S3, for example, provides IAM for access control, TLS
for encryption in transit, and AES for encryption at rest.
Scalability and availability
Object stores typically provide scalability and availability at a level few companies can achieve. S3
provides 99.999999999% durability and 99.99% availability, for example.
Many object stores also provide a variety of other useful features, such as automatic archival or deletion of older files, replication across data centers in different regions, search, analytics, and integration with compute.[36] This combination of features is why even companies that otherwise keep everything on prem often turn to the cloud and object stores for file storage.
Key Takeaway 6
Use file servers and object stores to serve static content, allowing your app servers to focus on serving dynamic
content.
To get a better sense for file storage, let’s go through an example.
Example: Serving Files with S3 and CloudFront
In this section, you’re going to set up scalable, highly available, globally distributed static content hosting by
going through the following three steps:
Create an S3 bucket configured for website hosting.
Upload static content to the S3 bucket.
Deploy CloudFront as a CDN in front of the S3 bucket.
Let’s start with creating the S3 bucket.
Create an S3 bucket configured for website hosting
Head into the folder you’ve been by using for this blog post series’s examples, and create a folder for a new OpenTofu
root module called static-website:
You can deploy a website on S3 using a module called s3-website, which is
in the blog post series’s sample code repo in the ch9/tofu/modules/s3-website folder.
The s3-website module creates an S3 bucket, makes its contents publicly accessible, and configures it as a website,
which means it can support redirects, error pages, and so on. To use the s3-website module, create
a file called main.tf in the static-website module, with the contents shown in Example 160.
provider"aws"{region="us-east-2"}module"s3_bucket"{source="brikis98/devops/book//modules/s3-website"version="1.0.0" # TODO: fill in your own bucket name!name="fundamentals-of-devops-static-website"(1)index_document="index.html"(2)}
This code creates an S3 bucket configured as follows:
1
The name to use for the S3 bucket. S3 bucket names must be globally unique, so you’ll have to fill in your own
bucket name here.
2
The suffix to use for directory requests. For example, if you set this to index.html, a request for the directory
/foo will return the contents of /foo/index.html.
Now that you have an S3 bucket, let’s put some content in it.
Upload static content to the S3 bucket
As an example of static content, let’s create an HTML page that includes CSS and an image. First, create a content
folder within the static-website folder:
console
$ mkdir-pcontent
In this folder, create index.html with the contents shown in Example 161.
<htmllang="en"><head><title>Fundamentals of DevOps and Software Delivery</title><linkrel="stylesheet"href="styles.css"></head><body><h1>Hello, World!</h1><p>This is a static website hosted on S3, with CloudFront as a CDN.</p><imgsrc="cover.png"alt="Fundamentals of DevOps and Software Delivery"></body></html>
This is a simple HTML page that renders an image and is decorated with a stylesheet. Add an image of your choice
into the content folder, naming it cover.png. For the stylesheet, create styles.css with the contents shown in
Example 162.
This code uses the aws_s3_object resource to upload files as follows:
1
Have the aws_s3_object resource loop over a map; the key is a file to upload from the content folder, and the value is the content type for that file.
2
Upload the files to the S3 bucket you created earlier.
3
For each file, use the key in the map as its path within the S3 bucket.
4
Read the contents of each file from the content folder.
5
Set the entity tag (ETag) HTTP response header to the MD5 hash of each file’s contents. This tells web browsers whether a file has changed and should be redownloaded. OpenTofu also uses this value to know whether a file needs to be reuploaded.
6
Set the content type HTTP response header for each file to the value in the map. This tells web browsers how to display the contents of the file (e.g., browsers know to render text/html as HTML, image/png as a PNG image, and so on).
7
Set the cache control HTTP response header for each file. The public directive indicates that it’s safe for shared caches (e.g., CDNs) to cache this resource, while max-age=300 indicates that shared caches and web browsers can cache this content for up to 300 seconds (5 minutes).
Watch Out for Snakes: Don’t Upload Files to S3 via OpenTofu
I don’t recommend the aws_s3_object resource for managing a large number of files in production because of (a)
performance and throttling issues, and (b) the lack of a way to process the files (e.g., minification, fingerprinting,
compression). Instead, use an asset pipeline (e.g., Rails
Asset Pipeline with the asset_sync Gem) or a library designed to sync files
with S3 (e.g., s3_website).
Finally, add output variables in outputs.tf as shown in Example 164.
output"s3_website_endpoint"{description="The endpoint for the website hosted in the S3 bucket"value=module.s3_bucket.website_endpoint}
Deploy this code as usual, authenticating to AWS, and running init and apply:
console
$ tofuinit
$ tofuapply
When apply completes, you should see the s3_website_endpoint output. Open http://<S3_WEBSITE_ENDPOINT>
(S3 websites support only HTTP; CloudFront supports HTTPS, as you’ll see shortly), and you’ll see something like
Figure 85.
Figure 85. The static website in S3
If the page shows up correctly, congrats, you’re successfully using S3 as a file server!
Let’s now put a CDN in front of it.
Deploy CloudFront as a CDN in front of the S3 bucket
To deploy CloudFront as a CDN, you can use a module called cloudfront-s3-website, which is in the blog post series’s sample code repo in the ch9/tofu/modules/cloudfront-s3-website folder.
The cloudfront-s3-website module creates a globally distributed CloudFront distribution, configures your static
website in S3 as an origin, sets up a domain name and TLS certificate, and plugs in some basic caching settings.
Update main.tf to use the cloudfront-s3-website module as shown in Example 165.
Pass in the S3 bucket name. This is mostly used as the unique ID within the CloudFront distribution.
2
Pass in the S3 bucket website endpoint. CloudFront will use this as the origin, sending requests to it for any
content that isn’t already cached.
3
The time-to-live (TTL) settings tell CloudFront how long it should cache content before
sending a new request to the origin server (the S3 bucket). The preceding code sets the minimum and default TTL to
0, which tells CloudFront to configure caching based on the content headers (e.g., cache control), and the maximum
TTL to 300 seconds (5 minutes), which tells CloudFront not to cache anything longer than that, regardless of
headers (which is convenient for testing).
4
Configure CloudFront to return the contents of index.html whenever someone makes a request to the root of your
CloudFront distribution’s domain name.
Add the CloudFront distribution’s domain name as an output variable in outputs.tf, as shown in
Example 166.
output"cloudfront_domain_name"{description="The domain name of the CloudFront distribution"value=module.cloudfront.domain_name}
Rerun init and apply:
console
$ tofuinit
$ tofuapply
CloudFront can take 2–10 minutes to deploy, so be patient. When apply completes, you should see the
cloudfront_domain_name output variable. Open https://<CLOUDFRONT_DOMAIN_NAME> (yes, HTTPS this time!) in your web
browser, and you should see the same content as in Figure 85. Congrats, you’re now serving and caching
static content via a network of 600+ CloudFront PoPs dispersed all over the world!
Get Your Hands Dirty
Here are a few exercises you can try at home to go deeper:
Configure CloudFront to use a
custom domain
name and TLS certificate. For example, use static.<YOUR-DOMAIN> as the domain name (using the domain name you registered
in Part 7), and use ACM to provision a TLS certificate.
When you’re done testing, commit your changes to Git, and run tofu destroy to clean everything up again. Now that
you’ve seen how to store files, let’s turn our attention to the next use case, which is handling semistructured data
and search.
Semistructured Data and Search: Document Stores
Relational databases are a great choice when your data has a clear, consistent, and predictable structure, which allows you to store the data in tables with well-defined schemas and to perform queries
on well-defined column names. However, this isn’t always the case. For example, if you are building software similar
to a wiki, where users can create arbitrary documents, tags, categories, labels, and so on, fitting all this into a static relational schema may be tough. For these use cases dealing with semistructured data, a
document store may be a better fit.
A document store is similar to a key-value store, except the values are richer
data structures called documents that the document store natively understands, so you get access to more advanced
functionality for querying and updating that data.
Popular general-purpose document stores include MongoDB and Couchbase
(full list).
Some document stores are optimized for search, building search indices on top of the documents
so you can use free-text search, faceted search, and so on. Popular options for search include
Elasticsearch/OpenSearch[37] and Algolia
(full list).
This section takes a brief look at document stores by considering the same data storage concepts you
saw with relational databases:
Reading and writing data
ACID transactions
Schemas and constraints
We’ll start with reading and writing data.
Reading and Writing Data
To get a sense of how document stores work, let’s use MongoDB as an example. MongoDB allows you to store JSON documents
in collections, somewhat analogously to the way a relational database allows you to store rows in tables. MongoDB does
not require you to define a schema for your documents, so you can store JSON data in any format you want. To read and
write data, you use the MongoDB Query Language (MQL), which is similar to JavaScript. Example 167
shows how you can use the insertMany command to store JSON documents in a collection called bank.
Example 167. Insert documents into the bank collection (ch9/mongodb/bank.js)
This is the same bank example you saw with relational databases earlier in this blog post, with the
same three customers as in Table 16. To read data back out, you can use the find command as shown in
Example 168.
Example 168. Look up all documents in the bank collection (ch9/mongodb/bank.js)
You get back the exact documents you inserted, except for one new item: MongoDB automatically adds an _id field to
every document, which it uses as a unique identifier, similar to a primary key. You can look up a document by ID as shown in Example 169.
The big difference between key-value stores and document stores is that document stores can natively understand and
process the full contents of each document rather than treating them as opaque blobs. This gives you richer query
functionality. Example 170 shows how to find all customers born after 1950, the
same query you saw in SQL in Example 146.
You also get richer functionality when updating documents. Example 171 shows
how to use the updateMany command to deduct $100 from all customers, similar to the SQL UPDATE you saw in
Example 147.
All this richer querying and update functionality is great, but it has two major limitations. First, most
document stores do not support working with multiple collections; there is no support for
joins.[38] Second, most document
stores don’t support ACID transactions, as discussed next.
ACID Transactions
The code in Example 171 has a serious problem: most document stores
don’t support ACID transactions.[39] You might get atomic operations on a single document (e.g., if you
updated one document with the updateOne command), but you rarely get them for updates to multiple documents. That means
it’s possible for that code to deduct $100 from some customers but not others (e.g., if MongoDB crashes in the middle
of the updateMany operation).
This is not at all obvious from the code, and many developers who are not aware of this are caught off guard when their
document store operations don’t produce the results they expect. This is one of many gotchas with using nonrelational
databases, especially as your source of truth. Other major gotchas include dealing with eventual consistency, as you’ll
see in Section 9.8.3, and the lack of support for schemas and constraints, as discussed next.
Schemas and Constraints
Most document stores do not require you to define a schema or constraints up front. This is sometimes referred to as
schemaless, but that’s a bit of a misnomer. The reality is that there is always a schema. The only
question is whether you enforce a schema-on-read or a schema-on-write.
Relational databases enforce a schema-on-write,
which means the schema and constraints must be defined ahead of time, and the database will allow you to write only
data that matches the schema and constraints. Most document stores, such as MongoDB, don’t require you to define the
schema or constraints ahead of time, so you can structure your data however you want, but eventually, something will
read that data, and that code will have to enforce a schema-on-read to be able to parse the data and do something
useful with it. For example, to parse data from the bank collection you saw in the previous section, you might create
the Java code shown in Example 172.
This Java class defines a schema and constraints: you’re expecting field names such
as name and balance with types String and int, respectively. More accurately, this is an example of
schema-on-read, as this class defines the schema you’re expecting from the data store, and either the data you read
matches the Customer data structure, or you will get an error. Since document stores don’t enforce schemas or
constraints, you can insert any data you want in any collection, as in
Example 173.
Did you catch the error? The code uses birth_date instead of date_of_birth. Whoops. MongoDB will
allow you to insert this data without any complaints, but when you try to parse this data with the Customer class,
you may get an error. And this is just one of many types of errors you may get with schema-on-read. Since most document
stores don’t support domain constraints or foreign-key constraints, you will also have to worry about typos in field
names, incorrect types for fields, IDs that reference nonexistent documents in other collections, and so on.
Dealing with these errors when you read the data is hard, so it’s better to prevent these errors in the first place by
blocking invalid data on write. That’s an area where schema-on-write has a decided advantage, as it allows you to
ensure that your data is well formed by enforcing a schema and constraints in one place, the (well-tested) data store,
instead of trying to enforce it in dozens of places, including in every part of your application code, every script, and every console interaction.
That said, schema-on-read is advantageous if you are dealing with semistructured or nonuniform data. I wouldn’t
use a document store for highly structured bank data, but I might use one for user-generated documents, event-tracking
data, and log messages. Schema-on-read can also be advantageous if the schema changes often. With a relational database,
certain types of schema changes take a long time or even require downtime. With schema-on-read, all you have to do is
update your application code to be able to handle both the new data format and the old one, and your migration is done.
Or, to be more accurate, your migration has just started, and it will happen incrementally as new data gets written.
Key Takeaway 7
Use document stores for semistructured and nonuniform data, when you can’t define a schema ahead of time, or for
search, when you need free-text search, faceted search, etc.
You should consider one other trade-off when choosing between schema-on-read and schema-on-write: performance. With schema-on-write,
as with a relational database, the data store knows the schema ahead of time, and the schema is the same for all the
data in a single table, so the data can be stored efficiently, both in terms of disk space usage and the
performance of disk lookup operations. With schema-on-read, as with a document store, since each document can have a
different schema, the data store has to store the schema with that document, which is less efficient. This is
one of the reasons that data stores designed for performance and efficiency typically use schema-on-write.
This includes data stores designed to extract insights from your data by using analytics, as discussed in the next section.
Analytics: Columnar Databases
Some data storage technologies are optimized for storing your data in a format that makes analyzing it
easier and faster. This is part of the larger field now called data science, which
combines statistics, computer science, information science, software engineering, and visualization to extract
insights from your data. A deep dive on data science is beyond the scope of this blog post series, but it is worth briefly
touching on some of the data storage technologies involved, as deploying and maintaining these systems often
falls under the purview of DevOps.
Under the hood, many analytics systems are based on columnar databases, so this section covers the basics
of what a columnar database is and then looks at common columnar database use cases.
Columnar Database Basics
On the surface, columnar databases (aka column-oriented databases) look similar to relational databases. They store data in tables that consist of rows and columns, they usually have you define a schema ahead of time, and
sometimes they support a query language that looks similar to SQL. However, a few major differences exist.
First, most columnar databases do not support ACID transactions, joins, foreign-key constraints, and many other key
relational database features. Second, the key design principle of columnar databases, and the source of their name,
is that they are column oriented, which means they are optimized for operations across columns, whereas relational
databases are typically row oriented, which means they are optimized for operations across rows of data. This is best
explained with an example. Consider the books table shown in Table 17.
Table 17. The books table
id
title
genre
year_published
1
Clean Code
tech
2008
2
Code Complete
tech
1993
3
The Giver
sci-fi
1993
4
World War Z
sci-fi
2006
How does this data get stored on the hard drive? In a row-oriented relational database, the values in each row will be
kept together, so conceptually, the serialized data might look similar to Example 174.
Example 174. How a row-oriented database may serialize data on disk
[1] Clean Code,tech,2008
[2] Code Complete,tech,1993
[3] The Giver,sci-fi,1993
[4] World War Z,sci-fi,2006
Compare this to the way a column-oriented store might serialize the same data, as shown in
Example 175.
Example 175. How a column-oriented database may serialize data on disk
In this format, all the values in a single column are laid out sequentially, with the contents of a column as keys
(e.g., Clean Code or tech), and the IDs as values (e.g., 1 or 1,2). Now consider the query shown in
Example 176.
Because this query uses SELECT *, it will need to read every column for any matching rows. With the row-oriented
storage in Example 174, the data for all the columns in a row is laid out sequentially on the hard
drive, whereas with the column-oriented storage in Example 175, the data for each column is
scattered across the hard drive. Hard drives perform better for sequential reads than random reads, so for this sort of
query, especially with a large amount of data (think millions or billions of rows), the row-oriented approach will be
considerably faster. Compare that to the query in Example 177.
This query uses an aggregate, SELECT COUNT(*), so it will need to read only the values in the year_published column
to satisfy the WHERE clause, and then count the number of matches. With the row-oriented storage in
Example 174, this requires jumping all over the hard drive to read the year_published value for
each row, whereas with the column-oriented storage in Example 175, all the data for the
year_published column is laid out sequentially. Therefore, for this sort of query, with millions or billions of rows, the
column-oriented approach will be considerably faster. When you’re doing analytics, aggregate functions such as
COUNT, SUM, AVG come up all the time, which is why the column-oriented approach is used in many analytics use cases, as described next.
Analytics Use Cases
The following are some of the most common use cases for columnar databases:
General-purpose columnar databases
Popular columnar databases that you can use for a wide variety of use cases include Apache Cassandra, Google Bigtable, and Apache
HBase (full list).
Time-series databases
Many data stores designed for storing and analyzing time-series data, such as metrics and logs, also use
column-oriented designs, because performing aggregate queries on this sort of data is common (e.g., show me the
average response time for my app). Popular time-series databases include Prometheus and InfluxDB
(full list).
Big data
Big data is a somewhat fuzzy term for data sets that are vastly larger than what you can process using
traditional data analytics systems, such as relational databases and document stores.
This is a significant challenge for companies operating at massive scale, such as Google, Meta, and X. In 2004, Google released a paper on
MapReduce that
described its approach to batch processing huge amounts of data by using distributed systems. This kicked off a wave of
big data tools. Some of the major players in this space include Apache Hadoop (an open source MapReduce implementation)
and Cloudera (full list).
Fast data
Big data systems typically do batch processing, generating analytics from large data sets by running in the
background on a periodic basis (e.g., once per day). Fast data systems are designed to do stream processing
(a topic you’ll learn more about in Section 9.7.2), generating analytics from large data sets
by running continuously, incrementally processing streams of data on a near real-time basis (e.g., in
milliseconds). Some of the major players in this space include Apache Spark and Apache Flink
(full list).
Data warehouses
A data warehouse is a central repository that integrates data from all your other systems, as shown in
Figure 86. With all your data in one place, you can perform a variety of analytics and generate reports. Data warehouses are often column oriented and use specialized schemas (e.g., star and snowflake schemas)
optimized for analytics. Major players in the data warehouse space include Snowflake, Amazon Redshift,
and Google BigQuery (full list).
Figure 86. Integrating all your data in a data warehouse
Key Takeaway 8
Use columnar databases for time-series data, big data, fast data, data warehouses, and anywhere else you need to
quickly perform aggregate operations on columns.
Figure 86 is highly simplified. First, what look like simple arrows from the various
systems to the data warehouse are actually complicated background processes known as extract, transform, and load
(ETL) that extract data from one system in one format, transform it into the format used by another system
(cleaning up and standardizing the data along the way), and then load it into that other system. This is complicated
enough that a whole ecosystem of ETL tools has sprung up, including Apache Airflow, Stitch, and Qlik
(full list). Second, there are not
only arrows from each system to the data warehouse, but arrows between many of the other systems too, representing
background jobs, event-based communication, and so on. All this falls into the realm of asynchronous processing,
which is the topic of the next section.
Asynchronous Processing: Queues and Streams
In Part 7, you learned about how microservices can communicate with one another by using tools
such as service discovery and service meshes. This approach typically uses synchronous communication: service A
sends a request to service B and waits for service B to send a response before moving on. In many use
cases, you’ll want to use asynchronous communication: service A sends a message to service B and then
moves on without waiting for the response. Service B can process that message at its own pace, and if it needs to
send a response, it can send a new asynchronous message to service A.
Note that the asynchronous style of communication has a hidden implication: you want to be sure that each
message is (eventually) processed. Losing a message or processing it multiple times could have negative consequences
for your business. Therefore, you don’t typically just send these messages from service A directly to service B, and
have service B hold on to the message in memory. Depending on the number of messages, this could use up more
memory than service B has available, and if service B crashes, all messages it hasn’t processed will be lost.
Instead, service A sends messages to, and service B reads messages from, a data store designed to facilitate
this type of asynchronous communication by persisting messages to disk and tracking the state of those messages
(e.g., whether they have been processed). Broadly speaking, two types of data stores can do this:
Message queues
Event streams
This section dives into each of these.
Message Queues
A message queue is a data store that can be used for asynchronous communication between producers, who write
messages to the queue, and consumers, who read messages from the queue, as shown in Figure 87.
Figure 87. The steps of using a message queue
The typical process of using a queue is as follows:
A producer, such as service A, publishes a message to the queue.
The queue persists the message to disk. This ensures that the message will eventually be processed, even if the queue or
either service has an outage.
A consumer, such as service B, periodically polls the queue for new messages.
When there is a new message, the queue returns the message to service B. The queue may record that the message is "in
progress" so that no other consumer can read the message at the same time.
Service B processes the message.
Once the message has been successfully processed, service B deletes the message from the queue. This ensures that the
message is processed only one time.
Queues are most often used for tasks that run in the background as opposed to tasks you do during a live request
from a user. For example, if you are building an app that lets users upload images, and you need to process each image
(e.g., create copies of the image in different sizes for web, mobile, and thumbnail previews), you may want to do
that in the background rather than making the user wait for it. To do that, your frontend server stores the original
image on a file server and adds a message to a queue with the location of the image. Later, a separate consumer
process reads the message from the queue, downloads the image from the file server, processes the image, and when it’s
done, deletes the message from the queue. Other common use cases include encoding videos, sending email campaigns,
delivering notifications, generating reports, and order processing.
Popular message queues include RabbitMQ, Apache ActiveMQ, and ZeroMQ
(full list). Queues
provide several key benefits:
Handling traffic spikes
A queue acts as a buffer between your services, which allows you to deal with spikes in traffic. If service A and B
were communicating synchronously, and traffic suddenly increased by 10 times, B might not be able to keep up with the
load, and you’d have outages and lost messages. With the queue in between, service A can write as many messages as
it wants, and service B can process them at whatever rate it can handle.
Decoupling
With synchronous communication, every service needs to know the interface to talk to every other service. In a large
company, one service may use JSON over HTTP, whereas another uses gRPC; one service may be in a service mesh that
requires mTLS, while another is a legacy service that is not in the mesh; and so on. Connecting many disparate
services can be a massive undertaking. With asynchronous communication via a message queue, each service solely needs
to know how to talk to one thing, the API used by the message queue, so it gives you a decoupled, standardized
mechanism for communication.
Guaranteeing tasks are completed
If service A sends a synchronous message to service B but never gets a response or gets an error, what do you do?
Most code just errors out. Some code uses retry logic, but if you’re not careful, this might result in service B
processing the message multiple times, or, if service B is overloaded, it might make the problem worse. Using
asynchronous communication with a message queue allows you to guarantee that each task is (eventually) completed,
even in the face of outages and crashes, as the queue persists message data and metadata (e.g., whether that message
has been processed). Note that in distributed systems theory, guaranteeing that a message is delivered exactly once is
provably impossible (if you’re curious why, look up the two generals problem), so in practice, most message queues
provide at least once delivery, and you create consumers who are idempotent (can ignore duplicate messages).
Guaranteeing ordering and priority
Some message queues can guarantee not only at-least-once delivery but also that messages are delivered in a
specific order. First-in, first-out (FIFO) queues, for example, deliver messages in the order they were received,
and priority queues deliver messages with the highest priority first.
Key Takeaway 9
Use message queues to run tasks in the background, with guarantees that tasks are completed and executed in a specific
order.
Whereas message queues are used for one-to-one communication between a producer and a consumer, event streams are used
for one-to-many communication, as discussed next.
Event Streams
Event-streaming tools allow services to communicate asynchronously in a manner similar to a message queue. The main difference is that instead of each message being consumed by a single consumer, streaming allows each message
to be consumed by multiple consumers, as shown in Figure 88. Some of the most popular event-streaming tools
include Apache Kafka, Amazon Kinesis, and NATS
(full list).
Figure 88. The steps of using an event-streaming platform
The typical process of using event streaming is as follows:
A producer, such as service A, publishes a message to the event stream.
The event stream persists the message to disk. This ensures that the message will eventually be processed, even if the
event stream or any other service has an outage. Under the hood, the messages are recorded in a log, which is an
append-only, totally ordered sequence of records, ordered by time, as shown in Figure 89.
Figure 89. An event-streaming log with one publisher and three consumers
One or more consumers, such as services B, C, and D, poll the event-streaming platform for new messages.
For each consumer, the streaming platform records that consumer’s offset in the log—the
last message that consumer saw. When there is a new message past that offset, the streaming platform returns that
message to the consumer.
Services B, C, and D process whatever messages they receive.
Once a service has successfully processed a message, it updates its offset in the streaming platform log, ensuring that it
won’t see the same message again.
At its most basic level, event streaming could be used as a replacement for a message queue to allow services to
communicate asynchronously. However, this is not the primary use case. A message queue is typically used to allow
service A to send a message specifically destined for service B. In contrast, in event streaming, every service publishes a stream of events that represent important data points or changes in state in that service but that aren’t
necessarily designed for any one specific recipient. This approach allows for multiple other services to subscribe and react to
whatever streams of events are relevant to them. This is known as an event-driven architecture.
The difference between messages in a message queue and events in an event stream has a profound impact on the way you build
your services. In Figure 86, you saw a simplified diagram showing all systems sending their data to a data
warehouse. Figure 90 shows a slightly more realistic image.
Figure 90. A system architecture without event streaming has n2 connections
As the number of services grows, the number of connections between them grows even faster. With n services, you end up
with roughly n2 connections across a variety of interfaces and protocols that often require complicated ETL. Setting
up and maintaining all these connections can be a massive undertaking. Event streaming offers an alternative solution,
as shown in Figure 91.
Figure 91. A system architecture with event streaming has n connections
In Part 7, you saw that a network switch allows you to connect n computers with n cables (each
computer has one cable connected to the switch) instead of n2. Analogously, an event-streaming platform allows you to
connect n services with n connections (each service has one connection to the event-streaming platform) instead of n2.
Dramatically simplified connectivity is one of the major benefits of an event-driven architecture. Another major
benefit, and one that’s less obvious, is that an event-driven architecture allows you to add new services—new
consumers—without having to modify any existing producers.
An example can help illustrate the power of this concept. First, consider an architecture where services message each
other directly. For example, service A might send the message a new image has been uploaded to location X, please process that image to service B. Six months later, you want to add a new service C to scan images for inappropriate
content. For this service to do its job, you have to update service A to send an additional message to service
C: a new image has been uploaded to location X, please scan that image for inappropriate content.
Now compare this to an event-driven architecture, where service A doesn’t have to know about the existence of other
services at all. Service A merely publishes important events, such as a new image has been uploaded to location X.
Perhaps on day one, service B subscribes to this event stream and is able to process each image. Six months later, when
you add service C, it can subscribe to the same event stream to start scanning images for inappropriate content—without
any need to modify service A. You could add dozens more services that consume service A’s event stream, again, with
no need for A to be aware of them at all.
In an event-driven architecture, every service publishes important events, such as a new user has registered, a user clicked a button, an order has been placed, a server is down, and so on. Any other service can subscribe to any of these event streams to perform a variety of actions (e.g., update a search index, detect fraudulent activity,
generate a report, or send out a notification). Moreover, each time a service subscribes to an event stream, it
can choose to start at offset 0 in that stream (refer to Figure 89), effectively "going back in time," and
processing all the historical events from that event stream (e.g., all images that have ever been uploaded) until it
catches up to the latest offset, or it can start immediately at the latest offset and just process new events.
Event-driven architectures provide many benefits:
All the benefits of a message queue
Event streams offer most of the same benefits as you saw with message queues previously: they help you
handle traffic spikes, decouple services, guarantee that tasks are completed, and guarantee task ordering.
Even stronger decoupling
Message queues provide a limited amount of decoupling by allowing services to communicate by having to interact
with only one interface, the queue, but some coupling remains, as each service must be aware of other services to send
them messages. Event streaming also allows services to interact with only a single interface, the event-streaming
platform, but it is even more decoupled, as publishers don’t need to be aware of consumers at all. This unlocks
remarkable flexibility and scalability in your architecture.
Monitoring
Event streaming turns out to be an excellent way to implement monitoring, including metrics and logs. The stream of
events from each service inherently provides visibility into what that service is doing, and you can hook up various
consumers to help make your monitoring data easier to visualize, search, and so on. You’ll learn more about
monitoring in Part 10.
ETL and stream processing
In Section 9.6.2, you learned about big data, fast data, and data
warehouses. Event streams play a key role in each. First, event streaming gives you a single, standardized
way to do ETL. Second, I mentioned that fast data is all about processing streams of data; well, the event stream
platform is what provides those streams of data!
Key Takeaway 10
Use event streams to build highly scalable, decoupled, event-driven architectures.
Let’s now move on, not to a specific data storage use case but to the general
problem of data store scalability and availability.
Scalability and Availability
For many companies, the data store is the biggest bottleneck in terms of scalability and availability. Making stateless software scalable and highly available is relatively easy; doing so for stateful software is considerably trickier. Over the years, there have been many attempts to solve this problem, and the key lesson we’ve learned is that there’s no one-size-fits-all solution. No data store or architecture is a silver bullet that can magically fix all your scalability and availability challenges. That said, some patterns come up again and again. This section looks at these common patterns in the following systems:
Relational databases
NoSQL and NewSQL databases
Distributed systems
Let’s start with scalability and availability patterns for relational databases.
Relational Databases
Most relational databases are designed to run on a single server. However, as you learned in
Part 3, a single server is a single point of failure—a bottleneck to scalability and availability.
This isn’t something you have to worry about until you are storing tremendous amounts of data and serving a huge amount
of traffic. However, if you get to that point, you should be aware that it’s easy to scale a database vertically by making a single server more powerful (more CPU, more memory, more disk space), but it’s harder to scale a database horizontally across multiple servers. To horizontally scale a relational database—or any data store—there are two primary
strategies, replication and partitioning.
Replication
Replication involves copying the same data to multiple servers called replicas. One of the key benefits of
replication is fault tolerance. Servers and hard drives fail all the time, so having a copy of your data in more than
one place can help improve availability.
One option is to replicate data to a standby replica, which doesn’t serve
any live traffic but can be swapped in quickly if the primary database goes down. Another option is to replicate data
to one or more active replicas, which serve live traffic, thereby allowing you to horizontally scale your database
by adding more replicas.
Note that with a relational database, these are read replicas, which can handle read
traffic, whereas all write traffic must instead go to the primary database. This helps you offload all read traffic
from the primary database, which is especially helpful for the many types of software that have vastly more reads than
writes.
Partitioning
Whereas replication is copying the same data to multiple servers, partitioning (aka sharding) is copying
different subsets of the data to different servers that each can handle both reads and writes. The
goal of partitioning is to divide your data set deterministically between n servers so that each one has to
handle only 1/nth of the total load.
Going back once more to the bank example, if you had grown to 10 million customers, you could partition them across 10 servers, so all the data in the customers table for customers with id 0–1,000,000 would be on server 0, all the data for customers 1,000,001–2,000,000 would be on server 1, and so on. If the bank had a website where most of the pages showed data for only one customer at a time, each database would have to handle only ~10% of the load, which is a huge win.
Partitioning effectively turns a single-node database into a distributed system, which helps with availability and
scalability, but it comes at a cost. We’ll dive into distributed systems challenges later in Section 9.8.3, but for now I’ll list just a few examples. With partitioning, you lose the ability to use
auto-incrementing sequences, queries across data in different partitions, and foreign-key constraints across data in
different partitions. You even lose ACID transactions for data in different partitions; for example, if a customer with id 50 wanted to transfer money to a customer with id 3,000,000, you couldn’t perform this update in a single transaction because the data for each customer lives in a separate partition. Moreover, if you don’t partition your data correctly, or if your access patterns change, it’s possible to develop hot spots, where certain partitions get a disproportionately higher percentage of traffic and become overloaded. To fix these, you have to update the partitioning strategy, known as rebalancing, which can be difficult and expensive, as it may require moving around a lot of data.
Key Takeaway 11
Use replication and partitioning to make relational databases more scalable and highly available.
Replication and partitioning can provide a remarkable amount of scalability and availability for a relational database. For example, Meta uses a relational database (MySQL) as the primary data store for its 3+ billion users, and the scale of their deployment is astonishing, consisting of thousands of servers across multiple continents, hosting millions of shards, and storing petabytes of data.[40] That said, some companies choose to handle scalability and availability challenges by moving away from relational databases entirely, as discussed next.
NoSQL and NewSQL Databases
In the mid- to late 2000s, the challenges with scalability and high availability for relational databases led to the
creation of nonrelational databases, often called NoSQL databases. NoSQL, which at various times stood
for Non-SQL or Not-Only-SQL, is a fuzzy term that refers to databases that do not use SQL or the relational model. Over
the years, many types of nonrelational databases have been created, most of which failed to gain wide adoption (e.g.,
object databases in the '90s, XML databases in the early 2000s), but NoSQL in particular refers to databases that were
built in the late 2000s, primarily by internet companies struggling to adapt relational databases to unprecedented
demands in performance, availability, and data volume.
The early inspirations for NoSQL included Google’s 2006 paper on
Bigtable and Amazon’s 2007 paper on
Dynamo. The actual term "NoSQL" came after
these papers, originating as a Twitter hashtag (#NoSQL) for a
2009 meetup
in San Francisco to discuss "open source, distributed, nonrelational databases," which is still the best definition of
NoSQL that we have. The primary types of data stores that fall under the NoSQL umbrella are key-value stores, document
stores, and columnar databases, all of which you’ve already seen in this blog post.[41]
Most NoSQL databases were designed from the ground up for scalability and availability, so the default deployment often
includes replication and partitioning. For example, MongoDB is typically deployed in a cluster that consists of
multiple shards, and each shard has a primary (for writes) and one or more replicas (for reads), plus dedicated
servers that handle query routing, auto-sharding, and auto-rebalancing. The benefit is that
you get a highly scalable and available data store. The cost is that these are complicated distributed systems, and
that comes with challenges, as you’ll see in the next section. For now, I’ll just say that, in the pursuit
of scalability and high availability, most NoSQL data stores sacrifice key features from relational databases, such as
ACID transactions, referential integrity, and flexible query languages that support joins.
For some use cases, this was too many sacrifices, which led to the creation of a new breed of relational database in
the mid- to late 2010s, often called NewSQL, that tried to retain the strengths of a relational database (e.g., ACID
transactions, SQL), while providing better availability and scalability. Some of the major players in this space include
Google Spanner, Amazon Aurora, and CockroachDB
(full list). Under the hood,
these are also complex distributed systems that use
replication and partitioning to achieve high scalability and availability, but they use new techniques to not
sacrifice too many relational database benefits along the way. The approaches they use are fascinating but beyond
the scope of this blog post series, especially as many of the early NewSQL players died out, and the ones that remain
are still relatively young and immature.
Remember, data storage technology takes at least a decade to mature. As of the writing of this blog post series, most
NoSQL data stores are 10–15 years old, so they are just starting to become mature and reliable systems. Most NewSQL
systems are still less than 10 years old, so they are still relatively young (at least as far as data storage
technologies go). Given that both NoSQL and NewSQL databases are typically complex distributed systems, they face
challenges that may take even more than a decade to solve, as discussed next.
Distributed Systems
As you may remember from Part 6, distributed systems are complicated, and distributed data
stores even more so. One of the challenges is that all distributed systems are subject to the
CAP theorem, which gets its name from the following three properties:
Consistency (C)
Every read receives the most recent write.
Availability (A)
Every request receives a response, even if some servers are down.
Partition tolerance (P)
The distributed system continues to work even if a break in communications (known as a partition) occurs between some of the servers.
All three of these are desirable properties for a distributed system, but the CAP theorem says you can pick only two. Moreover, in practice, no network can guarantee there will never be any partitions, so all real-world
distributed systems have to provide partition tolerance—they have to pick P—which means you really get to pick only one
more. That is, in the presence of a network partition, does your distributed system provide consistency (C) or
availability (A)?
Some systems, such as HBase and Redis, pick C, so they try to keep data consistent on all nodes, and in the case of a
network partition, they lose availability. If you use a data store that picks C, you have to accept that, from time
to time, that data store will be down. Other systems, such as Cassandra, Riak, and CouchDB, pick A, so they are
eventually consistent, which means that during a network partition, they will remain available, but different nodes
may end up with different data. In fact, even without a partition, eventually consistent systems may have different
data on different nodes, at least for a short time. If you use a data store that picks A, you have to
deal with an eventually consistent data model, which can be confusing for programmers and users (e.g., you just
updated some data, but after refreshing the page, you still see the old data).
Another challenge with distributed systems is that they introduce many new failure modes. At some point, every data
store will fail. The question is, how many ways can the system fail, and how easy is it to understand and fix
each one? Usually, the number and complexity of failure modes on a single-node system (e.g., a relational database) is
far lower than on a distributed NoSQL or NewSQL system that has multiple writers, auto-sharding, auto-rebalancing,
eventual consistency, consensus algorithms, and so on. Having to deal with too many failure modes
was one of the main reasons
Pinterest stopped using
Cassandra and Etsy stopped using MongoDB.
One more challenge with distributed systems is figuring out a business model to support them. Building a reliable data store takes a decade or two, and finding a way to sustainably pay developers during all that time is tricky. Many
data store companies have shut down (e.g., RethinkDB,
FoundationDB, GenieDB, ScaleDB, and many
others) which is a huge problem if your company relies on these technologies for storing your most valuable asset!
A data store that has been around 20+ years is not only more mature than one that just came out in the last
few years, but also more likely to still be around another 20 years from now (this is
called the Lindy effect).
This doesn’t mean you should avoid distributed systems, NoSQL, or NewSQL. It just means that you need to understand
what they are good at, what they are not good at, and the risks you are taking on. For example, if you have extreme
scale and availability requirements that you can’t handle with a relational database, and you have a team willing to
put in the time and effort to deploy and maintain a NoSQL or NewSQL database, then by all means, go for it. But if
you’re a tiny startup with virtually no traffic, using a complex distributed data store right out of the gate might
not be the right way to spend your limited resources.
Key Takeaway 12
Use NoSQL and NewSQL databases when your scalability and availability requirements exceed what you can do with a
relational database—but only if you can invest in the time and expertise of deploying and maintaining a distributed
data store.
Note that things can go wrong even with the most mature and battle-tested data store. Therefore, as the final topic of
this blog post, let’s talk about how to manage backup and recovery to minimize the risk of losing data.
Backup and Recovery
Whatever data store you pick, you’ll want to think through your backup-and-recovery strategy—that is, how to
protect your company from disaster scenarios that cause you to lose your data. Losing your company’s data can do
tremendous damage or even put you out of business. Broadly speaking, you need to protect against three types of disaster scenarios:
Data loss
You can lose data because of hardware failures, such as a server or hard drive dying, or software failures, such as
accidentally deleting data because of a bug in your code.
Data corruption
It’s possible for you to not lose any data, but for the data to become corrupted. This could be due to a software
bug, human error, or a malicious actor.
Inaccessible data
It’s also possible for you to not lose or corrupt your data, but you can no longer access it. This could happen, for
example, because you lost the encryption key that was used to encrypt the data.
To some extent, the solution is simple: back up your data. Periodically make copies of your data and store
those copies elsewhere, so if something goes wrong, you can restore from one of these copies. But it’s not as simple as it first sounds, as you’ll learn in the following sections:
Backup Strategies
Backup Recommendations
Example: Rackups and Read Replicas with PostgreSQL
Let’s start with backup strategies.
Backup Strategies
The following are the most common strategies for backing up data:
Scheduled disk backups
Scheduled data store backups
Continuous data store backups
Data store replication
These strategies are not mutually exclusive. In fact, each strategy has different advantages and drawbacks, and protects against different types of disasters, so it’s usually a good idea to use several of these strategies to ensure you’re fully covered. Let’s take a closer look.
Scheduled disk backups
The most basic approach to backups is to take a snapshot of the entire hard drive on a regular schedule (e.g.,
daily), and store that snapshot on a separate server.
Advantages
Protects against data loss.
Protects against data corruption.
Moderately portable. Some backup software gives you snapshots you can move to servers in different hosting
environments (e.g., on prem to cloud), but some does not (e.g., cloud provider backups sometimes work only with that
cloud provider).
Highly reliable. Disk snapshots store the exact data store configuration and version from when you took the
snapshot, so it’s likely to work as before.
Drawbacks
Disk space overhead. You store a lot of extra data you might not need (e.g., the OS and other software), so your
backups use more disk space.
CPU and memory overhead. Backing up more data requires using more CPU and memory while the backup process is running.
Inconsistent data. The data store may have data buffered in memory or only partially written to disk when you take
a snapshot, so you may end up with a snapshot that has incomplete or inconsistent data. You can mitigate this by
shutting down the data store while taking a snapshot, at the cost of downtime.
Does not protect against inaccessible data. The data within the hard drive snapshot will still be encrypted with
the same key, so if you lost that key, you won’t be able to read the data when you recover from the snapshot.
Lost data between snapshots. When a crash occurs, you lose any data between snapshots. Taking snapshots more
frequently can reduce the amount of data you lose, albeit at the cost of more resource usage and disk space.
The first three drawbacks to this approach—disk space overhead, CPU and memory overhead, and inconsistent data—can be
solved by using the next approach.
Scheduled data store backups
Many data stores provide a native way to back up just the data in that data store (rather than the entire hard drive),
so you can take snapshots of that data on a regular schedule and store those snapshots on a separate server. For
example, many relational databases can do a SQL dump, where they dump out a snapshot of SQL statements (e.g., CREATE
TABLE, INSERT INTO, etc.) that represent the contents of the database at a specific point in time.
Advantages
Protects against data loss.
Protects against data corruption.
Protects against inaccessible data if you encrypt the snapshot with a different key.
Consistent data. Data store snapshots are guaranteed to be consistent.
Less disk space overhead. You store only what’s in the data store, and not the surrounding OS and
software.
Less CPU and memory overhead. Less data to back up means you use less CPU and memory while the backup process is
running. You can even offload the backup process to a replica, thereby reducing CPU and memory overhead to zero on
the primary.
Highly portable. Scheduled data store backups usually work even if you move to a different server, OS, hosting environment, etc.
Moderately reliable. If you can’t deploy the data store with the same configuration and version as the original,
there’s a small risk you won’t be able to recover all your data. This typically happens only if you try to run
especially old backups with newer (possibly backward-incompatible) versions of the data store software.
Drawbacks
Lost data between snapshots. When a crash occurs, you lose any data between snapshots. Taking snapshots more
frequently can reduce the amount of data you lose, albeit at the cost of more resource usage and disk space.
Many companies use this approach, but the last drawback, where you may lose data between snapshots, leads some
companies to the next option.
Continuous data store backups
Some data stores allow you to take a snapshot after every single change. The way this works depends on the data
store. That said, a common way to implement a data store is to use a write-ahead log (WAL), an append-only file where
for every single requested change (e.g., every INSERT and UPDATE statement), the data store first writes the
requested change to the WAL and then performs the requested change on the actual data. If you back up each WAL entry
to a separate server, you get continuous backups, where you can replay the WAL to restore to any point in time.
Advantages
Protects against data loss.
Protects against data corruption.
Protects against inaccessible data if you encrypt the backups with a different key.
No lost data between snapshots.
Consistent data. Data store backups are guaranteed to be consistent.
Less disk space overhead. You store only what’s in the data store, and not the surrounding OS and
software. That said, continuous backups will be bigger, as they store every single change, whereas scheduled data
store backups store just the result. If a single row has changed 1,000 times, for example, the continuous backup will have
all 1,000 entries, whereas the scheduled data store backup would just have a single entry, which is the latest value.
Drawbacks
CPU and memory overhead. Backing up each change may slightly increase the database latency and use more CPU and
memory at all times (whereas periodic snapshots add overhead only when the backup process is running).
Not all data stores support continuous backups.
Not as portable. Moving continuous backups to a different server, OS, or hosting environment is not
always supported.
Not as reliable. If you can’t deploy the data store with the same configuration and version as the original,
there’s a small risk you won’t be able to recover all your data. This typically happens only if you try to run
especially old backups with newer (possibly backward-incompatible) versions of the data store software.
Continuous backups are the gold standard, so if they are available to you, you should typically use them. Moreover,
the same mechanism you use for continuous backups can be used for replication.
Data store replication
You saw earlier in this blog post that you can create replicas of a database to act as a failover or
handle some of the load. Replication also acts as a backup mechanism. In fact, replication is often based on the WAL
(each requested change is sent to the replica servers, which will end up with the exact same data after executing those
changes), so it’s effectively a type of continuous backup.
Advantages
Protects against data loss.
No lost data between snapshots.
Consistent data. Data store replication is guaranteed to be consistent.
Less disk space overhead. You store only what’s in the data store, and not the surrounding OS and
software.
Provides additional replicas to distribute the load or act as standbys for failover.
Drawbacks
Does not protect against data corruption. If invalid data is written to your primary data store, all the replicas
will end up with that invalid data too!
CPU and memory overhead. Replicating each change may slightly increase the database latency and use more CPU and
memory at all times (whereas periodic snapshots add overhead only when the backup process is running).
Does not protect against inaccessible data, as the replica typically uses the same encryption key as the
primary.
More expensive, as you need to have a live, active server running (the replica). You can mitigate some of this
extra cost by using a standby replica.
Now that you’ve seen the basic backup strategies, let’s talk about some of the practices for making your backups more
effective.
Backup Recommendations
I recommend a few practices when backing up your data:
The 3-2-1 backup rule
You should have at least three copies of your data, stored on at least two types of media,
and at least one copy stored offsite. For example, if you had a primary database (SSD), a replica database (SSD), and
nightly snapshots stored on magnetic hard drives in another data center, that would meet the 3-2-1 requirements. This
easy-to-remember rule that ensures that your data can survive a variety of disasters, from loss of a production
server all the way up to loss of an entire region.
Test your backups regularly
If you never test your backups, there’s a good chance they don’t work. Ensure that the step-by-step process of how
to restore from a backup is documented, so your team doesn’t have to figure out the process in the middle of a
crisis situation. Run through this process regularly—both manually, such as in a practice session you schedule once
per year, and automatically, with an automated test that regularly restores a system from backup and
checks that everything works as expected (e.g., the queries against the backup return the same data as the primary).
Protect your backups
You need to take just as much care with protecting your backups as the original data. As you learned in
Part 8, many data breaches are not from brilliant algorithmic hacks, but from a hacker
stumbling upon an old copy of the data that was poorly protected. Therefore, you need to ensure that your backups
have multiple layers of protection (defense in depth), including encryption, private networks, authentication,
and so on.
Key Takeaway 13
Ensure that your data stores are securely backed up to protect against data loss and data corruption, protect your backups,
test your backup strategy regularly, and follow the 3-2-1 rule.
Now that you’ve learned about backup strategies and recommended patterns, let’s put it into practice with a real-world
example using PostgreSQL.
Example: Backups and Read Replicas with PostgreSQL
Earlier in this blog post, you created a lambda-rds root module that could deploy PostgreSQL in AWS by using RDS. Let’s update that example to do the following:
Take daily snapshots
Back up the database by taking a daily snapshot.
Create a read replica
Create another database instance that will act as a read replica. This will help scale read traffic and act as
an additional form of (continuous) backup.
Head over to the lambda-rds module, open up main.tf, find your usage of the rds-postgres module, and update it
as shown in Example 178.
One of the benefits of using a managed service like RDS is that it makes it easy to enable common functionality,
such as backups:
1
Setting the backup retention period to a value greater than zero enables daily snapshots. This code configures RDS to retain those snapshots for 14 days. Older snapshots will be deleted automatically, saving you on storage costs.
2
This code configures the snapshotting process to run from 4 to 5 a.m. Coordinated Universal Time (UTC). You should set this to a time when load on the database tends to be lower. Also, remember that any data written between snapshots could be lost, so if you run an important business process at a specific time every day, you may want to schedule the snapshot shortly after that process.
To add a read replica, use the rds-postgres module a second time, as shown in Example 179.
Again, using RDS makes it easy to use common functionality such as read replicas:
1
Since the primary database is called bank, name the replica bank-replica.
2
Set the replicate_source_db parameter to the identifier of the primary database. This is the setting that
configures this database instance as a read replica.
The read replica, as its name indicates, is read-only, whereas the primary database accepts both reads and writes. To
run schema migrations, you need both read and write access, so you should continue to use the primary URL for those.
However, the Lambda function needs only read access for its one database query, so you can update it to talk to the
read replica rather than the primary, as shown in Example 180.
An RDS replica can take 5–15 minutes to deploy, so be patient. When apply completes, head over to the
Lambda console, click the lambda-rds-app function, select
the Configuration tab, and click "Environment variables" on the left side. You should see something similar to
Figure 92.
Figure 92. The Lambda function should be configured to talk to the replica
The Lambda function should now have the DB_HOST set to the replica’s URL, and not the primary’s. If everything looks
correct, test out the URL in the function_url output variable one more time:
If you see the exact same results as before, congrats, that means your Lambda function is now reading data from a
read replica! Moreover, your database is now backed up, both via the replica and daily snapshots.
Get Your Hands Dirty
Here are a few exercises you can try at home to go deeper:
Test your backups! Set the snapshot_identifier parameter of the rds-postgres module to restore from a snapshot.
When you’re done testing, commit your code, and run destroy to clean up everything. As part of the
destroy process, RDS will take one final snapshot of the database, which is a handy fail-safe in case you delete a
database by accident.
Conclusion
Let’s review the key takeaways from this blog post. Here are the first few:
Keep your applications stateless. Store all your data in dedicated data stores.
Don’t roll your own data stores; always use mature, battle-tested, proven, off-the-shelf solutions.
Use relational databases as your primary data store (the source of truth), as they are secure, reliable, and mature, and they support schemas, integrity constraints, foreign-key constraints, joins, ACID transactions, and a flexible query language (SQL).
These first few takeaways essentially boil down to one simple idea: your default choice for data storage should be a
relational database. Maybe that sounds boring to you. Perhaps you’re itching to try that sexy new NoSQL database that
you read about on Hacker News. It perfectly fits your use case! It’s so slick! It’s so fast and cutting-edge! When it
comes to data storage, cutting-edge is typically a bad thing. Yes, cutting-edge might get you access to new features,
but what it really means is that you’ll also be the one to discover all the bugs, corner cases, missing integrations, failure modes, and so on. Typically, you want a data store to be reliable, predictable,
and stable. This is the opposite of cutting-edge. It’s boring. But when it comes to data storage, boring is good,
and you should choose boring technologies.
That doesn’t mean you should never use other data stores. But you should use them only if you have use cases that a
relational database can’t handle, which means you need to understand what these other data stores are good at—and what
they are not good at. The next several key takeaways from this blog post cover the typical use cases
for other types of data stores:
Use key-value stores to cache data, speeding up queries and reducing load on your primary data store.
Use CDNs to cache static content, reducing latency for your users and reducing load on your servers.
Use file servers and object stores to serve static content, allowing your app servers to focus on serving
dynamic content.
Use document stores for semistructured and nonuniform data, when you can’t define a schema ahead of time, or for
search, when you need free-text search, faceted search, etc.
Use columnar databases for time-series data, big data, fast data, data warehouses, and anywhere else you need to
quickly perform aggregate operations on columns.
Use message queues to run tasks in the background, with guarantees that tasks are completed and executed in a
specific order.
Use event streams to build highly scalable, decoupled, event-driven architectures.
In addition to these use cases, you may need to pick your data store based on scalability and availability
requirements, per the following key takeaways:
Use replication and partitioning to make relational databases more scalable and highly available.
Use NoSQL and NewSQL databases when your scalability and availability requirements exceed what you can do with a
relational database—but only if you can invest in the time and expertise of deploying and maintaining a distributed
data store.
Finally, whatever data store you choose, keep the following key takeaway in mind:
Ensure that your data stores are securely backed up to protect against data loss and data corruption, protect your backups, test your backup strategy regularly, and follow the 3-2-1 rule.
Let’s now move on to Part 10, which is all about how to monitor your systems, including logs,
metrics, and alerts.