3 System Design Concepts for Distributed Systems

When working in large-scale distributed systems, there are a few considerations that will help you make them more reliable and maintainable.

photo by scott webb

The biggest complication of a distributed system is the principle that networks are unreliable. This means that when two services interface, there is no guarantee that the response will be sent back to the requestor.

To avoid this we can use idle operation. Any time we make a request that produces the same result when invoked multiple times is considered useless.

Consider an Express API endpoint that takes a name, inserts it into a database, and returns a user identifier to the client. Here is the code:

router.post("/api/v1/user", (req, res) => {
const id = db.insert({
name: req.body.name
});
res.send({userId: id});
});

We can see that for each request, we create a new user. There are many issues with this. First, if the client loses connection to the server after creating a user, the client will not know that the operation was successful and will try again, creating duplicate users.

Another problem we want to fix is ​​that the database identifier should not be disclosed to the client. If we have a database cluster, the identifier will be unique to that database server, but not to the cluster. This can cause a problem for the client if the two identifiers are identical.

Both of these issues can be resolved by creating a passive endpoint. POST is the only REST API that is not idiomatic by default but can be implemented so that it is.

router.post("/api/v1/user/:id", (req, res) => {
const id = db.insert({
id: req.params.id,
name: req.body.name
});
res.send(201);
});

Note that the endpoint is now required id (usually in the form of GUID) as a param created by the client and used to create the user in the database. If the client doesn’t get a 201 response, it can try again without the opportunity to create a duplicate record. Backend service should validate all inputs and ensure id is in the expected format.

It also solves database-specific ID issues by separating the backend and client from the database. The client creates unique identifiers that can be used to refer to records, allowing our database to scale without worrying about duplicate records.

Processing time can be a major bottleneck in heavily-loaded distributed systems, and this can happen due to high traffic or when a single piece of the system requires a lot of processing. The need for distributed systems to be scalable means that we need to maintain consistency as servers must be stateless.

There are many ways to reduce latency and increase consistency, but the most common is to use a messaging system. A messaging system or message queue is a way of taking a request and passing it to another service that can process the message when it becomes available. This way, the backend system will not be overloaded and can be scaled up without dropping requests. Another option is to incorporate triggers on the database records.

An asynchronous request means that the task will be performed without waiting for the task to finish. The drawback is that the client loses context in the result of the request and essentially states that the request was received and processed.

This is where the decision to inform the client about the requested results comes in. One of the most common ways for a client to wait for a resource is long polling. Long polling occurs when the client checks at intervals to retrieve the value being requested. Exponential backoff is one way to help reduce resource stress because the system will change the interval as much as it has to wait. This is the simplest option, but is very resource intensive on the server and can lead to extra work for both the client and the server.

If you are using service-to-service communication (both backend systems), a better option is to use webhooks. A webhook is a concept where the requester can provide a URL to the endpoint it hosts for the service to send a notification that the job is done. The URL can be passed in the request or configured anywhere the service can reach and knows to respond to. Here is a diagram for implementing webhooks using AWS and its simple notification system.

diagram by scott gering

The basic idea behind immutable data is to preserve it by never modifying or deleting it so that it can be used in the future. Immutable data allows for increased debugging power, and the audit log gives you a snapshot of what happened over time, when you’re getting reports of issues that can’t be reproduced. This is a valuable way to handle concurrency by removing the need for multiple updates on the same resource at a given time.

Let’s say we have a SQL database tracking order status. The simplest architecture would be to create a row for an order and update that order based on its current state. But then we lose the ability to go back and analyze changes in orders to keep track of possible adaptations or what happened to orders.

This is where immutable data comes into play. Each update of the command creates a new row, saving the previous metadata for future use. This also means that no matter which server is accessing the data, there will be no concurrency issues. Even though the data is out of date because of a new status update, two servers will not attempt to modify the same resource simultaneously. Get the latest record in the database to access the current data.

The downside of immutable data is that we add a lot of storage to the data. But, storage is relatively cheap these days, and depending on the requirements of the system, the data can be purged after a certain time frame. The queries have become more complex now as we need to filter the data and need to optimize based on the requirement.

Overall immutable data will save a lot of headache in future when you need visibility on your application which could not have happened without it.

Leave a Comment