Sapper

January 21, 2021 ~ 7 min read

Reactive Systems in the Serverless age


Many business applications that are big and successful seem to be built with reactive systems: Netflix, Linkedin, Tencent, Zalando, Uber, Tesla, Datadog, Ericsson, Lyft, etc.

1. Job to be done: huge performance

The same way Node.js brought non-blocking I/O for better performance on mono-server architectures, Reactive Systems are a framework for building distributed systems that make better use of available compute resources.

2. Benefits

2.1. The actor model as a simple mental model of distributed systems

Reactive systems are based on the Actor Model, a formalism first introduced in 1973 by MIT researchers Carl Hewitt, Peter Bishop and Richard Steiger in "A Universal Modular Actor Formalism for Artificial Intelligence".

Here is a small analogy that helps understand the actor model:

Think of you texting your friends. You can either ask them something or tell them something. They can, in turn, text other friends. While you wait for their answer, you can do something else. And most humans will have a natural callback to chase their friends if they don't answer. We also have a back-off mechanism: if a friend does not answer after being chased several times, we contact another friend. We also have a natural backpressure mechanism: if we are too solicited by friends, we enter a state called "burnout" during which we cease to communicate.

All these concepts are inherent to the Actor model. Frameworks that implement this model allow us to build systems that are responsive (i.e. fast), resilient, elastic and message-driven (a.k.a. "reactive" as defined in the "Reactive Manifesto" in 2013). Like communities of humans.

2.2 Reliability

Frameworks implementing this model help build complex applications with smaller code bases which leads to fewer bugs:

[...] we also often cut down code by 80% for the same functionality when compared to equivalent imperative code.

Akara Sucharitakul, Principal Member of Technical Staff at PayPal, 2016

Over the past 30 years, people have designed many tools and paradigms to help build Reactive applications. One of the oldest and most notable is the Erlang programming language, created by Joe Armstrong and his team at Ericsson in the mid-1980s. Erlang was the first language that brought Actors into mainstream popularity.

Armstrong and his team faced a daunting challenge: to build a language that would support the creation of distributed applications that are nearly impervious to failure. Over time, Erlang evolved in the Ericsson laboratory, culminating with its use in the late 1990s to build the AXD 301 telephone switch, which reportedly achieved “nine nines” of uptime—availability 99.9999999% of the time. Consider exactly what that means. For a single application running on a single machine, that would be roughly 3 seconds of downtime in 100 years!

Reactive Design Patterns, Roland Kuhn with Brian Hanafee and Jamie Allen https://www.manning.com/books/reactive-design-patterns

2.3. Better use of resources leading to better performance

This is how we can naively describe how applications used to work: a thread starts a task, the task calls an outside resource, has to wait for the response to keep going and the thread is blocked doing nothing. It's called blocking I/O.

Let's go back to Node.js, mentioned in the introduction. Node.js is an implementation of the reactor pattern, which resolves this problem.

The event loop, running on one thread, cycles through the incoming events and handles them. Callback functions are registered for requests that will result in a long-running task or blocking operation. The handle for the event gets added to a queue. The event loop iterates through the queue and will eventually observe the completion of the long- running task, trigger a callback, and return the result to the application.

event-loop.png

Reactive Systems Explained by Grace Jansen & Peter Gollmar, 2020, O'Reilly

This is great because suddenly you can get a lot more out of a single machine. But how do you distribute this across several machines?

In the actor model, the actors can be distributed anywhere. Each machine can have millions of actors (one actor uses ~500 bytes of RAM) and actors can be spread across several machines. Here is what PayPal observed when they re-wrote one of their systems using Akka, a leading Reactive System framework.

paypal-performance.png

Graph from the presentation "Turning PayPal’s Product Performance Tracking platform Reactive, End-to-End" by Michael Zeltser, 2018, Sr. Member of Technical Staff, Architect – Core Platform Services at PayPal.

The graph shows that the volume of transactions handled by this service at PayPal increase 4 times with the same compute resources.

With Akka, actors spread across different machines can exchange messages with a latency of only hundreds of microseconds.

paypal-performance-2.png

The above benchmark is running two m4.4xlarge ec2 instances (16 vCPUs and 64GB of RAM).

3. Tradeoffs

3.1. Reactive is hard and scary

Companies who embrace Reactive Systems achieve results that they would not have achieved otherwise.

akka-difficult.png

This comes from a slide presented by an architect working at Lightbend, the editor of Akka. https://qconlondon.com/ln2018/system/files/presentation-slides/high-performance-akka.pdf

Akka does not make things simple. Thinking Reactive instead of Imperative is a change that teams struggle with.

Moving to Reactive is seen as such a challenge that I have seen large companies ask strategy consultants to help them decide if they should go or not.

Large IT consulting firms also promote Reactive. For example IBM writes a lot of content about the advantages of Reactive Systems and how to migrate to them.

3.2. There are many options to pick from but mostly in Java

There are many frameworks to choose from when building a Reactive System in Java.

The other most established option is Erlang. It's virtually impossible to hire Erlang engineers. WhatsApp famously explained that they did not try to hire Erlang engineers but engineers smart enough that they could train them to do Erlang. There are enlightening stories of companies regretting choosing Erlang: https://news.ycombinator.com/item?id=23283675

If your team is not already using Java, going Reactive will be an even bigger challenge.

3.3. You still need servers and people to manage them

You still need to manage VMs or a Kubernetes cluster to deploy your Reactive System to.

For example, deploying Akka clusters is not a straightforward task. Lightbend, the editor of Akka, licenses a tool to help companies deploy Akka. And those clusters still need to be managed.

In addition, Reactive Systems are often built around Kafka as an event source. And they find themselves having to configure and manage it.

Capital One are one of the leading financial institutions using Reactive Programming. I would love to know more about how they do it.

Senior software architect at a French Government organization, November 2020

Capital One have been talking about their journey to Reactive Programming with Akka since 2017. All the content they publish is very positive about their experience but they are progressively moving more of their architecture to serverless, specifically leveraging Lambda, Kinesis and DynamoDB.

In 2019, Lightbend announced that they were working on "Akka serverless" https://www.lightbend.com/akka-serverless which can be seen as a sign that infrastructure is currently slowing down the adoption of Reactive Systems.

4. Secret sauce: building Reactive Systems with serverless technologies.

The decision of going Reactive is a big one. Companies do not make this decision lightly. The main concern is around skills. Will my team know how to build and run such a system?

The progressive migration of existing systems is necessary in both cases. Thinking in terms of events is necessary in both cases. In 2019, IBM have published a brilliant tutorial on progressively migrating to a Reactive architecture: https://developer.ibm.com/languages/java/tutorials/reactive-in-practice-1/, in particular, they see Event Storming as a required first step, just like with any serverless project.

So what are the differences between going Reactive serverless or traditional Reactive?

  • The developers can keep using the languages they know (if they are not Java devs).
  • They don't have to grasp all the concepts of Reactive Programming but still get the benefits of loosely coupled and elastic architecture. They will write code that runs on a lambda function in the language they are most familiar with.
  • The infrastructure is easier to manage with serverless and it truly scales effortlessly.
  • However, in a system of Actors deployed on serverless, the latency between actors is in milliseconds where it is in hundreds of microseconds with Akka for example. So if latency is critical to your business, Serverless might not be the right tool.

JR Beaudoin

I'm JR Beaudoin, CTO of Theodo in New York. Follow me on Twitter.