How we designed an Event-Driven Architecture using Cloud Functions

Théophile de Segonzac
Teemo Tech Blog
Published in
5 min readDec 12, 2018

--

What was the legacy architecture?

At Teemo, we process a huge volume of information, and to this end, we are used to collecting data in different ways. One of our main methods is to collect data directly from external systems.

Data Loader - Legacy Architecture

The above workflow consists of picking up fresh data every hour, if new files are detected, they are uploaded into our internal storage managed by Google Cloud Storage. Finally, our data loader application which runs on the Google Compute Engine will group the files into batches so as to process multiples files simultaneously. It will then launch a Google Dataflow job to process the data and insert them into our data warehouse managed with Google BigQuery.

This architecture is perfectly adapted since we manage a few external sources, precisely, a few ones! What happens if we have to manage hundreds of different sources? Consider the fact that for each new source, we must duplicate the following components:

  • Compute Engine Instance
  • Cloud Storage Bucket
  • Cloud Dataflow Stream

We can quickly see that there is a bottleneck when it comes to maintaining hundreds of servers, not to mention that there is also a significant cost.

Why did we choose an event-driven architecture?

First of all, you may ask: “what is event-driven architecture?” “Event-driven architecture (EDA) means constructing your system as a series of commands and/or events”, which is relevant to the extent that we could decompose our current workflow like the following:

Commands/Events Workflow

But the most interesting thing is that event-driven architecture is ideal for working with Architectural Patterns, which is exactly what we do here when we have to integrate new sources which send us very similar data to the others. The fact that we will be able to use only one generic pipeline to handle each new source is absolutely fantastic because we could drastically reduce our costs by handling only one implementation of each component. This is much easier to maintain than a hundred disparate services.

How we designed our architecture?

EDA work even better if it’s designed with microservices as you must respect the following chain “Command -> Event -> Command -> Event”. In fact, it is much more complicated to implement EDA using monolith applications because it involves managing the entire Command/Event chain on a single instance, making the software architecture, maintenance, and sustainability of the system more complex. If we want to effectively divide our current architecture into small components, we must ask ourselves when and how tasks are launched.

Events during the workflow

In the diagram above, we have defined the events, the next question is to think about how the actions will interact with events. As mentioned above, we must respect the “Commands/Events” chain, and we could consider that for each action performed an event is generated, thus calling for the next action.

Let’s consider that we no longer care about how the files are downloaded; we just provide a Google Storage Bucket as an entry point to our data pipeline.

Components/Events during the workflow

One of the main reasons we have redesigned our current architecture is to be more scalable. With this philosophy in mind, we may decide to delegate these components to managed services as much as possible.

For the processing of simple actions, our best allies are Google Cloud Functions, as it is an event-driven, serverless computing platform and it scales automatically. It is actually possible to trigger cloud functions based on several models such as HTTP request, Google Cloud Storage, Google PubSub and so on.

Did you say Google PubSub? PubSub is a reliable and scalable event-driven computing system that allows you to send and receive messages between independent applications.

In short, we could use the Cloud functions to launch jobs and PubSub to ensure communication between each component and Dataflow in order to process Batches.

Data Pipeline - Architecture

Let’s take each action again:

  1. Each time a new file is received, the File Command function is triggered.
  2. A File Receive event is sent to the File Event PubSub queue
  3. A Dataflow Stream process accumulates File Events
  4. A Schedule Batch event is sent to PubSub after a sufficient number of files.
  5. The Schedule Batch event is received by the Schedule Batch processor.
  6. Dataflow processing component is executed by Schedule Batch Processor.

The real deal using Cloud Functions

The Google Cloud function is one of the main managed services that allowed us to implement the EDA on our Data Pipeline. To fully benefit from the advantages of Cloud functions, the entry must be generic. This means that only one function will be used to manage all source files uploaded to Google Storage and all source files have a different name or format. This is why we have dedicated an external job outside our pipeline to correctly download and rename each of them.

Cloud Functions are perfectly adapted to our needs, and it is one of the easiest ways to run your code in the cloud, avoiding maintenance and scalability issues. Ultimately, this is in line with what we can expect in terms of business prices when using Cloud services — you pay for what you actually use.

Conclusion

This new architecture is a major step forward in terms of consistency and performance over time; the integration of a new external source can now be managed in less than 48 hours, while it used to take 7 days.

--

--