From Redis to Memorystore: a migration path

Published in

Teemo Tech Blog

5 min readJan 10, 2019

Context and business requirement

At Teemo, we bid for mobile advertising opportunities in real time. These opportunities are provided by auction servers on ad exchanges. Once we win the bid, the ad will be displayed on a user’s mobile phone.

For a more efficient campaign and a better user experience, the ad should be displayed evenly during the campaign. We have figured out a solution to maintain the impression count per user and bid only if the requested ads’ counter is below a defined limit, in order to avoid displaying too many ads to a user.

Technical constraints and implementation

Multiple bidder instances take the responsibility to bid for advertising opportunities. They are present in different regions.

The Won bid notification server counts displayed ads and clicks on these displayed ads via public APIs. It sends ads impression information to the Capping server, which counts ads impressions per user.

Real-time bidding systems’ main requirement is that bid decision need to be made in less than 120ms. In order to do so, our bidders — GCE instances — are replicated and located near each auction server selling advertising opportunities.

Another issue that we have is to update the capping information and make it available to our bidders as fast as possible.

To do so we used Redis master-slave replication. The Capping server maintains the impressions count on a Redis master instance. Each bidder instance maintains a Redis slave locally in order to make a quick bidding decision based on advertising campaigns and capping. The changes are then replicated to all Redis slaves within a few milliseconds.

Our main issue was that our Redis master was a single point of failure. Any downtime would prevent slaves from being updated which could make us overbid. Moreover, our bidder instances needed to wait for the master-slave synchronization which would take up to 10 minutes before being ready. These issues prevented auto-scaling to work properly. Our bidders were then sized for traffic peaks which was not cost-efficient.

We wanted to replace these local Redis instances and remove the master/slave replication pattern. We also wanted to switch to a more managed solution to limit operational costs.

Google Cloud Memorystore

Introduced by Google this May, the Cloud Memorystore provides fully-managed Redis instances offering many interesting features :

Allows easy migration with zero code changes
High availability instances providing automatic failover
Maintenance tasks like patching, scaling (up to 300GB) and high availability are all managed, dev teams have nothing to configure and can spend more time coding
Monitoring ensured by Stackdriver, which contains most commonly used measures of Redis
Redis instances are isolated by region and protected from the internet using Virtual Private Cloud (VPC) networks

Not so straightforward

Our first idea was to create one single Memorystore and use it from all Bidder clusters.

Things looked very easy and wonderful on paper, however, the system immediately faced the access restriction of Memorystore.

One of Memorystore restriction is that you can only access an instance from the region it was created in, and only the bidders present in the same region could connect to it (shown by red arrows).

Furthermore, we noticed that the latency between US and EU regions could be higher than 100ms, the concerned bidders would then almost always timeout. Parallel regional Memorystore Redis instances, on the other hand, would offer a much more acceptable latency of 15ms or less.

Redesign

We then settled for the solution displayed in the graph above. The Won bid notification server sends messages to a Pub/Sub topic which is subscribed by 3 Dataflow streamers in each chosen region. 3 Memorystore instances are used to provide low latency for each Bidder cluster.

When choosing to use memorystore for our Redis instances, we decided to use Dataflow to stream data to Memorystore. Dataflow is more flexible than the original capping server hosted on a GCE instance. By using Dataflow, the number of worker instances scales automatically, which allows the workflow to handle future surges of traffic.

We also wondered about using other Google Cloud services such as Cloud Function, but unfortunately, it’s not VPC-native therefore not accessible from/to Memorystore. AppEngine and AppEngine Flex support autoscaling and easy deployments without the management overhead, but they are geographically limited to the region chosen the first time AppEngine is set up on a given project. It would require at least a project per region which complicates maintenance, design, and billing.

Discussion & Conclusion

Components between Won bid notification server and Bidders are now fully managed and more flexible.

On the other hand, as seen above, we had to create and to use additional infrastructures due to the absence of cross-region accessibility of Memorystore. Finally, the overall complexity has not been considerably reduced. Isolated Redis instances lack consistency. Some additional tools have been developed to deal with incidents by reconstructing data from BigQuery and to keep all Redis instances synchronized.

In addition, Memorystore does not scale automatically, which means that we have to manually adapt the capacity to our storage needs. With Redis’ purge implementation, we have to pay for additional storage space for expired keys since they are kept in memory for a while.

Moving our infrastructure to memory store is not a cheap solution: a 10GB standard tier costs around $400 per month, and we now have to pay all auxiliary infrastructures. Though the result is satisfactory, Bidder clusters have successfully get rid of local Redis instances and the startup delay has been considerably reduced, from 10 minutes to few seconds.

Concerning the new latency since the removal of local Redis slave, our bidders had less than 1ms towards local Redis slave, and with Memorystore, this delay has been increased to 2–3ms.

Next steps

In a next step, we can make the architecture better by using another Dataflow to write messages into BigQuery. The migration of Won bid notification server to a more flexible infrastructure than Compute Engine like AppEngine or Cloud Function is also in our TODO list. The whole subsystem is becoming more robust and efficient.