Processing time zones from latitude longitude at high scale

Michael Troehler
Teemo Tech Blog
Published in
4 min readNov 30, 2017

--

Written by Michael Troehler & Guillaume Charhon

Teemo is the Drive-to-Store marketing platform that is revolutionizing retail advertising.

We combine a deep understanding of offline consumer behaviour and algorithmic learning to produce measurable, highly accountable in-store visits. Having the correct time zone is critical to correctly measure in-store visits. By entering into new international markets, we needed to match each event with its time zone. Since we couldn’t easily change the anonymous data coming to us, we needed to reprocess all our past events and the current ones in real time.

Searching for a time zone lookup library

We had to find a fast and accurate library which provided the time zone for a given latitude and longitude.

The fastest library we found was java-tzwhere. This library loads and indexes the tz_world shapefiles in an R-tree structure. This library has two main drawbacks for us:

● It is limited only to time zones on land

● Even if the R-tree structure is balanced and the running time complexity is O (logMn) with M being the max number of entries each page can contain, the average lookup takes around 5 ms. It would have taken around 92 days to process 548,628,401,322 records on a 64 core instance.

Splitting the earth

We needed to find a way to use this library more efficiently. Our idea was to “split the earth” into small squares of a few kilometers each, in order to generate discrete data and to calculate the time zone on these squares.

The grid starts from (0,0) to (180, 360) degree, then you need to choose the granularity between points. We chose 5km, it represents 0.05 degree.

With this configuration we generated a grid with 25,920,000 points. (3,600 latitude and 7,200 longitude)

We built a geo zone generator which is available on our github.

The output file looks like this:

latitude,longitude

51.15,2.2
51.15,2.25
51.15,2.3
51.15,2.35

51.15,2.4
51.15,2.45
51.15,2.5
51.2,2.35
51.2,2.4
51.2,2.45

We scanned all the geo zones generated and assigned to each geo zone a timezone using java-tzwhere.

51.15,2.2,Europe/Paris
51.15,2.25,Europe/Paris

51.15,2.3,Europe/Paris
51.15,2.35,Europe/Paris
51.15,2.4,Europe/Paris
51.15,2.45,Europe/Paris
51.15,2.5,Europe/Paris
51.2,2.35,Europe/Paris
51.2,2.4,Europe/Paris
51.2,2.45,Europe/Paris

We then had a simple grid, with time zones assigned to all land. We still had a problem to solve. How to assign a time zone to islands which java-tzwhere could not find?

1. We built up the k-d tree with the previous file, but we filtered out the lines without time zones.

2. We re-scanned all the input files for each latitude/longitude that does not have a time zone and looked them up in the k-d tree to find out the nearest neighbor. If the nearest point is at a distance less than a factor (let’s say 1 kilometer), we set the nearest time zone to the current location.

We finally had a file with time zones on lands and islands!

Building the timezone lookup library

The last step was to create a library which gives you the time zone for a given latitude and longitude.

The dataset generated in the previous section was loaded into an array. The array index for a given latitude and longitude can be found with the following hash function:

```

def function(longitude, latitude):
indexLatitude = float((latitude + 90) / STEP)
indexLongitude = float((longitude + 180) / STEP)
return indexLatitude + (indexLatitude * LONGITUDE_POINTS)

```

the STEP is the level of granularity in degrees of the dataset.

You can find our open source Java implementation on GitHub: timezone-lookup The performance was obviously much faster, since you directly read the value from a primitive array.

Processing a huge amount of events

As our data is securely stored Google Cloud Storage (GCS), Cloud Dataflow was a natural choice to process our historical events. Cloud Dataflow is a managed service which provides stream and batch data processing. As Cloud Dataflow supports Java, we were able to run a batch pipeline using our timezone-lookup library, reading from a GCS bucket and storing results in another GCS bucket. It took us a few days to process hundreds of TB of data.

We were also able to use timezone-lookup to assign a time zone to our real-time incoming events pipeline (hundreds thousand events/seconds) using our regular Java Vert.x pipeline.

Conclusion

We had to overcome multiple obstacles to achieve the processing of a large number of events, both on our historical data and on our real-time incoming traffic. As available libraries were too slow for our usage, we had to build a time zone lookup file. We also had to build a 2-dimensional tree to handle issues with islands. Cloud Dataflow was also helpful in scaling our implementation in a very limited timeframe using a serverless approach, thereby solving a cluster administration burden.

--

--