Behind the Scenes: Big Data at Torbit

by Tylor Arndt on February 19, 2013

A Trillion Queryable Performance Metrics (and Counting)

An ever increasing torrent of data flows into the analytical engines of Torbit. The pageviews represented by Real User Measurement (RUM) data are the life-blood of the internet. Helping our customers deeply understand their users’ experiences through their RUM data is a core component of our mission to make the internet faster.

Torbit Insight generates a mind-boggling amount of data each day. The increasing volume of RUM beacon data can be attributed to our existing customers’ increased success and to strong continued customer growth. We process over 6 billion performance metrics a day and our goal is to keep our customers’ data safe forever. The volume of this data is a core metric of our success. Every success has a price to be paid; in this case, the price of Torbit’s continued success is an ever increasing volume of data to store, analyze and present to our customers.

Alternative Solutions

At Torbit, we evaluated a number of the standard industry tools in the big data toolbox such as Hadoop, Riak, MySQL cluster and Google’s BigQuery. All of these tools fell short in one respect or another for our specific use case and goals.

We wanted:

  • Multi-host, MapReduce or similar support
  • Minimal query latency
  • High throughput
  • Minimal possible infrastructure changes
  • Low recurring cost outside of growth related costs
  • Scalable growth
  • Independence from third party service providers


No solution came close to meeting all these requirements, and we wanted as many as we could get, if not all of them. In the graph below, you will see that our final solution was able to out-perform Google BigQuery with common queries being about 3 times faster.

Since most off the shelf solutions are general purpose, these solutions often sacrifice performance, storage efficiency or both to provide extra features that are superfluous to our needs. For example, most of the data we process fits into the paradigm of a fixed-schema keyed time-series database. This means that any storage engine with a relational or variable schema support will have made structural compromises to enable functionality that our narrow use case does not benefit from.

After considering our goal of a high performance, highly scalable, fixed-purpose clustered data-store, we ran a short-term R&D project to determine if this might be one of those rare cases where developing our own solution was the best path. Reinventing the wheel is not to be undertaken lightly, so we wanted to be confident that our use case was sufficiently specific to warrant it.

After evaluating our unoptimized prototype and discovering it was simple, extensible and had performance competitive with some of the off-the-shelf solutions on our evaluation list, we decided to commit. Additionally, a peripheral benefit of building our own solution was that we were able to write our implementation in Go (Also known as Golang). We find that Go is very well suited for this kind of development, and has become Torbit’s preferred back-end development language.

Big Wins Processing Big Data

As is common in many large scale data crunching systems, our solution starts with a MapReduce library. For this purpose we created a MapReduce implementation that we call Atlas. Atlas not only supports local and multi-host (network) MapReduce Jobs, it supports external mappers written in Go (which is important because Go is statically linked). Since Cgo can be used to mix C and Go in the same project, it would be trivial to write most of a mapper in C if that was desired. Go’s support for functions as first class citizens, as well as built-in serialization from “encoding/gob” in the standard library, made this task much more pleasant than it would have been in many other languages.

To abstract the underlying details and complexity involved with invoking the MapReduce jobs, we created a web-service to act as an arbiter between the data cluster and the consuming external front-end client systems.

Introducing Atlas

 

Graph

Atlas is a general purpose MapReduce library that can be used for everything from crunching log files to optimizing user content; however, we still need a highly efficient way to store our analytics data for the Atlas mappers to leverage. With the prerequisite of a solid MapReduce library behind us, the next component we needed was a data-store.

When creating a fixed schema data-store is warranted, one of the most visible benefits to be gained is the extreme efficiency with which you can store your data. Techniques such as careful binary encoding and field de-duplication can massively reduce data size. We saw significant savings, and we know we have further headroom to greatly reduce our storage usage.

Graph

While there is always a CPU utilization versus storage capacity trade-off to be had in any compression scheme, when CPU capacity is sufficiently abundant, a more compact data format means less I/O. This almost always results in better data-store performance.

Since our data-store is fundamentally a time-series database, we were able to structure our data in such a manner to ensure that most read and write operations were sequential. SSDs are fantastic drop-in tools for increasing I/O performance, and when they are used in random workloads SSDs are often an order of magnitude faster than traditional disk. However, in sequential workloads SSDs are often only 1.5-3x faster than much larger traditional disks and are much more expensive per unit of capacity. In sequential I/O workloads it is not uncommon to discover that a RAID array of spinning disks at the same price point is actually much faster than SSD, as well as having the expected benefit of greater capacity.

Graph

System Outline

The final system begins with a web-service against which client systems interface. To ensure resiliency, an instance of the web-service runs on each cluster host. When a client request arrives the web-service creates a MapReduce job to fulfill client requests. The reducer function component of the MapReduce job runs within the web-service handling the request.

The Atlas library code running in the web-service communicates with the remote Atlas function servers on each host, invoking the specified Mapper function. The function server’s mappers then communicate with their local data-store to gather their sub-set of data and proceed to perform their processing task.

After processing, the mappers send their results back to the reducer running within the web-service via Atlas. The reducer proceeds to format and return it’s response to the external client.

Here is an example of how a small two node cluster might be configured:

We’re a data-driven company and we’re dedicated to building the tools we need to better understand what makes websites fast. Atlas is just one of the tools we’ve built at Torbit. If you have an interest and passion for big data processing, making the internet faster and solving interesting problems you should send us a note. We always enjoy talking to other people who get excited about the power of big data to transform businesses.


Leave a Reply