Tracking your website performance with a RUM (Real User Measurement) tool like Torbit Insight is a critical first step for all websites. It’s impossible to know how your site is performing without tracking it regularly and RUM is one of the best ways to do this.
Once you have a baseline idea of how your site is performing using the standard metrics like Page Ready and Page Load it can be extremely valuable to start using your own site specific events to track performance. For instance, maybe “Above the fold” time or the time that it takes for users to be able to interact with the sidebar navigation is actually a more significant performance metric to track with your visitors. Steve Souders recently posted about the importance of moving beyond “window.onload()” and we couldn’t agree more. This is why we’re proud to announce the addition of custom event tracking in Torbit Insight.
With custom events, you can instrument your page to track any relevant timing metric and see all the same reports you know and love in Torbit Insight. The simplest example is tracking a different load time metric (something other than Page Load or Page Ready, which are included by default) – for example “Sidebar loaded”. To add tracking for this custom event, you’d simply need to add the following code snippet at the point in the page where you consider the sidebar to be fully loaded.
So start tracking your own custom events and get an even better view into your site’s performance and how it’s impacting your business. Contact us at firstname.lastname@example.org to get custom events added to your Insight account.
A Trillion Queryable Performance Metrics (and Counting)
An ever increasing torrent of data flows into the analytical engines of Torbit. The pageviews represented by Real User Measurement (RUM) data are the life-blood of the internet. Helping our customers deeply understand their users’ experiences through their RUM data is a core component of our mission to make the internet faster.
Torbit Insight generates a mind-boggling amount of data each day. The increasing volume of RUM beacon data can be attributed to our existing customers’ increased success and to strong continued customer growth. We process over 6 billion performance metrics a day and our goal is to keep our customers’ data safe forever. The volume of this data is a core metric of our success. Every success has a price to be paid; in this case, the price of Torbit’s continued success is an ever increasing volume of data to store, analyze and present to our customers.
At Torbit, we evaluated a number of the standard industry tools in the big data toolbox such as Hadoop, Riak, MySQL cluster and Google’s BigQuery. All of these tools fell short in one respect or another for our specific use case and goals.
Multi-host, MapReduce or similar support
Minimal query latency
Minimal possible infrastructure changes
Low recurring cost outside of growth related costs
Independence from third party service providers
No solution came close to meeting all these requirements, and we wanted as many as we could get, if not all of them. In the graph below, you will see that our final solution was able to out-perform Google BigQuery with common queries being about 3 times faster.
Since most off the shelf solutions are general purpose, these solutions often sacrifice performance, storage efficiency or both to provide extra features that are superfluous to our needs. For example, most of the data we process fits into the paradigm of a fixed-schema keyed time-series database. This means that any storage engine with a relational or variable schema support will have made structural compromises to enable functionality that our narrow use case does not benefit from.
After considering our goal of a high performance, highly scalable, fixed-purpose clustered data-store, we ran a short-term R&D project to determine if this might be one of those rare cases where developing our own solution was the best path. Reinventing the wheel is not to be undertaken lightly, so we wanted to be confident that our use case was sufficiently specific to warrant it.
After evaluating our unoptimized prototype and discovering it was simple, extensible and had performance competitive with some of the off-the-shelf solutions on our evaluation list, we decided to commit. Additionally, a peripheral benefit of building our own solution was that we were able to write our implementation in Go (Also known as Golang). We find that Go is very well suited for this kind of development, and has become Torbit’s preferred back-end development language.
Big Wins Processing Big Data
As is common in many large scale data crunching systems, our solution starts with a MapReduce library. For this purpose we created a MapReduce implementation that we call Atlas. Atlas not only supports local and multi-host (network) MapReduce Jobs, it supports external mappers written in Go (which is important because Go is statically linked). Since Cgo can be used to mix C and Go in the same project, it would be trivial to write most of a mapper in C if that was desired. Go’s support for functions as first class citizens, as well as built-in serialization from “encoding/gob” in the standard library, made this task much more pleasant than it would have been in many other languages.
To abstract the underlying details and complexity involved with invoking the MapReduce jobs, we created a web-service to act as an arbiter between the data cluster and the consuming external front-end client systems.
Atlas is a general purpose MapReduce library that can be used for everything from crunching log files to optimizing user content; however, we still need a highly efficient way to store our analytics data for the Atlas mappers to leverage. With the prerequisite of a solid MapReduce library behind us, the next component we needed was a data-store.
When creating a fixed schema data-store is warranted, one of the most visible benefits to be gained is the extreme efficiency with which you can store your data. Techniques such as careful binary encoding and field de-duplication can massively reduce data size. We saw significant savings, and we know we have further headroom to greatly reduce our storage usage.
While there is always a CPU utilization versus storage capacity trade-off to be had in any compression scheme, when CPU capacity is sufficiently abundant, a more compact data format means less I/O. This almost always results in better data-store performance.
Since our data-store is fundamentally a time-series database, we were able to structure our data in such a manner to ensure that most read and write operations were sequential. SSDs are fantastic drop-in tools for increasing I/O performance, and when they are used in random workloads SSDs are often an order of magnitude faster than traditional disk. However, in sequential workloads SSDs are often only 1.5-3x faster than much larger traditional disks and are much more expensive per unit of capacity. In sequential I/O workloads it is not uncommon to discover that a RAID array of spinning disks at the same price point is actually much faster than SSD, as well as having the expected benefit of greater capacity.
The final system begins with a web-service against which client systems interface. To ensure resiliency, an instance of the web-service runs on each cluster host. When a client request arrives the web-service creates a MapReduce job to fulfill client requests. The reducer function component of the MapReduce job runs within the web-service handling the request.
The Atlas library code running in the web-service communicates with the remote Atlas function servers on each host, invoking the specified Mapper function. The function server’s mappers then communicate with their local data-store to gather their sub-set of data and proceed to perform their processing task.
After processing, the mappers send their results back to the reducer running within the web-service via Atlas. The reducer proceeds to format and return it’s response to the external client.
Here is an example of how a small two node cluster might be configured:
We’re a data-driven company and we’re dedicated to building the tools we need to better understand what makes websites fast. Atlas is just one of the tools we’ve built at Torbit. If you have an interest and passion for big data processing, making the internet faster and solving interesting problems you should send us a note. We always enjoy talking to other people who get excited about the power of big data to transform businesses.
It wasn’t that long ago when anyone looking for Real User Measurement (RUM) had to implement it themselves. Thankfully, that is no longer the case. As the RUM space gets more crowded, I thought it might be valuable to make a list of 10 things you should consider when choosing a RUM provider.
1: Ability to correlate your site speed with revenue
Site speed = $$. One of the most powerful things you can do with RUM is show the correlation between your site speed and your business metrics like your bounce rate, conversion rate and revenue. No RUM product is complete without the ability to go to your boss with a case study showing how much performance matters for your business.
2: 100% sampling rate
Every pageview matters. Processing data from billions of pageviews is hard. Many people don’t want to deal with that much data and choose to sample instead. Google Analytics collects some limited performance data, but their default sampling rate is 1%! Behind every pageview statistic is a story – a story of a potential customer, on the brink of clicking away because your site is too slow! You want a RUM provider that understands that every pageview matters and will track you performance data across 100% of your visitors.
3: Percentiles and histograms
Averages lie.Averages can be very misleading when looking at performance data. I regularly see websites with an average load time of 4 seconds, while 10% of their visitors are experiencing 20 second load times! If you believe every pageview matters, you’ll probably agree that 20 second load times aren’t okay. To really understand what is happening on your site, you need to be analyzing your top percentiles and looking at a histogram breakdown.
Live in the moment. It’s not very helpful if you find our tomorrow that your site was slow today. I recently talked to a top news site that used our realtime feature to catch a bad deploy. They didn’t have synthetic tests set up on the affected page, so Keynote reported that everything was fine. Only because they had our real time graph on a wall monitor were they able to catch the issue right away and revert the slow code.
5: Total browser coverage
6: Drilldown capabilities
Actionable data. Our slowest pages tab not only gives developers a prioritized todo list, but also gives actionable suggestions on what to fix. Want to know how fast your site loads in IE 6? Or compare your site speed in NY to your speed in SF? Drilldown capabilities are crucial for doing in-depth analysis. We’ve actually had customers build new data centers after digging into their performance data and realizing how much even a 1-second speed up was worth to them. RUM is great as an analytics tool. It’s best when it drives actionable decisions.
7: Long term data retention
Commitment. At Torbit, we’ll keep your data safe forever. Believe it or not, some analytics tools actually throw your data away over time. The data gets expensive to store over time, so they just toss it out. But what if you want to compare your site speed today to your site speed a year ago? Personally, I want to know that my data is being stored securely and will be available, no matter how far in the future I end up needing it.
8: Support for A/B testing
Measure then optimize. It’s very powerful to be able to test the performance implications of a new feature with your actual visitors. One of the most common uses of Torbit Insight is to evaluate CDN performance. Wayfair used Torbit Insight to discover that Akamai was not delivering a meaningful improvement to their site speed and shared the results on the Wayfair blog.
9: Affordable price
How about free? At Torbit, we’re really passionate about making the web faster. We believe the first step in achieving our mission is to help people get their hands on accurate performance data. We’ve worked hard to make sure our free plan includes tons of useful functionality. If at some point, you choose to upgrade to a paid plan, great! If not, that’s fine too.
10: Proven team and proven product
Don’t get left behind. The idea of RUM has been around for a while, but adoption has only recently started to pick up. We live in a fast changing environment. I personally believe we have only scratched the surface of what is possible with RUM. Resource Timing is already available in IE 10 and will be rolling out to other browsers soon. Make sure you pick a product built by a proven team that you can trust to stay on the cutting edge as new technology is available.
Torbit is used by hundreds of sites including multi-billion dollar corporations, top retailers and leading media properties. Our Insight product has been battle tested with billions of performance metrics. We started Torbit with an audacious mission and we’re dedicated to seeing it through. We live to make the web faster. In spite of all the improvements in browser technology and faster connection speeds, websites are slower than ever. As we head into 2013, I hope you’ll join us in pursing a faster internet for everyone.
The concept of prefetching is pretty simple. We often know about resources the browser is likely to need before the browser does. Prefetching involves either giving the browser hints of pages or resources it is likely to need so that it can download them ahead of time, or actually downloading resources into the browser cache before needed so that the overhead of requesting and downloading the object can be preemptively handled or done in a non-blocking way.
There are many ways to prefetch content, but here are 3 simple options.
DNS is the protocol that converts human readable domains (mysite.com) into computer readable IPs (22.214.171.124). DNS resolution is generally pretty fast and measured in 100′s of milliseconds, but because it must happen before any request to the server can be made it can cause a cascade effect that has a real impact on the overall load time of a page. Often we know about several other domains that will need to be loaded for resources later in the page or user session, such as subdomains for static content (images.mydomain.com) or domains for 3rd party content. Some browsers support a meta tag that identifies these domains that need to be resolved so the browser can resolve them ahead of time. The tag to do this is pretty straight forward:
Adding this tag causes the browser to do the DNS resolution ahead of time, instead of waiting until a resource requires it later. This technique is probably most valuable to preload DNS for content on other pages on your site that visitors are likely to go to. This feature is supported in Chrome, Firefox, and IE9+.
Although shaving a few hundred milliseconds might seem trivial, in aggregate this can be a measurable gain. It’s also a safe optimization and easy to implement. I was curious to see how often this technique is used, so I crawled the top 100K Alexa sites. It turns out only 552 sites (0.55%) are currently using DNS prefetching. This is a cheap win, and something more sites should leverage.
Images make up a large portion of the overall bytes of many major websites today. Often the overhead of making the requests and downloading images can have a significant performance impact. In many cases, though, the site developer knows when an image will be needed that won’t be detected early by the browser, such as an image loaded from an ajax request or other user action on the page. Resource prefetching is when you load an image, script, stylesheet, or other resource into the browser preemptively. This is most often done with images, but can be done with any type of resource that can be cached in the browser.
Of the three techniques I’m covering here, this is by far the oldest and the most used. Unfortunately I can’t give a concrete number about adoption because there are too many ways to implement this to detect in my Alexa crawl. Still, many sites don’t properly leverage this technique and even just preloading a few images can make a huge difference for the user experience.
Page Prefetching / Prerendering
Page prefetching is very similar to resource prefetching, except that we actually load the new page itself preemptively. This was first made available in Firefox. You can hint to the browser that a page (or an individual resource) should be prefetched by including the following tag:
In the case of prerendering, the browser not only downloads the page, but also the necessary resources for that page. It also begins to render the page in memory (not visible to the user) so that when the request for the page is made it can appear nearly instantaneous to the user. Prerendering was first added in Chrome. You can hint that a page should be prerendered by including the following tag:
This technique is by far the most controversial and the riskiest of the three. Prerendering a page should only be done when there is a high confidence that the user will go to that page next. The most well known example of this is Google Search, which will prerender the first result of the page if the confidence is high enough. I found only 95 examples of this in my crawl of the Alexa Top 100k sites. Although this technique is clearly not for every use case, I think many more sites could leverage this to improve the user experience.
Prefetching in general is often a controversial topic. Many people argue that it is not efficient and leads to a waste in bandwidth. It also uses client resources unnecessarily (most notably on mobile devices). Also worth mentioning is that in some cases prefetching or prerendering of pages can have adverse effects on analytics and log tracking since there is no obvious way to discern a user visiting the page (and seeing it) or simply the browser prerendering without the user’s knowledge.
Despite all of these cautions, prefetching can be a huge win. The fastest request is always the one we never have to make and getting as much into the cache as possible is the best way to make that happen. By making these expensive requests when the user is not waiting on them, we can greatly improve the perceived performance of even the slowest sites on the slowest networks. If you’re not already doing so, it’s worth trying these techniques on your site. The results will vary, so be sure to use Real User Measurement (e.g. Torbit) to find out how much of an improvement prefetching makes for you.
Today, Patrick Meenan (the mastermind behind WebPagetest) wrote a great blog post titled Motivation and Incentive. In this post, Patrick discussed his favorite performance article from 2012 – an article by Kyle Rush about the A/B testing that took place on Obama’s campaign site during the 2012 election. Patrick went on to talk about the importance of having the right incentives in place to make change happen.
I’ve been talking a lot lately about the misaligned incentives of CDNs. I was pleased to see Patrick agrees:
Maybe it’s my tinfoil hat getting a bit tight, but given that CDNs usually bill you for the number of bits they serve on your behalf, it doesn’t feel like they are particularly motivated to make sure you are only serving as many bits as you need to. Things like always gzipping content where appropriate is one of the biggest surprises. It seems like a no-brainer but most CDN’s will just pass-through whatever your server responds with and won’t do the simple optimization of gzipping as much as possible (most of them have it as an available setting but it is not enabled by default).
Certainly you don’t want to be building your own CDN but you should be paying very careful attention to the configuration or your CDN(s) to make sure the content they are serving is optimized for your needs.
If you haven’t read his post yet, be sure to check it out. He hits on a lot of important topics as he explores the root motivations behind providers across the entire tech stack.
Over the last few years, I’ve heard countless stories from customers who have been surprised to learn that their CDN didn’t have Gzip enabled. It sounds crazy since Gzip is one of the simplest tricks to make your site load faster. CDNs promise to make your site faster, but they also charge by the byte.
Think about that.
CDNs make more money when they serve larger files, more frequently.
Why are CDNs not doing more to enable compression for their customers? Sadly, as it often turns out, to find the answer you simply need to follow the money. The larger the files you send, the more money your CDN makes. This puts their business goals directly at odds with their marketing that says they want to help make your website fast.
I decided to dig into the data to see how wide-spread the problem is. I recently shared the results on the Performance Calendar which is a great resource for anyone who cares about performance. If you use a CDN or have ever questioned if you’re getting the performance you’re paying for, I hope you’ll check out the full article.
Here at Torbit we’re always working to make our Insight tool more useful and more accessible. The performance of your site is critical for your site’s success as we’ve seen time and again the strong correlation between speed and user engagement. Knowing just how important speed is to your business has always been a core part of our product. Tracking your performance is critical for everyone in the organization from the frontend developers to the CEO, and now we’ve made it easier than ever to have everyone involved in monitoring your site’s performance.
We’re happy to announce that you can now have multiple users in your Torbit account. This allows everyone in the company to have their own login and still access all the critical performance data Torbit provides. To add new users to your account simply click Account -> Manage Accounts -> Settings and find the “Add a User” section. Here you simply provide the new user’s email address and, optionally, their job title. They’ll then be invited to create a Torbit account or immediately added to your account if they’re already a Torbit user. You can see what this looks like in the screen shot below.
This has been one of our most requested features and we’re excited to make it available. We hope you’ll take this chance to share your performance data with your entire team.
Two years ago today, Jon and I met at Old Chicago and committed to each other to start Torbit. It was a few weeks later when we formally signed the paperwork, but we think of October 10th as our company birthday. Our founding date has special significance to us. We started the company on 10/10/10. In binary, that’s 42. If you don’t get the significance, just Google “the answer to life, the universe and everything”.
We’ve come a long way over the last two years. We’ve moved the company from Boulder, Colorado to Sunnyvale, California. And more recently, we’ve found a home in San Mateo. The team has grown. We’ve launched new products. We’ve raised money from some incredible investors. More importantly, we’ve helped thousands of websites measure and improve their site speed. We’ve worked with everyone from multi-billion dollar corporations to top internet retailers to top 100 sites.
In April this year, we launched Torbit Insight. For the first time we made it possible for any website to quickly find out how much website performance is impacting their revenue. In the last few months we’ve seen website performance change from being a technical metric to a business metric. Our users are not just developers and operations guys (although we have plenty of those too!). We have C-level execs and VPs using our data to drive real business decisions for their companies. With Torbit Insight, we’ve helped our customers evaluate their CDN, pick locations for their next data center, catch issues when they deployed bad code and finally quantify how much speed matters for their business. Torbit is used by hundreds of brands you know and we measure billions of pageviews every month. In fact, if you look at our ratio of pageviews per engineer it’s at over 2 billion! Not even the internet giants have a ratio that high.
To appreciate this milestone, it’s good to revisit why we built Torbit in the first place. We started out with an audacious goal. We wanted to make the web faster. We have a clearer sense of purpose today than when we started. In spite of all the improvements in browser technology and higher connection speeds, websites are slower than ever. We’re passionate about making the web faster and it’s that motivation that drives everything we do.
The last two years have been amazing. I am privileged to get to come to work everyday with the best team in the world. I couldn’t be more proud of our team and what they have accomplished. I also know that we’re just getting started. Our best days are ahead of us and I couldn’t be more excited to see what the coming years have in store for us. Thanks for tagging along as we do what we love most — helping people go fast!
I finally got around to reading the Steve Jobs biography by Walter Isaacson that’s been sitting on my bookshelf for months. It’s a great read and I’ve found myself captivated by the stories and lessons that can be found in Steve Jobs’ life. One story in particular jumped out at me:
One day Jobs came into the cubicle of Larry Kenyon, an engineer who was working on the Macintosh operating system, and complained that it was taking too long to boot up. Kenyon started to explain, but Jobs cut him off. “If it could save a person’s life, would you find a way to shave ten seconds off the boot time? he asked. Kenyon allowed that he probably could. Jobs went to a whiteboard and showed that if there were five million people using the Mac, and it took ten seconds extra to turn it on every day, that added up to three hundred million or so hours per year that people would save, which was the equivalent of at least one hundred lifetimes saved per year. “Larry was suitably impressed, and a few weeks later he came back and it booted up twenty-eight seconds faster,” Adkinson recalled. “Steve had a way of motivating by looking at the bigger picture.”
At Torbit, we believe that speed really matters. We have a simple, but audacious goal. We think the internet is too slow and we’re doing our best to fix it. It’s humbling to think about the collective amount of time (and lives) we’ve already helped save. It’s the reason why we founded this company. It’s the motivation behind what we do every day.
Last week our CEO, Josh Fraser gave a presentation at the San Francisco Web Performance Meetup cleverly titled “Yo ho ho and a few billion pageviews of RUM” – quite relevant for today, International Talk Like a Pirate Day. If you have some spare time, it’s definitely worth watching! (video and slides). In preparation for the talk, Josh and I gathered some intriguing statistics using the terabytes of data Torbit has collected in the last 4 months. Much of that data is listed below on a categorical basis using a sample of 1,000 sites representing 6.7 billion pageviews.
Frontend vs Backend
As a developer, I spend a good deal of time making my backend code efficient. While that certainly does matter, the vast majority of time users spend waiting is due to frontend loading. Steve Souders’ Golden Rule for Performance states that 80-90% of the end-user response time is spent on the frontend. Across Torbit’s data, that number is actually 93%. We measure frontend vs backend timing based on “time to first byte” (TTFB) and on average 7% of load time is spent on the backend compared to a whopping 93% on the frontend.
The following values are for the onload time across our sample set.
Geometric Mean: 2.19s
90th Percentile: 10.38s
95th percentile: 16.86s
99th percentile: 43.73s
You’ll see with our mobile data, everything is shifted to the right on the histogram. This is caused by a myriad of reasons including slower processors, latency and slower connections with Edge, 3G, etc.
Geometric Mean: 3.12s
90th Percentile: 12.07s
95th percentile: 18.11s
99th percentile: 44.42s
Taking a closer look at latency, the average response transfer time (time from first byte to last byte of the html response) is 0.30s from desktop browsers and 1.30s on mobile browsers, that’s over 4 times slower! The most important thing you can to to improve your performance on mobile is to reduce the number of requests that you make.
There are many factors that impact the performance experienced by end users in varying locations. When it comes to load times on the web, geography matters a lot.
Where’s the US? The US is the 22nd fastest country. Hopefully Google Fiber will help US cable companies to get with the times.
By US State
* scroll to see full list
Or if you prefer a visualization…
As you can see, the southern and rural states are the slowest, not too terribly surprising.
Of cities we have at least 100,000 data points for, below are the fastest and slowest.
Johannesburg, South Africa
University Park, USA
By US City
Slowest US Cities
Fastest US Cities
University Park, PA
College Park, MD
Notre Dame, IN
Stony Brook, NY
Princeton Junction, NJ
I can’t see a good reason why Independence Ohio would be so quick, but what most of the fastest cities have in common is a major university (Penn State, University of Maryland, Notre Dame, Stanford, etc.)
Safari as the quickest browser might be a little shocking… As Josh mentioned in his talk, it’s hard to tell what specifically leads to the faster speeds – using Safari typically means they’re on a Mac, a pricier (likely higher performance) machine, and can afford higher speed internet.
Chrome on Android
Read into that as you wish…
Bounce Rate (Desktop)
Load time plays a huge role in bounce rate. Not even 10 years ago most people were used to waiting 10-30 seconds for a page to load on dial-up. These days pages are expected to load right away, or the consumer will lose interest. The graph above is more proof of that fact.
Bounce Rate (Mobile)
Mobile devices suffer the same fate, just shifted to the right some. An interesting thing to note is how high the bounce rate is for pages that load in one second. With a bit of context behind the graph, the reason for that value is that typically the only pages that will load in 1 second are error pages.
Not only are consumers more likely to leave your page due to slow load times, they’re much less likely to be engaged users. I know if I have to wait 6 or 7 seconds for a page to load, I’m not going to stick around that site for long – this graph proves most people have that same mindset. Notice that engagement is doubled by reducing onload time from 6 seconds to 2 seconds.
If your site loads in 7 seconds on average, clearly that means you should add in a few more requests to bump that up to 9 seconds… In all seriousness though, mobile shows the same overall trend of less engagement with higher load times as expected. It has been suggested that the bimodal nature of this graph with the bump at 9 seconds might represent the difference between pages viewed on 3G versus wifi. In other words, perhaps we’re more patient if we know we’re on 3G and 9 seconds feels more reasonable to us.
Hopefully these statistics are of value to you, or at least somewhat entertaining. If you would like to see how your own site fairs, sign up for Torbit Insight where we provide all these statistics and more!