Why do so many companies write a homegrown pageviews tracking system? Between Google Analytics, Kissmetrics and many others, isn't that a completely solved problem?
These popular solutions lack domain knowledge. They are easily capable of segmenting users by region or browser, but they fail to recognize rules core to your business. Tracking pageviews with a homegrown system becomes your next sprint's goal.
Implementing a hit counter service is quite tricky. This is a write-heavy, asynchronous problem that must minimize impact on page rendering time, while dealing with rapidly growing amounts of data. Is there a middle ground between using Google Analytics and rolling out our own homegrown implementation? How can we use Google Analytics for data collection and inject domain knowledge into gathered data, incrementally, without writing our own service?
Let's write a Rake task that pulls data from Google Analytics. We can run it daily. Start with a Ruby gem called Garb.
Garb requires Google Analytics credentials. Those can go into a YAML configuration file, which will use environment settings in production (it's an ERB template, too). We can hardcode the test account values.
1 2 3 4 5 6 7 8 9 10 11 12 13
Establish a Google Analytics session and fetch the profile corresponding to the Google user account with Garb.
1 2 3 4
Garbs needs a data model to collect pageviews. It extends
Garb::Model and defines a set of "metrics" and "dimensions".
1 2 3 4 5
You can play with the Google Analytics Query Explorer to see the many possible metrics (such as pageviews) and dimensions (such as requested page path).
By default, Google Analytics lets clients retrieve 1000 records in a single request. To get all records we can add an iterator, called
all, that will keep making requests until the server runs out of data. The code for config/initializers/garb_model.rb is in this gist and I made a pull request into Garb if you'd rather merge that onto your fork.
The majority of our pages are in the form of "/model/id", for example "/artwork/leonardo-mona-lisa". We're interested in all pageviews for a given artwork and in pageviews for a given artist, at a given date. We'll store selected Google Analytics data in a
GoogleAnalyticsPageviewsRecord model described further.
1 2 3 4 5 6 7 8 9 10 11
GoogleAnalyticsPageviewsRecord contains the total pageviews for a given model ID at a given date. We now have a record for each artwork and artist. We can rollup existing data into a set of collections, incrementally. For example,
google_analytics_artworks_monthly will contain the monthly hits for each artwork.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
The rollup lets us query these tables directly. For example, the following query returns a record with the pageviews for the Leonardo's "Mona Lisa" in January 2012.
1 2 3
One of the obvious advantages of pulling Google Analytics data is the low volume of requests and offline processing. We're letting Google Analytics do the hard work of collecting data for us in real time and are consuming its API without the performance or time pressures.