A few years ago I've been asked to make the analytics data available to our customers, give them the tool they need in order for them to optimize their websites for SEO, check how many of their efforts have been converted to deals, leads, etc'.
We had a few options (like piwik.org), but we already had a few years of data inside google analytics and we wanted to integrate the reports with our own software, making it a one shop stop for our customers.
After attempting to retrieve data on the fly with Google API (download), We've found out we might have more requests than we anticipated for more data and more detailed reports, we decided to import it into our database, create a bunch of indexed views to ease the pain of the database (and ourselves) and give the customers what they want.
And so, a program was born.
I'll describe the stages briefly to get your own data but I won't go in into our implementation for obvious business reasons (though by the time this article was written we stopped using analytics).
First, a few notes about Google Analytics API limits, 50,000 requests per project per day, 10 queries per second per IP, 10,000 requests per profile per day, 10 concurrent requests per profile.
Sound pretty high no?
But imagine the 11th person going into your reporting page getting an error, or imagine your 10,000 users got into your reporting page and someone wants the 10,001 request. From my experience, SEO people will refresh the page as soon as they can to see if anything changed in the last 3 seconds.
Of-curse we can cache the results for X amount of time, but why not store the entire database on our servers and serve it from there?
So how do I get my program to respect google's limits?
1. For each request, retrieve the most amount of data possible by the API, it will make each request slower, but it also save me requests, making those 10,000 per profile count more.
2. Write a method that will check if I reached the 10 concurrent requests per IP and delay the next request until one of them is finished.
3. Further extent section 2 to include a check if in the past second I requested less than 10 requests, if I'm in the limit, wait a second.
If you're not implementing a multithreaded application, the 10 per second and 10 concurrent per IP are irrelevant to you.
So lets start with a limiter, I've implemented a a class which does the limiting job, its a combination of semaphore and temporal semaphore.
Limiter.cs
Then I've added the main program, first it uses AnalyticsService and set authorization (GDataCredentials).
Then we retrieve all the profiles/accounts with AccountQuery.
Then we determine the timezone that profile is using, its important if your application is serving multiple timezones so everyone will get a consistent time. Analytics stores and serves all dates and hours in the profile's timezone, I'm using PublicDomain to process TzTimeZone as its not part of the .NET framework.
After that, we're going to retrieve the records with DataQuery. Analytics uses a combination of Metrics and Dimensions to store data. Think of Dimensions as the "group by" section in a sql query and the Metrics as the select section.
You can find the reference here: https://developers.google.com/analytics/devguides/reporting/core/dimsmets
And there's a cool tool called Google Analytics Query Explorer in which you can execute queries and see the data returned immediately.
A few more thoughts which might help you implement your own tools:
1. implementation of a timeout method execution, I've noticed that from time to time some of the methods tend to freeze.
2. a retry method execution.
3. sort of sync, you should read every day the data from yesterday until today so everything will be in sync due to time zones differences.
You can find the demo project here:
https://github.com/drorgl/ForBlog/tree/master/GoogleAnalyticsData/GoogleAnalyticsDemo
Thursday, August 2, 2012
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment