Link to part 1
Link to part 2
Link to part 3

Entering data in SimpleDB

Below is the script, I wrote for parsing logs and make entries in SimpleDB.


#!/usr/bin/env ruby
require 'rubygems'
require 'file/tail'
require 'right_aws'
require 'logger'
require 'geoip'
g = GeoIP.new('GeoIP.dat')

$logger = Logger.new("sdb.log")
#Connection for SimpleDB interface
@sdb = RightAws::SdbInterface.new(AWS Access Key, AWS Secret Key, {:multi_thread => true, :logger => $logger})

@alltimehash = Hash.new() #All time hash
@h = Hash.new() #Hourly hash
@lasthr = 0

#Function for entering data in simpleDB
def enter_in_sdb(date, country)
if @alltimehash[:"#{country}"]
@alltimehash[:"#{country}"] += 1
else
@alltimehash[:"#{country}"] = 1
end

attributes = {}
attributes['date'] = date
attributes['count'] = @alltimehash[:"#{country}"]
@sdb.put_attributes('Analytics', country, attributes, true)

hr = date.split(":")[0]
if hr != @lasthr
@h.clear
@lasthr = hr
end

if @h[:"#{country}"]
@h[:"#{country}"] += 1
else
@h[:"#{country}"] = 1
end

domain_name = "Analytics" + hr.to_s
attributes['count'] = @h[:"#{country}"]
@sdb.put_attributes(domain_name, country, attributes, true)
end

File.open('haproxy.log') do |log|
log.extend(File::Tail)
log.interval = 0
log.backward(10)
uuid = 1
while true
begin
log.tail { |line|
#process each line to get timestamp and IP
country = g.country(IP)[4]
enter_in_sdb(timestamp, country)
}
rescue Exception=>e
puts "Met an exception"
puts e.inspect
next
end
end
end

This script is putting the data in SimpleDB after processing each line. Unfortunately, we don’t have our load balancer in Amazon cluster so putting the data in SimpleDB wasn’t very fast. Slideshare has huge traffic and haproxy.log was growing much faster than we were able to log data in SimpleDB. We saw that in 1 hour, we were lagging by 50 minutes in making entries in SimpleDB.

To solve this problem, we decided to batch the data. So we started collecting data in hashes for 5 minutes and after 5 minutes (or when hour digit changed), we put data in SimpleDB. In this manner we were able to log analytics data in SimpleDB with just 5 minutes lag. You can tune this 5 minutes lag as per your requirement. If we were in Amazon cluster, we could have reduced this lag by a few seconds or to Zero.

In next (last) part, we will see the consumer code and graphs produced.

Subscribe - To get an automatic feed of all future posts subscribe here, or to receive them via email go here and enter your email address in the box. You can also like us on facebook and follow me on Twitter @akashag1001.