We had a task to batch process around 10 million old slideshows at Slideshare. These slideshows were being processed in parallel and we are supposed to maintain logs to debug if something fails during the process. Each slideshow has an id associated with it and these ids were ranged from 1 to 13 million. It means we have some deleted slideshows in between.
Everything went fine when we were in early stage. We had processed from 1 to 100 thousand slideshows without any problem. When an engineer declared that first 1 million are done, we were happy that 1mn mark is reached much before anticipated time. While verification, we noticed that there were around 300 thousands slideshows, which were not processed. To add more problems, these slideshows were not in continuous range but divided almost equally between 100,000 and 1 mn.
Irrespective of how good logs we captured, It was not easy to parse logs and make a list of unprocessed slideshows. To add more problems, we captured really detailed logs so it was a nightmare to parse logs. Also we learnt that these 3 lakh slideshows were missed and not failed and finding them in DB wasn’t a option.
When we were writing this script, we did a good thing. We wrote script to ask start and end slideshow id as arguments and process slideshows between them. We created separate log file for each script instance (for each batch) and appended start and end slidehow_id in log file name. Log filenames followed the nomenclature ‘batch_process_start#_end#.log’. This also gave us flexibility to run script paralelly for small batches.
When we identified missing 3 lakh slideshows problem, we arranged log files in order of their start id. When we draw a sequence from start_id to end_id for each log file, we noticed some gaps between end_id of one batch and start_id of next batch. There were many batches for which we noticed this gap.
Once we identified all the missing batches, we ran our script only for missing batches. After completion, when we rearranged log file as earlier, we saw no gaps between end_id of one batch and start_id of next batch. Also while verifying if all the data is processed, we find it was so.
This particular naming convention saved us from parsing log files and identifying which files are missing. So not only identifying which logs are going to help, it helps how are you saving those logs.
Subscribe - To get an automatic feed of all future posts subscribe here, or to receive them via email go here and enter your email address in the box. You can also like us on facebook and follow me on Twitter @akashag1001.
A good practical example to have the notion of get things done, even though the approach is non-strategic. I mean, isn't we should debug the core algorithm why it was missing few ids in the first pass rather than going by problem-fix-problem-fix... approach?
Are we sure about the data consistency, or the current fix won't lead to any other problem in future?
Parallel processing is one challenging programming skill. Good that you are working on these...
Thanks Venkata for your feedback. Actually we used to run this script manually with start and end id. And all the slideshow in that range were being processed normally. So we were sure that script itself wasn't skipping some slideshows.
Also this script was a one-off script so spending time on that wasn't something we wanted. Also this strategy worked well for us to quickly identify the problems.
This Article is Awesome. It’s help me a lot. Sir,Please keep up your good work. We always with you and Waiting for your new interesting articles. Custom Web Development
How much I learnt from this blog is beyond your comprehension.
Custom Website Design
Very ccreative post
Great information.