We had a task to batch process around 10 million old slideshows at Slideshare. These slideshows were being processed in parallel and we are supposed to maintain logs to debug if something fails during the process. Each slideshow has an id associated with it and these ids were ranged from 1 to 13 million. It means we have some deleted slideshows in between.
Everything went fine when we were in early stage. We had processed from 1 to 100 thousand slideshows without any problem. When an engineer declared that first 1 million are done, we were happy that 1mn mark is reached much before anticipated time. While verification, we noticed that there were around 300 thousands slideshows, which were not processed. To add more problems, these slideshows were not in continuous range but divided almost equally between 100,000 and 1 mn.
Irrespective of how good logs we captured, It was not easy to parse logs and make a list of unprocessed slideshows. To add more problems, we captured really detailed logs so it was a nightmare to parse logs. Also we learnt that these 3 lakh slideshows were missed and not failed and finding them in DB wasn’t a option.
When we were writing this script, we did a good thing. We wrote script to ask start and end slideshow id as arguments and process slideshows between them. We created separate log file for each script instance (for each batch) and appended start and end slidehow_id in log file name. Log filenames followed the nomenclature ‘batch_process_start#_end#.log’. This also gave us flexibility to run script paralelly for small batches.
When we identified missing 3 lakh slideshows problem, we arranged log files in order of their start id. When we draw a sequence from start_id to end_id for each log file, we noticed some gaps between end_id of one batch and start_id of next batch. There were many batches for which we noticed this gap.
Once we identified all the missing batches, we ran our script only for missing batches. After completion, when we rearranged log file as earlier, we saw no gaps between end_id of one batch and start_id of next batch. Also while verifying if all the data is processed, we find it was so.
This particular naming convention saved us from parsing log files and identifying which files are missing. So not only identifying which logs are going to help, it helps how are you saving those logs.
Subscribe - To get an automatic feed of all future posts subscribe here, or to receive them via email go here and enter your email address in the box. You can also like us on facebook and follow me on Twitter @akashag1001.