Suicide Workers

SlideShare is world’s largest community for sharing presentation across the world. We process thousand of files in different formats (pdf, ppt, pptx, doc, docx, keynote, odp and many others) daily. At times, Number of uploads surges to unexpectedly large volumes and at times, it keeps low (normally during Christmas.)

We always want our users to have a smooth experience while sharing anything at Slideshare. This means that when we have lots of traffic; users shouldn’t experience a delay in conversion process. How does SlideShare achieve this?

Since we have different machines and workflows for processing different kind of document formats and running all these process on different machines. We call them ‘conversion workers/factories.’ One thing, to be learnt, is that with a fixed number of conversion workers, we can’t ensure same conversion time in high and low traffic situations. So we need to increase the number of workers in high traffic situations and kill them in normal situations.

This is one of the reasons; we choose to go with cloud so that we can get new machines working for us instantaneously. We choose Amazon Web Services as our cloud provider and use EC2 as the computing power for conversion factories. We puppetized our conversion factories so that just by running a script, we can get them in action. But in that too, we need to wait for some time before our machines come into action. So we create AMI of different factories for stable versions to reduce spawning time.

Now what if suddenly, the upload traffic at SlideShare goes up significantly? In a normal scenario, users would have to wait longer to get their job finished. But we create new workers on the fly using those AMI and puppet so that more workers are able to handle the surge. Now if we keep them running, once spawned, we will get bankrupt in a few high traffic instances. So these new workers kill themselves after some time. Life time of these adhoc workers isn’t fixed but depends on traffic at that time. That’s why we call them SUICIDE WORKERS.

One may ask, why don’t we use the Auto-Scaling feature provided by AWS. A simple answer is that our metrics for spawning a machine are different than used for Auto-Scaling provided by AWS.

We use SQS to send document data to conversion workers. Conversion workers read data from SQS and start working. So when our SQS length is more than a threshold, we know that our current conversion fleet is not enough to handle increased load.

Update: You can read the updated version of this blog post with code here at Slideshare engineering blog.

Subscribe - To get an automatic feed of all future posts subscribe here, or to receive them via email go here and enter your email address in the box. You can also like us on facebook and follow me on Twitter @akashag1001.

PROGRAMMING INTERVIEWS

Suicide Workers

2 Comments

Find us on Facebook

I write about