Amazon S3 | Batch processing of image files

Amazon S3 | Batch processing of image files

At work, we store all of our images on Amazon S3.  This allows us to have a reliable storage point where we can host millions of photos.

We store a couple different sizes of images, so the Amazon S3 buckets (folders) are named by the different image sizes we keep.  As a new feature, we started saving a larger image size whenever anyone uploaded a new photo through our site so that new bucket (1680px wide) needed to be back filled with the largest image size we stored for all previous images (1200px wide).

As I looked into this, there is no batch process on Amazon that allows you to send them a large list of keys and a command to run on those keys.  A program must be written to do this instead.

Since we had almost 5 million photos to transfer, this batch processes needed to be fast and not take a long time.  This meant that any I/O time needed to be reduced.

The first I/O bottle neck is reading in all the list of keys that we wanted to transfer from one bucket to another.  In order to reduce the time needed to read these from the database, I did some pre-processing on the data.  I did a select query on our database to extract the image ID and image extension and exported that to a file (C:\1680pxWide.rpt).  This is basically just a text file that contains a table of the images ID’s and image extensions that I wanted to work with.  Then in the start of my program, I read the entire file into memory.  I did this instead of reading line by line from the file because I thought it would reduce I/O time from reading a large file like the one I had (almost 75mb).  Getting the file into memory at a single point in my program would decrease I/O time by only having to read that once from the hard disk and then the rest of the time, it can be stored in memory which is much faster.

Now that I had all the information I needed about the keys I wanted to transfer, I could start the transfer process.  The server I was going to run this program on had many cores so I thought it would be useful to utilize all of those cores in order to increase the speed of the program.  The built in parallel for loop in C# is very useful and easy to use, it extracts out many of the difficulties of parallel programming with a simple data set.  So, I created a parallel for loop around the list of input lines from my data file.  Now, each core on the machine would be able to process a single line in parallel and since everything was in memory already, it would be very fast.

Lastly, I fill in the request Amazon S3 object with the bucket/key name I want to act on and tell it I want to do a copy.  This tells Amazon to directly copy the file object inside their network so the file is never downloaded and re-uploaded.  This will save a lot of time.  The request is sent off and I make sure to output any errors that might have happened.  It probably would have been better to do the logging in a better way through saving the errors in a file.

The program ran very smoothly and in under a day we had all of our files copied over to the new bucket!  Below is the code for reference, feel free to use it your self.

 

Leave a Reply

Your email address will not be published. Required fields are marked *