Getting most out of Amazon S3

Amazon S3 is a very useful service. S3, according to the official Amazon Web Services website is

Amazon S3 is storage for the Internet. It is designed to make web-scale computing easier for developers.

Its a no frills service and does exactly what it promises — makes it easy for developers so that they can concentrate on features and leave the scaling to Amazon. If you are new to Amazon S3, heres a good starting guide for you.

S3 is like a sharp sword, you must know how to play with it lest you can hurt yourself. Thats exactly what happened to me. One of the many applications that we are developing on MySpace, Sketch Me, required us to store (and serve) huge amount of image data (user sketches). And due to the viral nature of the application, the load almost tripled every month. S3 was a clear choice. It saved us time, money and headache. Our current stats (with caching) are:

  • Total files stored: 205GB
  • Bandwidth per month: 2TB
  • GET Requests per month: 112m

Clearly I would not like to waste time setting up image serving servers that can handle such load and I am more than happy to outsource it to Amazon S3.

You would be surprised before caching when our total images were just 5GB, the no of requests were 263m ($363.91) (almost double to what it is now with 205GB of images)

So if we take the total request to be directly proportional to number of images, with that rate the actual requests should be approx 4.5 billion or $15,000 :O

How did I tame the beast?

At first look the pricing of Amazon’s S3 services seems quite cheap. Wait until you get your first bill and you will see have cents add up to huge $$$.

Storage

  • $0.150 per GB – first 50 TB / month of storage used

Data Transfer

  • $0.100 per GB – all data transfer in
  • $0.170 per GB – first 10 TB / month data transfer out

Requests

  • $0.01 per 1,000 PUT, POST, or LIST requests
  • $0.01 per 10,000 GET and all other requests

So after getting the first bill for a few hundred dollars, I sat down thinking how to bring that down. When I was digging the HTTP headers, I found out that by default S3 doesn’t have any cache request headers set. So even when the visitor has the requested the file from S3 before and has it in his browser cache, the browser will send a HTTP GET request to S3 just to verify if the file has changed. S3 returns a 304 Not Modified header if the file has not changed and file wont be downloaded. You may think, S3 saved me a few GB of bandwidth cost. But each of this requests cost you ($0.01 per 10,000 GET) which is generally the bulk of the S3 bill.

Since photos our users upload almost never change. Asking S3 every time if the file has changed on S3 is certainly not required. You can stop browser sending this extra request for the same user by setting appropriate Cache-Control headers or Expires headers for the files. We can set Cache-Control max-age=864000 which will tell browser to not request the same file until next 10 days (3600*24*10 sec)

Fortunately S3 allows us to do that, but there is no simple and easy way to do that. So I decided write a small script (see below) to achieve this.

After Domestication

After setting Cache headers, my bill got down drastically. The traffic doubled and number of images  almost tripled (5GB – 15GB) within a month while the number of requests was reduced 3 times. So that is 9 times reduction in cost. Ideally, your bandwidth cost should be more than your requests cost.

If you own a high traffic blog or website, you can also store your javascripts or css files on S3 by using far fetched expires headers and using versioning (changing file name when contents change) so that the browser knows when the file has changed.

For Example:

<link href="http://s3.amazonaws.com/lalit/style.css?v=3" ... />
after change in stylesheet, change your code to,
<link href="http://s3.amazonaws.com/lalit/style.css?v=4" ... />

Domestication

The popular Firefox extension S3Fox doesn’t allow us to do that. So I decided write a small script using the S3 PHP REST Library.

You can download the code here (zip 6k).

Update: Fixed a bug in code (mime type calculation was failing on some php configurations)

To use the script, you have to upload it to your server (running PHP). You need to edit the upload.php file to specify your AWS access key and Secret, your S3 bucket name.

/* One time settings. */
$awsAccessKey	= '---';	// your AWS Key
$awsSecretKey	= '---';	// your AWS Secret
$bucket_name	= '---';	// S3 Bucket name
$age		= 3600*24*10;	// Cache age 10 days
 
/* File Data */
$s3_dir_name	= 'dir1/dir2/';	// Directory on s3 where you want to upload file
				// example http://s3.amazonawas.com/bucket_name/dir1/dir2/filename.ext
 
$upload_file	= 'filename.ext';// name of the file you want to upload.
				// keep it in the same dir as this file.

After you have saved the config info, every time you need to upload a file, you have to specify the file name and the dir name in upload.php and callit from the browser http://yoursite.com/s3/upload.php. (I know it sucks, I promise will make it better soon)

If you want to escape all this trouble, there is a paid tool Bucket Explorer which helps you to upload files with custom headers. I havn’t used it but it looks great and works on Win/Linux/Mac.

I hope your next S3 bill will come down :)

• • •

17 Responses to Getting most out of Amazon S3

  1. Lalit says:

    Thanks Gaurav!
    Actually we tired this out at SlideShare too, but I was not aware of the impact because I didn’t have the S3 stats. I am sure it would have been more drastic than this.

  2. Samir says:

    Hi Lalit,

    Good info. http://www.codeplex.com/spaceblock is free API for S3 which allow cache-control edit.

    Thx
    SA

  3. Pingback: Amazon S3 « Sandeep Verma

  4. You could also compare costs with Mosso’s S3-like cloud storage service, and have a look at Steadyoffload.com (I believe it is cheaper than S3)

  5. 4x4 Mecca says:

    Thanks you guys! Lalit for writing this and the php script and Samir for the link, I for spaceblock working and it’s pretty easy now. Thanks!

  6. Andy says:

    Another freeware Amazon S3 client that allows setting cache control and other HTTP headers is CloudBerry Explorer http://s3explorer.cloudberrylab.com/

  7. Fred says:

    Hey Lalit, nice idea.

    To prevent GET requests you need to modify the cache-control http header or the expire header?

    I’ve reade the developer.yahoo.com section about cache control and expire headers and i’m a bit confused :)

    Actually i’m using the Amazon s3 plugin for wordpress that modify automatically expire headers, am i ok with those settings?

    Thanks
    Fred

  8. Lalit says:

    Fred,
    Ideally both, if you are using both of them.

    By default S3 doesn’t have any headers set, so stetting any one of them set is good enough. But to be on the safer side, you should set both.

  9. Pingback: Amazon CloudFront and S3 maximum cost

  10. Pingback: Lower your Amazon S3 Bill and also Improve Website Loading Time | _Twi8erTwi8er

  11. Mangal says:

    Hi Lalit,

    Does this method help in faster display of images from s3 to web page too?

    • Lalit says:

      It doesn’t affect the speed of download of images or other static content. It merely instructs the browser to use the cached version of the image or file instead of downloading it when it is requested 2nd time. That way the loading of that image or file is *much faster the 2nd time* as it is read directly from the disk.

  12. Amit says:

    Lalit,thanx for the script.

    I used it but am unclear if i set i right.Can you help how can i verify (in php) ?. I used bucket explorer,selected properities but did not see cache-control details there.Am aware your script interacts with s3 and am not sure of parameter order like can i add $age at the end ? or should it be after the array(), ‘Content-type.’

    Sorry for the newbie queries.

    • Lalit says:

      Amit,
      You just have to change the $agevariable in upload.php
      That should be it. You can also verify it by checking it in firebug by accessing it in the browser.

  13. Ryan Hellyer says:

    Thanks for writing the PHP application :)

    This problem has been driving me batty all night. I couldn’t understand why my cache headers were all over the place. I’d been blaming my CloudFront settings, but it turns out that S3 was the problem, but a problem I could not fix … until see your script just now :)