S3 (in)Sanity: When `info` doesn’t cut it.

If you are a web developer, in this day an age, you will at some point swim a bit in the waters of the land of AWS. It is a fertile land, with many fruit, and exciting things to do, and once there, you will eventually work some with S3. S3 is fairly straight forward in it’s basic use-cases, with s3fs it’s basically just another drive, with the added bonus that you can share that drive across multiple systems and share files, like cached images or pages.

That’s one of the ways that we use it where I work, but in addition to that, we have an S3 bucket for each of our application environments, dev, test, staging, and production, and these buckets are mounted to the systems that fall into those groups.

Recently, I was tasked with rewriting a piece of software that syncs data down from staging or production into dev or test. Historically, this was done rather blind, with a simple call to s3cmd sync with what it was instructed to consider the two buckets and folders it had to sync from and to. Also, historically, if a user entered incorrect data, it had a tendency to fail silently, mucking up things in the process, which is where I came in.

In my rewrite, I desired to do extensive sanity checking prior to executing any of the number of tasks the tool needed, and one of those sanity checks was to make sure that our source and destination actually existed. The sanity checks had to be fast, and ideally would take the form of single commands I could execute against the shell and inspect the return value to determine success.

I had already gathered that I could use s3cmd ls to list the contents of a directory, but I knew that listing the contents of any of the directories that I would be syncing would incur significant wall time due to their size, and sanity checks such as this should be fast.

But why not just omit the trailing slash, and have it look for the folder in the actual directory?

Well, a little further research lead me to this StackOverflow Q&A which pointed out that this would have the side effect of listing any item that started with that path, which would result in false positives if a directory had additional characters, but was not the actual directory I was looking for. The accepted answer went on to suggest using s3cmd info, a very quick command that simply returns various metadata regarding the requested node in S3.

This was precisely what I needed, and I finished the refactor and moved on.

Fast-forward a few days, and we ran into a problem:

The folder exists, but s3cmd info is throwing a 404. We are catching the exit code for the command to determine success, and (as you can imagine) 404 does not throw a code that indicates any kind of success.

So I was left with a little bit of a dilemma. Using ls on the directory is slow, and using info seems to be unreliable, and for some reason, the internet was not bustling with answers on this one. So how do I solve this? Well, as is usually the case, I went with my initial solution, with a twist.

Since I know that a possible side-effect of searching without the trailing slash would be to list identically prefixed siblings, but I also know that this will be quite a bit faster than listing the entire directory (as the ls happens in S3, natively), I can leverage grep to ensure that I am actually getting the result I want without the overhead of a huge search.

Seems obvious enough, but it wasn’t, for some reason… :-/

So, to automate this, I intentionally trim the last character off the search string if it is a /, I then execute ls on that, and pipe it into grep to confirm that what I am actually looking for exists.

The >/dev/null bit discards any output, as we do not need it, we just want the return code. This also works to confirm existence of files too.

And it’s as simple as that.

You will, of course, want to sanitize the input directory to make sure it doesn’t have any nefarious bits in it, perhaps using escapeshellarg or escapeshellcmd, but beyond that, this is really all there is to it. In our utility we do this once for each endpoint, and if they all pass, we know that we are all set to execute the sync when we get around to it.

Hope that helps, and happy coding.

Comments are closed.

Post Navigation