Tuesday, September 17, 2013

Calculating an MD5 Hash for Google Cloud Storage

When you upload an object to cloud storage, it's a good idea to make sure that the object that you have in the cloud is the same object that you have locally. The simplest way to test this would be to download the object and compare it to yours directly, but that's annoying for big objects. Fortunately, Google Cloud Storage automatically calculates an MD5 hash of of every object that you upload.

MD5 hashes are small numbers that are calculated by examining all of the bytes of a file. If you make any changes to the file, the MD5 hash changes. For purposes of error checking, it's statistically impossible to change the file without its MD5 hash changing as well. So if you have an object with an MD5 hash of def51393b98548cf7f9471d2820a0347, and the object in the cloud has the same hash, you can be pretty darn sure that they are the same object.

An MD5 hash is a 128 bit number. There are several ways to represent it. The popular command md5sum spits out MD5s as a string of characters 0-F representing a hexadecimal number.

$ md5sum importantFile.txt
def51393b98548cf7f9471d2820a0347 /home/automaton/importantFile.txt

Another way to represent an MD5 hash takes the MD5 hash (as a binary number) and encodes that binary value into a base64 representation. This is the way that both HTTP's "Content-MD5" header and Google Cloud Storage usually think about MD5s.

Here is how we generate this style of MD5:

$ openssl dgst -md5 -binary importantFile.txt | openssl enc -base64
3vUTk7mFSM9/lHHSggoDRw==

Just to restate, def51393b98548cf7f9471d2820a0347 and 3vUTk7mFSM9/lHHSggoDRw== are two different ways to encode the same MD5 value.

Here is how we would generate the first style of MD5 in Python:

import hashlib
hash = hashlib.md5(open(file_name,'rb').read()).hexdigest()
# 'def51393b98548cf7f9471d2820a0347'

And here's how we would generate the second style (the one used for Content-MD5 or Google Cloud Storage hashes):

import hashlib
import base64
binary_hash = hashlib.md5(open(file_name,'rb').read()).digest()
hash = base64.b64encode(binary_hash)
# "3vUTk7mFSM9/lHHSggoDRw=="

Note that one common mistake people make is to take the hexadecimal string value, "def51393b98548cf7f9471d2820a0347", and base64 it. This produces something that looks like the right kind of base64'd data, but it's wrong.

So now we have an MD5 hash. Great! We can now upload the value to cloud storage and then verify that its hash is correct. Still, it's kind of a pain that we have to first upload it and then check its metadata to make sure that it's right.

Well, actually, we don't! We can tell Google Cloud Storage exactly what hash the object we're uploading is going to have. Using the XML interface, we can shove the value into the Content-MD5 header, and using the JSON interface, we can set the md5Hash property of the metadata. If the data that shows up on the remote server doesn't have exactly that MD5 value, the upload request will fail, and we can just try and upload it again.

2 comments:

  1. I haven't had the opportunity to use such solutions yet, but I hope it will change in the near future. The more that I use IT solutions used by the company https://www.grapeup.com/ quite often, thanks to which I know what is the perfect operation of native applications in cloud computing.

    ReplyDelete