Hashing – Good For You to Know
Humans Don’t Hash
When faced with the task of comparing two electronic files to ensure that they are identical, people often compare printouts line by line. One method to compared files (while persevering the physical evidence of a printout) is to append a hash to the end of the printout[2. Using a tool like Linux’s diff might be an even better solution because in the case of a difference the user wouldn’t have to search the document to find out what the change was.]. If one were using Windows, Hash Calculator would be a nice tool as it even allows the user to cut and paste text into a pane to be hashed. Unfortunately printing out hashes might cause more questions and uncertainty one’s boss doesn’t understand what those cryptic characters are.
What is hashing?
Hashing is taking some digital inputs (text, a picture, a song) and producing a unique signature of the data. A the hash (signature) looks something like this: 34542d6380f7993a99bb7bae8bb4177f. The important points about hashing are:
- It’s unlikely that two different inputs will produce the same hash. Unlikely like 2^128 with md5 if just haphazardly attempting to produce identical hashes with two different inputs. There are ways to more efficiently find collisions and the birthday attack means that many inputs have better and better odds of producing a collision somewhere. But forget about all that. If collision avoidance is extremely important just use a bigger, better, hashing algorithm (i.e. SHA-512). Keep in mind that we can’t even fathom the numbers like 2^512.
- Even a very small change will produce a very different hash. Notice how one letter change (“is” → “iz”) dramatically changes the hash in this Ruby code:
require 'digest/md5' puts Digest::MD5.hexdigest('hashing is tight') # => b72fa580415e60b1cd7613dfcc044d01 puts Digest::MD5.hexdigest('hashing iz tight') # => 41891aedfef6b48526a771f893649408 - Making a hash is easy. Finding the input from the hash is hard. Because of this property hashes are referred to as one-way-functions. Again there are some caviots (mainly rainbow table attacks) but these issues can be avoided if this type of security is required.
It’s all around us
- Git version control uses hashes to tell if a file has changed.
Programs that synchronize files (i.e. rsync) can use hashes to know what files have not been changed and can be ignored when synchronizing. - Your passwords are (often) hashed when they are stored on your machine and on websites. This means that if someone sees your stored password hash they don’t know your password.
- Cool next-generation file systems like ZFS and BTRFS use hashing to improve reliability and add features like easy creation of snapshots and the ability to roll back changes (think Apple’s Time Machine).
- Cryptographic signatures can ensure that a message has not been modified and, with the help of public key encryption, can verify the sender.
End Users Should Know Something About Hashing
Whenever I see people struggling to use computers most often I find that their expectations of computer usability is very low. Sometimes users feel that if they make a mistake they will have to start a day-long project all over. Users that aren’t very familiar with computers often feel that a single mouse click in a wrong spot will be devastating to all of their data.
Of course, ideally an end-user operates a device without any knowledge of the actual device, but certain understandings (i.e. the concept of saving something to the hard disk, the concept of being connected to the Internet) help users feel more confident as they operate and troubleshoot.
Hashes look ugly but a user that has some concept of hashing is able to tackle problems that other users cannot. For instance I observed a user who received a license key for a piece of software and found that upon entering their name and key the product would not register. This key was comprised of a hash that included the user name and the user had capitalized their name when entering it into the program. Writing their name in all lowercase as the company had in their emails allowed the user to register the product[3. Yes, this was a shoddy product and a shoddy company].
Users that understand hashing can feel confident when they use operations like Ubuntu’s merge all to merge files from different folders (for windows users… maybe winmerge?). The alternative is painstaking and paranoid checking of many files to ensure that no data is lost.
Users that understand hashing can feel confident when using systems like Apple’s Time Machine and understand that this is activity is easy for the computer and is an efficient use of space. They can feel that their data is secure while omiting periodic backups that they may have grown accustomed to. When redundant hard drive arrays become more common (like those possible with ZFS) users will feel confident when removing a dead hard drive from their system even when this would have been disastrous in the past.
Anyway, it’s just a random thought that turned into a long post. What else should a computer literate person (dare I say “power user”?) know to help them on their way?
Footnotes: