Hashes and Passwords: Not quite as oversimplified

This post was migrated. The content was written 2019-26-1

An overly technical brief introduction to hashing and passwords

I've been spending too much time on HackForums lately (and it hasn't even been a full day), explaining stuff like zero-days, hash-cracking, anonymity and secure deletion, wireless hacking, and reverse engineering. One recurring theme I've already seen is relating to password hashes and what they are.

People often think of hashes in the same terms that they think of encryption – encryption takes data and makes it unreadable, and hashes do much the same. The problem is that although they operate in much the same way and do, for the most part, the same thing at the overview level, they're 2 entirely separate things.

What are hashes?

To understand what we're really saying when we say “crack a hash” you need to understand what they are, and what the differences are between hashing and encrypting information.

Encryption is meant to hide data from anyone without knowledge of some secret value (aka the key). It is intended on being a reversible process – knowledge of the key allows you to get the data back. The most common encryption algorithm is AES, an algorithm standardized as FIPS 197 (the United States Federal Information Processing Security) publication back in November 2001.

AES is a block cipher used to secure secret information. Block ciphers operate on fixed-size units of data – typically 128 bits – and encrypt each block separately. This will be represented here as a function, encrypt(message, key), which takes a message, combines it with a secret key, and gives back the value.

Most hash functions are built from an encryption algorithm, due to the way modern encryption algorithms behave. This is because that, in order to be considered secure, a few properties need to be in place.

  1. The message should be irretrievable without the key.
  2. Given the message, the key should still be irretrievable, to prevent known-plaintext attacks – where an attacker knows some or all of the original message, but still needs the key.
There are other properties, but these are the 2 we're concerned with.
The way most hashes are defined is as a Merkle-Damgård construction. A Merkle-Damgård construction takes a message and puts it in place of the key when encrypting it. Then, the previous value of the hash is used as the message, instead. (Note that this is an oversimplification, but explains why the properties work out how they do)

Remember property #1 above, saying the message should be irretrievable without the key? If the message is the key, then we need the message to get the message, making it irreversible.

Then there's property #2 above, which states the key should be irretrievable even with the message. This is important because of the definition of a block cipher I gave above. A block cipher takes fixed-size units of data – got 128 bits to encrypt? That's 1 block, but 256 bits will require another block – so will 129 bits, for that matter. Why does this matter? Because each block is an entire message.

The hashing algorithms operate 1 block at a time. They take the first block, hash that, then combine that with the hash of the next block. That is then used as the message when hashing the next block.

Let's say you could get the key using the original message – when you hash data, you could use knowledge of the hash's initial state, and connect the dots until you know what was hashed. Every hash has an initial state – hypothetically, if you were to hash zero blocks, the result would be the initial state. This is used as the key for the first block hashed. The initial state could lead to getting the information needed to break the next block, until the hash becomes useless – you can still reverse it, anyway.

But with that second property, you defend against that, since the actual message is protected. By swapping the message and the key, we go from something meant to be reversed, to something that can never be reversed.

So how can I decrypt a hash?

You might already know the answer to this, but hashes aren't decrypted – they're cracked. Ideally, the most efficient way to crack a hash would be to try every possible message until you just happen to get one right. That way, the security is based purely on the message. The upper limit to the security is the block-size divided by 2 – MD5, which uses a 128 bit block, will be cracked in 2^64 attempts using this approach – longer than most people want to wait. Note, also, that MD5 can in practice be cracked in only a few seconds, but that's a different attack.

If the hash is perfectly secure, the brute-force algorithm is the most efficient you'll ever be able to pull off. This is what tools like Hashcat and John the Ripper do – they guess until they get one right. They will also use a wordlist – a list of things it's more likely to be – to speed up the attack, since let's face it, nobody uses truly random passwords unless they have LastPass or some other tool do it for them.

When you talk about decrypting a hash, you really mean cracking it. It's a search where you say “Is this it? NO. Is this it? NO.” until you get to “Is this it? Yup!”

It's strategic stumbling.

What is a salt when hashing?

A salt is a value meant to make it harder to crack a hash. A unique salt will generate a unique hash, even if the message is the same. This way, someone can't just hash every password once and look it up in a massive table (this is what Ophcrack and most online hash decryptor services do). It forces you to do the legwork every time, even if you've cracked that hash before with a different salt.

The salt value is stored alongside the hash – it's not meant to be any more secret than the hash itself, although you still should keep it at least as secure as the hash. I like to compare it to salting your food – it tastes like sweet security, but the attackers get all fat and slow from consuming too much. After all, who doesn't like salted foods? That is, before they start caring about their health.

What's the actual most efficient way to crack a hash?

I've been talking about hashes in a rather idealized manner, so let's bring this down to Earth. Although hashes are typically pretty close to their expected security levels for a while, hashes such as MD5 and SHA-1 have been broken in practice using more mathematically sophisticated methods. How? Because hashes are hard.

Creating a secure hash is difficult – you need to understand a lot about a lot, and this leads to some hashes being broken after people learn more about them. Take MD5, for example. MD5 was originally intended to cryptographically verify whether 2 pieces of data are the same, without revealing the data. The problem is MD5 is vulnerable to a collision attack – given an MD5, I can calculate other data that will produce the same MD5. Is it the same as the original data? I don't know, but I also don't care, since when verifying data, all that matters is that the hashes match.

How can I identify a hash when I see it?

Most hashes are also PRFs, or pseudorandom functions. Any secure PRF has the ability to create every possible output, with equal probability. If I have 128 bits of data, that data is a valid MD5, but that doesn't mean it's being used as one. Therefore, I can do a few things:
  1. I can find the algorithm that generates or compares the hash.
  2. I can find the original input. Note that this has to be the original input – collisions don't count, since as stated, every 128-bit block of data is the MD5 of something.
If I want to do option #1, I need to do a bit of reverse engineering. For a website, I need to hope the algorithm is somewhere I can see it (it's probably wishful thinking, but check the Javascript).

For option #2, unless it's an easy case, you don't identify the hash – you say “MD5? Looks good to me :P” and go on your merry way. You can't actually use that knowledge, since it very well may not be, but it's not like hashes or useful, right? Right? (Note: Wrong)



I hope you learned something, and know that even this overly technical explanation is an oversimplification of how things really are. 

This is a complex field of study, so get used to it!

All original non-code/non-software content is committed to the public domain, except where otherwise explicitly stated. Code/software is licensed under the BSD 3-clause license, except where otherwise explicitly stated. Content not originally created by Serpent Security may be subject to separate licensing terms.