November 2, 2018

Message Digest in Java

How to hash a data with MD5, SHA-1 or SHA-256 algorithm in Java?

Long story short

To avoid wasting of time, let’s try to use MD5 hashing algorithm:

try {
    // any source of data represented as byte array
    byte[] data = "test string".getBytes(Charset.forName("UTF-8"));

    // instance of desired algorithm MD5, SHA-1, SHA-256
    MessageDigest messageDigest = MessageDigest.getInstance("MD5");
    byte[] digest = messageDigest.digest(data);

    // not necessary, but often preferred 
    String stringRepresentation = new BigInteger(1, digest).toString(16);
    System.out.println(stringRepresentation);
    // 6f8db599de986fab7a21625b7916589c

} catch (NoSuchAlgorithmException e) {
    // ...
}

Let’s check results with embedded md5 command:

md5 -s "test string"
MD5 ("test string") = 6f8db599de986fab7a21625b7916589c

Thread-safety

MessageDigest is not thread-safe, and in general, you should use every new instance for every thread.

Or if you have dependency on Apache Commons library, you could use thread-safe implementation:

 byte [] digest = new DigestUtils(SHA_224).digest(dataToDigest);

But as said in Apache Commons documentation: “However the MessageDigest instances it creates generally won’t be.”

Use cases

Generic idea

A hash function or a message digest algorithm is known as function which handles data of an arbitrary length as input and returns data of fixed length as output.

The most important properties which hash function should support:

  1. Deterministic. For the same inputs, hash function must always return same outputs, otherwise it is not usable.
  2. One-way. There is no possibility to recover original data (message) from a given hash.
  3. Weak collision resistant. It should be very hard to generate message for a given hash.
  4. Strong collision resistant. The digests for two close/similar messages must not be similar.
  5. Unpredictability. General property, function must return unpredictable results for a give message, but deterministic.

Examples of application

Of course, the most common use cases are of course - cryptography: integrity checks, password storage, SSL, PGP. But not only.

Search algorithms could use hashing to match identical contents without checking all content.

Relates to search - finding duplicates. E.g. image could be hashed and searched across database just by hash.

DVCS git and Mercurial SCM. Both use SHA-1 to build unique identifier of commit to synchronize and maintain linear history of changes across distributed set of repositories..

The TCP/IP checksum is used to detect corruption of data over a TCP or IPv4 connection. The TCP checksum is a weak check by modern standards.

General recommendations

  1. Do not forget about thread-safety properties of MessageDigest class. It is not thread-safe and should be used properly.
  2. Avoid using MD5, SHA-0 and SHA-1 algorithms. These algorithms have been compromised and could not be used, anymore, for any applications that requires collision-resistance properties, such as password storage, generating digital signatures or time stamps.
  3. Always search and check latest news about comprises of chosen algorithm and your special case.