How to calculate the MD5 checksum of a file with Python

The MD5 checksum is a digital signature that is generated to verify the integrity of a file. It is often used to ensure that a file has not been altered or corrupted during transfer. Python provides simple and effective tools to calculate the MD5 checksum of a file through the hashlib module. Below, we will explore how to use this module to calculate the MD5 checksum of a file in a few steps.

To avoid overloading the memory, especially with large files, it is advisable to read the file in blocks. In Python, we can do this with a loop that reads the file in blocks of a defined size, for example 4096 bytes.

After reading the file, we can update the hash object for each block of data read. This allows us to get the MD5 of the complete file without loading the entire content into memory.


import hashlib

def calculate_md5(file_path):
  # Create an md5 object
  md5_hash = hashlib.md5()

  # Open the file in binary read mode
  with open(file_path, "rb") as f:
    # Read the file in blocks
    for block in iter(lambda: f.read(4096), b""):
      md5_hash.update(block)

  # Return the hex digest, i.e. the MD5 checksum as a hexadecimal string
  return md5_hash.hexdigest()

# Use of the function
fpath = "path/to/your/file"
checksum = calculate_md5(fpath)
print("MD5 Checksum:", checksum)

Explanation:

  1. MD5 Object Creation: hashlib.md5() creates a hash object using the MD5 algorithm.

  2. File Opening: We open the file in binary read mode ("rb") to ensure that the file content is read as a sequence of bytes, regardless of the file type.

  3. Block Reading: Using iter(lambda: f.read(4096), b""), we read the file in blocks of 4096 bytes at a time. This approach is especially useful for large files, as it reduces memory usage.

  4. Hash Update: For each block read, we update the hash object using hash_md5.update(block).

  5. MD5 Hash Calculation and Output: After reading the entire file, we get the MD5 checksum using hash_md5.hexdigest(), which returns the hash as a hexadecimal string.

Conclusion

The MD5 algorithm is fast and effective for file integrity checking, but it is not ideal for applications that require a high level of security, as it is not considered collision-proof. In security-critical applications, it is better to use more robust algorithms such as SHA-256. However, for basic integrity checks and to detect accidental changes to files, MD5 remains a practical and widely used choice.

Back to top