ada's blog

compression algorithms

a deep dive into compression algorithms and how to notice them in hex
last updated at published at

Compression Algorithms

Update: 2023-07-11 - zenhax is offline, replaced links with archive.org links.

One of the things that my programmer friends often ask me about is how I can tell what kind of compression algorithm is used by a file. This is an interesting question, and I hope that this post will help you understand how I notice compression algorithms in hex.

I will not be going over the fundamentals of compression algorithms or go into detail about how they work.

The forum posts1 and reference docs that have taught me how to do this are referenced where appropriate.

I will explain some of the structure of how the compression algorithms are set up because I believe that understanding what these values mean, it will help you understand why they are there and how to notice them when the configuration values are anything but the defaults.

All magic values are written as byte sequences (i.e. big endian)

The Basics

The first thing you need to know is that compression algorithms are not magic. In many cases compression algorithms have a sanity check (a “magic” number) that is used to verify and set up the decompressor. In other cases parts of the data can be seen in the compressed data.

The Dreaded Lempel-Ziv Algorithm (Lz*)

The Lempel-Ziv algorithm is a compression algorithm that is used in many compression formats. It comes in a lot of flavors and figuring out which one is used can be difficult.

To figure out if a file might be compressed with an Lz algorithm, you should look for the following:

This is because LZ algorithms use a dictionary to store data that has been seen before, and the first byte (the “block”) is used to determine the length of the data to be copied from the dictionary.

I strongly suggest using comscan2 with quickbms3 to test what compression algorithm is used by a file when you encounter this and LZ4 (see below) does not work.

LZ4

LZ44 is a common compression algorithm used especially in video games during the 2010s.

LZ4’s block has the following format:

struct lz4_block {
    uint8_t encode_count : 4;
    uint8_t literal_count : 4;
}

The encode_count is the number of bytes to copy from the decompressed stream, and the literal_count is the number of bytes to copy from the compressed data. There is also a special case where either byte is 0F, which means that the next bytes are added to the count (until the byte is no longer FF). The minimal number of literals read is 4.

LZMA, and LZMA2

LZMA5 has no hard defined header6, though it will often start with 5D or 2C followed by a 32-bit integer that is the size of the inline dictionary data (usually zero.)

LZMA2 likewise has no header, though it will often start with 18 followed by compressed data. Note that this byte is optional.

I have not yet seen a raw LZMA stream in the wild beyond 7z files, likely due to it’s large overhead.

DEFLATE, Zlib, and GZip

DEFLATE7 is a compression algorithm that is used in many files, and you will most likely have already seen it if you do any amount of file analysis.

ZLib uses a DEFLATE block with a header, and ADLER32 as it’s checksum algorithm.

ZLib

ZLib8 preprends a 2 byte header to the compressed block (usually deflate). This usually is 78 9C or 78 DA

The first byte is the compression method, and the second byte has some flags. The compression method is usually 8, which is DEFLATE.

struct zlib_header {
    uint8_t compression_method : 4;
    uint8_t compression_info : 4;
    uint8_t checksum : 5;
    uint8_t dict : 1;
    uint8_t level : 2;
}

The compression_info is the log base 2 of the window size (the size of the dictionary used by the compressor), and the checksum a the checksum of the header. The dict flag is set if a dictionary is used, and the level is the compression level used by the compressor.

Knowing this, zlib header will always start with 78 if the compression method is DEFLATE.

DEFLATE

A “raw” DEFLATE9 stream is not very common, but it is still used in some places (especially in files produced by C# projects).

It usually starts with C# or E# but it’s a bit more complicated than that.

The DEFLATE block has the following format:

struct deflate_block {
    uint8_t final : 1;
    uint8_t type : 2;
}

The final flag is set if this is the last block in the stream, and the type is the type of block.

GZip

GZip10 is a ZLib stream with a well formed header.

GZip will always start with a magic number (1F 8B) as well as a compression method (8 for DEFLATE).

The header has the following format:

struct gzip_header {
    uint16_t magic;
    uint8_t compression_method;
    uint8_t flags;
    uint32_t timestamp;
    uint8_t xtra_flags;
    uint8_t os;
}

ZStandard

ZStandard11 (zstd) is a relatively new compression algorithm that is starting to be used in many places due to it’s ability to have very high compression ratios with specialized dictionaries.

Fortunately, ZStandard has a magic number that is used to identify the file, this usually is ## B5 2F FD with the unknown byte being the specific version.

The ZStandard header has the following format12:

struct zstd_header {
    uint32_t magic;
    uint8_t content_size_flag : 2;
    uint8_t single_segment_flag : 1;
    uint8_t unused : 1;
    uint8_t reserved : 1;
    uint8_t checksum_flag : 1;
    uint8_t dict_id_flag : 2;
}

ZDictionary

ZStandard might use a dictionary13 to compress the data, and the dictionary is stored either as a separate file, or in the same file as the compressed data. In some cases it might be in the executable itself (very rare!) Documentation on ZDict is sparse, however we know that the magic value is 37 A4 30 EC

The ZDict header has the following format:

struct zdict_header {
    uint32_t magic;
    uint32_t dict_id;
}

Oodle

Oodle14 is a proprietary compression format used in many games, and has a hardware encoder in the PS5.

Oodle will always start with #C if it is made with version 4 or higher. Version 4 Oodle files have the following format:

struct oodle_block_header {
    uint8_t magic : 4;
    uint8_t version : 2;
    bool copy : 1;
    bool reset : 1;
    uint8_t compression_type : 7;
    bool has_checksum : 1;
}

Compression type will be between 0 and 13 as of Oodle Version 9 (oo2core_9), and the checksum used is a modified Jenkins algorithm. From this we can deduce that the second byte will be between 00 and 0D, or 80 and 8D

Note that Oodle will still load version 3 and older files, which will start with #0, #1, #2, or #3 and will usually look like LZW or LZB.

Zip Signature Speedrun

Compression archives almost always have a signature at the start of the file. I’m adding them here for completeness.