It depends on the type of hash. For the type of hashing used by checksums, a single byte is enough, because they’re cryptographic hashes, and the intent is to identify whether files are exact matches.
However, the type of hashing used for CSAM is called a semantic hash. The intent of this type of hash is that similar content results in a similar (or identical) output. I can’t walk you through exactly how the hash is done, but it is designed specifically so that minor alterations do not prevent identification.
If, for instance, I was pirating a video game, would packing it in an encrypted container along with a Gb or two of downloaded YouTube videos be sufficient to defeat semantic hashing? What about taking that encrypted volume and spanning it across multiple files?
Encrypting it should be enough to defeat either hash.
Without encryption I think it would depend on implementation. I’m not aware of the specific limitations of the tools they use, but it’s for photo/video and shouldn’t really meaningfully generalize to other formats.
It depends on the type of hash. For the type of hashing used by checksums, a single byte is enough, because they’re cryptographic hashes, and the intent is to identify whether files are exact matches.
However, the type of hashing used for CSAM is called a semantic hash. The intent of this type of hash is that similar content results in a similar (or identical) output. I can’t walk you through exactly how the hash is done, but it is designed specifically so that minor alterations do not prevent identification.
If, for instance, I was pirating a video game, would packing it in an encrypted container along with a Gb or two of downloaded YouTube videos be sufficient to defeat semantic hashing? What about taking that encrypted volume and spanning it across multiple files?
Encrypting it should be enough to defeat either hash.
Without encryption I think it would depend on implementation. I’m not aware of the specific limitations of the tools they use, but it’s for photo/video and shouldn’t really meaningfully generalize to other formats.