YSK about compression

@flint5436 · 2 years ago

YSK about compression

tal · edit-2 2 years ago

YSK long noise videos cant effectively be compressed

From the standpoint of loading down Reddit today, yes. But, if we want to talk computer science theory and what one compression algorithms one could build, that’s not really true.

There are two classes of compression – lossless compression and lossy compression.

Lossless compression retains an exact copy of the original data. Compress and then decompress and you get back the original.

So, okay. How can you compress data? I mean, if I have a byte of data, eight 1s or 0s, how can I use less than eight 1s or 0s to store those? For lossless compression, the answer is that you have to have some knowledge of what information it is that you’re storing. If you know that information of a given length N with certain characteristics comes up more-frequently than others, then you can assign a shorter pattern M to represent that pattern and then use the old pattern of length N to represent something else less-common. Lossless compression is just the art of reordering representations of data to more-closely fit the frequency with which they arise: shorter for things that are relatively-more-common.

If you’re wrong about that order, then lossless compression can make the representation of your data larger.

Now, technically the noise in there is actually probably very predictable, because it’s likely based off a pseudorandom number generator (PRNG). That isn’t really “random” – it’s just making numbers that look random from a single number that’s hard to predict. If the PRNG isn’t having more entropy injected into it over time, then all of the noise generated during the session comes down to that one small number. If you were clever and could figure out the seed – a small number, often something like 64 bits, and often seeded off something like the Unix time at the time that the random numbers started being generated, which makes it even more predictable than that – or at least the internal state of the PRNG, maybe 256 bits – you could basically store the content of the whole video in just a few bytes. However, it’s not always easy to determine that original state – in the case of cryptographically secure PRNGs, it’s specifically intended to be impractical.

However, we generally treat pseudorandom noise as if it were actually truly random, rather than just pseudorandom, which means that it’s totally unpredictable, and if that is the case, then you cannot losslessly compress it and make it smaller, not over a sufficient quantity of noise, because you can know nothing about the frequency with which a given pattern arises.

So, depending upon the source of noise used, we might be able to do lossless compression of noise, if pseudorandom noise was used (probably) and if we can figure out what that number is that was used to generate that noise.

Okay, enough about lossless compression. Can we do lossy compression of noise?

And there the answer is…yeah, probably yes.

The way lossy compression works is that we have to know something about what information is actually “important” when we get around to actually using it. That lets us throw out some of the less-important information. What we get back, unlike with lossless compression, is not true to the original, but it’s a lot closer than if we just threw out information without regard for what’s important and what’s not. Lossy compression is often used to compress audio and video.

For a lot of things, there’s a lot of not-very-important information.

Let’s say that we’re looking at a video of noise. Your brain doesn’t care about every exact pixel there. It’s looking for shapes that remain across multiple frames, move together, so it can pick out objects and the like. Your brain just basically sees the noise as one big field of stuff of an approximate color changing at a given rate. None of the specifics of that noise matter. Basically, regardless of what seed was used to generate that noise, pretty much all noise with the given properties (black and white, 1 pixel size, changes every frame, N fps) looks pretty much identical. So a good form of lossy compression in a video codec would be to detect anything that looks like noise – and then just replace it with generated noise using a fixed seed. All noise looks pretty much identical to a human. So you’d get pretty much identical-looking output. As it happens, existing video codecs don’t have a noise detector, but they could.

So video of noise is lossily-compressible.

Now, I will grant that this is unrelated to putting load on Reddit, but, hey, might as well start filling the Fediverse with useful information now.

skulblaka · 2 years ago

This is a good post. I don’t really have much more to add than that but I want to boost you with the interaction. Very well written and informational, and as far as I can tell, accurate.

@A_Toasty_Strudel · 2 years ago

I started reading and felt my brain start to numb a little. Is there an ELI5?

@SameOldJorts · 2 years ago

So say you have a picture, and it’s made up of pixels, and you want to send that picture to someone but in order to do so, you have to make it smaller. You could send the most important bits and allow reconstruction on the receiver’s end, or you could some how make it smaller without changing the information. So if your picture is four blue pixels, followed by 3 red, and 2 yellow you could send the entire string like that, versus blue, blue, blue, blue, red, red, red, yellow, yellow. This would be lossless and are generally GIF, PNG, etc. JPEG is lossy compression, and it would be like telling your friend receiving the picture “I have a picture of a bird, here’s part of a beak, one wing, a tail, and one foot.” Your friend, being smart, can reconstruct the data that wasn’t sent (other wing and foot, body) because they have a good idea how the rest of the bird should look based on the parts they see. Lossy is better for smaller compression, but lossless is important if all the information needs to reach the receiver. Hope that helps.

@A_Toasty_Strudel · 2 years ago

This was actually super informative! Thanks, my dude!

@flint5436 · 2 years ago

A fellow cs mayor I see, no I apreciate the thoroughness. You are right I was trying to put it in laymans terms and might have been a little to cursory.

I gotta disagree on the pseudo randomness tho. At least in linux /dev/random generates its entropy pool by using device drivers. So there is no simple algorithm behind it where you can copy a seed. So you would have to to copy the system state and all external events happening (eg. the ethernet network traffic) to generate the same output.

@PixxlMan · 2 years ago

It’s likely that the compression uses a fixed bit rate, meaning that no matter how large the original file is, the file spat out by reddits compressor won’t be any larger than any other video of the same length. The quality of a video with higher entropy (like a noise video) would simply be decreased until it fit into the allocated space. That means this wouldn’t work.

LazaroFilm · 2 years ago

If it’s VBR, either two things can happen, the compression esteems the noise can be averaged to a solid gray background, and the bit rate will be very low to just display a solid gray card, or it deems each pixel worthy of staying distinct, which forces the compression to drop and will raise the bit rate to the maximum.

@flint5436 · 2 years ago

Oh yeah good point, I didn’t think about server side compression. But I think noise videos might still be worse, since they provide no additional content for the platform.

@SkullHex2 · edit-2 2 years ago

deleted by creator

Trebach · 2 years ago

/dev/urandom will supply different seeds for the different function calls if you’re on Linux because it’s pulling from a rotating random number generator. Eventually it will repeat itself if there’s not enough random data being generated from /dev/random to pull from though.

For Windows, instead of hard coding a number as the random number generator, use “time(0)” instead. It’s the number of seconds since January 1, 1970 down to 6 decimal places. It also changes per frame, so your noise should be different than anyone else’s noise. Someone demonstrates it here https://www.youtube.com/watch?v=3jb8NNmooCM but just prints it to the command line.

@_MoveSwiftly · 2 years ago

Could you please add a “Why YSK:”? It’s rule #2. Thank you. :)

tal · edit-2 2 years ago

Removed by mod

@flint5436 · 2 years ago

Oh sorry on Jerboa it does not show Community descriptions and I did not see the rules. But I am kind of curious tho, does it need do be worded like that? I wrote you should know this, if you want to keep the costs down, doesn’t that count? :P

@_MoveSwiftly · 2 years ago

Valid feedback, I’ll have to discuss it with other mods. It’s a nice way to state why a YSK is being posted, so it’s helpful to have and make the content easy to digest.

Codo · 2 years ago

This is fascinating. Thanks for taking the time to explain

onyx · edit-2 2 years ago

The noise videos could be from someone using that platform to store their own data.

This is happening on youtube, there’s a tool that embeds data/files into video and then upload that to youtube.

https://github.com/DvorakDwarf/Infinite-Storage-Glitch

nobodyspecial · 2 years ago

You know what would be even worse? Uploading 10 hour versions of “What’s Going On” the He-Man mix. With a bit of extra data or a few seconds less data to make it unique. Heck, converted to an animated image and set as a profile pic could work too.

https://www.youtube.com/watch?v=Kob0G2hE8IY – source material.