How does reverse engineering of software work?

@[email protected] · 1 year ago

How does reverse engineering of software work?

@[email protected] · 1 year ago

Software consists of instructions for a computer to do something. These are made to be easy to follow for a computer, not for a human.

Humans write software in a human-readable form, the source code. This then (usually) gets converted to a machine-readable form, called machine code (or bytecode for some languages).

Depending on the programming language and settings used, more or less information is completely lost in the process. For some languages (.NET, Java) you can get most of the structure, sometimes even with most original variable and function names, back from the bytecode, and see relatively easily what the program does.

For other languages (e.g. C/C++), even the structure is lost - you can’t even reliably tell which parts of the program belong to the same function. You can read the machine code, and it “clearly” says what it does, but trying to make sense of that mess is slow, error-prone, and you won’t fully understand every part (it’s just too much), so you will mostly be looking for parts that seem related to what you’re interested in. For example, if you’re looking for an encryption algorithm, you may look for code that opens two files, reads from one and writes to the other, then look for a piece of code “nearby” that’s doing a lot of math. Or for malware, you may want to focus on network connections. Since the software needs to talk to the operating system to make network connections, this tends to happen in a standardized way and you can quickly find the part of the code that talks to the network features of the OS.

You can also run the program step by step and observe what it does (possibly messing with it while doing so to see how that changes the behavior).

For an example of how machine code looks, what in source code would be ShowDialog('hello') could become

put 0x1005f225 (your reverse engineering tool helpfully will add a note that this is the address where a text “hello” is stored) into register 1
increment stack pointer by 4
push R1 onto the stack
call 0x10000443C (you now look at the code there)
put the value of R1 into R4
if r1 is zero jump to 0x10004458
put r4 onto the stack
call the OS function to show a dialog (if you’re lucky your reverse engineering tool has identified this for you.)

(Made up inaccurate example just to illustrate the idea. It’s horrible to read.)

@[email protected] · 1 year ago

Thanks a lot for this good explanation!

@Rednax · 1 year ago

People seem to equate reverse engineering with decompiling. Those are not the same.

To me, reverse engineering is attempting to answer Why a piece of code does something. While just reading code attempts to answer What the code does. You attempt to reconstruct the decisions that lead to the current behaviour of the software.

Even if you do have the source code, and can easily answer What the code does, you may still not know why.

For example: why did the Lemmy devs disable captchas in server version 0.18.0? It is easy to see in the code that they did, but if they left no documentation, it is hard to know why. And without knowing why, you cannot fix any problem they had with it. Unfortunately, most why-questions are a lot harder to answer than that one. Mostly because the Lemmy devs are decent at commucation.

@Zardoz · 1 year ago

Reverse engineering is more about understanding how a piece of software does something so you can better work with it, or make your own version of it. Typically requires a lot of time studying it, and usually goes hand in hand with decompiling. But decompiled source isnt the cleanest and doesn’t give you the exact same code the devs have. It only gives you the low level order of operations.

Most of the time, knowing why requires understanding all the code from an architecture perspective, which typically requires being part of the internal decision making. You won’t get that unless you have the actual source code with good comments and documentation. All of which would be stripped out during compilation.

Whenever I reverse engineer something at work is because it is usually a super old 3rd party software that’s out of support, and I need to see how it’s performing some task. I’m never able to get the context of why they do it a certain way but I do get the how of it

@WalrusByte · 1 year ago

While you are correct, I would note that OP didn’t ask “What is reverse engineering?”, they asked “How do you reverse engineer software?”. That typically always starts with decompiling in some form. You’re right that it’s not the whole picture, but I would say “Decompiling and studying binaries” would be a satisfactory ELI5 answer to OPs question.

@WalrusByte · 1 year ago

To understand this you need to know how code is compiled into machine code. So basically computers only understand ones and zeros, but that’s really hard for humans to work with. So we created something called assembly, which allows us to convert more human understandable phrases like “add” and “sub” to perform calculations and map them to certain machine code instructions (AKA ones and zeros). But it turns out using just assembly was also pretty tedious, so they created languages like C, where you have another program called a compiler take in C code which was easier for humans to understand and convert it to the equivalent assembly automatically.

So most software you run on the computer is a binary, meaning it’s a bunch of the machine code that was previously compiled from some other language like C. You can decompile these binaries back into assembly, which you can then manually read and convert back to a more human readable language. There’s also other tools out there that make this process easier, but that’s the basic idea: take ones and zeros, convert it back into assembly, then try and figure out how it works from there.

Rikudou_Sage · 1 year ago

Really depends on the kind of software and what exactly you’re trying to reverse engineer.

If it’s a software communicating via internet with something and you want to know what the communication looks like, you install another software that can catch all network traffic on your computer and look through the requests and their responses. Basically you do some action (like click a button) and watch the requests. From that you know what the button does and how to replicate it.

If you mean the source code of an app, I don’t know that much about that, but I know you can decompile the software (which means you take the app and turn it into source code) which produces a horrible looking source you then go through and look for what you’re interested in. But honestly, I don’t have a deep knowledge of how it’s done, definitely not enough to be explaining it to a 5yo.