• @[email protected]
    link
    fedilink
    English
    192 years ago

    Software consists of instructions for a computer to do something. These are made to be easy to follow for a computer, not for a human.

    Humans write software in a human-readable form, the source code. This then (usually) gets converted to a machine-readable form, called machine code (or bytecode for some languages).

    Depending on the programming language and settings used, more or less information is completely lost in the process. For some languages (.NET, Java) you can get most of the structure, sometimes even with most original variable and function names, back from the bytecode, and see relatively easily what the program does.

    For other languages (e.g. C/C++), even the structure is lost - you can’t even reliably tell which parts of the program belong to the same function. You can read the machine code, and it “clearly” says what it does, but trying to make sense of that mess is slow, error-prone, and you won’t fully understand every part (it’s just too much), so you will mostly be looking for parts that seem related to what you’re interested in. For example, if you’re looking for an encryption algorithm, you may look for code that opens two files, reads from one and writes to the other, then look for a piece of code “nearby” that’s doing a lot of math. Or for malware, you may want to focus on network connections. Since the software needs to talk to the operating system to make network connections, this tends to happen in a standardized way and you can quickly find the part of the code that talks to the network features of the OS.

    You can also run the program step by step and observe what it does (possibly messing with it while doing so to see how that changes the behavior).

    For an example of how machine code looks, what in source code would be ShowDialog('hello') could become

    • put 0x1005f225 (your reverse engineering tool helpfully will add a note that this is the address where a text “hello” is stored) into register 1
    • increment stack pointer by 4
    • push R1 onto the stack
    • call 0x10000443C (you now look at the code there)
    • put the value of R1 into R4
    • if r1 is zero jump to 0x10004458
    • put r4 onto the stack
    • call the OS function to show a dialog (if you’re lucky your reverse engineering tool has identified this for you.)

    (Made up inaccurate example just to illustrate the idea. It’s horrible to read.)