In Computer Science when do you learn the fundamentals of high level languages and the methodologies a compiler uses to create assembly instructions? What is the primary book used for this course? Like, if you’re using Ada or Ghidra and trying to piece together what is happening in binary execution, I want to know structures to look for on this basic level.

I’m asking about the simple stuff like what you find in Arduino sketches with variables, type declarations, branching, looping, booleans, flags, interrupts etc. Also how these might differ across architectures like CISC/RISC, Harvard/von Neumann, and various platform specifics like unique instruction set architecture implementations.

I have several microcontrollers with Flash Forth running the threaded interpreter. I never learned to branch and loop in FF like I can in Bash, Arduino, or Python. I hope exploring the post topic will help me fill in the gap in my understanding using the good ol’ hacker’s chainsaw. If any of you can read between the lines of this inquiry and make inference that might be helpful please show me the shortcuts. I am a deeply intuitive learner that needs to build from a foundation of application above memorization or theory. TIA

  • @agent_flounder
    link
    English
    99 months ago

    You probably want to look for books on reverse engineering. And a book on assembly for your CPU.

    I learned assembly language for VAX-11 (this was like 30+ years ago) in a CS class. We also learned 6502 assembly in a computer engineering class. Neither book would help you. You want a book specific to whatever CPU you’re using.

    Now, I never took it, but friends in college took a CS Compilers course where they learned the basics of writing a compiler. But that’s not what you’re talking about though it might help.

    Trying to understand what a program does is reverse engineering. And a tool like IDA Pro would help you understand subroutines, variables, flow, library calls, and so on.

    A debugger will be invaluable for seeing a program execute one instruction at a time.

    You would need to know the assembly language for your CPU. And it would help to become familiar with certain patterns. I haven’t done much assembly (but I have done assembly on a few different CPUs) nor much reverse engineering so I’m not sure I can lend a whole lot of insight there.

    As you learn assembly instructions, you will start to understand how loops, subroutines, if/then/else, and other things are accomplished for your CPU.

    For example, if/then/else and loops are often accomplished with conditional branching. The conditions are based on CPU flags (bits in the Status Register) that are set by a comparison instruction. You’ll start to recognize how if/then/else and loops and other things are commonly implemented in assembly (without necessarily having to study the compiler; it will be obvious without knowing anything but assembly).

    Another example might be how C structs are implemented. Some CPUs provide convenient memory addressing modes for structs, some don’t. Nearly all I am familiar with provide a convenient way to reference arrays with a simple index.

    Subroutines are jumps to a set location and at the end of that code is a return instruction. Usually registers have to be saved when jumping and restored when returning. Arguments to the subroutines are pushed on to the stack either by value or by reference. Return value is provided through some convention (machines with lots of registers might always use one particular one for return).

    I guess bottom line, learn assembly for your particular CPU, then take a crack at using a debugger and disassembler / reverse engineering tool.

    I’m not entirely sure I follow why that is needed to learn how to do branching in forth but I only vaguely remember that language. Maybe if I did it would be more clear.

    Anyway I hope this helps at least a little.

    • @[email protected]
      link
      fedilink
      English
      4
      edit-2
      9 months ago

      A great way to learn would also be to write your own c programs and then disassemble or use the -S compiler flag to see the result of the compilation and play around with different optimizations levels (-O)

  • @abhibeckert
    link
    9
    edit-2
    9 months ago

    What is the primary book used

    There isn’t one. Most people don’t learn this stuff by reading a book.

    The best way to learn is by looking at actual assembly code, then research what each instruction does. I wouldn’t start with actual compiler generated code. Being computer generated it’s often quite messy and obviously undocumented. Best to start with easier to read code like the example I’ve included below — a simple “print Hello World” in CISC, then in RISC.

    Notice CISC uses mov, int and xor, while RISC uses mov, ldr, and svc. You should look those up in a manual (plenty of free ones online) but in simple terms:

    • mov: move memory from one place to another. RISC and CISC have the same instruction but but they’re not identical
    • int: means interrupt, essentially stop execution (for a moment) and hand execution over to other software
    • xor: modifies a value (an XOR operation)
    • ldr: is “load register”, it loads a value from elsewhere in memory
    • svc: means “supervisor call” which, is used in much the same way as int. The code is asking the kernel to do something (once to write to stdout, and once to terminate the program).

    section .data
        helloWorld db 'Hello World',0xa  ; 'Hello World' string followed by a newline character
    
    section .text
        global _start
    
    _start:
        ; write(1, helloWorld, 13)
        mov eax, 4          ; system call number for sys_write
        mov ebx, 1          ; file descriptor 1 is stdout
        mov ecx, helloWorld ; pointer to the string to print
        mov edx, 13         ; length of the string to print
        int 0x80            ; call kernel
    
        ; exit(0)
        mov eax, 1          ; system call number for sys_exit
        xor ebx, ebx        ; exit status 0
        int 0x80            ; call kernel
    

    .section .data
    helloWorld:
        .asciz "Hello World\n"
    
    .section .text
    .global _start
    
    _start:
        ; write(1, helloWorld, 13)
        mov r0, #1                  ; file descriptor 1 is stdout
        ldr r1, =helloWorld         ; pointer to the string to print
        mov r2, #13                 ; length of the string to print
        mov r7, #4                  ; system call number for sys_write
        svc 0                       ; make system call
    
        ; exit(0)
        mov r0, #0                  ; exit status 0
        mov r7, #1                  ; system call number for sys_exit
        svc 0                       ; make system call
    
    
  • @marcos
    link
    69 months ago

    I don’t understand…

    Are you trying to learn how to make compilers/interpreters? Are you trying to learn how to write assembly like the compiler does? Or are you trying to have a deeper understanding of those languages?

    Those are all very different things. (Computer architecture is also one very different thing, but from what I understand that’s not what you want.)

    • BoscoBear
      link
      fedilink
      49 months ago

      Not op. Deeper understanding of how these compilers are written across different architectures yet share common compiled structures.

      • @marcos
        link
        29 months ago

        “Deeper understanding of how these compilers are written” you can get in a compiler book. I’ve found a copy of the dragon book here: https://iitd-plos.github.io/col729/refs/ALSUdragonbook.pdf

        Currently, I’d recommend you read a monadic parser tutorial and jump over the practical material about compiler parsing (the theory is still very useful). There re more modern books, more focused on semantics, but I don’t remember of any to recommend you.

        “Deeper understanding about compilers across different architectures” looks like an assembly course to me. If you want to compare RISC and CISC, you’ll probably want x86 assembly and something like MIPS. (Notice that you will probably never use any of those on practice. But any assembly you would use on practice is too complicated to start with.)

        But that “yet share common compiled structures” part, I have no idea at all. I’m not sure anybody formally studies this. You may want to read about the LLVM intermediate representation and how to create a backend for it.

        • BoscoBear
          link
          fedilink
          29 months ago

          The LLVM is probably the most appropriate answer to my question. Books about the development of it would be outstanding.

    • @j4k3OP
      link
      English
      19 months ago

      I want to get a deeper understanding of assembly to high level structures. FF has poor documentation in general, but I can compile my own Forth words using assembly. I don’t know assembly as a functional language but know the basics. I’m mostly looking for a way to better understand what FF is doing or write my own branching. I also want a better understand reverse engineering basics using ghidra.

      • @marcos
        link
        39 months ago

        Oh, ok. You want to learn PIC assembly.

        Forth is a fun language, in that most of what one would study on compilers do no apply to it at all. You would need some book specifically aimed at Forth.

        I don’t think you will get anything useful from computer science material. You need focused, technical material, not theory.

        Anyway, a processor manual is usually called a “datasheet”. (E.g. https://ww1.microchip.com/downloads/en/devicedoc/35007b.pdf) That will have the hardware information (instructions, interruptions, I/O, embedded devices, hardware flags, register types, etc).

        The types, variables, and control flow are defined by the language, not the hardware. And again, whatever Forth gives you will be highly unusual and probably not covered on a compilers book. I don’t have a good book on Forth to recommend.

        (I hope somebody gets a better recommendation than mine, because honestly, now that I understood your problem, this is quite useless. Sorry.)

  • @[email protected]
    link
    fedilink
    59 months ago

    I learned through three things:

    1. writing some basic functions in assembly code by hand for a course (not many)
    2. implementing a basic compiler back-end in llvm (any similar IR or assembly target would do)
    3. learning the principles other people were using to write fast code (in my case game engine developers)

    The first two things helped me understand how common code constructs are translated to assembly, so I can do a rough projection in my head when skimming a C function. Nowadays you can get quite far just by playing around on godbolt.

    The third thing helps surface the less visible aspects of CPUs. After learning how a few low-level optimisations work, all the principles and explanations start to repeat, and 90% of them apply to every modern architecture. You can set out with specific high-level questions, like:

    • why is iteration faster with an array than a linked list?
    • what does vectorisation mean?
    • what is a “struct of array” optimisation?
    • why does the ECS pattern make game engines fast?

    Very quickly you’ll find lots of insightful articles and comments explaining things like CPU caching, prefetching, branch prediction, pipelining, etc.

    I have no book recommendations for you. I’ve found all the best information is freely online in blogs and comment sections, and that the best way to direct my learning is to have a project (or get employed to do low-level stuff). Might be different for you though!

  • @solrize
    link
    49 months ago

    For reverse engineering you probably have to study some assembly output of compilers since the methods of implementing stuff like C++ vtables can be a bit intricate. There are also some books you can read. This bundle has one that I haven’t looked at:

    https://www.fanatical.com/en/bundle/effective-cybersecurity-prevention-bundle

    There are decompilation tools that can recognize some of that stuff automatically too (idk if Ghidra does that).

    Handwritten asm code will generally not look like compiled code. The Forth interpreters you are looking at are probably a good place to start.