• @Aceticon
    link
    30
    edit-2
    2 months ago

    First a fair warning: I learned this stuff 3 decades ago and I’ve actually been working as a programmer since then. I do believe the example I’ll provide still applies up to a point, though CPUs often implement strategies to make this less of a problem.

    =====

    CPU’s are internally like an assembly line or a processing pipeline, were the processing of an assembly instruction is broken down into a number of steps. A rough example (representative but not exactly for any specific CPU architecture) would be:

    • Step 1: fetch assembly instruction from memory
    • Step 2: fetch into the CPU data in memory that the instruction requires (if applicable).
    • Step 3: execute arithmetic or binary operation (if applicable).
    • Step 4: evaluate conditions (if applicable)
    • Step 5: write results to memory (if applicable)

    Now, if the CPU was waiting for all the steps to be over for the processing of an assembly opcode before starting processing of the next, that would be quite a waste since for most of the time the functionality in there would be available for use but not being used (in my example, the Arithmetic Processing Unit, which is what’s used in step #3, would not be used during the time when the other steps were being done).

    So what they did was get CPUs to process multiple opcodes in parallel, so in my example pipeline you would have on opcode on stage #1, another that already did stage #1 and is on stage #2 and so on, hence why I also called it an assembly line: at each step a “worker” is doing some work on the “product” and then passing it to the next “worker” which does something else on it and they’re all working at the same time doing their thing, only each doing their bit for a different assembly instruction.

    The problem with that technique is: what happens if you have an opcode which is a conditional jump (i.e. start processing from another point in memory if a condition is valid: which is necessary to have to implement things like a “for” or “while” loop or jumping over of a block of code in an “if” condition fails)?

    Remember, in the my example pipeline the point at which the CPU finally figures out if it should jump or not is almost at the end of the pipeline (step #4), so everything before that in the pipeline might be wrong assembly instructions being processed because, say, the CPU assumed “no-jump” and kept picking up assembly instructions from the memory positions after that conditional-jump instruction but it turns out it does have to jump so it was supposed to be processing instructions from somewhere else in memory.

    The original naive way to handle this problem was to not process any assembly instructions after a conditional jump opcode had been loaded in step #1 and take the processing of the conditional jump through each step of the pipeline until the CPU figured out if the jump should occur or not, by which point the CPU would then start loading opcodes from the correct memory position. This of course meant every time a conditional jump appeared the CPU would get a lot slower whilst processing it.

    Later, the solution was to do speculative processing: the CPU tried to guess if it would the condition would be true (i.e. jump) or false (not jump) and load and start processing the instructions from the memory position matching that assumption. If it turned out the guess was wrong, all the contents of the pipeline behind that conditional jump instruction would be thrown out. This is part of the reason why the pipeline is organised in such a way that the result of the work only ever gets written to memory at the last step - if it turns out it was working in the wrong instructions, it just doesn’t do the last step for those wrong instructions. This is in average twice as fast as the naive solution (and better guessing makes it faster still) but it still slowed down the CPU every time a conditional jump appeared.

    Even later the solution was to do the processing of both branches (i.e. “jump” and “no-jump”) in parallel and then once the condition had been evaluated throw out the processing for the wrong branch and keep doing the other one. This solved the speed problem but at the cost of having double of everything, plus had some other implications on things such as memory caching (which I’m not going to go into here as that’s a whole other Rabbit Hole)

    Whilst I believe modern CPUs of the kind used in PCs don’t have this problem (and probably also at least ARM7 and above), I’ve actually been doing some Shader programming of late (both Computing and Graphics Shaders) and if I interpreted what I read correctly a version of this kind of problem still affected GPUs not that long ago (probably because GPUs work by having massive numbers of processing units which work in parallel, so by necessity they are simple) though I believe nowadays it’s not as inadvisable to use “if” when programming shaders as it used to be a few years ago.

    Anyways, from a programming point of view, this is the reason why C compilers have an optimization option of doing something called “loop unrolling” - if you have a “for” loop with a fixed number of iterations known at compile time - for example for(int i = 0; i < 5; i++){ /* do stuff */ } - the compiler instead of generating in assembly a single block of code with the contents of the “for” loop and a conditional jump at the end, will instead “unroll the loop” by generating the assembly for the body of the loop as many times as the loop would loop - so in my example the contents of that “for” loop would end up as 5 blocks in assembly each containing the assembly for the contents, one after the other, the first for i=0, the next for i=1 and so on.

    As I said, it’s been a long time since I’ve learned this and I believe nowadays general CPUs implement strategies to make this a non-problem, but if you’re programming microcontrollers or doing stuff like Compute Shaders to run on GPUs, I believe it’s actually the kind of thing you still have to take in account if you want max performance.

    Edit: Just remembered that even the solution of doing the parallel execution of both branches doesn’t solve everything. For example, what if you have TWO conditional jump instructions one after the other? Theoretically would need almost 4 or everything to handle parallel execution for it. How about 3 conditional jumps? “Is you nested for-loop screwing your performance? More news at 11!”. As I said, this kind of stuff is a bit of a Rabbit Hole.