• @kromem
    link
    English
    10
    edit-2
    7 months ago

    The network architecture seems to create a virtualized hyperdimensional network on top of the actual network nodes, so the node precision really doesn’t matter much as long as quantization occurs in pretraining.

    If it’s post-training, it’s degrading the precision of the already encoded network, which is sometimes acceptable but always lossy. But being done at the pretrained layer it actually seems to be a net improvement over higher precision weights even if you throw efficiency concerns out the window.

    You can see this in the perplexity graphs in the BitNet-1.58 paper.