Inside the GPT-3
You must log in or register to comment.
I’ve got a background in deep learning and I still struggle to understand the attention mechanism. I know it’s a key/value store but I’m not sure what it’s doing to the tensor when it passes through different layers.
@behohippy @saint Instead of timestep by timestep sequence modeling the attention allows us to pass sequential model in a parallel NN just like fully connected one, where the positional encoding helps us to know the sequence of each and we can remove the keys having less attention value…
What are you eating which needs that large of a napkin?