Wait, is this implementation just wrong? The "num_channel" is the embed_d, the K, Q should be applied to every embedded vector, here the embedded vector has length 1 so K, Q and V should be a single scalar and being applied like torch.matmul(x_i, self.K) [sequence_dim, 1] x [1, 1], this is also intuitive because the sequence can have different size and self attention needs to work anyway.
If you want insead treat the "num_inputs" as embed_d then the attention matrix W would be 1x1 because there is only one 768 vector but instead the matrix W here is [ num_inputs, num_inputs ].
Now that you have updated the code I think you have just flipped the error, it's still there, just doing the opposite (still wrong) thing: for example, if "num_steps" is 768 and "num_inputs" is 1 then K, Q, V is 1x1, okay but the the attention matrix cannot also be 1x1, it needs to be 768x768; you can also do the opposite K, Q, V being 768x768 and the attention matrix 1x1, before the error was that you made both the K, Q, V and the attention matrix 768x768 by applying the matrices on the wrong dimension, now the error is that you have made the the K, Q, V and the attention matrix 1x1.
For a simple fix just make the attention matrix W [num_steps, num_steps] instead of [num_inputs, num_inputs] then "y_i = torch.matmul(x_v_i, w_i)" would be "y_i = torch.matmul(w_i, x_v_i)" and should be correct.
I just checked and the attention matrix is 768x768. Basically, I reshape each sample to be a single step with an embed size of 768. I hope this makes sense. I may rewrite the README later tonight to make this point clear.
If you have a single embedded vector, it should be clear that the self-attention matrix cannot be an arbitrary NxN but only a 1x1, follow the fix I have added to my previous comment.
Also if K, Q, V and the self-attention matrix have the same dimensions even when the embed_d and sequence dim have different sizes there is something wrong.
I made the same mistake recently! It doesn’t help that there are so many poorly documented repos that make similar mistakes (and worse yet, half baked medium articles that blatantly give false information). Directly translating papers to PyTorch is hard too, I really wish there was a better API to express these models in that aligns better with how they’re actually documented in papers.
No one here actually checks anything before upvoting it. Having said that, this repo seems worthwhile and I appreciate the effort of OP. It's open source which empowers people to fix such problems themselves and make a PR, or at least an issue.
Once I got this feedback from GaggiX, I was actually looking in HN for a way to remove my post, but then discovered I could not! So I have tried to fix it as soon as possible.
For the best, IMO. That was a great interaction, and anybody running into issues in the future related to this problem actually stand a chance of running into this thread now.
If you want insead treat the "num_inputs" as embed_d then the attention matrix W would be 1x1 because there is only one 768 vector but instead the matrix W here is [ num_inputs, num_inputs ].