Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wait, is this implementation just wrong? The "num_channel" is the embed_d, the K, Q should be applied to every embedded vector, here the embedded vector has length 1 so K, Q and V should be a single scalar and being applied like torch.matmul(x_i, self.K) [sequence_dim, 1] x [1, 1], this is also intuitive because the sequence can have different size and self attention needs to work anyway.

If you want insead treat the "num_inputs" as embed_d then the attention matrix W would be 1x1 because there is only one 768 vector but instead the matrix W here is [ num_inputs, num_inputs ].



You may be right. I was not expecting this to be at the top of HN

UPDATE: I have updated the repo, attempting to address the issue raised by GaggiX


Now that you have updated the code I think you have just flipped the error, it's still there, just doing the opposite (still wrong) thing: for example, if "num_steps" is 768 and "num_inputs" is 1 then K, Q, V is 1x1, okay but the the attention matrix cannot also be 1x1, it needs to be 768x768; you can also do the opposite K, Q, V being 768x768 and the attention matrix 1x1, before the error was that you made both the K, Q, V and the attention matrix 768x768 by applying the matrices on the wrong dimension, now the error is that you have made the the K, Q, V and the attention matrix 1x1.

For a simple fix just make the attention matrix W [num_steps, num_steps] instead of [num_inputs, num_inputs] then "y_i = torch.matmul(x_v_i, w_i)" would be "y_i = torch.matmul(w_i, x_v_i)" and should be correct.


I just checked and the attention matrix is 768x768. Basically, I reshape each sample to be a single step with an embed size of 768. I hope this makes sense. I may rewrite the README later tonight to make this point clear.

Let me know if you think I still don't got it!!!


If you have a single embedded vector, it should be clear that the self-attention matrix cannot be an arbitrary NxN but only a 1x1, follow the fix I have added to my previous comment.

Also if K, Q, V and the self-attention matrix have the same dimensions even when the embed_d and sequence dim have different sizes there is something wrong.


Okay, I think it is fixed. I appreciate your help. If you see this, let me know if the issue still exists!


Now it should be correct.


I made the same mistake recently! It doesn’t help that there are so many poorly documented repos that make similar mistakes (and worse yet, half baked medium articles that blatantly give false information). Directly translating papers to PyTorch is hard too, I really wish there was a better API to express these models in that aligns better with how they’re actually documented in papers.


[flagged]


"Don't be snarky."

"Edit out swipes."

https://news.ycombinator.com/newsguidelines.html


No one here actually checks anything before upvoting it. Having said that, this repo seems worthwhile and I appreciate the effort of OP. It's open source which empowers people to fix such problems themselves and make a PR, or at least an issue.


The discussions are sometimes more interesting than the submission itself.


Once I got this feedback from GaggiX, I was actually looking in HN for a way to remove my post, but then discovered I could not! So I have tried to fix it as soon as possible.


For the best, IMO. That was a great interaction, and anybody running into issues in the future related to this problem actually stand a chance of running into this thread now.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: