GPUs are great at doing the same math on every item of a large multidimensional ...

contravariant · on April 22, 2024

Indeed GPUs are great at doing the same calculation in parallel. But if it was just that there should be enough opportunity to parallelise even without doing the exact same calculation multiple times.

The main reason I can come up with why doing the same calculation 8 times in parallel instead of 8 times sequentially is that you benefit from better locality of reference.

SonOfLilit · on April 22, 2024

As I said, the attention step is O(n^2) per token sequentially and O(n^2) for the entire sequence when calculating the entire sequence in parallel, where n is the length of the sequence.