GPUs are great at doing the same math on every item of a large multidimensional array.
Therefore, unsurprisingly, the cost per item of inference on a batch of items is significantly lower when the batch is e.g. 8 than 1 (in the case of Transformers there are further gains to be made because roughly half of the attention calculations in token k+1 are identical to the calculations of token k and can be easily reused by writing the formulas a certain way, the keyword to look for is causal attention mask).
In any reasonable GPU inference setup the weights would be preloaded.
Indeed GPUs are great at doing the same calculation in parallel. But if it was just that there should be enough opportunity to parallelise even without doing the exact same calculation multiple times.
The main reason I can come up with why doing the same calculation 8 times in parallel instead of 8 times sequentially is that you benefit from better locality of reference.
As I said, the attention step is O(n^2) per token sequentially and O(n^2) for the entire sequence when calculating the entire sequence in parallel, where n is the length of the sequence.
Therefore, unsurprisingly, the cost per item of inference on a batch of items is significantly lower when the batch is e.g. 8 than 1 (in the case of Transformers there are further gains to be made because roughly half of the attention calculations in token k+1 are identical to the calculations of token k and can be easily reused by writing the formulas a certain way, the keyword to look for is causal attention mask).
In any reasonable GPU inference setup the weights would be preloaded.