I mean the actual video, that's much bigger With a vision transformer each token...

I mean the actual video, that's much bigger

With a vision transformer each token may be around 16x16 pixels. I found an example where they use images of resolution 224x224 for training a vision transformer so if we go with that that 256 pixels per token and 50176 pixels per image, so 196 tokens per frame, 24 frames per second, that's 4704 tokens per second or 16934400 token / hour. In total we're at 2.6x10^15 tokens.

GPT-3 was trained on 5x10^11 tokens, so YouTube done this way would be around four orders of magnitude more tokens that GPT-3 was trained on.

GPT-3 was undertrained by 1-2 orders of magnitude, so the compute required to trained a model on YouTube would then be around 6 orders of magnitude higher than what was used to train GPT-3, so about one million times more.

I did a linear regression on the training costs from cerebras(1) and came up with the formula (1901.67366*X)-197902.72715 where X is number of tokens in billions.

Plugging in 5x10^15 tokens we get a training cost of 5 billion dollars. I guess a lot of optimizations could be done that would decrease the cost, so maybe its doable in a few years.

1. https://cirrascale.com/cerebras.php