Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

GPT-3/175B model required 3.14E23 flops of compute for training. Even at theoretical 28 TFLOPS for V100 and lowest reserved Azure pricing, this will take 355 GPU-years and cost $3.6M for a single training run!


I know it's a meme for the GPT team to just take the latest transformer model and add a magnitude of parameters, done and done!

It'll be interesting to see whether the new paradigm really offers new insights, or whether it's really just kicking the can down the road - and we see the limits of generalizability in some other fashion.

I guess what irks me is that there is so little theory and math behind many papers, even if there are dozens of co-authors on it.

The question of generalizability is deeply connected to statistics, e.g. causal models, spurious correlations and so forth. Statements about these things are just "thrown" in there, without any citation or proof. In peer review, wouldn't anyone object? Those are clearly things that we actually do not know enough about to be sure.

Edit: Reflecting further, perhaps this rapid iteration and result orientation is in fact something positive. Perhaps it's good the way it is, without so many scientific conventions and signals of deference. Perhaps it's that which made other sciences more anemic and ML very productive.

All my whining aside, impressive work of course.


Can you point out some books/authors/papers to close the gap between statistics and NNs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: