Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?


They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: