Does the paper assume uniform settings through out the training phase? Or is it ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		KuriousCat on April 23, 2024 \| parent \| context \| favorite \| on: Phi-3 Technical Report Does the paper assume uniform settings through out the training phase? Or is it the bound no matter what training strategy is used given the dataset?

HarHarVeryFunny on April 24, 2024 [–]

They only experimented with different cosine learning rate decay schedules, but found results consistent across these, as well as across two different types of experiment where they either varied number of training tokens for a given model size, or varied model size for a given number of training FLOPs.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact