Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wondered how is training data balanced? If you put in to much Wikipedia, and your model sounds like a walking encyclopedia?

After doing the Karpathy tutorials I tried to train my AI on tiny stories dataset. Soon I noticed that my AI was always using the same name for its stories characters. The dataset contains that name consistently often.

 help



At this scale, that kind of thing is not really a problem; you just dump all of the data you can find into the model (pre-training)1. Of course, the pre-training data influences the model, but the reinforcement learning is really what determines the model’s writing style and, in general, how it “thinks” (post-training).

1 This data is still heavily filtered/cleaned


This isn’t quite accurate. Data weighting is quite important in pretraining.



Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: