Notes on 'The 2025 AI Engineer Reading list'

The following are loose notes while going through the papers in The 2025 AI Engineer Reading list They are personal notes and are not refined for an external audience really. Still they might be useful to someone :).

Foundational Models

GPT-1

Uses two step process of first doing unsupervised pretraining and then does supervised finetuning. Finetuning happens not just via a more curated dataset (that’s what I thought and what’s in later models referred to as “Supervised Fine Tuning”, SFT) but training multiple different capabilities like text classification, text similarity, choosing the best response.

GPT-2

1.5 Billion parameters still underfit the training data set
different task fine-tuning as described above might not be necessary as the LLM can just adapt to new use-cases
The authors suspect that the LLM can learn from content in any format and then apply that to other tasks. For example training on wikipedia and then being able to use facts from it in a conversation
Common Crawl had a too low quality of text - instead used Reddit outbound links as human pruning of the dataset
Comparison of GPT-2 output with other datasets or benchmarks sometimes difficult because of bad data quality in other datasets (tokenization, permutation)
GPT-2 is really not great (translation is worse than dictionary-based translation, summarization is about as bad as selecting three random sentences from the main text) - “much much worse [compared to domain specific SOTA baselines]” to put it in the words of the authors
A lot of test/validation data is contained in the training set
Hallucinations have not yet been identified as a problem, rather they are viewed as a sign of the models ability to generalize (“GPT-2 is also able to write news articles about the discovery of talking unicorns.”)
My understanding is that GPT2 wasn’t fine-tuned

GPT-3

GPT-3 Base / Language Models are Few-Shot Learners

not fine-tuned
175 billion parameters - 10x the size of GPT-2
focus on in-context learning
In some cases almost as good as SOTA of domain specific baselines. (much better than GPT-2)
Figure 3.1 is strongly suggesting the hypothesis that more compute leads to better performance
Still issues with coherence over longer (more than a couple of sentences) outputs.
Hallucination still viewed as a feature, not a bug
Future Work: Currently all training tokens are weighted equally, while they may drastically vary in importance to get right. (There’s some discussions elsewhere for GPT-4 whether RHLF is doing a better job providing nuanced feedback)
For the first time goes into risks, attacks, biases, environmental impact

Codex / Evaluating Large Language Models Trained on Code

Proposes HumanEval
Show that different number of shots fare best with different temperature
Highlights that Codex is learning much slower than a human per token read
Misalignment: A prompt containing errors will be get the model to “deliberately” put errors in the response to best predict the next token.

InstructGPT / Training language models to follow instructions with human feedback

InstructGPT is a much better aligned model, given better responses, using mostly the same training data (98%)
GPT-3 is first fine-tuned based on a high-quality data set “Supervised Fine Tuning”(SFT)
Then the SFT model is trained based on human feedback to declare a winning response given multiple responses. The resulting model is called the Reward Model
The reward model is then used to further refine the SFT.
The model based on the human feedback is performing much better than the model only fine-tuned based on the high quality data set
I’ve struggled to understand the RHLF, policy model, SFT part of the paper, Huyen Chip lays this out much more understandable than the paper. Also check out Yoav Goldbergs notes on John Schulmans talk on RHLF
Hallucinations are still an issue even if alignment is a lot better
Personal thought: I wonder if hallucinations happen if during finetuning responses are provided that are not sufficiently contained in the model. Thus the model learns that providing plausible instead of truthful responses are desired. Maybe it would be possible to judge how solidly the SFT data is contained in the base data set. Maybe this could reuse a similar approach creating a classifier of in-dataset / out-of-dataset.

GPT-4 Technical Report

GPT-4 outperforms many domain-specific baseline SOTA ML models
They focus on having accurate predictions of model performance/behavior using much smaller models. This allows them to iterate a lot cheaper before pretraining the final model
Multimodality: Images are accepted as input
GPT-4 has improved alignment and truthfulness but no details are given as to how

o3 / Deliberative Alignment: Reasoning Enables Safer Language Models

CoT improves detection of forbidden prompts as well as reduces false-positive rates
Context distillation is a method to create responses using a prompt containing certain information (e.g. the OpenAI policies) and then using these responses as training data. This way the additional context in the prompt can later be omitted which saves compute and improves alignment.
A safety-unaligned (but instruct- and CoT-tuned?) o1 model is used to generate “safe” responses using a custom system prompt
A thus safety-aligned o-series model has lower inference costs (doesn’t need the policies in the prompt) and has lower failure rates
The quality control model for o3 has different category and the worst score across the different categories is taken as the overall score of a response.
Both SFT as well as RLHF improve model performance on safety

To be continued