2025-01-16 - John Lam's Website

7:55AM ReaderLM-v2 looks like a very interesting model: [twitter](https://x.com/JinaAI_/status/1879551743748706487). Their announcement [blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json/) post contains a number of interesting details about how it was trained. ![[Pasted image 20250116075625.png]] Especially interesting are the techniques they used to expand the context window. This is pretty well known in the local LLaMA communities. > We began with **long-context pretraining**, using the `html-markdown-1m` dataset. Techniques like ring-zag attention and rotary positional encoding (RoPE) were used to progressively expand the model’s context length from 32,768 tokens to 256,000 tokens. To maintain stability and efficiency, we adopted a gradual training approach, starting with shorter sequences and incrementally increasing the context length. More details of their SFT approach would be useful as well. Not clear what this is exactly. > Following pretraining, we moved to **supervised fine-tuning (SFT)**. This stage utilized the refined datasets generated in the data preparation process. These datasets included detailed instructions for Markdown and JSON extraction tasks, along with examples for refining drafts. Each dataset was carefully designed to help the model learn specific tasks, such as identifying main content or adhering to schema-based JSON structures. This is a supervised task as well. I wonder how challenging this was? > We then applied **direct preference optimization (DPO)** to align the model’s outputs with high-quality results. In this phase, the model was trained on pairs of draft and refined responses. By learning to prioritize the refined outputs, the model internalized the subtle distinctions that define polished and task-specific results. This unsupervised technique is new to me - need to learn more about this. > Finally, we implemented **self-play reinforcement tuning**, an iterative process where the model generated, refined, and evaluated its own outputs. This cycle allowed the model to improve continuously without requiring additional external supervision. By leveraging its own critiques and refinements, the model gradually enhanced its ability to produce accurate and structured outputs.