What Parameter Golf taught us about AI-assisted research
Key Points
- 16 MB artifact + 10-minute 8×H100 training budget
- AI coding agents sped iteration but complicated review
- Quantization and test‑time adaptation drove strong gains
Summary
Parameter Golf was an eight-week open ML challenge (FineWeb held-out loss) with strict resource limits: a 16 MB artifact cap (weights + training code) and a 10-minute training budget on 8×H100s. Over 1,000 participants submitted 2,000+ entries. Submissions that won combined disciplined optimizer/tuning work, aggressive compression (GPTQ variants), test‑time adaptation, and novel tokenization/architectural ideas. AI coding agents accelerated iteration and broadened participation, but also amplified review, attribution, and reproducibility challenges.
Key Points
- Contest constraints: minimize held-out loss on FineWeb; 16 MB artifact limit; 10-minute training on 8×H100s; baseline and evaluation scripts provided.
- Participation: 1,000+ participants, 2,000+ submissions; RunPod sponsored $1,000,000 in compute, lowering entry barriers.
- Technical themes surfaced:
- Training optimization and disciplined combination of improvements (optimizer schedules, weight decay, initialization).
- Quantization and compression advances (GPTQ-lite, full Hessian GPTQ) that materially improved evaluation scores.
- Test‑time strategies and evaluation tricks (LoRA per-document adaptation, self-generated calibration) that required careful rule review.
- New modeling/data ideas (lossless tokenizer operators, efficient attention variants, learned token features, mini depth recurrence).
- AI coding agents: widely used for prototyping, code inspection, and faster experiments, but they increased noise, rule‑violation propagation, and reviewer workload.
- Operations: high submission volume required automated triage (Codex-based bot) and clearer community review tooling to scale leaderboard management.
Practical recommendations for engineers
- Prioritize reproducibility and compact, self-contained artifacts under strict size/time budgets.
- Invest in quantization and export workflows early—compression can beat naive model-scaling within tight limits.
- Automate submission checks and triage to catch invalid evaluation paths when using agent-accelerated pipelines.
- Use agents to accelerate prototyping, but enforce provenance, tests, and reviewer checkpoints to avoid copied invalid strategies.