Pushing language models beyond superhuman performance
- When explicitly optimized for a task, large deep learning systems achieve superhuman performance easily. Modern image classifiers are human-level or better; even medium-size language models are unquestionably superhuman at its core task of next-token prediction.
- The main reason language models aren't superhuman in most tasks today is because they aren't explicitly optimized for things like "be funny" or "reason like a mathematician" or "be better at programming". Those skills are learned proximately because those skills are helpful for next token prediction, but neural nets are lazy learners. They learn all of the easy things before any of the hard things.
- There are indicators (RLHF, task vectors) that demonstrate that once we achieve near human-level performance at a task with a particular model, and we develop some kind of discriminator or reward model for what "better performance" at that task looks like (based on human feedback or self-play), models rapidly supersede human performance for that task.
- One of the key findings of the original RLHF paper from OpenAI was that models trained with this method not only matched, but outperformed humans — something a model trained on the typical supervised or self-supervised objective cannot accomplish.
- I think we're only beginning to scratch the surface of these formulas for achieving superhuman performance by directly optimizing for specific skills. I suspect we'll see superhuman reasoning, search, reading comprehension bootstrap itself from existing LLMs in 2023.