Linus's stream

Impossible to overstate the technical complexity involved in building a simple rich text editor that works well. I'm constantly surprised by how deceptively simple it seems from the outside, and how even extremely well-resourced teams with very talented people often discover latent bugs or misalignments between "what should make sense" and "what people seem to expect the thing to do in the real world."

Composable reasoning

A key tension when engineering with deep learning systems is balancing composability with performance. In general, when not bottlenecked by compute nor data, DL systems perform best when trained end-to-end on a suitable objective. But intermediate features of the learned model are hard to interpret, and because these intermediate features not intended to be consumed by humans, it's hard to integrate many such systems together into a larger whole in a way that allows humans to reason about the behavior of the integrated whole as reliably as “classic” software systems with well-defined interfaces between modular parts.

For example, Tesla's autonomous driving system has hundreds of subsystems trained semi-independently (same model backbone shared for efficiency, but different training objectives for downstream models). These subsystems perform specific, human-legible tasks like “identify drivable areas” or “locate pedestrians in the scene” which feed into subsequent tasks. This makes debugging and maintenance easier, but might cap system performance for tasks that require passing ambiguity or other more nuanced state between models.

Language model cascades are a notable exception, because using chains of thought, forcing the model to “think” through human-interpretable intermediate states (generated text) seems to improve performance on many tasks compared to a baseline of direct prompting (though they still lose to direct optimization against a large supervised dataset). LLM chains also improve debugging and interpretability, and in general make “engineering with NLP models” more tractable.

Composition and modularity are key to how we maintain large classic software systems, and it seems noteworthy that (1) deep learning systems in general push against this practice and (2) LLM programs can be composed from modular pieces without sacrificing power. This is not to diminish the importance of work in AI interpretability, though, and I think there’s lots of valuable advances ahead in how we convert opaque learned features in end-to-end trained systems to human-legible ideas.

The latter is also the subject of some of my current work: How can we take intermediate features of generative models, render them legible to humans, and then let us use them to further control and refine existing models?

What is the self-driving car of NLP?

Autonomous driving is a landmark problem in computer vision, perhaps the real-world problem, as @geohot from Comma says often:

Self-driving cars are still the coolest applied AI problem today.

I think it’s worth thinking about what such an applied, real-world machine learning problem would be for natural language understanding and text generation. My hypothesis is that a good candidate for this “self-driving car of NLP” is open-domain, abstractive question answering, wherein a human uses an AI system to synthesize a natural language answer to some knowledge-based question, based on a large corpus of diverse documents, only some of which contain information relevant to answering the question.

Natural language web search is the most ambitious form of this problem, but a more tractable target might be to solve a similar problem for organizations with lots of private knowledge — chat histories, emails, planning documents, feedback surveys, paperwork, contacts, and on and on and on — synthesizing answers to questions like "Do I know any recruiters working at a biotech company?" or "Was there any update about our deal with X from last week's board meeting?" or, even higher level, "What are the most common customer complaints we have about Y feature?"

Both autonomous driving and ODQA:

  • are challenging and unsolved technical problems, where solving it perfectly requires fully general intelligence, and would advance the state of the art in many related research domains
  • meet clear and lucrative market needs with willingness to pay (== willingness for the market to finance R&D necessary to push on the problem)
  • are hot problem spaces with many initial players, but probably few real winners in the end, mostly driving by superiority in data, compute ($$), and research capabilities.
  • automate real-world activities that most humans perform often in their daily lives, so that solving it would dramatically improve most people’s lives and save humans lots of labor.

In particular, speaking from my personal experience as well as opinions from experts I’ve spoken to, I believe the market is dramatically underestimating the unsolved technical challenges standing in between us today and the long-term solution to this problem.

What is the "autonomous driving" of natural language understanding and generation?

Assembly programming is to prompt engineering what Python is to ___?

Life is just a process of acquiring more and more inboxes to check every day until you die.

Even though 5 years have passed since, there is no work that better captures the potential I feel in AI and generative models for augmenting human intelligence than Using Artificial Intelligence to Augment Human Intelligence. It is criminally under-discussed, especially amongst the crowd of people currently building interfaces on top of large generative models like GPT and Stable Diffusion.

Pushing language models beyond superhuman performance

  • When explicitly optimized for a task, large deep learning systems achieve superhuman performance easily. Modern image classifiers are human-level or better; even medium-size language models are unquestionably superhuman at its core task of next-token prediction.
    • The main reason language models aren't superhuman in most tasks today is because they aren't explicitly optimized for things like "be funny" or "reason like a mathematician" or "be better at programming". Those skills are learned proximately because those skills are helpful for next token prediction, but neural nets are lazy learners. They learn all of the easy things before any of the hard things.
  • There are indicators (RLHF, task vectors) that demonstrate that once we achieve near human-level performance at a task with a particular model, and we develop some kind of discriminator or reward model for what "better performance" at that task looks like (based on human feedback or self-play), models rapidly supersede human performance for that task.
    • One of the key findings of the original RLHF paper from OpenAI was that models trained with this method not only matched, but outperformed humans — something a model trained on the typical supervised or self-supervised objective cannot accomplish.
  • I think we're only beginning to scratch the surface of these formulas for achieving superhuman performance by directly optimizing for specific skills. I suspect we'll see superhuman reasoning, search, reading comprehension bootstrap itself from existing LLMs in 2023.

Tools for the million-call language model chain future

There is no shortage of tools and startups that want to help people write "language model chains" or "cascades". These tools stitch multiple calls to large language models together to perform complex high-level tasks. Beyond adding capability, making reasoning steps explicit can also make language model programs more interpretable and easier to debug.

LLM chaining tools today accommodate in the order of 10-100 calls to language models, and towards the higher end of that spectrum, these interfaces get very unwieldy to use. But it seems likely to me that, as LLM inference cost falls and latency drops, the average number of LLM inference calls in a language model cascade program is going to go up exponentially, or at least superlinearly, bottlenecked by inference compute cost and quality of tooling.

Tools that work for 100-call model cascades are going to look very different than those designed for 1M-call model cascades, analogous to how programming languages and environments for MHz-range computes look very different compared to languages and tools for modern multi-core GHz-range computers. I think this is a forward-looking problem worth thinking about: What kinds of tools do we need to enable future language model programs with millions of calls to multi-modal, large generative models?

Is this a component of the final system?

When prototyping towards an ambitious vision, every component of the prototype should be either

  1. smoothly scalable all the way up to become a component of the final, full-scale system, or else
  2. should teach you so much about what to build next, so cheaply and quickly, that the component itself going to waste is worth it.

If a component falls in neither category, it's most likely unnecessary effort; perhaps even a distraction.

(The primary exception to this rule is internal tools that help you develop the real thing faster and more cheaply.)

A lot of people are building LLM-powered systems that are not a component of a sustainable long-term vision of a society with cheap, universally accessible, AI systems that will be orders of magnitude more capable and just as good at understanding our intent.