Pushing language models beyond superhuman performance
- When explicitly optimized for a task, large deep learning systems achieve superhuman performance easily. Modern image classifiers are human-level or better; even medium-size language models are unquestionably superhuman at its core task of next-token prediction.
- The main reason language models aren't superhuman in most tasks today is because they aren't explicitly optimized for things like "be funny" or "reason like a mathematician" or "be better at programming". Those skills are learned proximately because those skills are helpful for next token prediction, but neural nets are lazy learners. They learn all of the easy things before any of the hard things.
- There are indicators (RLHF, task vectors) that demonstrate that once we achieve near human-level performance at a task with a particular model, and we develop some kind of discriminator or reward model for what "better performance" at that task looks like (based on human feedback or self-play), models rapidly supersede human performance for that task.
- One of the key findings of the original RLHF paper from OpenAI was that models trained with this method not only matched, but outperformed humans — something a model trained on the typical supervised or self-supervised objective cannot accomplish.
- I think we're only beginning to scratch the surface of these formulas for achieving superhuman performance by directly optimizing for specific skills. I suspect we'll see superhuman reasoning, search, reading comprehension bootstrap itself from existing LLMs in 2023.
Tools for the million-call language model chain future
There is no shortage of tools and startups that want to help people write "language model chains" or "cascades". These tools stitch multiple calls to large language models together to perform complex high-level tasks. Beyond adding capability, making reasoning steps explicit can also make language model programs more interpretable and easier to debug.
LLM chaining tools today accommodate in the order of 10-100 calls to language models, and towards the higher end of that spectrum, these interfaces get very unwieldy to use. But it seems likely to me that, as LLM inference cost falls and latency drops, the average number of LLM inference calls in a language model cascade program is going to go up exponentially, or at least superlinearly, bottlenecked by inference compute cost and quality of tooling.
Tools that work for 100-call model cascades are going to look very different than those designed for 1M-call model cascades, analogous to how programming languages and environments for MHz-range computes look very different compared to languages and tools for modern multi-core GHz-range computers. I think this is a forward-looking problem worth thinking about: What kinds of tools do we need to enable future language model programs with millions of calls to multi-modal, large generative models?
Is this a component of the final system?
When prototyping towards an ambitious vision, every component of the prototype should be either
- smoothly scalable all the way up to become a component of the final, full-scale system, or else
- should teach you so much about what to build next, so cheaply and quickly, that the component itself going to waste is worth it.
If a component falls in neither category, it's most likely unnecessary effort; perhaps even a distraction.
(The primary exception to this rule is internal tools that help you develop the real thing faster and more cheaply.)
A lot of people are building LLM-powered systems that are not a component of a sustainable long-term vision of a society with cheap, universally accessible, AI systems that will be orders of magnitude more capable and just as good at understanding our intent.
There are no original ideas; only infinitely varied remixes.
This itself isn't a new idea. All thinking is creative recombination. But language model-based creative tools of today all require us to start each idea from scratch, at an empty text box.
When I think within my own mind, I build upon snippets of conversations and bits of quotes from stories I've heard, and recombine them to produce something new. The ingredients with which I think are not words or tokens, but pieces of ideas, abstract blobs of concepts and quotes from my memory. It seems obvious that we should be able to work with language models similarly, at least for creative use cases, by recombining pieces of our experience rather than typing into a mostly empty rectangle.
Instead of instructing with words or prompting with keywords, I want to bring in a passage from my favorite author and say "what does this make you think of?" I want to smash two different paragraphs about creativity from my notes together inside a neural network and see what ideas fall out. I want to paint over a model-generated image with a brush I conjured from the color palette of my favorite photograph. I want to control these models with ideas and experiences plucked from my memory, not tokens and words.
Creation is iterated refinement in an infinite option space
(an excerpt from a text conversation)
I was talking earlier today about how you can view a creation process not as additive (starting with a sentence, and then adding another and another) but as iterated refinement and filtering through an infinite option space (there are an infinite continuations of your first sentence. How do you choose the right continuation?) Because LLMs can explicitly compute all possible continuations, this is an interesting way to look at writing with an AI, and AI-augmented creation in general.
In the case of writing, an interesting UI for this could be like, each paragraph you write becomes a "card", and the AI can place cards underneath a card as suggestions for alternative wordings, newly discovered images, links, etc. Cards kind of "peeking out" from under a paragraph might be a neat way to signal "theres something new here you should look at" without intruding on the writer's space when they haven't explicitly summoned help. It also makes possible this other interface that I've thought about, where you click on any paragraph's "stack of cards" and it unfurls to show you 3-4 possible variations to explore other stylistic variations or clearer wordings.
One of the critical things that Midjourney got right, as I learned from talking to David, was targeting prosumer and the "enthusiastic non-professional" users first over "pro" users. Pros may appear more lucrative in the beginning for a creative product, but they have ingrained workflows that they don't want to change. This early in the lifecycle of the technology and product, when rapid iteration is critical, Midjourney benefits enormously from the flexibility afforded by a less "pro" user base with more flexible workflows that can evolve as quickly as they can ship and experiment.
I imagine similar tailwinds benefit Replit, who seem to ship interesting future-facing ideas about how software is built faster than almost any other organization of any size.
An insight about recommender systems from someone whose name I regretfully can't recall at the moment:
Many recommender systems/algorithms model interest as a precise static region in representation space, such that the algorithm becomes about zooming in forever and ever to higher-resolution patches of this space to find exactly what the user wants. In reality, interests shift, and the recommender algorithm may influence the user's shifting interests, in addition to being informed by them. So it makes more sense to model recommendation algorithms as a thing that traverses a linked list or a branching tree evolving over time, more than a "zooming in forever" into some perfectly interest-aligned patch of the topic space.
I noted once that the most interesting potential for virtual/mixed reality wasn't to put yourself in a virtual office or the ocean floor; it was that you could experience entirely different worlds with different physics, where time flows differently, where acoustics mutate as sound waves fly through the air. In VR, you could move through scales of experience, from nanometers to miles, as easily as you move a few feet through space in the real world.
I feel a similar sense of loss of underexploration about large generative models for images and text. We can use these models to render anything at all, tell any story at all, invent any language, create any soundscape ... and we use these dream machines mostly to render simulacra of reality with the details swapped around.
Endowed with the magic to immerse ourselves in worlds of our own making and languages of our own creation, we are so eager to rebuild worlds that already constrain us, speaking languages just as familiar as our own. We are given the power to imagine anything, and we imagine the here and now. Why?
All around us, there are other worlds blooming, if we only looked a bit closer.