Linus's stream

We are beginning to enter the alchemy stage of our relationship with intelligence. Soon, we will have chemistry.

In search of transcendent conversations; only finding email

Bleep bloop next token bleep bloop

After much careful study I've come to the conclusion that the E line is the best NYC subway line, followed closely by the F, L, and 6.

There are so many "Prompt-Ops" tools and I'm sold on none of them

Since early this year, I've had the fortune of speaking with many founders working on LLM-related tools that I'll broadly call "Prompt-Ops" tools. These tools are targeted at devs building products on top of LLMs, and aim to do any of the following:

  • Proxy, log, and cache requests to LLM inference providers for cost savings and observability
  • Provide a better UI to iterate on prompts
  • Provide a simpler technical framework (utility library, web API wrapper) to call prompt chains as "LLM microservices"
  • Help version control and collaborate on prompts in a team
  • Improve workflows for evaluating LLM outputs across models and prompts
  • Gather human feedback and continuously fine-tune models for cost savings and output quality
  • Integrate with vector databases to deliver easier and higher-quality retrieval-augmented text generation

The best known example of this class of tool is LangChain, which does many, but not all, of the above. There are dozens more startups and open-source libraries trying to offer similar value propositions.

I haven't really found any of them appealing enough to adopt, let alone pay for. I'm not yet sure if this is because this space is nascent and the founders don't know what problem they're solving, or because the use cases most familiar to me (research and potential adoption at work) aren't the ones targeted by these products. I do feel that most offerings in this space are built for side project builders and tinkerers rather than people building products with real users.

Regardless, for the benefit of anyone thinking about building tools in this space, I thought I'd share some notes I took after a bunch of these conversations.

  • All existing solutions feel like “LLM app frameworks” that assume that the core of a particular service is a call to an LLM chain, and require business logic to be plugged into it, rather than the business logic calling into LLMs as needed. This makes it hard for most existing SaaS codebases/companies to adopt, because most apps are not built around a chain of LLM calls.
  • All existing solutions try to offer some kind of UX for evaluating model output, but I haven’t seen any that have a rigorous quantitative approach to evals, nor offer something better for qualitative evals than a giant spreadsheet, like an embedding space visualization of model outputs clustered by error types or failure modes. Humanloop has the most reasonable evaluation UI I've seen so far, but they're for human feedback in production, not frequent feedback/iteration during development cycles.
  • These tools frequently want to be in the critical path of production workloads (proxy between users and inference providers), which is a big, sometimes unacceptable latency and security tradeoff. It seems better to e.g. let developers use the tool to push final prompts and model configs to their codebase via GitHub and then let them call providers directly.
  • What I really need as someone building a product atop LLMs are:
    1. Reliable, insightful evaluations that are easy to run and interpret
    2. Better version control for prompts that integrate them into a codebase as if prompts were just source code, because they are (an example might be something that continuously commits changes to prompts into a folder on my project's GitHub repo)
    3. Better composition/reuse of “shared prompt logic”.

Nobody has really pushed beyond the basics on any of those 3 fronts, as far as I can tell. (If you are, get in touch!)

Expertise is having knowledge in your head that allows you to better read the world. — Fred Hebert

He knows that there are in the soul tints more bewildering, more numberless, and more nameless than the colours of an autumn forest […] Yet he seriously believes that these things can every one of them, in all their tones and semitones, in all their blends and unions, be accurately represented by an arbitrary system of grunts and squeals. He believes that an ordinary civilized stockbroker can really produce out of his own inside noises which denote all the mysteries of memory and all the agonies of desire.

— G.K. Chesterton

Pixel space is the ultimate IO surface between computers and humans — everything that isn't auditory or tactile ultimately gets flattened to raster images before it's consumed by our senses. It feels like a big part of my job in AI interface design these days is collapsing the layers of abstraction between pixel space and deep models' representations.

Most notable among these layers of abstraction in between pixels and raw information is writing, from written natural language to notations. The best notations flatten key operations to transformations in the pixel space.

If ChatGPT is the command-line interface, where's the multi-touch?

An underrated benefit of direct manipulation interfaces is low-friction iteration and exploration of the possibility space, where low friction means:

  • Intuitive
  • Fast, with immediate feedback
  • Reversible and interruptible, so easy to undo/redo

It might even be fair to say that direct manipulation is only instrumental to the more fundamental end goal of low-friction iteration and exploration.