There are so many "Prompt-Ops" tools and I'm sold on none of them
Since early this year, I've had the fortune of speaking with many founders working on LLM-related tools that I'll broadly call "Prompt-Ops" tools. These tools are targeted at devs building products on top of LLMs, and aim to do any of the following:
- Proxy, log, and cache requests to LLM inference providers for cost savings and observability
- Provide a better UI to iterate on prompts
- Provide a simpler technical framework (utility library, web API wrapper) to call prompt chains as "LLM microservices"
- Help version control and collaborate on prompts in a team
- Improve workflows for evaluating LLM outputs across models and prompts
- Gather human feedback and continuously fine-tune models for cost savings and output quality
- Integrate with vector databases to deliver easier and higher-quality retrieval-augmented text generation
The best known example of this class of tool is LangChain, which does many, but not all, of the above. There are dozens more startups and open-source libraries trying to offer similar value propositions.
I haven't really found any of them appealing enough to adopt, let alone pay for. I'm not yet sure if this is because this space is nascent and the founders don't know what problem they're solving, or because the use cases most familiar to me (research and potential adoption at work) aren't the ones targeted by these products. I do feel that most offerings in this space are built for side project builders and tinkerers rather than people building products with real users.
Regardless, for the benefit of anyone thinking about building tools in this space, I thought I'd share some notes I took after a bunch of these conversations.
- All existing solutions feel like “LLM app frameworks” that assume that the core of a particular service is a call to an LLM chain, and require business logic to be plugged into it, rather than the business logic calling into LLMs as needed. This makes it hard for most existing SaaS codebases/companies to adopt, because most apps are not built around a chain of LLM calls.
- All existing solutions try to offer some kind of UX for evaluating model output, but I haven’t seen any that have a rigorous quantitative approach to evals, nor offer something better for qualitative evals than a giant spreadsheet, like an embedding space visualization of model outputs clustered by error types or failure modes. Humanloop has the most reasonable evaluation UI I've seen so far, but they're for human feedback in production, not frequent feedback/iteration during development cycles.
- These tools frequently want to be in the critical path of production workloads (proxy between users and inference providers), which is a big, sometimes unacceptable latency and security tradeoff. It seems better to e.g. let developers use the tool to push final prompts and model configs to their codebase via GitHub and then let them call providers directly.
- What I really need as someone building a product atop LLMs are:
- Reliable, insightful evaluations that are easy to run and interpret
- Better version control for prompts that integrate them into a codebase as if prompts were just source code, because they are (an example might be something that continuously commits changes to prompts into a folder on my project's GitHub repo)
- Better composition/reuse of “shared prompt logic”.
Nobody has really pushed beyond the basics on any of those 3 fronts, as far as I can tell. (If you are, get in touch!)
Expertise is having knowledge in your head that allows you to better read the world. — Fred Hebert
He knows that there are in the soul tints more bewildering, more numberless, and more nameless than the colours of an autumn forest […] Yet he seriously believes that these things can every one of them, in all their tones and semitones, in all their blends and unions, be accurately represented by an arbitrary system of grunts and squeals. He believes that an ordinary civilized stockbroker can really produce out of his own inside noises which denote all the mysteries of memory and all the agonies of desire.
— G.K. Chesterton
Pixel space is the ultimate IO surface between computers and humans — everything that isn't auditory or tactile ultimately gets flattened to raster images before it's consumed by our senses. It feels like a big part of my job in AI interface design these days is collapsing the layers of abstraction between pixel space and deep models' representations.
Most notable among these layers of abstraction in between pixels and raw information is writing, from written natural language to notations. The best notations flatten key operations to transformations in the pixel space.
An underrated benefit of direct manipulation interfaces is low-friction iteration and exploration of the possibility space, where low friction means:
- Intuitive
- Fast, with immediate feedback
- Reversible and interruptible, so easy to undo/redo
It might even be fair to say that direct manipulation is only instrumental to the more fundamental end goal of low-friction iteration and exploration.
Lots of people and companies seem to be doing a lot of redundant work to make easy things easier for building with LLMs, but what I'm really interested in are the tools that make hard things easy and previously impossible things possible.
Prepping a talk for an undergraduate fellowship program, I'm struck by the duality of being just experienced enough to have small nuggets of knowledge I think are worth sharing, but not at all experienced enough to have enough conviction to preach them like the gospel. Just a series of personal anecdotes.