There are so many "Prompt-Ops" tools and I'm sold on none of them
Since early this year, I've had the fortune of speaking with many founders working on LLM-related tools that I'll broadly call "Prompt-Ops" tools. These tools are targeted at devs building products on top of LLMs, and aim to do any of the following:
- Proxy, log, and cache requests to LLM inference providers for cost savings and observability
- Provide a better UI to iterate on prompts
- Provide a simpler technical framework (utility library, web API wrapper) to call prompt chains as "LLM microservices"
- Help version control and collaborate on prompts in a team
- Improve workflows for evaluating LLM outputs across models and prompts
- Gather human feedback and continuously fine-tune models for cost savings and output quality
- Integrate with vector databases to deliver easier and higher-quality retrieval-augmented text generation
The best known example of this class of tool is LangChain, which does many, but not all, of the above. There are dozens more startups and open-source libraries trying to offer similar value propositions.
I haven't really found any of them appealing enough to adopt, let alone pay for. I'm not yet sure if this is because this space is nascent and the founders don't know what problem they're solving, or because the use cases most familiar to me (research and potential adoption at work) aren't the ones targeted by these products. I do feel that most offerings in this space are built for side project builders and tinkerers rather than people building products with real users.
Regardless, for the benefit of anyone thinking about building tools in this space, I thought I'd share some notes I took after a bunch of these conversations.
- All existing solutions feel like “LLM app frameworks” that assume that the core of a particular service is a call to an LLM chain, and require business logic to be plugged into it, rather than the business logic calling into LLMs as needed. This makes it hard for most existing SaaS codebases/companies to adopt, because most apps are not built around a chain of LLM calls.
- All existing solutions try to offer some kind of UX for evaluating model output, but I haven’t seen any that have a rigorous quantitative approach to evals, nor offer something better for qualitative evals than a giant spreadsheet, like an embedding space visualization of model outputs clustered by error types or failure modes. Humanloop has the most reasonable evaluation UI I've seen so far, but they're for human feedback in production, not frequent feedback/iteration during development cycles.
- These tools frequently want to be in the critical path of production workloads (proxy between users and inference providers), which is a big, sometimes unacceptable latency and security tradeoff. It seems better to e.g. let developers use the tool to push final prompts and model configs to their codebase via GitHub and then let them call providers directly.
- What I really need as someone building a product atop LLMs are:
- Reliable, insightful evaluations that are easy to run and interpret
- Better version control for prompts that integrate them into a codebase as if prompts were just source code, because they are (an example might be something that continuously commits changes to prompts into a folder on my project's GitHub repo)
- Better composition/reuse of “shared prompt logic”.
Nobody has really pushed beyond the basics on any of those 3 fronts, as far as I can tell. (If you are, get in touch!)