A random idea I've been thinking about: heterogeneous-media shared documents as a collaboration medium between human orchestrators and LLM-style agents.
I sort of mentioned this in passing in one of my blogs in the past. I think it would be interesting to build an assistant experience where tasks are framed as the model filling out a worksheet or modifying some shared, persistent environment alongside the user.
So instead of "Can you book me a flight for X at Y?" you're just co-authoring a "trip planning" doc with the assistant, and e.g. you may tag the assistant under a "flights" section, and the assistant gets to work, using appropriate context about trip schedules and goals elsewhere in the doc. It can obviously put flight info into the doc in response, but if it needs to clarify things or put additional info/metadata/"show work" or share progress updates as it does stuff, a shared doc provides a natural place for those things without having to do sync communication with the user over the rather constrained text chat interface.
A "document" vs "message thread" is also more versatile because the model can use it as a human-readable place to put its long-term knowledge. e.g. keeping track of a list of reminders to check every hour, a place for it to write down things it learns about the user's preferences such that it's user-editable/auditable, etc. Microsoft's defunct Cortana assistant had a "Notebook" feature that worked this way, and I thought still think it was a good idea.
The heterogeneous media part is where I think it would be extra useful if this weren't just a Google Doc, but accommodated "rich action cards" like what you see in actionable mobile notifications or in MercuryOS. The model could do things like embed a Linear ticket or create an interactive map route preview, in a way that those "cards" are fully interactive embeds, but are still legible to the language model under the hood.
To make good tools, think of yourself as a toy maker.
Idea: Obscura
Synopsis:
- Instagram circa early 2010s, but instead of filters that apply color changes from a LUT or something, each filter is a different direction in the Stable Diffusion latent space.
- Main technical challenge here would be editing real images while preserving fidelity to the original, especially on things like human faces, but I think that’s a solved problem in a few months. Imagic achieves this, sort of, but takes a long time (~5-6mins on a top-of-the-line datacenter GPU).
I’ve been thinking a lot this week about Paper (the classic iPad drawing app). They gave you like a color palette of 15 colors, the most simple color mixer in the world (and no color picker otherwise), and like 5 "brushes" that were like pencil, pen, watercolor. Super simple. As simple as possible. But it was very hard using those tools to make something that looked ugly. What if you could build images by capturing the basics of an idea from a napkin sketch or something in front of you, and then use like 5-10 "style filters" in the Stable Diffusion latent space to "artsify" it in a way that was hard to mess up?
Inspirations from Paper by Fiftythree
- The experience of reading, ideation, craft, creation deserves a beautiful interface, not just a functional one. The interface should inspire creation, not intimidate with a blank canvas.
- Paper is great because it offers a simple, minimal set of tools that each work beautifully without needing or affording customization, but combine together to form a cohesive toolkit. Attribute vectors in a latent space UI should feel similarly: a small default set of abstractions not needing customization, that inspire beautiful creation and provide a huge expressive range. This isn't so much about PCA or anything quantitatively computed, as much as a matter of taste -- What minimal set of thoughtfully designed tools makes people unafraid to create, because an unassuming brushstroke can yield something beautiful that they can be proud of?
- Paper's color mixing wheel is a beautiful interface. It lets users traverse a latent space without explicitly constructing or visualizing the space or attribute vectors, by simply picking from a palette or sampling colors from their canvas. We can imagine implementing something similar for images or language using generative models, moving around in latent space by swirling and remixing images and texts we come across in life or across the web.
Good food for thought a time of explosive experimentation and exploration for creative interfaces using generative AI models.
The degree to which just scaling up compute (read: money) in language models just solves problems that architectural tweaks can't... is truly annoying sometimes.
What minimal set of thoughtfully designed tools makes people unafraid to create, because an unassuming brushstroke can yield something beautiful that they can be proud of?
Oral tradition is the more canonical form of language — written came later. Maybe that means Glyph should primarily be situated in context and oral/linear-first?
Text-to-image generative models as a kind of "augmented imagination" rather than "augmented creativity" or "augmented craft".
Posit: Higher intelligence is being able to learn an arbitrary class of patterns/functions in-context, what a less intelligent system needs to be directly optimized ("trained") to learn.
For example:
- Larger transformers can in-context learn things smaller models need to be trained to know
- Extrapolating, super-human general intelligence will be able to learn in-context things that take humans lots of iterations to learn.
- The set of things that take humans many iterations to learn includes the ability to autoregressively predict (imitate) another human. Thus, a sufficiently intelligent AGI will be able to mimic any human with arbitrary accuracy by learning from "in-context" observations.
In-context learn "arbitrary class of functions" here seems pretty profound. It doesn't just include language, but also models of environments and policies (Gato), arbitrary linear functions, etc.
Notes on WebTexture
I don't think I have time to build this. If you're interested in building this, email me (linus @ my main website) and I'm down to help.
Premise
- There's a lot of value in, and demand for, language models capable of performing tasks on the Web by acting as if it were a human taking actions in a browser. Many AI labs and startups are working on this.
- All existing approaches seem to build these action-taking models from HTML representations of webpages, which has obvious long-term limitations for visually rich websites and utilizes an LM's context length poorly with special code characters and non-semantic text.
- HTML also doesn't extend to native software, which is still where a lot of enterprise automation value lies.
- The Web is a reinforcement learning-hostile environment, because (1) it's extremely diverse, and (2) reward signals are often far removed from actions that cause them.
- On the other hand, collecting high quality human data points for fine-tuning on labelled action trajectories is very expensive.
What we need is a multimodal model that can learn about webpages in a self-supervised way against an Internet-scale dataset, which can enable sample-efficient training of action-taking models downstream.
Idea
- WebTexture is a medium-size (~20-50B params) pretrained multimodal (image + text) representation model for web content that maps a screenshot and displayed text on a webpage into a unified quantized token sequence for downstream tasks.
- TL;DR — Webpage screenshot + displayed or OCR'd text goes into the model, and the model translates it seq2seq style into a heterogeneous stream of learned "style tokens" (representing non-linguistic information like visual style, UI element type, position, etc.) as well as word tokens. Action-taking models then consume this token stream and output actions against specific UI elements on the page.
WebTexture input
[screenshot patch 1]<bos>click for accessibility<eos>
[screenshot patch 2]<bos>Welcome to IKEA<eos>
[screenshot patch 3]<bos>about<eos><bos>contact us<eos>
WebTexture output
[page header]<bos>Welcome to IKEA<eos>
[about link]<bos>about<eos>
[generic button]<bos>contact us<eos>
Action model input
<bos>Ask if IKEA has the Blahaj in stock.<eos><WebTexture output>
Action model output
[click]contact us
JavaScript
$('[text="contact us"]').click()
Proposed architecture and training objective (initial noob guess)
- A two-headed "hydra" transformer:
- Seq2seq language model backbone, which can be initialized from e.g. UL2-20B.
- Text tokens cross-attend to ViT patch embeddings with position biases so that patches where the text occurs on screen are weighted higher. (But I feel like position biases can be learned?)
- Closest thing I've seen in published literature is Meta's CommerceMM, which a multimodal representation model for e-commerce product listings.
- Meta has also done CM3, which learns a fill-in-the-middle GPT over multimodal web data to see some great zero-shot capabilities.
- This model can be trained on a number of self-supervised objectives. The most obvious is masked language modeling, but I can also imagine doing something like shuffling some of the patch-text pairs and having the model discriminate which are mispaired.
- It's important here that the model is trained on something that makes it learn visual detail, not just text content.
- There are good heuristics for web content segmentation into semantic "blocks", like looking at scroll and layout boundaries in CSS, the browser's compositor layers ("z-index" layers), etc.