Notes on WebTexture
I don't think I have time to build this. If you're interested in building this, email me (linus @ my main website) and I'm down to help.
- There's a lot of value in, and demand for, language models capable of performing tasks on the Web by acting as if it were a human taking actions in a browser. Many AI labs and startups are working on this.
- All existing approaches seem to build these action-taking models from HTML representations of webpages, which has obvious long-term limitations for visually rich websites and utilizes an LM's context length poorly with special code characters and non-semantic text.
- HTML also doesn't extend to native software, which is still where a lot of enterprise automation value lies.
- The Web is a reinforcement learning-hostile environment, because (1) it's extremely diverse, and (2) reward signals are often far removed from actions that cause them.
- On the other hand, collecting high quality human data points for fine-tuning on labelled action trajectories is very expensive.
What we need is a multimodal model that can learn about webpages in a self-supervised way against an Internet-scale dataset, which can enable sample-efficient training of action-taking models downstream.
- WebTexture is a medium-size (~20-50B params) pretrained multimodal (image + text) representation model for web content that maps a screenshot and displayed text on a webpage into a unified quantized token sequence for downstream tasks.
- TL;DR — Webpage screenshot + displayed or OCR'd text goes into the model, and the model translates it seq2seq style into a heterogeneous stream of learned "style tokens" (representing non-linguistic information like visual style, UI element type, position, etc.) as well as word tokens. Action-taking models then consume this token stream and output actions against specific UI elements on the page.
Proposed architecture and training objective (initial noob guess)
- A two-headed "hydra" transformer:
- Seq2seq language model backbone, which can be initialized from e.g. UL2-20B.
- Text tokens cross-attend to ViT patch embeddings with position biases so that patches where the text occurs on screen are weighted higher. (But I feel like position biases can be learned?)
- Closest thing I've seen in published literature is Meta's CommerceMM, which a multimodal representation model for e-commerce product listings.
- Meta has also done CM3, which learns a fill-in-the-middle GPT over multimodal web data to see some great zero-shot capabilities.
- This model can be trained on a number of self-supervised objectives. The most obvious is masked language modeling, but I can also imagine doing something like shuffling some of the patch-text pairs and having the model discriminate which are mispaired.
- It's important here that the model is trained on something that makes it learn visual detail, not just text content.
- There are good heuristics for web content segmentation into semantic "blocks", like looking at scroll and layout boundaries in CSS, the browser's compositor layers ("z-index" layers), etc.