the stream

2022/11/18 7:21

Notes on WebTexture

I don't think I have time to build this. If you're interested in building this, email me (linus @ my main website) and I'm down to help.

Premise

There's a lot of value in, and demand for, language models capable of performing tasks on the Web by acting as if it were a human taking actions in a browser. Many AI labs and startups are working on this.
All existing approaches seem to build these action-taking models from HTML representations of webpages, which has obvious long-term limitations for visually rich websites and utilizes an LM's context length poorly with special code characters and non-semantic text.
- HTML also doesn't extend to native software, which is still where a lot of enterprise automation value lies.
The Web is a reinforcement learning-hostile environment, because (1) it's extremely diverse, and (2) reward signals are often far removed from actions that cause them.
On the other hand, collecting high quality human data points for fine-tuning on labelled action trajectories is very expensive.

What we need is a multimodal model that can learn about webpages in a self-supervised way against an Internet-scale dataset, which can enable sample-efficient training of action-taking models downstream.

Idea

WebTexture is a medium-size (~20-50B params) pretrained multimodal (image + text) representation model for web content that maps a screenshot and displayed text on a webpage into a unified quantized token sequence for downstream tasks.
TL;DR — Webpage screenshot + displayed or OCR'd text goes into the model, and the model translates it seq2seq style into a heterogeneous stream of learned "style tokens" (representing non-linguistic information like visual style, UI element type, position, etc.) as well as word tokens. Action-taking models then consume this token stream and output actions against specific UI elements on the page.

WebTexture input
	[screenshot patch 1]<bos>click for accessibility<eos>
	[screenshot patch 2]<bos>Welcome to IKEA<eos>
	[screenshot patch 3]<bos>about<eos><bos>contact us<eos>
WebTexture output
	[page header]<bos>Welcome to IKEA<eos>
	[about link]<bos>about<eos>
	[generic button]<bos>contact us<eos>
Action model input
	<bos>Ask if IKEA has the Blahaj in stock.<eos><WebTexture output>
Action model output
	[click]contact us
JavaScript
	$('[text="contact us"]').click()

Proposed architecture and training objective (initial noob guess)

A two-headed "hydra" transformer:
- Seq2seq language model backbone, which can be initialized from e.g. UL2-20B.
- Text tokens cross-attend to ViT patch embeddings with position biases so that patches where the text occurs on screen are weighted higher. (But I feel like position biases can be learned?)
- Closest thing I've seen in published literature is Meta's CommerceMM, which a multimodal representation model for e-commerce product listings.
- Meta has also done CM3, which learns a fill-in-the-middle GPT over multimodal web data to see some great zero-shot capabilities.
This model can be trained on a number of self-supervised objectives. The most obvious is masked language modeling, but I can also imagine doing something like shuffling some of the patch-text pairs and having the model discriminate which are mispaired.
- It's important here that the model is trained on something that makes it learn visual detail, not just text content.
There are good heuristics for web content segmentation into semantic "blocks", like looking at scroll and layout boundaries in CSS, the browser's compositor layers ("z-index" layers), etc.