Notes on WebTexture

I don't think I have time to build this. If you're interested in building this, email me (linus @ my main website) and I'm down to help.

Premise

  • There's a lot of value in, and demand for, language models capable of performing tasks on the Web by acting as if it were a human taking actions in a browser. Many AI labs and startups are working on this.
  • All existing approaches seem to build these action-taking models from HTML representations of webpages, which has obvious long-term limitations for visually rich websites and utilizes an LM's context length poorly with special code characters and non-semantic text.
    • HTML also doesn't extend to native software, which is still where a lot of enterprise automation value lies.
  • The Web is a reinforcement learning-hostile environment, because (1) it's extremely diverse, and (2) reward signals are often far removed from actions that cause them.
  • On the other hand, collecting high quality human data points for fine-tuning on labelled action trajectories is very expensive.

What we need is a multimodal model that can learn about webpages in a self-supervised way against an Internet-scale dataset, which can enable sample-efficient training of action-taking models downstream.

Idea

  • WebTexture is a medium-size (~20-50B params) pretrained multimodal (image + text) representation model for web content that maps a screenshot and displayed text on a webpage into a unified quantized token sequence for downstream tasks.
  • TL;DR — Webpage screenshot + displayed or OCR'd text goes into the model, and the model translates it seq2seq style into a heterogeneous stream of learned "style tokens" (representing non-linguistic information like visual style, UI element type, position, etc.) as well as word tokens. Action-taking models then consume this token stream and output actions against specific UI elements on the page.
WebTexture input
	[screenshot patch 1]<bos>click for accessibility<eos>
	[screenshot patch 2]<bos>Welcome to IKEA<eos>
	[screenshot patch 3]<bos>about<eos><bos>contact us<eos>
WebTexture output
	[page header]<bos>Welcome to IKEA<eos>
	[about link]<bos>about<eos>
	[generic button]<bos>contact us<eos>
Action model input
	<bos>Ask if IKEA has the Blahaj in stock.<eos><WebTexture output>
Action model output
	[click]contact us
JavaScript
	$('[text="contact us"]').click()

Proposed architecture and training objective (initial noob guess)

  • A two-headed "hydra" transformer:
    • Seq2seq language model backbone, which can be initialized from e.g. UL2-20B.
    • Text tokens cross-attend to ViT patch embeddings with position biases so that patches where the text occurs on screen are weighted higher. (But I feel like position biases can be learned?)
    • Closest thing I've seen in published literature is Meta's CommerceMM, which a multimodal representation model for e-commerce product listings.
    • Meta has also done CM3, which learns a fill-in-the-middle GPT over multimodal web data to see some great zero-shot capabilities.
  • This model can be trained on a number of self-supervised objectives. The most obvious is masked language modeling, but I can also imagine doing something like shuffling some of the patch-text pairs and having the model discriminate which are mispaired.
    • It's important here that the model is trained on something that makes it learn visual detail, not just text content.
  • There are good heuristics for web content segmentation into semantic "blocks", like looking at scroll and layout boundaries in CSS, the browser's compositor layers ("z-index" layers), etc.