Linus's stream

Text-to-image generative models as a kind of "augmented imagination" rather than "augmented creativity" or "augmented craft".

Posit: Higher intelligence is being able to learn an arbitrary class of patterns/functions in-context, what a less intelligent system needs to be directly optimized ("trained") to learn.

For example:

  • Larger transformers can in-context learn things smaller models need to be trained to know
  • Extrapolating, super-human general intelligence will be able to learn in-context things that take humans lots of iterations to learn.
    • The set of things that take humans many iterations to learn includes the ability to autoregressively predict (imitate) another human. Thus, a sufficiently intelligent AGI will be able to mimic any human with arbitrary accuracy by learning from "in-context" observations.

In-context learn "arbitrary class of functions" here seems pretty profound. It doesn't just include language, but also models of environments and policies (Gato), arbitrary linear functions, etc.

Notes on WebTexture

I don't think I have time to build this. If you're interested in building this, email me (linus @ my main website) and I'm down to help.

Premise

  • There's a lot of value in, and demand for, language models capable of performing tasks on the Web by acting as if it were a human taking actions in a browser. Many AI labs and startups are working on this.
  • All existing approaches seem to build these action-taking models from HTML representations of webpages, which has obvious long-term limitations for visually rich websites and utilizes an LM's context length poorly with special code characters and non-semantic text.
    • HTML also doesn't extend to native software, which is still where a lot of enterprise automation value lies.
  • The Web is a reinforcement learning-hostile environment, because (1) it's extremely diverse, and (2) reward signals are often far removed from actions that cause them.
  • On the other hand, collecting high quality human data points for fine-tuning on labelled action trajectories is very expensive.

What we need is a multimodal model that can learn about webpages in a self-supervised way against an Internet-scale dataset, which can enable sample-efficient training of action-taking models downstream.

Idea

  • WebTexture is a medium-size (~20-50B params) pretrained multimodal (image + text) representation model for web content that maps a screenshot and displayed text on a webpage into a unified quantized token sequence for downstream tasks.
  • TL;DR — Webpage screenshot + displayed or OCR'd text goes into the model, and the model translates it seq2seq style into a heterogeneous stream of learned "style tokens" (representing non-linguistic information like visual style, UI element type, position, etc.) as well as word tokens. Action-taking models then consume this token stream and output actions against specific UI elements on the page.
WebTexture input
	[screenshot patch 1]<bos>click for accessibility<eos>
	[screenshot patch 2]<bos>Welcome to IKEA<eos>
	[screenshot patch 3]<bos>about<eos><bos>contact us<eos>
WebTexture output
	[page header]<bos>Welcome to IKEA<eos>
	[about link]<bos>about<eos>
	[generic button]<bos>contact us<eos>
Action model input
	<bos>Ask if IKEA has the Blahaj in stock.<eos><WebTexture output>
Action model output
	[click]contact us
JavaScript
	$('[text="contact us"]').click()

Proposed architecture and training objective (initial noob guess)

  • A two-headed "hydra" transformer:
    • Seq2seq language model backbone, which can be initialized from e.g. UL2-20B.
    • Text tokens cross-attend to ViT patch embeddings with position biases so that patches where the text occurs on screen are weighted higher. (But I feel like position biases can be learned?)
    • Closest thing I've seen in published literature is Meta's CommerceMM, which a multimodal representation model for e-commerce product listings.
    • Meta has also done CM3, which learns a fill-in-the-middle GPT over multimodal web data to see some great zero-shot capabilities.
  • This model can be trained on a number of self-supervised objectives. The most obvious is masked language modeling, but I can also imagine doing something like shuffling some of the patch-text pairs and having the model discriminate which are mispaired.
    • It's important here that the model is trained on something that makes it learn visual detail, not just text content.
  • There are good heuristics for web content segmentation into semantic "blocks", like looking at scroll and layout boundaries in CSS, the browser's compositor layers ("z-index" layers), etc.

As a society, we need to evolve beyond reading and writing walls of text.

People need to be more thoughtful building products on top of LLMs. The fact that they generate text is not the point. LLMs are cheap, infinitely scalable, predictably consistent black boxes to soft human-like reasoning. That's the headline! The text I/O mode is just the API to this reasoning genie. It's a side effect of the training paradigm (self-supervision on Internet scale text data).

A vanishingly small slice of knowledge work has the shape of text-in-text-out (copywriting/Jasper). The real alpha is not in generating text, but in using this new capability and wrapping it into jobs that have other shapes. Text generation in the best LLM products will be an implementation detail, as much as backend APIs are for current SaaS.

Text, like code, is a liability, not an asset. An organization should strive to own as little text as necessary to express their information and accomplish tasks. If you don't heed this warning, you end up with a Notion that has 10 copies of every piece of information, 4 of which contradict each other and only 2 of which reliably surface on searches. Willy-nilly spraying the GPT-3 next-token prediction powder on your tool/product is a recipe for disaster outside of narrow workflows where text is the asset being produced. In all other cases, don't ship the API to the user. Text generation is not the product.

Notion's "AI" product is an affront to Doug Engelbart's name. Is there nobody left at the company who's thinking creatively about AI x knowledge work?

Thinking in neighborhoods

As I've been using latent space navigation tools more in my own thinking work, one feeling I've noticed is that every idea no longer feels independent and singular, but instead feels like one of many (infinite?) possible variations from which I've simply plucked one version. If I look a bit closer to the left and right of any given idea, there are similar ideas from different perspectives and different ideas with shared perspectives, simply waiting to be made visible.

I call this thinking in neighborhoods -- where every thought exists within a neighborhood of other ideas tightly connected to each other, just out of sight.

For example, if I write down a sentence about generation ships traveling across the universe:

Generation ships are spacecraft where multiple generations of people live and die as they travel towards some destination, such as another star system. Because of the general difficulty of space travel and the speed limit of light, these voyages might take centuries to be completed.

Rather than thinking of that sentence as, well, just a sentence about some spaceships, I'm instead beginning to feel the "neighborhood" of ideas that lie next to it.

Some points in this idea neighborhood are just paraphrases, but others retell the story in different styles. Yet others change settings, wrap the idea in positive or negative light, or change the topic entirely, while keeping the tone and structure of the sentence. From the outputs of the model I used to study this idea, my favorite of the set compared the scale of interstellar travel to the vastness of the ocean, and another speculated that planets themselves may be considered a kind of generation ship.

As I've adjusted to this new feeling, a complementary sensation has drifted into the back of my mind: In a world where ideas exist in this infinitely densely packed fabric of variations, thinking individual strands of thoughts without visceral awareness of the rich variations in the idea of every thought seems like a loss, like trying to take in the the night sky with a telescope that only sees one star at a time. My hope is that with some combination of better tools and ways to represent/communicate ideas, we'll be able to open our minds up to the infinite variations present just besides every word and idea we perceive today.

Added a few functions for drawing histograms into the Oak standard library, and now I can do this!

std.range(100000) |>
	std.map(random.normal) |>
	debug.histo({ bars: 20, label: :start, cols: 50 })

... which renders ...

    7 ▏
   38 ▏
  175 ▌
  574 █▋
 1740 █████
 3951 ███████████▍
 7804 ██████████████████████▌
12066 ██████████████████████████████████▉
15913 ██████████████████████████████████████████████
17296 ██████████████████████████████████████████████████
15783 █████████████████████████████████████████████▋
11614 █████████████████████████████████▋
 7160 ████████████████████▊
 3562 ██████████▎
 1527 ████▍
  577 █▋
  159 ▌
   40 ▏
   12 ▏
    1 ▏

Might our descendants look upon our probable emergence and departure from Earth's gravity well the same way we look upon our ancestors' emergence and departure from the ocean — as a necessary proving ground that we outgrew?

I was trying to develop an intuition for how "hard" masked language modeling and masked text reconstruction (autoencoding) are, so I wrote a little script to "mask" a certain % of words from stdin.

{
	println: println
	default: default
	map: map
	stdin: stdin
	constantly: constantly
} := import('std')
{
	split: split
	trim: trim
	join: join
} := import('str')

Cli := import('cli').parse()

MaskFraction := float(Cli.opts.'mask-fraction') |>
	default(float(Cli.opts.m)) |>
	default(0)

stdin() |>
	trim() |>
	split('\n') |>
	map(fn(l) l |> split(' ')) |>
	map(fn(l) l |> map(fn(w) if rand() < MaskFraction { true -> w |> map(constantly('_')), _ -> w })) |>
	map(fn(l) l |> join(' ')) |>
	join('\n') |>
	println()

and then e.g.

pbpaste | oak masker.oak --mask-fraction 0.3

Difficulty shoots through the roof for me around 30% masking. (Coincidentally, this is also where I'm training my current ContraBT5 bottleneck model.)