the stream

2022/11/8 0:39

I was trying to develop an intuition for how "hard" masked language modeling and masked text reconstruction (autoencoding) are, so I wrote a little script to "mask" a certain % of words from stdin.

{
	println: println
	default: default
	map: map
	stdin: stdin
	constantly: constantly
} := import('std')
{
	split: split
	trim: trim
	join: join
} := import('str')

Cli := import('cli').parse()

MaskFraction := float(Cli.opts.'mask-fraction') |>
	default(float(Cli.opts.m)) |>
	default(0)

stdin() |>
	trim() |>
	split('\n') |>
	map(fn(l) l |> split(' ')) |>
	map(fn(l) l |> map(fn(w) if rand() < MaskFraction { true -> w |> map(constantly('_')), _ -> w })) |>
	map(fn(l) l |> join(' ')) |>
	join('\n') |>
	println()

and then e.g.

pbpaste | oak masker.oak --mask-fraction 0.3

Difficulty shoots through the roof for me around 30% masking. (Coincidentally, this is also where I'm training my current ContraBT5 bottleneck model.)