I was trying to develop an intuition for how "hard" masked language modeling and masked text reconstruction (autoencoding) are, so I wrote a little script to "mask" a certain % of words from stdin.
{
println: println
default: default
map: map
stdin: stdin
constantly: constantly
} := import('std')
{
split: split
trim: trim
join: join
} := import('str')
Cli := import('cli').parse()
MaskFraction := float(Cli.opts.'mask-fraction') |>
default(float(Cli.opts.m)) |>
default(0)
stdin() |>
trim() |>
split('\n') |>
map(fn(l) l |> split(' ')) |>
map(fn(l) l |> map(fn(w) if rand() < MaskFraction { true -> w |> map(constantly('_')), _ -> w })) |>
map(fn(l) l |> join(' ')) |>
join('\n') |>
println()
and then e.g.
pbpaste | oak masker.oak --mask-fraction 0.3
Difficulty shoots through the roof for me around 30% masking. (Coincidentally, this is also where I'm training my current ContraBT5 bottleneck model.)