Somewhere between 1B - 5B parameters, transformer-based language models go from interesting to intelligent to insightful. Currently training a 3B model after having worked for a while with a sub-1B one (t5-3b
/ t5-large
) -- the difference is palpable.