Somewhere between 1B - 5B parameters, transformer-based language models go from interesting to intelligent to insightful. Currently training a 3B model after having worked for a while with a sub-1B one (t5-3b / t5-large) -- the difference is palpable.