I think the next bit of performance may be leaning hard into QAT. We know there is a lot of wasted precision in models, so the more we understand that during training the better quality small quants can get.
I also think diffusion LLMs ability to change previous tokens is amazing. As well as the ability to iteratively use an auto regressive LLM to increase output quality.
I think a mix of QAT and iterative interference will bring the biggest upgrades to local use. It'll give you a smaller higher quality model thay you can decide to run for even longer for higher quality outputs.