don't get me wrong: it *is* a flaw when something performs better on a benchmark than actual usage. Sometimes, that's a flaw of the benchmark (they're supposed to simulate actual usage), other times it's a flaw of effectively developing *for* the benchmark - that's not great, and should be brought up as a mistake, at least.
But categorising it as "cheating", etc? Nah - they're probably mistakes similar to overtraining an AI, tbh.
(Some stuff is evil though: don't violate human rights, dammit)
A Mastodon server friendly towards anti-fascists, members of the LGBTQ+ community, hackers, and the like.