What I Learned Trying to Benchmark Roleplay Models
Every RP benchmark is either vibes or tests the wrong thing. I built one that tried to do better, and the real lesson was about how badly LLM-as-judge fails on subjective tasks — until 162 community voters fixed it.