Chatbots are getting scary good — but evaluating them? That’s still a pain. BLEU and ROUGE scores feel like trying to judge a movie by its subtitles. Human evaluation is time-consuming, inconsistent, and honestly… nobody has time for that. So here’s the question I tackled in this project: Can we let an LLM evaluate other LLMs? Spoiler: Yep. And it’s shockingly effective. The Big Idea: LLM Rating LLM

Chatbots are getting scary good — but evaluating them? That’s still a pain. BLEU and ROUGE scores feel like trying to judge a movie by its subtitles. Human evaluation is time-consuming, inconsistent, and honestly… nobody has time for that.
So here’s the question I tackled in this project:
Can we let an LLM evaluate other LLMs?
Spoiler: Yep. And it’s shockingly effective.