
We swept 156,928 job postings and 435,000+ blog and news articles for ~90 LLM-eval frameworks and ~55 evaluation methodologies. The result: hiring names a tiny tool set (LangSmith + Langfuse = 56% of all eval-tool mentions; no framework over 1%), practitioners are converging on LLM-as-a-judge with a rubric, and the benchmarks the press argues about — SWE-bench, MMLU, GPQA — show up in roughly zero job descriptions.
May 12, 2026








