Blackpearl unveils GTM-Bench for AI sales evaluation
Summary
Blackpearl Group has introduced GTM-Bench, a benchmark designed to measure the commercial value of AI systems in sales and prospecting workflows. The tool tests leading AI models, including those from OpenAI, Anthropic, Google, and DeepSeek, on real-world tasks using both public and proprietary data. The results indicated that four out of six leading AI sales agents produced negative overall scores, highlighting the risk that poor-quality output can outweigh any useful results. The benchmark focuses on "buyer and seller coherence," assessing whether an AI system can understand a seller's offerings, identify likely buyers, and return relevant, evidence-backed prospect records. It covers 72 tasks across 11 types and 15 market categories, built from 59,881 prospecting queries. The scoring system rewards a good lead with +1 and penalizes a bad lead with -1 to reflect the commercial costs of low-quality prospecting. A key finding was that one AI agent generated 6,342 prospect records for a single task, illustrating the problem of prioritizing volume over quality. Blackpearl CEO Nick Lissette emphasized that the benchmark aims to shift focus from activity to outcomes, noting that "bad AI may be worse than no AI at all." The stronger systems cast a wide net before narrowing results using evidence, while weaker systems returned large numbers of records with less discipline. Blackpearl's RTSA system achieved a net score of +26,615.6, while GPT-5.5 scored +4,040.9 with proprietary data access and +1,015.4 with public data only. No model led every category, with GPT-5.5 outperforming Blackpearl RTSA in healthcare, recruiting, industrial, and real estate, while Blackpearl's system was weaker in public sector and sustainability tasks. Lissette noted this variation suggests businesses cannot assume one AI system will perform best across every environment, pointing to the importance of testing models against specific commercial tasks. He also highlighted a broader shift toward specialized AI systems in industry, comparing Blackpearl's approach to examples like Harvey for legal AI and Tempus for health AI. The benchmark also examined the value of proprietary data, showing that GPT-5.5's score improved almost fourfold with access to Blackpearl's internal data. However, Blackpearl's own system still outperformed GPT-5.5, arguing that the design of task-specific AI agents has a significant effect on sales outcomes. Lissette explained that combining great data with foundational models yields four times better results, and adding go-to-market vertical AI on top yields a further six times better results, totaling twenty-six times better. Blackpearl has made the benchmark's methodology, code, tasks, and results public to ensure transparency and allow others to re-run experiments and challenge the findings.
(Source:It Brief Australia)