FrameworkUpdated 26 Mar 2026Starter

Framework: AI Tool Evaluation Matrix

A structured decision matrix for evaluating AI tools before committing. Scores tools across seven weighted criteria to cut through marketing hype and make informed choices.

Share on X

Use cases

Strategy & Planning, Operations & Workflow

Platforms

Model-Agnostic

Jump to a section

When to Use This Why It Works How to Customise Limitations Model Notes

The resource

Copy and adapt. Do not paste blind.

```markdown
# AI Tool Evaluation Matrix

When to Use This

Use this before committing to any AI tool, platform, or service. The matrix is designed to cut through marketing hype and demo-driven excitement by forcing structured evaluation against criteria that actually matter for day-to-day use.

Especially useful when: your team is debating between competing tools, you are advising a client on AI tool selection, you are about to commit budget to an annual subscription, or you find yourself evaluating a new AI tool every week (the matrix forces discipline).

Why It Works

Weighted criteria reflect real-world importance. Output Quality and Reliability are weighted at x3 because they determine whether you actually use the tool after the first week. A tool that is fast to set up (x2) but produces unreliable output (x3) will score poorly, as it should. The weights encode the priority hierarchy that most people learn through painful experience.

The evaluation protocol prevents demo bias. Most AI tools are evaluated by watching a demo or running the tool's own example inputs. The protocol requires testing with YOUR actual use cases and YOUR real inputs. This is where most tools reveal their limitations.

The scoring anchors (1-5 with descriptions) prevent inflation. Without anchors, everyone scores everything 3 or 4. The explicit descriptions ("5 = best in class, no meaningful compromise") force honest assessment.

The red flag list provides automatic disqualification. Some issues are not matters of degree. No privacy policy is not a "score of 2 on Data Handling" — it is a reason not to use the tool at all. The red flags encode these deal-breakers.

The decision log creates accountability. Recording the reasoning and setting a review date prevents the common pattern of choosing a tool, forgetting why, and letting the subscription auto-renew long after the tool stopped being useful.

How to Customise

Adjust the weights. If integration is more important than output quality for your context (e.g. the tool connects critical systems), increase the Integration weight. The weights should reflect YOUR priorities.

Add criteria. Some teams need additional criteria: "Team adoption" (will people actually use it?), "Compliance" (does it meet regulatory requirements?), or "Customisability" (can you tailor it to your workflow?). Add rows with appropriate weights.

Create use-case-specific versions. An evaluation matrix for "AI writing tools" might add criteria like "Voice consistency" and "Format flexibility." An evaluation for "AI code tools" might add "Language support" and "IDE integration." Specialise the matrix for your most common evaluation scenarios.

Limitations

This framework is deliberately simple. It does not capture nuance that matters in some contexts: enterprise procurement requirements, security audit results, vendor lock-in analysis, or total cost of ownership modelling. For enterprise-scale decisions, use this matrix as a first pass to shortlist, then apply deeper due diligence.

The scoring is subjective. Two evaluators may score the same tool differently. For team decisions, have each evaluator score independently, then discuss and align where scores diverge by more than 1 point.

Model Notes

This framework is model-agnostic. It evaluates AI tools, not AI models. However, you can use this same structure to compare models (Claude vs GPT vs Gemini) by adjusting the criteria to: output quality, instruction following, context window, pricing, API reliability, and ecosystem.

Related Resources

Browse Frameworks

PromptStarter

System Prompt: Research Analyst

A system prompt for configuring an LLM as a structured research analyst that separates facts from interpretation, scores confidence, and flags gaps clearly.

Research & Analysis · Strategy & Planning

View resource

FrameworkStarter

Framework: Prompt Audit Checklist

A 15-point checklist for evaluating any prompt before putting it into production. Catches the most common prompt failures: vague instructions, missing constraints, absent error handling, and untested edge cases.

Operations & Workflow · Strategy & Planning