AI Hiring Doesn't Have to Be a Black Box

Most AI hiring tools score the same CV 40 one day and 80 the next. Here's what we changed to make AI hiring scoring reliable enough that a human reviewer reaches the same answer.

Janet Paul

May 28, 2026

Most AI hiring tools score the same CV 40 one day and 80 the next. Here's what we changed to make AI hiring scoring reliable enough that a human reviewer reaches the same answer.

How we made AI hiring scoring reliable enough that a human reviewer reaches the same answer, and why most AI hiring tools score the same CV 40 one day and 80 the next.

The same candidate shouldn't get two different scores. But run their CV through most AI hiring tools twice and that's exactly what happens. A 40 on Monday, an 80 on Wednesday, same job, same requirements, no explanation for the gap. It's the open secret of AI hiring: the scores are unreliable in a way the product pages never admit, and the recruiters who notice it stop trusting the number and go back to reading CVs by hand. They're not wrong to.

We make an AI hiring tool, so this is our problem as much as anyone's. And we'll be honest about the limits. No one can fully open up a large AI model and trace exactly why it produced a given answer; some of how these models work is genuinely hard to see into. But the kind of AI you use to screen the people who want to work for you doesn't have to be opaque in the ways that matter. The score a candidate gets should be something a human reviewer, given the same inputs, would arrive at on their own. If the AI can't do that, you have a tool you can't trust.

This piece is about what it takes to make AI scoring reliable: stable enough that the same CV gets the same score, and clear enough that you can see why it got that score. We had to work that out to build something we'd trust ourselves. Here's what that took.

Why most AI hiring tools give unreliable scores

Three things tend to go wrong, and each one makes the next worse.

The first is that the AI sees too much. Almost every AI screening tool we have looked at feeds the raw CV straight into the model. That CV contains a candidate's name, gender markers, the schools they went to, where they live, signals that imply their age. The AI's training data carries biased scoring patterns tied to each of those signals. The model has read millions of documents in which certain names and certain schools were treated as higher status than others, and that bias gets carried into how it scores a CV. You do not need to believe AI is intentionally biased to see the problem. The training data was biased because the world it was scraped from was biased. The score reflects that.

The second is that the AI is asked to do too much at once. One model, one prompt, "read this CV, compare it to this job description, return a score." That is several cognitive jobs in a single step, and the model wobbles between them.

The third is that the scoring is anchored against the wrong thing. Most AI tools score the CV against a job description. But a job description is a candidate-attraction tool, not a decision-making tool (see our piece on job descriptions). It is written to make the role sound appealing, not to define what evidence makes someone qualified for it. Scoring against it is scoring against a moving target.

Each of these is fixable. We know because we fixed them.

The three changes that made our scoring reliable

We did not invent any of this. The three design choices we made are individually well-known. What is uncommon is using them together, in a category that has been racing in the opposite direction.

1. We strip identifying information before the AI sees the CV

The CIPD's Inclusive Recruitment guide for employers

recommends anonymising CVs before they go in front of human reviewers, because the same biases that live in training data live in human reviewers too. We applied the idea to AI screening. The candidate's name, the demographic and lifestyle signals carried alongside it, the schools and locations the AI's training data has opinions about, all of it is stripped out before any scoring step. The AI sees the evidence of what the candidate has done. It does not see who the candidate is.

This was the single biggest fix for the candidate score swings. With identifying information present, two CVs with the same skills evidence could score very differently. With it removed, the scoring became consistent.

This is engineering, not aspiration. Anyone could do it. Most AI hiring tools do not, because it is costly and it makes the scoring slower. We chose to do it anyway because a faster but biased AI is not actually faster: it just produces a worse shortlist quickly.

2. We split the scoring into multiple steps

The other half of the scoring inconsistency problem is that asking one AI model to do everything in one shot is unreliable. Read the CV, anonymise it, hold the competency framework in mind, evaluate the evidence, calculate a score, all in one prompt. The model spreads itself too thin.

So we split the scoring into multiple steps. No single model is asked to do everything at once. Each step does one narrow task and hands its output to the next. We will not spell out the exact sequence here, but the principle is the one that matters: when we asked the AI to do everything in one step, the scores moved a lot; when we broke the evaluation into smaller pieces, each one doing one job, the reliability improved by a lot.

This is the engineering trade most AI hiring tools have not made. It is more steps, not fewer. It costs more compute per candidate and the work is slower. We think that trade is the right one to make. A fast wrong answer still wastes the interview slot.

3. The framework is the judgement, not the AI

When the hiring team sets a competency to a particular expertise level, the scoring is mechanical. Does the candidate's evidence match the indicators we already defined for that level? A human given the same framework and the same evidence reaches the same answer. The AI is not deciding what counts as expert; the hiring team decided that, in advance, by writing the framework. The AI is applying that decision.

That is what makes the score auditable. You can ask, "Why this number?" and trace it back to the indicators in the framework. A score is never the AI's opinion. It is the framework's opinion, applied at speed.

Why our scores match a human reviewer 95% of the time

Once we made those three changes, we needed to know whether the AI was doing what we thought it was, or whether we were telling ourselves a story.

We ran a blind test. The same candidate was scored by the AI and by a human reviewer, independently. Both saw the same competency framework, the same anonymised CV, and the same answers to knockout questions. The human reviewer scored without seeing what the AI had given. We ran this across about 60-70 candidates over several test rounds, and then again against our first customer's live screening process.

In about 95% of cases, the AI's score matched the human's score within 1-2 points out of 100.

This is not "the AI is unusually intelligent." It is "the AI and the human are evaluating the same evidence against the same framework, so they reach the same answer." The framework is doing the work.

The validation is internal. It is not peer-reviewed, and we are not pretending it is. We ran the test multiple times because consistency itself is a signal, and we have described the methodology in enough detail that anyone could replicate it. In a category where most reliability claims are unfalsifiable, that is the standard we want to set for ourselves.

When the AI and the human disagreed by more than 1-2 points, it was usually because the human reviewer had not caught a piece of evidence buried somewhere in a long CV. A human reading a five-page CV under time pressure misses things. The AI does not. When we went back and looked at the disagreements, the AI was usually the one that had read more carefully.

What explainable scoring looks like on a candidate profile

A score you can trust shows its work. Not as one number you have to take on faith, but broken down to the level of each competency, with the reasoning visible for each one. That principle is what separates an explainable score from an opaque one, whatever tool produces it.

In Workcraft, that looks like a row of traffic lights on every candidate profile, one for each competency. Green means the candidate's evidence matches the expertise level you asked for. Yellow means partial match. Red means the indicators are not there. Next to each light is a short, keyword-level explanation of why, drawn from the candidate's CV and knockout answers.

Explainable candidate evaluation shows how an applicant scores against your criteria

What that means in practice is that the hiring team does not need to read the CV first. They read the colours and the reasons, and decide who to dig into. The CV is still there if they want it. But the AI's reasoning is visible on the surface.

Our customers also do not use the score the way we expected. They do not threshold on "above 80." They look at the top 10, top 15, or top 20 candidates and ignore the rest. The score's precision matters less than the relative order of the funnel. The AI's job is to sort that funnel reliably enough that a hiring team can review 15 candidates instead of 200.

That is not a black box. That is the work showing.

Three questions to ask any AI hiring tool

If you are evaluating AI hiring tools, including ours, three questions separate the reliable ones from the black boxes:

Does it strip identifying information before scoring, or does the model see the raw CV? If the model sees names, schools, and locations, its training-data bias is in your scores.
Does one model do everything in one step, or is the work split up? Asking a single model to read, compare, and score in one prompt is where the wild swings come from.
Is the score anchored to a framework you control and can audit, or to the AI's own judgment? If you can't trace a score back to criteria you set, you can't trust it and you can't defend it.

A tool that gets these right gives you the same score for the same candidate every time, and shows you why. A tool that doesn't is a black box, whatever its marketing says.

Reliable AI hiring is a choice most tools haven't made

Reliable AI hiring is not a question of whether AI is the right tool. It is a question of how the tool is built, and those three questions are how you tell the difference. The principles behind them don't change when the next AI model comes out: strip the inputs that trigger biased patterns, split the work so no single model carries the whole job, anchor the scoring on a framework the hiring team controls and can audit.

If you tried an AI hiring tool and gave up because the scores did not make sense, that was a reasonable thing to do. The tool you tried was probably a black box. But it doesn't have to stay that way. The version that earns your trust is the version where a human reviewer, given the same inputs, would reach the same answer the AI did.

FAQ

Is the 95% validation peer-reviewed?

No. It is Workcraft's internal validation. We ran the same blind test multiple times across approximately 60-70 candidates because consistency is itself a credibility signal, and we have named the methodology in enough detail that anyone could replicate it. Internal validation is not peer reviewed.

If AI models can't be fully explained, how can your scoring be trusted?

It is a fair point. The deep-learning models underneath any AI scoring system are not fully interpretable. What we control is what we put into them and how we use what comes out. We anonymise the inputs so demographic patterns cannot bias the score. We split the work across multiple steps so no single model is doing everything at once. We anchor every score on a competency framework, so a human can audit any candidate's score by tracing it back to the indicators the hiring team defined. That is not a fully open box. But it is not opaque either.

Why does the same CV get different scores from most AI hiring tools?

Two reasons. First, the AI sees identifying information (name, demographic signals, school, location) and its training data carries biased scoring patterns tied to those signals. Second, most AI scoring tools ask one model to do everything at once: read the CV, compare it to a job description, return a score. That is several cognitive tasks in one prompt, and the model wobbles. Strip the identifying signals before scoring and break the work into smaller steps, and the swings stop.

How big is the sample size?

About 60-70 candidates across multiple test runs. We ran the blind validation 4-5 times internally, with about 10 candidates per run, then re-ran the comparison against our first customer's live screening of 70+ candidates. The same 95% agreement held across both contexts.

What happens when the AI and the human reviewer disagree?

Most of the time it is because the human reviewer did not catch a piece of evidence buried somewhere in a long CV. A human reading a five-page CV under time pressure misses things. The AI does not. When we went back and looked at the disagreements, the AI was usually the one that had read more carefully.

Doesn't anchoring on a framework just move the black box to the framework?

The framework is built by the hiring team and is visible to them, with explicit indicators at each expertise level. It is not a black box because the hiring team sees and controls it directly. The AI applies the framework; the framework is the judgement.

The 95% human-agreement figure and the 60-70 candidate sample are Workcraft's internal blind-test validation, not externally published research.

Sources & outbound links

CIPD, Inclusive recruitment: Guide for employers