Udo Nkwocha

We are Letting LLMs Decide Who Gets Hired and Doing It Wrong

The interviewer finishes the call. They open the transcript, paste it into a chatbot, paste the evaluation criteria, and ask "How did the candidate do?"

A paragraph comes back. Clean. Confident. Authoritative.

They copy it into Greenhouse. Submit.

That's a $200K decision. Made by a $20 chatbot. With no rubric, no audit, no oversight.

There's a new person in the hiring room. Nobody interviewed them. Nobody trained them. And nobody can fire them.

I've spent over 400 hours interviewing software engineering candidates. Only recently did I start using LLMs to help draft my feedback, and I get why it's tempting. Writing notes after back-to-back interviews is tedious, and with every hour that passes, context evaporates.

But it's not just me. With transcription tools everywhere now, LLM-assisted evaluations are becoming an interviewer's tool in the hiring arsenal.

I felt the risk firsthand outside of work during a mock interview. I used ChatGPT to transcribe the session, dropped the same transcript into an evaluation template, and watched how easily the outcome could be steered. With a few prompt tweaks, the same performance could read as impressive or shaky.

Trying to interview without transcription or LLM help isn't a great alternative either. It's hard to meaningfully engage in a conversation while typing notes, and if you don't capture things in the moment, you forget more than you think.

So the question isn't whether LLMs should be used in hiring. It's how to design a workflow around them that's auditable, responsible, and justifiable.

Over the last few weeks, I dug into that. I didn’t expect it to take me from Industrial-Organizational Psychology to computer science papers on arXiv, but it was genuinely exciting. Connecting those worlds gave me a first-draft approach to making LLM-assisted hiring safer, more consistent, and auditable.

The current reckless, naive approach

Transcript + evaluation criteria + LLM prompt => paste the output into the hiring system. It’s fast, but it’s riddled with predictable failure modes:

  1. Hallucinated details: LLMs can confidently invent "evidence"[1]
  2. Verbosity bias: longer answers can be judged as better even when they add little substance (the “verbosity attack”).[1]
  3. Position and framing bias: if you show the model two options (A vs. B), it often favors the one it sees first; swap the order, and the verdict can flip. Even small changes in how you ask the question can change the verdict.[1]
  4. Transcript errors become decision errors: speech-to-text mistakes propagate downstream and can change the evaluation.[2]

My suggested approach: A structured, auditable workflow

The foundation is not the model. It’s the interview.

1) Start with a structured interview

Google's guidance is simple: ask candidates applying for the same job a consistent set of questions, score them on the same scale, and make decisions against predetermined qualifications.[4] They also publish an example interview grading rubric that shows what "levels" look like in practice.[8] A rubric matters because it reduces variance that comes from interviewers, not candidates.

2) Use behaviorally anchored rubrics

This is not a new idea. Industrial-Organizational Psychology has three closely related ways to build rubrics:

  • Behaviorally Anchored Rating Scales (BARS) is a framework where each score level is tied to concrete examples of what "good" and "bad" look like.[5]
  • Behavioral Observation Scales (BOS) asks the interviewer to rate how often they observed specific behaviors (for example, "clarified requirements" or "tested assumptions").[6]
  • Behavioral Summary Scales (BSS) is a simplified version of BARS that uses a few short behavior summaries (instead of lots of detailed anchors), making it easier for different interviewers to stay consistent.[7]

For interviews, I think BSS is the best fit. Interview feedback is noisy, time-constrained, and often debated by reviewers. A small number of clearly described levels (for example: poor, mixed, good, excellent) are easier to calibrate across interviewers, and easier to audit later.

Example: Software Engineering Coding Interview Rubric

Google's sample structured interview grading rubric

3) Extract grounded evidence ("cite before you speak")

Now the LLM's job is not to "judge the candidate". It's to match the transcript and artifacts to the rubric's behavior statements. For each behavior under each level (for example, "Struggles to break the problem into steps or pick a workable approach"), the model should extract the best supporting evidence from the interview artifacts (transcript, code, and any diagram) and return it with a direct link to the exact transcript line(s) so a human reviewer can verify the claim.

If it can't find evidence, it must say "insufficient evidence" rather than guessing. This "cite before you speak" constraint makes the output harder to steer and easier to audit: every rating is backed by something you can verify.[3]

Prompt construction is critical here. Explicitly instruct the model to extract the shortest transcript segment that demonstrates the behavior. This mitigates verbosity bias, where longer responses are favored regardless of substance.

Not all models ground equally well, so you can measure it. Zeng et al. propose Claim Grounding Rate (CGR): the fraction of claims supported by the evidence set (higher is better).[3]

CGR = grounded claims total claims

You can push this further. Inspired by Andrej Karpathy's LLM Council, run evidence extraction with two or three different models (ideally from different providers). When all of them independently cite the same transcript lines for a behavior, that's a strong signal that the evidence is real. When they disagree, you've surfaced a review point for the human-in-the-loop.

Example: Multi-Model Evidence Extraction

Demonstrating consensus across GPT-5, Claude Sonnet, and Gemini 2.5 Pro

4) Apply a scoring model

Once the evidence is collected, scoring becomes a policy choice. The hiring team has to decide how to turn behavior-level evidence into a rating for each focus area, and then combine those focus areas into a final decision.

In Industrial-Organizational Psychology, a few common decision models show up repeatedly when combining signals:

  • Compensatory model: strong scores in one area can offset weaker scores in another (a weighted sum).[9]
  • Multiple hurdle model: candidates must pass each stage in sequence, and failing a stage stops the process.[10]
  • Multiple cutoff model: candidates must meet a minimum bar on every dimension (fail one, fail overall).[11]
  • Modified compensatory model: apply floors first, then allow compensation above those floors (a practical hybrid used in HR).[12]
  • Conjunctive-compensatory: an academic framing of “floors first, then tradeoffs.”[13]

For this example, we'll use the compensatory model, where strong evidence in one behavior can offset weaker evidence in another. First, score within each focus area by mapping levels to numbers:

poor = -1, mixed = 0, good = 1, excellent = 2

For a given focus area (say, Problem Solving), your rubric contains multiple behavior statements across levels. After evidence extraction, each behavior statement gets a score (only if there's sufficient evidence). Then aggregate those behavior scores into a single focus-area score:

PS = (s₁ + s₂ + … + sₙ) / n

Higher scores indicate more good/excellent behaviors, and lower scores indicate more poor behaviors.

The downside of a pure compensatory model is that it can average away serious "poor" signals. That is how teams end up hiring a toxic high performer, sometimes described as a "brilliant jerk".[14]

A practical fix is to add a floor before compensation. If at least 50% of behaviors where evidence was found are rated poor, force the focus-area rating to poor. For example, if the Poor level lists 10 behaviors but evidence is found for 5 of them, then the area is rated poor and higher levels cannot compensate.[12]

If (poor_evidenced / total_evidenced) ≥ 0.5 → area = poor

This is a "floors first, then tradeoffs" approach, a conjunctive stage that filters out unacceptable levels before allowing compensation above the floor.[13]

Conclusion

Making LLM-assisted hiring auditable and defensible is more complex than it looks.

Platforms like Zoom, Google Meet, Teams, and CoderPad weren't built for interviewing in the age of AI. Sure, they have AI tools bolted on, but that just makes them AI horseless carriages. The interview experience needs to be reimagined from the ground up. That's why I built ScreenStack, an early prototype implementing this workflow.

It brings transcript, code, and whiteboard into one place and captures code snapshots throughout the interview, so you see how the candidate arrived at their solution, not just the final result. This reveals the problem-solving process far better than static artifacts. ScreenStack applies the evidence-based approach from this article plus many other techniques to help teams make faster, more rigorous hiring decisions.

Try it here: https://app.screenstack.tech/

References

  1. [1] Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685. https://arxiv.org/abs/2306.05685
  2. [2] Ansari et al. Evaluating Cascaded Speech-to-Text–LLM–Text-to-Speech Systems. arXiv:2507.16835. https://arxiv.org/html/2507.16835v1
  3. [3] Zeng et al. Cite Before You Speak: Enhancing Context-Response Grounding in E-commerce Conversational LLM-Agents. arXiv:2503.04830. https://arxiv.org/abs/2503.04830
  4. [4] Google re:Work. Use structured interviewing. https://rework.withgoogle.com/intl/en/guides/hiring-use-structured-interviewing
  5. [5] Smith & Kendall (1963). Retranslation of expectations (Citation Classic summary). https://garfield.library.upenn.edu/classics1983/A1983RB10200001.pdf
  6. [6] Kline & Sulsky (2009). Measurement and Assessment Issues in Performance Appraisal. Canadian Psychology. https://stevenmbrownportfolio.weebly.com/uploads/1/7/4/6/17469871/measurement_issues.pdf
  7. [7] Klieger et al. (2018). Development of the Behaviorally Anchored Rating Scales for the Skills Demonstration and Progression Guide (BSS discussion). https://files.eric.ed.gov/fulltext/EJ1202769.pdf
  8. [8] Google. Interview grading rubric (example). https://docs.google.com/document/d/1iPw2p90HbEciKpt84JSVwefFnYtkN_W3X9SVV6FtvCg/edit
  9. [9] Schmidt & Hunter (1998). The Validity and Utility of Selection Methods in Personnel Psychology. Psychological Bulletin. PDF
  10. [10] Sackett & Roth (1996). Multi-stage selection strategies: A Monte Carlo investigation of effects on performance and minority hiring. Personnel Psychology. PDF
  11. [11] Guion (2011). Assessment, Measurement, and Prediction for Personnel Decisions (2nd ed.). Routledge. https://www.taylorfrancis.com/books/mono/10.4324/9780203836767/assessment-measurement-prediction-personnel-decisions-robert-guion
  12. [12] SHRM. Employee Selection - Structured Exercise (Instructor's manual). PDF
  13. [13] Srinivasan (1988). A conjunctive-compensatory approach to the self-explication of multiattributed preferences. Decision Sciences. PDF
  14. [14] Housman & Minor (2015). Toxic Workers. Harvard Business School Working Paper 16-057. PDF