Evaluate: Why Human Judgment Is Non-Negotiable in AI Development

We have arrived at the phase of ADD where the most important human skill comes into play. You have written a specification. You have generated code using appropriate context and patterns. Now you must determine whether that code is actually correct.

This is not a formality. AI-generated code can be syntactically correct, pass basic tests, and still be fundamentally wrong. It can implement something plausible that is not what you specified. It can handle the happy path while silently failing on edge cases. It can introduce subtle vulnerabilities that no linter or automated tool will catch.

Evaluation is the quality gate that separates disciplined AI usage from hope-based development. And it requires skills that only humans bring.

Why Evaluation Cannot Be Skipped or Automated Away

The temptation to skip evaluation is real. You wrote a detailed specification. You used good prompt patterns. The generated code looks clean and well-structured. Why not just run the tests and move on?

Because the tests themselves might be incomplete. Because “looks clean” is not the same as “is correct.” Because the AI’s failure modes are subtle, not obvious.

There is also a temptation to automate evaluation entirely: let the linter check style, let the tests check correctness, let static analysis check security. These tools are valuable. They catch real issues. But they check for known patterns of failure, not for the kinds of novel errors that AI can introduce.

An AI might generate a function that perfectly implements a caching layer but invalidates cached entries using the wrong key format. The tests pass because the test data happens to work with both key formats. The linter finds nothing wrong. Static analysis sees correct caching patterns. Only a human reading the code and understanding the system recognizes that this caching approach will fail in production when key formats diverge.

I have encountered this exact category of failure multiple times. The code works. The tests pass. Everything looks correct. But a detail that requires understanding the full system context is wrong, and no automated tool has the system-level awareness to catch it.

Automated tools complement human evaluation. They do not replace it. Use them to catch the easy issues so your human evaluation time can focus on the hard ones.

The Multiple Dimensions of Evaluation

Effective evaluation examines code across five dimensions. Each catches different categories of failure, and none is sufficient alone.

Correctness: Does It Implement the Specification?

The most fundamental question. Does the code do what the specification says it should do?

This sounds simple, but it requires careful comparison between the specification and the implementation. Check each requirement. Does the function accept the specified input types? Does it return the specified output types? Does it handle each specified error condition? Does it follow each specified business rule?

Common AI failures in correctness:

Partial implementation. The AI implements most of the specification but quietly omits a requirement. It generates eight out of ten validation rules. The two missing rules are not immediately obvious because the code looks complete.

Plausible alternatives. The AI implements something similar to the specification but not identical. You specified a function that returns the first matching item. The AI returns the last matching item. Both implementations are reasonable; only one matches the specification.

Specification interpretation. The AI interprets ambiguous parts of the specification differently than you intended. This reveals specification gaps, which is useful, but only if you catch the divergence during evaluation.

Check correctness by tracing through the specification point by point. Do not rely on an overall impression of “this looks right.” Compare systematically.

Fitness: Does It Fit the Existing System?

Correct code that does not fit your system is wrong in context. Fitness evaluation examines whether the generated code integrates properly with the surrounding codebase.

Pattern consistency. Does the code follow the same patterns as the rest of your system? If your codebase uses repository pattern for data access, does the generated code use repository pattern, or did the AI invent its own approach?

Naming conventions. Do variable names, function names, and class names follow your team’s conventions? This sounds trivial, but inconsistent naming creates real confusion for future developers.

Error handling alignment. Does the generated code handle errors the same way the rest of your system does? If your system uses typed errors with error codes, does the generated code produce typed errors with error codes, or does it throw generic exceptions?

Abstraction levels. Does the generated code operate at the same level of abstraction as surrounding code? AI sometimes produces code that is more granular or more abstract than the context requires, creating a mismatch that makes the codebase harder to understand.

Fitness evaluation requires knowledge of your codebase. This is one reason why the skills discussed in Post 1 of this series remain essential. You cannot evaluate fitness for a system you do not understand.

Security: Does It Introduce Vulnerabilities?

Security evaluation is particularly important for AI-generated code because AI models replicate patterns from training data, including insecure patterns. The code may look correct while containing vulnerabilities that are not visible to casual review.

Input handling. Does the code validate and sanitize all external input? Does it use parameterized queries for database access? Does it encode output for the appropriate context (HTML, URL, SQL)?

Authentication and authorization. If the code involves access control, does it check permissions correctly? Does it fail closed (denying access by default) rather than failing open?

Data exposure. Does the code log sensitive data? Does it include sensitive information in error messages? Does it expose internal details through API responses?

Cryptographic practices. If the code involves encryption, hashing, or token generation, does it use current recommended approaches? AI models sometimes generate code using deprecated cryptographic libraries or approaches that were common in older training data.

Timing vulnerabilities. Does the code compare secrets using constant-time comparison? Does it reveal information through response timing differences?

Security evaluation requires specific knowledge. If your team lacks security expertise, consider whether AI-generated code touching security-sensitive areas should receive external review.

Performance: Will It Scale?

AI-generated code often works correctly at small scale but contains performance issues that emerge under production load.

Algorithmic complexity. What is the time complexity of the generated code? Is it appropriate for expected input sizes? AI sometimes generates O(n²) solutions when O(n) or O(n log n) approaches are needed.

Memory usage. Does the code load unnecessary data into memory? Does it create unnecessary copies? For large data sets, does it use streaming or pagination?

Database efficiency. Does the code produce efficient queries? Does it avoid N+1 query patterns? Does it use appropriate indexing?

Concurrency. If the code handles concurrent operations, does it use appropriate synchronization? Does it avoid unnecessary blocking?

Performance evaluation requires understanding of your system’s scale. Code that performs adequately for 100 records might be catastrophically slow for 100,000. Evaluate against realistic production volumes, not test data sizes.

Maintainability: Can Others Understand It?

The final dimension addresses the long-term health of your codebase. Code is read far more often than it is written. AI-generated code must be understandable by humans who encounter it in the future.

Clarity of intent. Does the code clearly communicate what it does and why? Are complex sections explained? Would a new team member understand this code without the specification?

Appropriate complexity. Is the code as simple as it could be? AI sometimes generates unnecessarily complex solutions: using advanced features where simple ones would suffice, or abstracting things that do not need abstraction.

Testability. Is the code structured so it can be easily tested? Does it have clear inputs, outputs, and side effects? Or does it entangle concerns in ways that make testing difficult?

Documentation accuracy. If the AI generated comments or documentation, are they accurate? AI-generated comments sometimes describe what the code should do rather than what it actually does, especially after iterations that changed the implementation without updating comments.

Common AI Failure Modes

Understanding how AI commonly fails helps you know where to focus evaluation attention.

Plausible but incorrect logic. The code follows a reasonable approach but gets a detail wrong. An off-by-one error in a boundary check. A comparison that should be inclusive but is exclusive. A sort that is ascending when it should be descending. These errors are hard to spot because the code reads naturally.

Training data bias. The AI generates code reflecting common patterns in its training data, which may not be current best practices. Using older library versions, deprecated API calls, or patterns that have known issues. The code works but uses approaches that experienced developers would avoid.

Hidden assumptions. The AI makes assumptions about data format, size, encoding, or availability that are not visible in the code. These assumptions hold for the test data but fail for production data.

Confident incorrectness. The AI generates code with authoritative-looking comments that describe incorrect behavior. “This function returns the median value” when it actually returns the mean. The comment creates false confidence during review.

Copy-paste drift. When the AI generates multiple similar functions, later functions may drift from the pattern of earlier ones. The first three are correct; the fourth has a subtle variation because the AI’s attention shifted.

Evaluation Techniques and Heuristics

Read the code without the specification first. Before checking against your specification, read the code and form your own understanding of what it does. Then compare that understanding to the specification. This catches cases where the code does something different from what was specified but looks reasonable enough that specification-focused review might miss it.

Trace through edge cases manually. Pick the most challenging edge cases from your specification and trace the code’s execution path for each. Do not trust your intuition that “it probably handles this correctly.” Walk through the code line by line with concrete edge case values.

Check what is not there. AI-generated code often fails by omission. Look for what should be present but is not: missing null checks, missing error handling, missing validation, missing logging. The code that is present might be correct; the danger is in what was left out.

Compare to your exemplar. If you used the Exemplar Pattern during generation, compare the output to the exemplar. Does it follow the same patterns? Where it differs, is the difference intentional or accidental?

Ask “what could go wrong?” For each significant piece of logic, imagine failure scenarios. What if this API call times out? What if this data is null? What if this string is in an unexpected encoding? If the code does not handle these scenarios, add them to your evaluation notes.

Look for the second-order effects. First-order evaluation asks “does this code work?” Second-order evaluation asks “what does this code change about the system?” Does it introduce new dependencies? Does it change the performance profile of an existing workflow? Does it alter error behavior in ways that affect calling code? AI-generated code exists in a system, and its effects extend beyond its immediate function.

Time-box your evaluation. For routine tasks, set a time limit. If you cannot evaluate the code thoroughly in that time, the code may be too complex for the task. Consider decomposing the task or simplifying the specification. Evaluation should be thorough but proportionate to the risk and complexity of the generated code.

The Skill Requirement

Here is the uncomfortable truth about evaluation: you cannot evaluate what you do not understand.

If you do not know how SQL injection works, you cannot evaluate whether generated code is vulnerable to it. If you do not understand time complexity, you cannot evaluate whether an algorithm will scale. If you do not know your codebase’s architecture, you cannot evaluate whether generated code fits it.

This is why the skill maintenance practices I described in Post 1 of this series are not optional extras. They are prerequisites for effective evaluation. The better you understand code, architecture, and engineering principles, the more effective your evaluations become.

This also means that evaluation speed improves with practice. The first time you systematically evaluate AI-generated code across all five dimensions, it takes significant time. As you build pattern recognition for common AI failure modes, you develop intuitions about where to focus attention. Experienced evaluators check the same dimensions but do so faster because they recognize patterns.

Do not let the time cost of thorough evaluation discourage you. The alternative is not saving time. The alternative is spending time later debugging production issues, fixing security vulnerabilities, or untangling architectural drift. Evaluation time is an investment with measurable returns.

Teams can also build evaluation efficiency by sharing their findings. When one developer discovers a new AI failure mode, sharing that discovery means the entire team can check for it. Over time, the team develops a collective awareness of where AI-generated code tends to fail in their specific context. This collective knowledge makes individual evaluations faster and more effective.

When Evaluation Fails Repeatedly

Sometimes generated code fails evaluation consistently. Each attempt produces output that does not meet the specification, does not fit the system, or introduces problems across multiple dimensions.

This is a signal, and you should read it correctly.

If the code fails on correctness repeatedly, the specification may be ambiguous or the task may exceed the AI’s reliable capability. Consider revising the specification or decomposing the task further.

If the code fails on fitness repeatedly, you may need to provide more context. Include more exemplars. Add more constraints. Make the system’s patterns explicit in the specification.

If the code fails on security or performance repeatedly, the task may be in a category where AI assistance is unreliable. Consider writing security-critical or performance-critical code manually.

If multiple dimensions fail simultaneously, the task may not be suitable for AI-assisted generation at all. Recognizing when to step away from AI and write code directly is itself a valuable skill.

Repeated evaluation failure should prompt reflection, not frustration. Each failure teaches you something about what the AI can and cannot do reliably, and that knowledge improves your judgment for future tasks.

From Evaluation to Integration

Code that passes evaluation across all five dimensions moves to the Integrate phase. But evaluation does not end with a binary pass/fail. Good evaluation produces notes: minor concerns, potential improvements, areas to monitor. These notes travel with the code into integration.

In Post 11, I will provide concrete evaluation checklists that operationalize the dimensions covered here. These checklists transform the principles of evaluation into repeatable practices that ensure consistency across team members and across tasks.

For now, begin practicing multi-dimensional evaluation. The next time you review AI-generated code, consciously check each dimension: correctness, fitness, security, performance, maintainability. Notice which dimensions you naturally focus on and which you tend to skip. The dimensions you skip are where problems hide.

Let’s Continue the Conversation

Which evaluation dimension do you find most challenging? Where have you discovered AI failures that surprised you?

What evaluation techniques have you developed beyond what is covered here?

Share your experience in the comments. Evaluation is a skill that sharpens with practice and with exposure to others’ approaches.