Evaluation Checklists: Building Your Quality Gate for AI Code

In the previous post, I covered the five dimensions of evaluating AI-generated code: correctness, fitness, security, performance, and maintainability. Understanding these dimensions is essential. But understanding is not enough.

Under time pressure, even experienced developers skip evaluation steps. They focus on the dimensions they find most interesting or most familiar, and they neglect the others. When different team members evaluate code, they check different things, and nobody notices the gaps until a bug reaches production.

Checklists address this problem. They transform evaluation principles into repeatable practice. They ensure that every evaluation covers the same ground, regardless of who performs it or how much time pressure exists.

This post provides complete checklists you can use immediately, plus guidance on adapting them to your context and evolving them based on what you learn.

Why Checklists Work for Evaluation

The case for checklists comes from fields where the cost of missing something is high: aviation, surgery, construction. In these domains, professionals with decades of experience still use checklists because human memory and attention are unreliable under pressure.

Software development shares these characteristics. The cost of missing a security vulnerability or a performance problem can be severe. The pressure to ship is constant. And evaluation requires checking many things in sequence, which is exactly the situation where checklists provide the most value.

Checklists work because they externalize memory. You do not need to remember what to check. The checklist remembers for you. This frees cognitive resources for the actual evaluation: looking at the code, understanding what it does, and judging whether it is correct.

Checklists also create consistency across a team. When everyone uses the same checklist, everyone checks the same things. Gaps that would exist with individual judgment are eliminated. New team members can perform effective evaluations immediately because the checklist encodes what experienced members have learned.

Finally, checklists are learnable artifacts. When evaluation misses a bug, you can ask: “What checklist item would have caught this?” The answer becomes a new item, and future evaluations benefit from the lesson learned. Over time, your checklist becomes a record of everything that has ever gone wrong.

The Basic Evaluation Checklist

This checklist covers all five evaluation dimensions from the previous post. It is designed to be comprehensive enough to catch common issues while remaining short enough to actually use. Fifteen items, each with a specific question to answer.

Correctness

Specification match. Does every requirement in the specification have corresponding implementation? List each requirement and verify it is addressed.
Input handling. Does the code handle all specified input types, including edge cases like empty values, null, maximum sizes, and invalid formats?
Output conformance. Does the code produce output in exactly the specified format, type, and structure?
Error conditions. Does the code handle all specified error conditions and produce the specified error responses?
Logic verification. For complex logic, have you traced through at least one happy path and one error path with concrete values?

Fitness

Pattern consistency. Does the code follow the same patterns used elsewhere in the codebase for similar functionality?
Naming alignment. Do names (variables, functions, classes) follow the conventions established in the existing codebase?
Error handling style. Does the code handle errors using the same approach (exceptions, result types, error codes) as the surrounding system?

Security

Input validation. Is all external input validated before use? Are there any paths where unvalidated input reaches sensitive operations?
Output encoding. Is output encoded appropriately for its context (HTML, SQL, URL, JSON)?
Sensitive data handling. Does the code avoid logging, exposing in errors, or transmitting sensitive data inappropriately?

Performance

Algorithmic efficiency. Is the algorithm appropriate for expected input sizes? Are there any O(n²) or worse operations that could become problems at scale?
Resource management. Does the code properly acquire and release resources (connections, file handles, memory)?
Query efficiency. For code involving database access, are queries efficient? Are N+1 patterns avoided?

Maintainability

Clarity of intent. Would a developer unfamiliar with this code understand what it does and why? Is complex logic explained?

Using the Basic Checklist

Work through the checklist in order. For each item, give a clear yes or no. If the answer is no, note the specific issue. If you cannot answer the question because you lack information, that itself is a finding: the code may need more context or documentation.

Do not rush. The checklist takes five to ten minutes for a typical code generation. This is not overhead. This is the core of evaluation. The time invested here prevents much larger time investments debugging production issues later.

When an item reveals a problem, you have three options: reject the code and regenerate with an improved specification, fix the issue manually, or note it as accepted technical debt with a plan to address it later. The checklist does not make this decision for you. It ensures you see the issue so you can make an informed decision.

Security-Focused Checklist Additions

For code that handles authentication, authorization, user input, or sensitive data, add these items to the basic checklist.

Authentication verification. If the code requires authentication, is the authentication check present and correct? Can it be bypassed?
Authorization checks. If the code requires specific permissions, are those permissions verified? Is the check at the right level of granularity?
Injection prevention. For code that constructs queries or commands, are parameterized approaches used consistently? Is there any string concatenation with user input?
Secret handling. Are secrets (API keys, passwords, tokens) handled appropriately? Are they ever logged, exposed in URLs, or stored insecurely?
Cryptographic practices. If encryption or hashing is used, are current recommended algorithms and libraries used? Are there deprecated approaches?
Session management. For code involving sessions, are sessions created, validated, and invalidated correctly?
Rate limiting awareness. For exposed endpoints, is there awareness of rate limiting needs? Is the code structured to support rate limiting if needed?
Error information leakage. Do error messages avoid revealing internal details (stack traces, database schemas, internal paths) to external users?
Cross-site protections. For web-facing code, are CSRF protections in place? Are cookies configured with appropriate flags?
Dependency security. If the code introduces dependencies, are they from trusted sources? Are there known vulnerabilities in the versions used?

Performance-Focused Checklist Additions

For code with scaling requirements, high-frequency execution, or resource constraints, add these items.

Memory patterns. Does the code avoid loading large datasets entirely into memory? Is streaming or pagination used where appropriate?
Caching considerations. Where caching could improve performance, is it implemented? Where caching could cause stale data issues, is it avoided?
Concurrency safety. If the code may execute concurrently, are shared resources protected appropriately? Are there race conditions?
Database round trips. Is the number of database round trips minimized? Could multiple queries be combined?
Index usage. For database queries, will they use indexes effectively? Are there full table scans on large tables?
Timeout handling. For external calls, are timeouts configured? Does the code handle timeout gracefully?
Resource pooling. For resources like database connections, is connection pooling used appropriately?
Lazy loading. Is expensive computation or data loading deferred until actually needed?

Domain-Specific Considerations

The checklists above are generic. Your domain will have specific concerns that deserve their own checklist items.

Financial systems need items about decimal precision, rounding rules, and audit trails. Healthcare systems need items about data privacy regulations and consent tracking. Real-time systems need items about latency budgets and deterministic behavior.

To develop domain-specific items, review past bugs and incidents. Each one that reached production is a candidate for a checklist item that would have caught it earlier. Ask: “What question, if we had asked it during evaluation, would have revealed this problem?”

Document the origin of domain-specific items. “Added after the 2025 decimal rounding incident” helps future team members understand why the item matters and prevents it from being removed as seemingly unnecessary.

Integrating with Existing Code Review

If your team already has code review practices, evaluation checklists should complement them rather than replace them.

One approach: use the evaluation checklist before submitting code for review. The checklist becomes a self-review step that catches obvious issues before another developer spends time on them. The code review then focuses on higher-level concerns: design decisions, architectural fit, and knowledge sharing.

Another approach: incorporate checklist items into your review template. If your team uses pull request templates, add checklist sections that reviewers complete. This makes checklist usage visible and creates accountability.

The goal is not additional process but better coverage. If your current reviews already check everything on these lists, you do not need to change anything. But most teams, when they compare their actual review practices to a comprehensive checklist, find gaps they did not realize existed.

Evolving Checklists Based on Discovered Issues

A checklist is not a static document. It should evolve as you learn.

When a bug escapes evaluation and reaches later stages or production, conduct a brief retrospective. Ask: “What checklist item would have caught this?” If no existing item would have caught it, create a new item. If an existing item should have caught it but was overlooked, investigate why. Perhaps the item is too vague. Perhaps it needs an example. Perhaps it needs to be earlier in the list so it gets more attention.

When checklist items consistently pass without revealing issues, consider whether they are still necessary. If a certain category of bug never occurs in your AI-generated code, the checklist items addressing it may be obsolete. Keep items that address severe risks even if they rarely trigger, but remove items that address low-risk issues that never occur.

When checklist items consistently trigger false positives, refine them. An item that always flags non-issues will train reviewers to ignore it, which defeats the purpose. Narrow the scope, add exceptions, or reword to improve precision.

Review your checklist quarterly. Remove items that have become irrelevant. Add items based on recent learnings. Ensure items are still clear and actionable. A well-maintained checklist grows more valuable over time.

Checklist Anti-Patterns

The exhaustive checklist. Fifty items nobody reads. The intention is thoroughness, but the result is that reviewers skim or skip entirely. Keep your active checklist short enough to actually use. If you have many items, consider splitting into a basic checklist everyone uses and specialized checklists for specific contexts.

The vague checklist. Items like “code is secure” or “performance is adequate” that mean different things to different people. Every item should have a specific question that can be answered yes or no. “Is all user input validated before database queries?” is specific. “Code handles input properly” is not.

The aspirational checklist. Items that describe ideal practices your team does not actually follow. If the checklist says “comprehensive unit tests exist” but your team does not write comprehensive unit tests for generated code, the item will always be skipped or falsely checked. Checklists should reflect actual practice, not aspirational practice.

The ignored checklist. The checklist exists but is not actually used. Perhaps it is not integrated into the workflow. Perhaps it takes too long. Perhaps people do not believe it adds value. An unused checklist provides no benefit. If your checklist is being ignored, find out why and address the root cause.

Making Checklists a Team Asset

Checklists work best when they are shared and maintained by the team, not imposed by an individual.

Store checklists in version control alongside your code. Changes to the checklist should be reviewed like code changes. This creates shared ownership and ensures everyone is working from the same version.

Discuss checklist updates in team retrospectives. When someone proposes a new item, discuss whether it belongs. When someone questions an existing item, evaluate whether it still provides value. These discussions reinforce the purpose of each item and build collective understanding.

Onboard new team members by walking through the checklist. Each item is an opportunity to explain a past lesson or a current standard. The checklist becomes a teaching tool that accelerates learning.

Let’s Continue the Conversation

What checklist items would you add for your domain? What bugs have escaped your evaluation that a checklist item could have caught?

Does your team use evaluation checklists today? If so, what has made them effective or ineffective?

Share your experience in the comments. The best checklists are built from collective learning.