Evaluation design scenarios — CCA-F Exam Prep

L3.16|Evaluation design scenarios

1/12

Mystery

The AI passed every test. Then it failed in production.

A content summarization AI scored 95% on the eval suite. The team deployed with confidence. Within a week, customers complained: summaries were missing critical details, inventing facts, and contradicting the source material.

The eval suite had 50 test cases. All were short, well-structured articles. Production content included legal documents, medical reports, and rambling forum posts. The AI was tested on easy mode and deployed on hard mode.

The eval suite was the problem. Not the AI.