Expert advice for testing generative AI applications: A recommended reading list

John Gluck
March 20, 2024

Part of the challenge of testing generative AI applications in this emergent and ever-changing field is simply finding information on exactly how to do it and then determining whether that information is reliable. 


We compiled a list of articles representing the pinnacle of current thinking about testing generative AI applications.  We refer to these articles for our day-to-day work, and they align with our thinking on these various subjects.  You will find all this information valuable and relevant to testing your generative AI application. We grouped the articles chronologically into sections based on how testers would likely encounter the subjects during the SDLC.

We specialize in black-box end-to-end automation testing, and we’ve done our fair share of testing on generative AI apps.  So believe us when we say we recognize the goods when we see them.

Understanding generative AI app structure

If you want to test generative AI applications, you must first understand their structure. You can and should understand and still test them as a black box without losing subjectivity. Having a grasp on their structure gives you context on how to concentrate your testing efforts to find the best bugs.  

Deploying Large Language Models in Production: The Anatomy of LLM Applications by Andrew Wilson—This is a solid primer on the structure of apps built around LLMs (including generative AI apps). Understanding what each component does gives you a better idea of the defects you should look for in that part of the app.

Generative AI Design Patterns: A Comprehensive Guide by Vincent Koc—Most teams develop applications around existing models (e.g., GPT-4, PaLM). This article discusses different operational and design patterns commonly used today. It is not an exhaustive list (how could it be?), but it goes into a fair amount of detail about how models get built and run. It’s a good reference for familiarizing yourself with the various approaches teams use to test their models.

Unit testing/White box testing

There’s more to testing AI than functional black-box testing. It’s important to understand what your developers are doing with their white-box testing specifically so you can avoid duplicating coverage and, thus, reduce the overall testing effort's monetary and time investment. 

Implementing cost-effective Test-Driven Development in an LLM application by Fanis Vlachos—This real-life example is a sensible approach to testing genAI applications. By reducing their testing to component interactions, the team avoided any direct testing of the model. It’s not just an excellent example of how to do component-level testing on an AI-based app; it’s a primer for how to do any component integration testing. Note that the emphasis is on cost efficiency: the real motivation for shifting left.  

Black-box testing

This is one of our favorite subjects here at QA Wolf.   We’ve located a few articles we like, including one we wrote ourselves to address the gap.

Testing Language Models (and Prompts) Like We Test Software by Marco Tulio Ribeiro—If your application uses a third-party model – and, let’s face it, these days, most applications do –  then you can’t really test the model, and you probably don’t want to.  However, you need to test the prompt template because that’s the most critical component of your application that your developers control and change most frequently.  This article walks you through some realistic examples of how to go about testing your application’s prompt template.

6 things to know about automated testing of genAI apps by John Gluck—QA Wolf has been automating black-box tests for generative AI apps for quite some time now, and we’ve learned some stuff along the way. This article discusses some of the thorny challenges we’ve tackled while attempting to automate these apps and gives guidelines you can apply to your organization. 

Testing LLM-Based Applications: Strategy and Challenges by Xin Chen—This article is about automated end-to-end testing. It uses a real-life scenario to discuss the team’s reasoned approach to validating their application with automated black-box tests. It breaks down their process into various testing goals that testers can apply to testing any generative AI application.

Model Testing

Depending on your application's structure, different teams will need to test the model to varying degrees, from simple smoke testing to thorough UAT. These articles can help you determine what level of testing is appropriate for your application and give tips and strategies for implementing it.

ML Model Testing: 4 Teams Share How They Test Their Models by Stephan Oladele—This piece discusses the complexities and nuances of testing machine learning models. It presents a range of testing strategies, including both automated and manual methods, that teams can and might be using to validate the robustness and reliability of ML models in real-world scenarios. Read this piece to learn what your developers are doing (and therefore not and could be doing) to test the model.  Understanding what is possible will help you make sure your developers have done all they can to get the model ready for production.

AI Shift-Left, Test Right by Tariq King—This useful article explains how models can be tested continuously from their initial design through the release process and past production. It also offers advice on how testers of traditional applications can add value to the testing efforts with their knack for customer empathy and analytic skills. It provides specific tools and examples for how to use them.  

LLM Testing: Too Simple? By Jason Arbon—You won’t get far in the AI testing space without running into Jason Arbon;  he has established himself as a thought leader. This piece is noteworthy because it provides practical information and specific datasets that model developers can use, which are freely available to those who actually need to test the model. 

AI testing war stories

This stuff is hard, and there are lots of mistakes to be made. Learn from the experience of others, and don't get discouraged. These stories illustrate the continuing value of the tester's mindset when applied to AI-based projects. And you will learn a thing or two along the way.  

Test Automation for Machine Learning: An Experience Report by Angie Jones—This is a classic testing tale about the author, a giant in the testing community, learning about how this new domain differs from those that came before. She learns by doing and shows us how she automated her application’s model testing while encountering the universal experience of self-doubt in the face of “experts.” 

Diary Testing An AI-Based Product by Wayne Roseberry—The author discusses his learning from testing a generative AI application that tests other applications. He learns about how the application is structured and then applies his expertise in creative exploration and, subsequently, persistence. There’s a lot of talk these days about what is different about testing AI from traditional apps; this piece focuses on what is the same.

How I Contributed as a Tester to a Machine Learning System: Opportunities, Challenges, and Learnings by Shivani Gaba and Ben Linders—This is a timely piece because it addresses how a tester can add value to ongoing model testing efforts and demonstrates how the tester’s perspective positively affects the outcome. 

Happy reading!

Keep reading