How to Test an AI
Often complex systems appear hard to understand in regards to how ‘accurate’ they are at their job. This blog seeks to provide a simple, non-math heavy guide as to a few of the most common ways these systems are measured.
We will break this up into:
Classification Systems: The most common, systems that predict a confidence restricted to between 0 (False) and 1 (True).
Regression Systems: Systems that predict a number of any value like a linear regression.
Text Generative AI Output: Generative AI systems that produce a final output of text, such as Gemini, LLaMA, HuggingFace models, OpenAI/ChatGPT, and other popular transformer/”large language model” systems.
…Of Classification Systems
The most common of systems today are the systems whose final output is a classification confidence (“is X event true or false?“). Most larger systems, like generative AI systems, are made of these. These systems may be evaluated by viewing their coefficients / feature importances (not available for deep learning systems), qualitatively viewing their output, or several quantitative metrics that help determine their relative accuracy.
All of these quantitative metrics use the same data - An input list of probabilities (predictions) and actual answers represented as 0’s and 1’s (meaning False and True respectively). Depending on what the goal of the task at hand is, a different metric might be more useful.
Log Loss (most common): captures how far the predictions deviate from actual outcomes. This is most useful when accuracy of probability is important, such as in sports betting.
Precision Recall (PR) Curve: PR overall assesses the ability of a system to detect and prioritize true instances over false instances. Simply put, the larger it is the better. Area under the PR Curve is therefore a commonly used metric. F1 Score, also common, is the harmonic mean - a sort of fancy average - of this curve.
The curve itself is created by sorting output probabilities from largest to smallest, then evaluating them one by one. If you imagine this were a deck of cards, sequentially starting with the top card of the deck, each card is flipped over to reveal if the prediction was True or False. Precision is the percentage of cards/predictions up to that point in the deck that have actually been True. Recall is the percentage true positives seen up to that point in the deck as proportion of total that exist in the deck. Typically the curve will curve down at the end - to around 50% for a balanced test set.
For a poor model with zero predictive power, a balanced test set (1/2 true, 1/2 false) will typically have a PR curve area of 0.5 as shown by the dotted line. Said another way, every guess is a coin flip.
Receiver Operator Characteristic (ROC): Less common, the ROC curve seeks to determine how sensitive the model is to false positives. It has high utility when the test data is unbalanced (the PR curve is less informative). A perfect Area under the ROC Curve is 1.0. When there is 0 predictive power of a model with balanced test data, it will roughly fall along the dotted diagonal.
Calibration: Not all models generate a probability, but rather a confidence. In an ideal world these are the same, but in practice they are not, as in the case of support vector machine models. These may output a number that is as high as 0.9 (90% confidence) for example, but among all the times 90% shows up, only 70% of the time it is actually true. It is in effect, overconfident. This may sound bizarre but it can be corrected if handled properly. We call this “model calibration.“ Multiple methods for doing this exist.
A calibrated model will fall along the dotted diagonal, demonstrating that each confidence value results in an equivalent proportion of true positives at that confidence value.
…Of Regression Systems
In regression performance, typically the list of expected and actual numbers are used and compared with a simple mathematic method. It is worth noting that machine learning systems are FAR more adept at bound range predictions (e.g. between 0 and 1) than they are at regression tasks. It is common, if possible, to break a problem up into bound tasks when generating a final scalar if possible. That being said, they are still effect at regression.
Mean Absolute Error (MAE) / Mean Squared Error (MSE): There are roughly all average error calculated in different ways.
…Of Text Generative AI Systems
Text Generative AI systems are frequently benchmarked to pre-built questions sets to determine their abstraction and reasoning ability. They are assessed by their percentage. In some cases, they may use a portion of the questions as a sort of training set, but this is rare.
Here is an example of a top benchmarked HuggingFace Text AI model:
ARC (AI2 Reasoning Challenge): Evaluates AI systems' ability to reason, consisting of 7,787 multiple-choice science questions.
HellaSWAG: Ability to predict endings for scenarios described in texts. It has roughly 70,000 multiple choice questions.
MMLU (Massive Multitask Language Understanding): 57 tasks, 14,079 test-set questions in multiple choice format aross a broad spectrum of subjects.
TruthfulQA: Designed to evaluate models not just on their ability to generate accurate answers but also on their commitment to truthfulness and avoidance of generating misleading or false information. It contains 817 open-ended questions across multiple categories where there is there is a plurality of false information available - for example:
ChatGPT 3.0 answered this question with ”It tells us that global warming is a hoax.” The correct answer is “The weather on one day doesn’t tell us about climate trends“.
Winogrande: Tests AI models on their reasoning capabilities via fill-in-the-blank English sentences with multiple choice answers. 44,000 questions.
GSM8K (Grade School Math 8K): 8,500 grade-school level math problems.
Conclusion
The benchmarks outlined above serve as critical tools in the evaluation of AI systems, highlighting the diversity in approaches to measure the capabilities and performance of models across a range of tasks. As AI technology continues to evolve, these benchmarks will play a pivotal role in guiding improvements, ensuring models are not only more sophisticated and versatile but also reliable and ethically aligned with expectations for accuracy and truthfulness.