False Positives and the Confusion Matrix

In recent customer conversations about the Respond Analyst questions arose about our “confusion matrix.”  At the root of this is a concern shared by all SOC managers about false positives and their associated impact across Security Engineering, the SOC and Incident Response. After some probing, it became clear there was an assumption about our use of machine learning implicit in the line of questioning that needed clarification.

In machine learning, a confusion, or error matrix is used to judge the performance of a classifier, typically for supervised learning systems. This is great if you have well-labeled sets of (real) data to compare and create a formal result table. We can quickly see from the sample confusion matrix below that this particular algorithm is quite good at finding rabbits, not so good at classifying dogs. A confusion matrix may be quite large as the number of classification outcomes grows.

 [1]

However, long before machine learning came about, statistical theory had already originated the concepts to identify statistical errors in outcomes of testing hypotheses. Statistical test theory originated in the late 1920’s and defined a straightforward binary classification matrix to represent the four possible outcomes from hypotheses testing. The errors that arise in classifying a hypothesis even received special naming. A Type I error refers to a false positive and a Type II error is a false negative.

 [2]

The differences between Type I and Type II are important. The confusion matrix is measuring the performance of Machine Learning classification. Type I and Type II errors arise based on whether a hypothesis is rejected or accepted, usually in a statistically significant way, e.g., >5%. Statistical testing is a critical tool and continues to be the cornerstone of scientific proof in many areas including medicine and cybersecurity.

In the SOC, the false positive is considered a source of considerable wasted time and energy and there is an effort toward minimizing them. The accepted wisdom is that all false positives are bad. However, the quest for zero false positives is mathematically impossible and definitely not a good security practice. The root of this is the nature of the traffic security sensors monitor and the concepts of the false positive paradox [3] and the base rate fallacy [4]. 

The false positive paradox occurs where the likelihood of a true positive is lower than the likelihood of a false positive. This situation arises when looking for an extremely unlikely event in a huge population of samples. This is exactly the case when we analyze billions of network packets and hundreds of thousands of events looking for rare cyber-attacks.  While striving to minimize false positives and their associated cost there is a practical threshold below which, and somewhat counter-intuitively, the overall effectiveness of detection decreases. Security sensors tend to produce what appear to be false positives, and the SOC reaction is to “tune down” those devices because the human analyst cannot deal with the total volume. In the process, many valuable pieces of data are lost.

Related but different is the base rate fallacy. Where the false positive paradox is statistical, the base rate fallacy is a human failing. This is the tendency to remember what is recent and specific and give it significantly more weight in decision making or estimates of likelihood. There is considerable psychology research into why humans fall into this trap, but it is clear that humans are not disciplined, probabilistic reasoners. Kahneman and Tversky wrote about this in 1973! [5]

As you have probably guessed, the Respond Analyst is not measured by a confusion matrix, but rather how a human analyst is measured.  For example, for a given set of facts, how accurately is judgment applied to test the hypotheses?  Does this set of facts represent something malicious and actionable?  Also, there is no need to tune down the volume of alerts. The Respond Analyst can analyze events at 800X the rate of a human analyst. Nor does the Respond Analyst fall prey to the base rate fallacy because it correctly and unfailingly models correct probabilistic judgment using our PGOTM algorithm.

The confusion matrix presents an interesting talking point, but the more interesting question is to ask how to improve the quality of escalated incidents to the IR team? By reducing or even eliminating the concern of false positive findings from security devices like NIDS/NIPS we give SOCs a fighting chance against their very human adversary.

Steve Dyer, CTO, Respond Software

You might like this article:  Probability Theory:  The Space Between One and Zero

 

[1] https://en.wikipedia.org/wiki/Confusion_matrix

[2] https://en.wikipedia.org/wiki/Type_I_and_type_II_errors#Statistical_test_theory

[3] https://en.wikipedia.org/wiki/False_positive_paradox

[4] https://en.wikipedia.org/wiki/Base_rate_fallacy

[5] http://psycnet.apa.org/record/1974-02325-001