Just as a point, not as a critic, but your question should be formulated in a different way: "what statistics should any person know?".
Fact is, unfortunately we all deal with statistics. It's a fact of life. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding or being scammed, whatever that means.
Said that, I think the following points are important to understand
- mean, median, standard deviation of a sample, and the difference between sample and population (this is very important)
- the distributions, and why the gaussian distribution is so important (the central limit theorem)
- What it is meant with Null Hypothesis testing.
- What is variable transformation, correlation, regression, multivariate analysis.
- What is bayesian statistics.
- Plotting methods.
All these points are critical not only to you as a computer scientist, but also as a human being. I will give you some examples.
- The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a confidence interval, not a definitive answer, like : there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine.
- Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports a possible signal, it does not report a finding.
- Bayesian. We all know the spam filter. but there's more. Suppose you go to a medical checkup and the result tells you have cancer (I seriously hope not, but it's to illustrate a point). Fact is: most of the people at this point would think "I have cancer". That's not true. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 8 per thousands people have cancer, picked out of thin air number) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm that you have cancer, then it's very probable you actually have it, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.
- Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, or a pie chart.