Can you draw a dinosaur with 142 points? (The answer is yes, by the way.) On a graph, draw a dinosaur with 142 points such that:
- The horizontal mean is 54.26 and the vertical mean is 47.83
- The horizontal standard deviation is 16.79 and the vertical standard deviation is 26.93
- The Pearson correlation coefficient is -0.06
Obviously, this would be ridiculous to do by hand. How would you safely draw 142 points with these bounds? And what even is a Pearson correlation coefficient?
The PCC (it’s short for Pearson correlation coefficient, not for precipitated calcium carbonate), in short, is the ratio between the covariance of two variables and the product of their standard deviations.
Remember standard deviations? The square root of the average of the squares of the distances between the points and the mean? If not, then here you go. If it’s still confusing, here’s an excellent example on Wikipedia.
Covariance is a bit trickier. Think of it as difficult version of correlation. The sign (positive/negative) of the covariance determines how the values correspond: if it is positive, then the two variables have similar behaviour, namely if one variable increases, so does the other and vice versa. If it is negative, then it is the opposite: if one variable increases, then the other variable decreases and vice versa.
If you want the formula for covariance, it’s on this Wikipedia page. Even after reading this, I can’t understand it, mainly because of the expected value theorem.
Now that we’ve cleared some of the statistics terms, let’s go back to our 142-point dinosaur.
In the hidden land of Canada in 2017, two researchers created twelve sets of data that had basically the same statistics to the dinosaur, whose name is the Datasaurus. The Datasaurus Dozen was a collection of the twelve sets of data with the same horizontal/vertical means, standard deviations, and all had the same Pearson correlation. Yet, they looked completely different. How could this be?
This is the truly deceptive aspect of statistics. The slightest change in the values, such as the means, or standard deviations, could land you with a completely new graph. The differences between all five values for all thirteen graphs was less than a hundredth, and by rounding down, it could be said that the graphs had equal values.
If you want a page on this with images, the link is here.