More than Just the Mean

From Stepping Up

Jump to: navigation, search
More than Just the Mean
Letting the Numbers Talk
Stepping Up Guide

Being able to describe a set of numbers may well be easy: you can just take the mean. But there's more information than the mean that you want to talk about. Take these two sets of numbers:

{0, 1, -1}

and

{0, 1000, -1000}.

Both have the same mean, but they're very different sets of numbers. How do you distinguish between the two? Read on!

Central tendency

Let's say we have a set of values all associated with a given quantity. A first goal of statistics is to be able to represent an "average" value for this quantity. An average summarizes a data set in a single value and provides information about the magnitude, the sign and the units of measured values.

Multiple samples require a tool that makes a comparison both easy and quick; central tendency does just that! Statistics provide you with various descriptors for the average: median, mode and mean among others. However, you must be careful as they represent your data in very different ways.


If you sort the set in increasing order, giving an index to each value, the median and mode can be found.

  • The median is the value whose index lies in the middle of the set, namely half the values are below it and half above it. If the number of values n is odd, the median is the middle value. Counting in from the ends, we find this value in the (n+1)/2 position. When n is even, there are two middle values. So, in this case, the median is the average of the two values in positions n/2 and (n/2)+1.
  • The mode is simply the most frequent value(s). For example, the mode of {1,1,3,3,3,4,7} is 3, while the modes of {1,2,2,4,4,5} are 2 and 4.

Probably the most commonly used central tendency measure is the arithmetic mean:

  • The mean is calculated by determining the sum of all values and dividing by the total numbers of values, n. Formally, for a set of numbers X={x1,x2,...,xn} the mean is x = (1/n)*SUM(xi).
Image:important.png
Important Information
A common mistake is to confuse the terms mean and average; while the term mean mainly refers to the arithmetic mean (see above), average encompasses all measures of central tendency.


Other more specialized tools might be of interest for your project:

  • The midrange is the mean of the maximum and the minimum values. It only summarizes the two extreme values in a set and can be biased if those are outliers.
  • The weighted mean is a mean which is computed to give more importance to some data in contrast to other. This can be useful if you have a good reason to believe that some data are more reliable than other, for instance their uncertainty or dispersion is lower.


Data dispersion

Two experiments are carried. For one experiment, values are all similar; for the other, values are very different. A question that statistics can answer is: how different are the many values with respect to each other?

The fact that values are not all the same, and the extent to which they differ, is called dispersion.


An important question, which can help you with your research, is: why is it that values are different, for the same experiment?

Common sources of dispersion are:

  • Observations are drawn from a sample that is heterogeneous,
  • Variation due to the random nature of the variable being observed,
  • Errors and uncertainties in the measurements.


Consequently, data dispersion provides two types of information:

  • Information about the variable(s) being observed, and
  • Information about the quality of the methodology.

It is often difficult to separate the two, you have to use your judgement to decide on the probable cause of dispersion. Good projects try and find solid grounds to justify this kind of decisions.


Let's have a look at various methods that describe the dispersion of a data set: range, and standard deviation.

  • The observed range is the difference between the maximum and the minimum observed (experimental) values. Since it does rely on only two extreme values, it conveys only scarce information. However, one can make arguments about the extent of dispersion and the precision of a method when comparing it to the potential range, that is the difference between the maximal and minimal possibly observable values.
  • The standard deviation takes into account how far each value is from the mean. Therefore it is very useful in thinking about how widely spread the values in a data set are. If many data points are close to the mean, then the standard deviation is small; if many data points are far from the mean, then the standard deviation is large. If all the data values are equal, then the standard deviation is zero. How do we calculate standard deviation?


One way to think about spread is to examine how far each data value is from the mean. This difference is called a deviation. We first square the deviations and then average them. We square them to keep them from canceling out (ie: positive and negative deviations). Squaring also emphasizes larger differences. When we add up these squared deviations and find their average, we call the result the variance.

Variance

The variance is not the ideal measure of spread because its data re in squared units. So we take the square root of s^2. The result, s, is the standard deviation. Formally, the standard deviation is the root mean square (RMS) deviation of values from their arithmetic mean. All together then, the standard deviation of the data is found by the following formula:

Standard Deviation


Skewness

It can sometimes happen that the data is not distributed equally on both sides of the median. It is then said that the distribution of numbers is skewed, or asymmetric.


Skewness can be useful in analyzing data that is not distributed normally. Many statistical models just assume that data is distributed normally about the mean, but unless the skewness value is zero, this is not the case.

Skewness deals primarily with tail or snake values in a bar chart. Sometimes, values on either side of the bar chart will taper off in a manner that is different from a regular bell curve (one side tapers steeply compared to the other ,etc.). It is for this reason that measuring skewness can lend more credit and meaning to your analysis.

There are two types of skewness negative skewness and positive skewness. Negative skewness refers to a tail that is longer towards the left. The majority of values are concentrated to the right in this case. A good example would be a bar chart with values {1,2,500,700,800,1100,1400}. It is very easy to see that this distribution has very few low values. Whether this type of distribution is intended or not, it is important to note that in this case the mean is lower than the median which is in turn lower than the mode. In a regular Bell Curve, all three quantities are equal.

Positive skewness is the exact opposite of negative skewness. Most values are low, and concentrated to the left of the distribution. The mean will be greater than the median, which will be greater than the mode.

Here is a link with the formulae regarding calculating skewness. Skewness Formulae


This article was written by:

Aaron Hakim, Jean-Philippe Demers and Arif Ali Awan

Personal tools