Geographical statisitics
Measures of Central Tendency
These are not specific statistical tests but achieve a similar objective in that they summarise a data set into one figure that can be compared with other data sets very easily.
Statistcal Tests
Statistical tests are simply a way of summarising a large set of data into a single figure. This is a lot easier to deal with, and the results can be compared with results from other data sets.
There are a range of different tests that show a range of different things about the data in question.
Statistical significance
The result of any statistical test must be tested for its significance; most results do not mean anything until they have been tested for significance. A null hypothesis cannot be rejected if the significance of the result is more then 0.05, ie above the 95% significance level.
----------------------------------------------------------------------------------------------------------------------------------
Measures of Central Tendency
These are a way of summarising large data sets into one single value, which summarises the data set. They are not specific statistical tests but achieve a similar objective, allowing easy comparison between data sets.
· Mean
To get the mean of a data set you must add up all the results and divide by the number of results that you have.
This result is the average result of the data set, which all other results vary around; it is represented by the X-bar symbol shown above.
This is a useful measurement as it is quick and easy to calculate, and the result is usually a simple figure. The means of two data sets can be easily compared, showing if the results from the two sets are of similar value or widely different. It takes all the data into consideration, and the results is applicable for further mathematical processing to provide more information about the data set.
Although the mean is very useful, and one of the most widely used statistics it is easily distorted by extreme values; this makes data less reliable and is often used by companies to bias data, as one large value can make the mean appear to be much larger than it actually should be.
· Median
This is the middle number of a data set, it is worked out by ranking the data in order; for small data sets the middle number will be obvious but for larger data sets the formula N+1 / 2 can be applied. N= the number of results in the data set, then you add 1 to this number, dividing by two then gives the position of the middle number not the value itself, find this position in the ranked data and that is the median. If there is an even number of results in the data set there will appear to be two median values, to overcome this the average of the two results must be calculated.
The median is a useful figure to compare to central value to two data sets to see if they are similar or widely different. The value is applicable for further mathematical processing to see if similarities or differences are statistically significant. The median value is also more accurate than the mean as it is not affected by extreme values; it shows nothing of the spread of data and only one figue from the data set.
- Mode
The mode is simply the value that occurs the most in a data set (Mode= Most Often). It is therefore very quick and easy to calculate, and remains unaffected by extreme values. Any data set with two modes is said to be bi-modal, with three it is tri-model etc, but comparing data sets with over three or four modal values would not be very useful.
The mode is not suitable for further mathematical processing and has limited value with numerical data sets. However it is very useful for analysing categories and questionnaire results, from which tally charts have been constructed; it would therefore be more applicable to human based geographical studies.
The measures of central tendency are all useful in their own right, in different ways for different data sets. The mode is a measure unique to certain types of data, and the median and mean are most informative when used together. These values are also made more useful through further statistical tests.
----------------------------------------------------------------------------------------------------------
Normal Distribution
A normal distribution is represented by a bell shaped curve:
This is created when a histogram is plotted from the results and the general trend of the results is shown.
A normal distribution occurs when the mean, median and mode are all the same. It means that the mean value is in the centre of the data set, and therefore representative of the data not affected and skewed by extreme values. It also means that the mean value is the most often occurring. It shows that most values in the data are clustered around the mean with a decreasing number of extreme values either side.
The steeper the curve the more clustered around the mean value the data is, and vice versa. The larger the data set the more likely it is to show a normal distribution, which is why we take a large number of samples when conducting fieldwork.
When the data doesn't conform to normal distribution it is said to be "Skewed", meaning that extreme values are included in the data set and the mean is less reliable.
(Note that extreme values are more likely to be anomalous, so make a data set less accurate!)
· Negatively skewed data is where the mode and median lie to the right of the mean (it is skewed by extremely small results)
· Positively skewed data is where the mode and median lie to the left of the mean (it is skewed by extremely large results)
Normal Distribution and Skewness are both important as they determine which type of test can be applied to a data set:
· If the data set shows normal distribution then parametric tests should be applied!
· If the data is skewed then non-parametric tests should be applied!
-----------------------------------------------------------------------------------------------------------
Standard Deviation
Standard devation is a test that measures dispersion in a data set and the reliabilty of the mean value, as it measures the spread of data around the mean. It is used for measurements of variability/diversity in a data set.
A normal distribution means that most of the values in a data set are close to the average value and few results tend to one extreme; the mean value is representative of the data set and therefore reliable. If data is very clustered around the mean the bell-shaped curve of a normal distribution wll be steep, and the standard deviation value will be small. If data has a large spread with more extreme values the bell-shaped curve will be flatter, and the standard deviation value large.
To calculate standard deviation it is best to put data into a table:
Year
|
Rainfall (X)
|
X- mean
|
(X-mean) 2
|
1
|
389
|
-321.17
|
103150.17
|
2
|
786
|
75.83
|
5750.19
|
3
|
990
|
279.83
|
78304.83
|
4
|
1195
|
484.83
|
235060.13
|
5
|
485
|
-225.17
|
50701.53
|
6
|
4293
|
582.83
|
339690.81
|
7
|
531
|
-179.17
|
32101.89
|
8
|
372
|
-338.17
|
114358.95
|
9
|
421
|
-289.17
|
83619.29
|
10
|
983
|
272.83
|
74436.21
|
11
|
384
|
-326.17
|
106386.87
|
12
|
693
|
-17.17
|
294.81
|
∑
|
8522
|
1223855.68
|
Mean = 710.17
1. Add all the numbers in the X column up, then divide by the number of results to find the mean. (The X-bar symbol for the mean cannot be typed on this page for some reason, see post on "measures of central tendency" to see what the symbol for the mean looks like.)
2. Once the mean is calculated take it away from each value individually
3. Square the result to remove any negative numbers
4. Then add up all the reults in the final column (the ∑ symbol means sum of)
5. Once this has been done you need to use the following equation
6. You already have the result for the top line, then simply divide it by the number of results (n) and square root the answer: the result for this data set is 319.36.
Interpreting the value can be slightly hard. One standard deviation away from the mean accounts for 68% of the results; in this example it means that 68% of the results are either 319.36 above or below the mean, ie 68% of the results are between the values of 390.81 - 1029.53. Two standard deviations away from the mean accounts for 95% of the data, and three standard deviations accounts for 99% of the data.
We can see that this figure is quite large, and that there is a large range in the values of data at even one standard deviation. Therefore there is a large spread of data in the set, and the mean is not very representative.
Standard deviation is very useful for comparing data sets that have the same or similar means; as two data sets can have the same mean but vey different standard deviation values showing a very different spread of results. Standard deviation takes into account all the actual results of a data set and provides further information about the data than just the mean; however because the test is based on the mean it is affected by extreme values in a data set.
-----------------------------------------------------------------------------------------------------------
Measuring Dispersion
The dispersion of a data set is simply the spread of data within the results; these measures are used to provide information about the spread of data around the mean and median values, and therefore used in conjunction with measures of central tendency
· Range
The range is simply the difference between the lowest and highest figures in a data set. The range is therefore a crude indication of the spread of the data, it is very quick to calculate but is easily skewed by extreme values/ anomalies.
· Interquartile Range
This removes the top and bottom quarters of the results and shows the dispersion of the central 50% of results, it therefore removes, and remains unaffected by, extreme values. It is best used with the median, and a higher interquartile range means that there is a greater spread of results about the median and vice versa. Although this result is not skewed by extreme values, and therefore often more representative of the dispersion of a data set than the range, it doesn't take all values into account.
The interquartile range is caluculated in a similar way to the median. The data must first be ranked and then upper and lower quartiles calculated; where the range is the middle/half value, quartiles are the quarter values. Like the median the formulas give the position of each quartile, not the actual values.
If data is ranked lowest to highest the formulas are as follows: (N is the number of values)
LQ= N+1 / 4
UQ= (N+1 / 4) x 3
If data is ranked highest to lowest the formulas are as follows:
LQ= (N+1 / 4) x 3
UQ= N+1 / 4
Notice the formulas are just the opposite way round; it doesn't really matter which way the data is ranked as long as you remember that the upper quartile value should be bigger than the lower quartile value. The upper quartile is the value 75% through the data, the median is 50% and the lower quartile 25%.
The Interquartile range (IQR) is the difference between UQ and LQ.
· Standard deviation
This is also a measure of dispersion; it measures the spread of data about the mean rather than the median. It takes into account all results, but is affected by extreme values and anomalies.
See the post on Standard Deviation for further information.
-----------------------------------------------------------------------------------------------------------
Correlation Coefficients
Correlation
When two things vary together there is said to be a correlation between them; correlations are often shown by a line of best fit on a scatter graph.
A positive correlation occurs when an increase in one variable results in an increase in another variable, on a graph it is a diagonal line SW to NE across the page.
A negative correlation occurs when an increase in on variable results in a decrease in the other variable, on a graph it is a diagonal line NW to SE across the page.
No correlation is present when a line of best fit cannot be drawn on the graph.
Note that a correlation does not mean a causal relationship between two variables, ie an increases in one causes an increase in the other, it simply suggests a link between the two that can be investigated further.
Spearman's Rank Correlation Coefficient (Rs)
This is a statistical test used to analyse correlation between two variables; the spearman rank value measures how strong a correlation is and in what direction it is.
This test is a non-parametric test, that can be used with data that is not normally distributed, but can only analyse variables that have a linear relationship (which can be shown by a scatter graph). The data must be ordinal, ie it can be ranked in order.
The data must first be put into a table:
Site
|
Discharge
(Variable 1)
|
Rank
|
Velocity
(Variable 2)
|
Rank
|
d
|
d2
|
1
|
0
|
5
|
0.05
|
5
|
0
|
0
|
2
|
100
|
4
|
0.19
|
3
|
1
|
1
|
3
|
200
|
3
|
0.18
|
4
|
-1
|
1
|
4
|
300
|
2
|
0.34
|
1
|
1
|
1
|
5
|
400
|
1
|
0.28
|
2
|
-1
|
1
|
∑
|
4
|
1. The results of the two variables are ranked highest to lowest seperately (it doesn't actually matter how you rank them as long as you rank the two variables the same way).
2. "d" is then calculated by taking the the second rank away from the first rank at each site
3. The value for d is squared to remove any negative values, and the sum of this column is calculated
4. The equation is then applied
5. You already have the result to the top line. On the bottom line "n" is the number of pairs of data, ie 5. Then don't forget to take your result away from one. For this data set Rs= 0.8.
Interpreting the value is relatively easy. Your Rs value should always be between -1 and +1, if it isnt you have done something wrong; a value of -1 means a strong negative correlation, a value of +1 means a strong positive correlation and a value of 0 means no correlation at all. The closer to +/- 1 your value is the stronger the correlation for your data set.
Significance
The result cannot however be used as evidence, or to disprove a null hypothesis, unless it is statistically significant. Your Rs value must be inputted into a spearman rank significance table against the degrees of freedom for your data (this is the number of pairs of data you have minus 1). The higher your degrees of freedom, ie the more results you have the more likely your result is to be significant.
The result of 0.8 suggests a fairly strong positive correlation, but because of the small amount of data collected it is not significant so cannot be used for anything.
Pearson's Product Moment Correlation Coefficient
This is a more accurate test for correlation than Spearmans Rank because it uses actual values in the data set, rather than relative ranks.
This is a parametric test, so can only be used for data that shows a normal distribution, and has a much more complicated equation than Spearman Rank.
(I am unsure exactly how to work this value out; but for exams you are likely just to need to know the pros and cons as listed above.)
-----------------------------------------------------------------------------------------------------------
Mann-Whitney U test
This is a test that compares the medians of two data sets, to see if there is a significant difference between the data sets. It shows if one of the samples tends to have large values than the other, and therefore shows if there is a difference between the data sets or if any perceived difference is simply due to chance. Two data sets can have similar means but very different values within them, this test therefore highlights differences within the data sets; it can also be used to show that two independent samples do in fact have the same distributions.
This is a non-parametric test, meaning data doesn't have to be normally distributed; the parametric alternative is the Students T-test which is a much more complicated analysis. The U test uses relative ranks of data, so the data must be ordinal.
Before the test is completed a null hypothesis must be established, this is always the same and states:
There is no significant difference between the two data setsThis is the hypothesis that you aim to disprove (or prove if you want to show that the samples are indeed showing the same distributions/ from the same populations).
Completing the test:
For example if you wanted to see if there was a difference in traffic flow before and after a supermarket was built. You could collect data before the costruction and after the construction, then use the Mann-Whitney U test to see if the difference in traffic flow is significant or not.
The data first needs to be put into a table, with the two data sets being labelled A and B (you can use any letters you like as long as they are constant throughout the analysis).
Traffic Flow before construction (A)
|
Rank (ra)
|
Site Number
|
Traffic Flow after construction (B)
|
Rank (rb)
|
126
|
11
|
1
|
194
|
2
|
148
|
7
|
2
|
128
|
10
|
85
|
15.5
|
3
|
69
|
18
|
61
|
19
|
4
|
135
|
9
|
179
|
4
|
5
|
171
|
5
|
93
|
12.5
|
6
|
149
|
6
|
45
|
20
|
7
|
89
|
14
|
189
|
3
|
8
|
248
|
1
|
85
|
15.5
|
9
|
79
|
17
|
93
|
12.5
|
10
|
137
|
8
|
∑ra
|
120
|
∑rb
|
90
|
- The two samples must be ranked together; in Mann-Whitney data is always ranked Highest to Lowest values, ie the largest result had the rank 1. If there are any ties between values the average of the values is used for all of those tied, eg rank 15 and 16 both equal 85 so both have the value 15.5, if rank 15,16 and 17 all had the same values all would be ranked as 16.
- Next you must figure out what Ua and Ub are respectively. This uses two formulas: n is the number in the sample.
- To calculate Ua first multiply the number in sample A by the number in sample B, eg 10x10. Then complete the next part of the equation which for this example is 10x (10+1) / 2 which equals 55. So we now have 100 + 55, which equals 155 and we then minus ∑ra (which is the sum of the ranks in column A). 155- 120 = 135, meaning Ua =35.
- The value of Ub must then be calculated using the same formula but substituting the values for b. For this example Ub = 65.
- The U values on there own are meaningless and tell us nothing about the data, instead we must test them in significance tables. For this we only use the smallest of the U values, for this example we use Ua (35).
- The smallest U value then needs to be put into a critical value table to test its significance; the table will have a row of numbers on the top and a row of numbers down the side. You should use the number of results in sample A (eg 10) for one column and the number of results in sample B (eg 10) for one row. Find where this column and row intersect and this is you critical value. Eg 1234678910111213141512345678910231112131415
- The U value used is significant at the 0.05 level if it is smaller than the critical value in the table; this means that there is a difference between the two samples, with only a 5% probability that the results are due to chance.
Chi-Squared Test
The Chi-squared test has the capability to become very complicated, and can be completed in several ways. I am going to concentrate on the most simple form of the test, which you are most likely to come across at A-level.
This is a test that is used for categorised data; because it uses categories rather than continuous data it doesn't require a normal distribution and is therefore neither parametric or non-parametric.
Chi-squared is used to test if there is a significant difference between the expected frequencies and the actual observed frequencies in one or more categories. It can then assess if any difference between the expected and observed frequencies is due to sampling error and chance or is a significant difference that can be investigated further. This test is sometimes called the Chi Squared goodness of fit test, as it can be used to assess the "goodness of fit" of an observed distribution to a theoretical one.
Calculating Chi-Squared:
The ratio of males to females in a science faculty is 1:1, but in the chemistry lectures there have been 80 females and 40 males. Is this a signifcant difference in number from what is expected?
(You can suggest from the figures that it is, but it will not always be so clear!)
You first need a table of the observed values:
Female
|
Male
| |
Observed values
|
80
|
40
|
Then the expected values need to be caluculated: the expected values are always the number you would expect in a random distribution. They can be worked out in two ways:
- You predict that all the values in the categories will be the same. This is calculated by dividing the total of all the categories by the number of categories, in the example (80+40=) 120/2 meaning 60 in each category.
- You determine the expected frequencies on some prior knowledge. Eg suppose we alter the question for this example and have the knowledge that 30% of the faculty were males and 70% females; the expected values would now be 36 males and 84 females.
Female
|
Male
|
Total
| |
Observed values (O)
|
80
|
40
|
120
|
Expected values (E)
|
60
|
60
|
120
|
0-E
|
20
|
-20
|
0
|
(O-E)2
|
400
|
400
| |
(O-E)2/E
|
6.67
|
6.67
|
13.34
|
Note that the total of the observed values and expected values must always be the same. The total of the 0-E column must always be 0.
Chi-squared equation is (O-E)2/E, so the chi squared value for this example is 13.34.
Again this value means nothing on its own and must be tested in significance tables; here n-1 is the degrees of freedom and the chi squared value must be greater than the critical value stated in the table. If it is greater than the critical value it means that there is a significant difference between the observed and expected frequencies. Eg something is causing more females to attend the chemistry lectures than males or causing the males to miss these lectures.
No comments:
Post a Comment