Monday, 30 January 2012

Radial Diagrams

Radial Diagrams

Radial diagrams are a type of graph where values extend out from a central point; therefore they show the relationship of each variable to the central point/item.
They are useful because a number of different variables can be plotted and shown on the same graph, as more than one axis can be used. However using over four or five different variables can make a diagram quite complicated.
The most common form of this type of graph is a wind rose diagram, shown below, which shows the frequency of wind direction; the axis' represent North, South, East and West, and the number on each of the axis is the length of time that the wind was blowing in this direction. Therefore the prevailing wind direction can be easily shown.

Radial Diagrams are most commonly used to show the relationship of a variable to compass direction, ie they are used as directional diagrams, as shown in the wind rose diagram above where it shows the relationship of wind and compass direction. However they can be used to plot a rangs of variables.
They are advantageous because trends in the data set are clearly shown, with the variable with the largest value being highlighted on the graph. It is clear from the graph that the wind blows in an almost North South direction for much of the time.
The data that is applicable to this method is limited, but some data sets, such as compass direction, would be hard to present in any other way. It can also be hard to read exact values from the scale on the axis, as it often makes the diagram too crowded to include the scale; data often has a wide range of values when plotting a number of difffrent variables, meaning that it can be hard to find a suitable scale to use. 


Friday, 27 January 2012

Chi Squared

Chi-Squared Test

The Chi-squared test has the capability to become very complicated, and can be completed in several ways. I am going to concentrate on the most simple form of the test, which you are most likely to come across at A-level.
This is a test that is used for categorised data; because it uses categories rather than continuous data it doesn't require a normal distribution and is therefore neither parametric or non-parametric.
Chi-squared is used to test if there is a significant difference between the expected frequencies and the actual observed frequencies in one or more categories. It can then assess if any difference between the expected and observed frequencies is due to sampling error and chance or is a significant difference that can be investigated further. This test is sometimes called the Chi Squared goodness of fit test, as it can be used to assess the "goodness of fit" of an observed distribution to a theoretical one.

Calculating Chi-Squared:

The ratio of males to females in a science faculty is 1:1, but in the chemistry lectures there have been 80 females and 40 males. Is this a signifcant difference in number from what is expected?
(You can suggest from the figures that it is, but it will not always be so clear!)

You first need a table of the observed values:



Female
Male
Observed values
80
40


Then the expected values need to be caluculated: the expected values are always the number you would expect in a random distribution. They can be worked out in two ways:
  1. You predict that all the values in the categories will be the same. This is calculated by dividing the total of all the categories by the number of categories, in the example (80+40=) 120/2 meaning 60 in each category.
  2. You determine the expected frequencies on some prior knowledge. Eg suppose we alter the question for this example and have the knowledge that 30% of the faculty were males and 70% females; the expected values would now be 36 males and 84 females.
We can then insert this data into a full table:


Female
Male
Total
Observed values (O)
80
40
120
Expected values (E)
60
60
120
0-E
20
-20
0
(O-E)2
400
400

(O-E)2/E
6.67
6.67
13.34


Note that the total of the observed values and expected values must always be the same. The total of the    0-E column must always be 0.

Chi-squared equation is (O-E)2/E, so the chi squared value for this example is 13.34.
Again this value means nothing on its own and must be tested in significance tables; here n-1 is the degrees of freedom and the chi squared value must be greater than the critical value stated in the table. If it is greater than the critical value it means that there is a significant difference between the observed and expected frequencies. Eg something is causing more females to attend the chemistry lectures than males or causing the males to miss these lectures.


Mann-Whitney U test

Mann-Whitney U test

This is a test that compares the medians of two data sets, to see if there is a significant difference between the data sets. It shows if one of the samples tends to have large values than the other, and therefore shows if there is a difference between the data sets or if any perceived difference is simply due to chance. Two data sets can have similar means but very different values within them, this test therefore highlights differences within the data sets; it can also be used to show that two independent samples do in fact have the same distributions.
This is a non-parametric test, meaning data doesn't have to be normally distributed; the parametric alternative is the Students T-test which is a much more complicated analysis. The U test uses relative ranks of data, so the data must be ordinal.
Before the test is completed a null hypothesis must be established, this is always the same and states:
There is no significant difference between the two data sets
This is the hypothesis that you aim to disprove (or prove if you want to show that the samples are indeed showing the same distributions/ from the same populations).

Completing the test:
For example if you wanted to see if there was a difference in traffic flow before and after a supermarket was built. You could collect data before the costruction and after the construction, then use the Mann-Whitney U test to see if the difference in traffic flow is significant or not.
The data first needs to be put into a table, with the two data sets being labelled A and B (you can use any letters you like as long as they are constant throughout the analysis).
 

Traffic Flow before construction (A)
Rank (ra)
Site Number
Traffic Flow after construction (B)
Rank (rb)
126
11
1
194
2
148
7
2
128
10
85
15.5
3
69
18
61
19
4
135
9
179
4
5
171
5
93
12.5
6
149
6
45
20
7
89
14
189
3
8
248
1
85
15.5
9
79
17
93
12.5
10
137
8
ra
120

rb
90


  1. The two samples must be ranked together; in Mann-Whitney data is always ranked Highest to Lowest values, ie the largest value is ranked 1. If there are any ties between values the average of the values is used for all of those tied, eg rank 15 and 16 both equal 85 so both have the value 15.5, if rank 15,16 and 17 all had the same values all would be ranked as 16.
  2. Next you must figure out what Ua and Ub are respectively. This uses two formulas:                   n is the number in the sample.
  3. To calculate Ua first multiply the number in sample A by the number in sample B, eg 10x10.      Then complete the next part of the equation which for this example is 10x (10+1) / 2 which equals 55.     So we now have 100 + 55, which equals 155 and we then minus ra (which is the sum of the ranks in column A). 155- 120 = 135, meaning Ua =35.
  4. The value of Ub must then be calculated using the same formula but substituting the values for b. For this example Ub = 65.
  5. The U values on there own are meaningless and tell us nothing about the data, istead we must test them in significance tables. For this we only use the smallest of the U values, for this example we use Ua (35).
  6. The smallest U value then needs to be put into a critical value table to test its significance; the table will have a row of numbers on the top and a column of numbers down the side. You should use the number of results in sample A (eg 10) for one column and the number of results in sample B (eg 10) for one row. Find where this column and row intersect and this is your critical value. Eg

    1
    2
    3
    4
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    1














    2














    3














    4














    5














    6














    7














    8














    9














    10








    23





    11














    12














    13














    14














    15














  7. The U value is significant at the 0.05 level if it is smaller than the critical value in the table; this means that there is a difference between the two samples, with only a 5% probability that the results are due to chance.