I thought it might be helpful if we had a thread on statistics so to begin the thread I just give an overview and to do that I will summarise a BBC TV programme called "The Joy of Statistics" presented by Professor Hans Rosling, which hopefully you can still find and enjoy on the *** somewhere.
Purpose - Statistics can tell us whether what we think and believe is actually true or not - are men better drivers than women, are Germans taller than Frenchmen, is drug A better than drug B etc. All we need is the possibility of defining and getting relevant data - this is far from easy, for example, it is easy to define and get data on infant mortality but what data could we define to show that God exists? Professor Rosling joked that "a man remonstrated with his member of parliament telling him that unemployment was up 13%, pay is down 5%, suicides have increased by 7% so why is the government wasting money collecting statistics." It is in fact surprising how little accurate information any person has about the world they live in. One might think that highly educated people know more but often what they know is no better and sometimes worse than what a simple guess might yield.
Statistics comes from the word 'state' and some scholars pinpoint the origin of statistics to 1663, with the publication of Natural and Political Observations upon the Bills of Mortality by John Graunt. However, it's modern form it began about 200 years ago in Sweden in 1749 with the creation of the Office of Tables (Tabellverket) for the systematic collection of national statistics and it was the first time that any government could get an accurate picture of it's peoples. One interesting find by the Office of Tables was that the population was 2 million not the 20 million that everyone thought!! After the shock of this finding the Swedish government took action and improved health care for example because it now knew where mortality was occurring. So knowing the statistics one can exercise control - interestingly in it's early days is was also known as 'political arithmetic'.
The Average - In the early years huge amounts of data where collected and shown in tables but it was soon realised that to make sense of it the data had to be analysed and presented in better ways than just long tables of numbers. One of the first tools was the average so that a whole mass of data is reduced to a single number that characterise a whole population. For example, in the UK the number of people who die in traffic accidents is almost a constant from year to year so here we have a case where a statistic describes a social phenomena. But averages don't tell you the whole story and one can get weird results such the fact that Swedish people on average have slightly less that two legs (average number of legs 1.999) so most Swedish people have slightly more that the average number of legs. This oddity arises because some people have one leg and some have none but no one has more that two legs.
Variation - So we need to look a variation in the data and to get a handle on that we need to look at shapes and these shapes typically show how the data varies around the average. As these shapes were explored one shape turned up again and again because it fitted so many sets of measurements that Sir Francis Galton, a polymath named it the 'normal curve'. In time other distributions were discovered related to social or other processes:
Continuous Distributions - Normal Distribution, Uniform Distribution, Cauchy Distribution, t Distribution, F Distribution, Chi-Square Distribution, Exponential Distribution, Weibull Distribution, Lognormal Distribution, Gamma Distribution, Double Exponential Distribution, Tukey-Lambda Distribution, Beta Distribution etc
Discrete Distributions - Binomial Distribution and Poisson Distribution
Data Patterns - So shapes show patterns in the data but also are a communication tool to a wider public and these days its almost an art form for we can use different shapes, colours and animations to tell the story. If the story in the numbers can be told by a beautiful and clever image then everyone can understand them. One of the pioneers of statistical graphics was Florence Nightingale who was herself a passionate statistician and she once famously said that "to understand God's thoughts we must study statistics for these are the measure of his purpose" so for her statistics was a religious duty. She went to the Crimean was and for two years she recorded mortality in meticulous detail and gathered her data in a devastating report forcing the government to set up a committee of enquiry. What created her place in statistical history is the graphics she used and one in particular, the polar area graph where she showed deaths from wounds, death from accidents and deaths from preventable causes and her graphics were so clear as to be unmistakable that they led to a revolution nursing care and hygiene in hospitals world wide.
Correlation - The next step after seeing in statistics what is happening we try to find out why using the powerful idea of correlation, how things causally vary together, meaning there is a definite link of cause and effect - crime correlates to poverty or infection correlates to poor sanitation but correlation can be very very tricky. For example, we might find a correlation between size of shoe and intelligence but it is extremely unlikely there is a causal link. One of the most famous cases was the investigation by Dr. Richard Doll in the 1950s of a causal link between lung cancer and smoking. The work was difficult of course because Doll had to show there was no other factor involved and that is much harder than you might think but he in effect used an RCT by looking at those who smoked and got lung cancer and this who did not smoke but still got lung cancer. He also looked at those who stopped smoking and calculated their reduced risk of getting the disease.
Cautions - When therefore one comes up with a correlation we must not stop thinking but try as hard as we can to disprove it by looking for other possibilities, explanations of cause and effect or getting more data or both - if it withstands all those efforts at refutation then cautiously we might say we really have something here. So data is the oxygen of science for the more we have the more corrections we will find and in today's world data is growing exponentially.
Example -consider ******** translation then the old way of doing it would be to discover all its rules and program them. But ******** is flexible and it is not necessary to know the rules to understand what is being said and what is being said may itself be ungrammatical or ambiguous and in every ******** there are exceptions. Modern methods use correlation between words and between phrases and so see that these words and phrases often also correlate to words and phrases in another ******** and in a way treat everything as an exception so a system just needs in essence large bodies of texts to find the correlations and use then to translate (see Google Translator).
The Future - With such vast amounts of data it is possible to do what's called 'computational science' where one might be able to statistically change how science is done or even say which science is possible. So one might be able to take what data you have and then run it against a whole range of models or hypotheses to see where a best fit occurs and we might do this with tens of thousands of scenarios or hypotheses where the system can automatically discard poor cases.
In my next post I will begin with some simple statistics ideas - basic are everything!
مواقع النشر (المفضلة)