Skip to content

Review of “Statistics Done Wrong” by Alex Reinhart

This is a book review for the applied math journal SIAM Review.  Comments are welcome.  An short version of the book can be found here.

Most of us accept that statistics is not applied mathematics. The goal of statistics is to obtain answers from data, and mathematics just happens to be an exceptionally useful tool for doing so. However, for many applied mathematicians
statistical analysis is an integral part of daily work. Experimental studies often motivate our research, and we use data to develop and validate our models. To understand how experimental outcomes are interpreted, and to communicate with scientists in other fields, a knowledge of statistics is indispensable.

One issue that we need to take seriously is that misapplications of statistics have lead to false conclusions in much, maybe even most published studies [1]. Although the “soft sciences” have received most scrutiny in this regard [3], the “hard sciences” suffer from closely related problems [2]. Anybody who uses the results of statistical data analysis – and this includes most applied mathematicians – needs to be aware of these issues.

As the title suggests, Alex Reinhart’s “Statistics Done Wrong” [4] is not a textbook. Rather, its aim is to explain some of
the ways in which data analysis can, and often does go wrong. The book is related to Darrel Huff’s classic “How to Lie with Statistics” [5] which covers many topics that are now part of freshman courses. Huff, a journalist (and later consultant to the tobacco industry) provided a lively discussion that alerted general readers to the misuses of statistics by the media and politicians. Since the first edition of Huff’s book in 1954 computational power has increased immensely. But our increased ability to collect and analyze data has also made it easier to misapply statistics. Reinhart’s book aims to introduce the present consumers of statistics to the resulting problems, and suggests ways to avoid them.

The book starts with a review of hypothesis testing, p-values and statistical power. Here Reinhart introduces a recurrent topic of the book: the errors and “truth inflation” due to the preference of most journals and scientists to publish positive results. The ease with which we can analyze data makes the problems of multiple comparisons and double dipping particularly important. The book provides a number of thoughtful examples to illustrate these issues. The last  few chapters provide good guides to data sharing, publication, and reproducibility in research. Each chapter ends with a list of useful tips.

Most of these issues are more subtle than those discussed by Huff [5]. While not heavy on math, the book presents arguments that require reflection. The ideas are frequently illustrated using well chosen examples, making for an entertaining read. The book is thus informative, yet easy to read.

Reinhart predominantly discusses issues resulting from the misuse of frequentist statistics. This is understandable, as the frequentist approach is currently dominant in most sciences. However, it is worth noting that Bayesian approaches make it easier to deal with some of the main problems discussed in the book. Bayesian statistics makes it easier to deal with multiple comparisons, and replaces p-values with measures that are easier to interpret. However, it is not a magic bullet – as Bayesian approaches become more common over the next decades, we may need another volume describing their misuses.

What is the audience for this book? Many of the topics need to be familiar to anybody doing science today. The book could also provide good supplementary material for a second course in statistics.

Doing statistics can be tricky. Finding the right experimental design requires a careful consideration of the question to be answered. The interpretation of the results requires a good understanding of the methods that are used. All statistical models are by necessity approximate. Knowing how to verify that the underlying assumptions are  reasonable, and choosing an appropriate way to analyze data is essential. A central point here is that the statistical analysis deserves as much
attention as the conclusions we draw from it. And perhaps the most important lessons of this book is that questions of statistical analysis should be addressed when the research is planned.

Reinhart’s book is not a comprehensive list of the different ways in which misuses of statistics can lead us astray. It provides no foolproof answers on how to detect problems in statistical analysis. However, it does an excellent job of introducing a range of common pitfalls, and provides sensible tips on hows to avoid them. Doing statistics means accepting that we will be wrong some of the time. The best we can do is to maximize our chances of being right, and understand how likely it is that we are not.


  1. Ioannidis, J. P. A. Why most published research findings are false. PLoS Medicine, 2:8, (2005) e124.
  2. Button, K. S., et al. Power failure: why small sample size undermines the reliability of neuro- science, Nat Neuroscience, 14, p. 365-376 112. (2013)
  3. Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349:6251,p. 943. (2015)
  4.  Reinhart, A. Statistics done wrong. No Starch Press (2014).
  5. Huff, D. How to Lie with Statistics. Norton, W. W. & Company, Inc. (1954).

Statistics and clinical trials

In the night of December 13, 1799, George Washington woke his wife Martha to tell her that he was feeling ill. Following the medical practice of the day, he was bled repeatedly and given an assortment of medicines, some of which contained mercury. By the time Washington died, half of his blood had been removed. He may have lived longer had the doctors simply done nothing.

Washington’s case is not unique. Doctors of the past did not know whether their medicines worked. Some, like quinine, were real cures. Most others, like lead and mercury, did more harm than good. Until the second half of the 20th century physicians did not have the tools to decide which medicines help patients. And it was not microscopes or sophisticated lab equipment they were lacking. Rather, they did not know how to reliably test and compare different treatments.

Suppose you want to test whether a drug helps people sleep better. You could give it to many patients, and ask them if their sleep has improved. But how can you tell whether the results are due to chance, or the placebo effect? And how many people do you need to ask to conclude that the drug works? One answer is to divide patients randomly into two groups. Give the drug to those in the test group, and not to those in the so-called control group. If those in the test group fare better, the drug is effective.

This may sound simple, but it is not. Some people in both groups will still sleep better by chance. Fortunately, if you have sufficiently many patients, such differences even out. How many patients do you need, and how confident can you be in your conclusions? The mathematical theory of probability and statistics provide precise answers to these questions.

The ethical questions are more difficult. In such clinical trials we give the drug to some people and not to others. How can we withhold a potentially life saving drug from some dying patients, and give it to others? But remember – before a treatment is proven to work, it is possible that doing nothing is the better option. The reason we test treatments is that we only suspect that they work. Doctors, like the rest of us, are not clairvoyant. They do not know what treatment is best until it is tested.

As an example, antiarrhythmic drugs were given to patients for decades to stop irregular heartbeats. Doctors argued that it was self-evident that these drugs saved lives. The drugs were eventually tested in clinical trials, but only because doctors believed they should be used more widely. However, a statistical analysis showed that these supposedly life saving medicines were killing patients each year. The drugs may have been responsible for more than 50,000 deaths in the US alone. What seemed obvious and intuitive turned out to be very wrong.

We laugh at the medieval use of leaches, and other remedies that help “balance the humors”. But even today we sometimes base our decision on intuition rather than evidence. Many avoid vaccinations which have been proven to be safe and effective. On the other doctors and patients frequently demand expensive medical tests when none are needed. Clinical trials and statistics can tell us which treatments work. It is up to us to make use of that knowledge.

Some notes:

  1. How much doctors rely on proven medicines, and how much they go by instinct and guesses is a matter of debate.
  2. I recommend Druin Burch’s book “Taking the Medicine” for a look into the history of clinical trials.
  3. Much has been written about the modern anti-vaccine movement. Here an older, but still relevant article in Wired 
  4. The ethics of randomized clinical trials is a complicated subject. I have written about a particular case here.  You will find links to more detailed discussions here.
  5. Here is a good article about over treatment.

How to guide graduate students to good research projects?

I think it is great when graduate students come up with problems to work on by themselves. There is probably no better preparation for a research career.  Unfortunately, grad students in math only have about 3-4 years to learn about a field, produce new results, publish their research and write a thesis. This does not leave a lot of time for exploration. The role of the advisor is to try help select the most fruitful directions. Here are some of the questions that I found helpful:

1) Is the problem new, or has it been answered in some form before?

This is essential since you don’t want to have a student writing a thesis or a paper on something that is already known – and if it is known, it is good to find how others did it. This can then naturally lead to further questions.

2)  Is the question interesting?

This is harder. One way is to have students think about how they could convince other graduate students that this problem is worth studying. After they can think about how to convince non-scientists.

3) Is it the question related to what we are doing in the lab/research group? Can it be answered in a reasonable time?

This is a more practical point. More senior graduate students that already have the majority of the thesis written, can have more leeway.

I am sure there are a number of other criteria here. In his TED talk, Uri Alon says that he approaches research like improv theater. This is great once a question is clearly articulated, and the group is looking for a way forward. Picking the right problem to work on is tricky. What other questions can help guide graduate students to a good problem, while encouraging them to take a role in designing their own research?

Compressive Sensing

In his short story “On Exactitude in Science” Jorge Luis Borges describes a map so detailed that it is the exact size of the empire it represents. Every bridge, road and house on the map is the size of their real counterpart. Of course, such a map is absurd. Maps are useful because they show the most important features of a location – they give us only the information we need. Even a map that is more than a few feet across is useless for navigation. The job of a mapmaker is then to discard enormous amounts of detail, and show us only what we need to know to get from one place to another.

This is not only true for maps. Our megapixel cameras are similar to the cartographers in Borges’ story. Each photo we take contains enormous amounts of data. However, the information that we need to reconstruct a good approximation of each image is much smaller. This is why photos are almost always compressed when stored. Indeed, most compression works by discarding some of the unnecessary information in the original pictures.

But if much of the information in a photo is thrown away afterwards, then why do we need megapixel cameras? Why not just record the essential information about the scene in front of us to start with. If we did so, we would not have to compress the files later.

How to do this has long puzzled scientists and mathematicians. The problem is that one approach may work for some pictures, but fail for others. How can we make a camera that will always only make measurements we need to reconstruct a picture?

The mathematician Emmanuel Candés stumbled upon the answer almost by chance while trying to remove noise from an image. He knew that an exact algorithm that would identify only relevant information would require far too much time to run. He therefore tried an approach that at first sight should not have worked well. To his surprise he recovered almost all the information in the original image – the process worked like magic. Candés said later: “It was as if you gave me the first three digits of a 10-digit bank account number — and then I was able to guess the next seven.”

This was the first step in the development of what is now called “compressive sensing”: A way to take good pictures by taking very few measurements. Surprisingly, rather than carefully planning which measurements to take, it is best to take measurements at random. When this was first proposed, many thought it must be wrong. How could taking measurements at random work better than a sophisticated algorithm? Later mathematical analysis answered these doubts, however. As a result, compressive sensing is used in different areas of medical imaging, and its applications are growing.

We are awash in information. But the main problem is not how to acquire more. Like cartographers, we need to find what matters, and discard the rest. And here, like in so many other matters, it is mathematics that shows the way.

Some notes:

Here is an interview with one of the discoverers of compressive sensing, Emmanuel Candés. Follow the link to a more technical review article.

Here is a good article in Wired that gives a more detailed overview of the origins of compressive sensing  Here is another overview with a bit more math, but also very understandable. Here is also an understandable lecture.

Simpson’s paradox

Here is the text of another episode for Engines. For a great illustration of this idea, see this excellent web page created by undergrads at UC, Berkeley.  I will try to write a second  on compressive sensing, and record them together. Comments are welcome, of course:

In 1973 the University of California at Berkeley was sued for sex discrimination in graduate student admissions. The case seemed clear cut: only 35% of female compared to 44% of male applicants were admitted to graduate programs across the university. However, when statisticians looked at the data in more detail they found a surprise. When looking at the admission rates of individual departments, the apparent bias disappeared. Across departments, women were either more likely to be admitted or about equally likely to be admitted as men. Indeed, at the level of individual departments, women seemed to fare slightly better.

This is an example of Simpon’s paradox – a paradox that can affect averages whenever we combine, or pool, data. Here is another example involving two New York Yankees players, Derek Jeter and David Justice. In both 1995 and 1996 David Justice had a higher batting average than Derek Jeter. However, when we compute the batting average over both seasons, then Derek Jeter is ahead of David Justice. Again, pooling the data gives a different picture than when looking at smaller chunks.

How is this possible? Let’s look at the case of graduate school applicants to the University of California at Berkley. It turns out that more women applied to departments in the humanities, while men tended to apply in higher numbers to engineering and science departments. Humanities departments had fewer available slots, and rejected more applicants. Thus female applicants applied mostly to departments which admitted fewer students, whether male or female. As a result, the overall fraction of women admitted was lower than that of men. A bias may have existed, but it was not a bias in the rate of admissions. Rather, it was a bias in the number of women who chose to apply graduate studies in technical fields.

Simpson’s paradox can have important consequences. For example medical researchers compared a less invasive treatment for kidney stones to the established surgical methods, and found the new treatment to be better overall. However, the less invasive treatment was more frequently applied to small kidney stones. Since smaller kidney stones are easier to treat, this gave an advantage to the new, less invasive method. When the treatments were compared separately on small kidney stones and large kidney stones, the traditional treatment proved to be more successful. Taking into account kidney stone size completely changed the conclusion about which treatment is better.

The outcomes of lawsuits, promotions, and our choice of medical treatments are frequently based on numerical evidence. Yet our intuition can easily mislead us when we think about numbers. Mathematics and statistics can help – they can give us answers to the question we are asking. But it is up to us to make sure that we are asking the right questions.


The Wikipedia article on Simpson’s Paradox has a number of other good examples.

The mathematician John Tukey is credited with saying that “Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

How to make decisions when the environment is changing

I have recently revisited the following old problem in a new paper with Alan Veliz-Cuba and Zack Kilpatrick: Suppose that you are trying to decide between different alternatives. Maybe you are trying to find the best place to fish, behind which bush to look for your dinner, or between different products in a super market. Humans, animals, and even animal collectives decide between alternatives all the time. Such decisions are frequently not based on a single observation. Indeed, animals accumulate evidence, and combine bits of information to arrive at a better choice.

There has been a lot theoretical work to determine the best way to integrate information that arrives over time. There are also fascinating experiments that suggest that certain animals make decisions in a way that is very similar to the theoretical models. Indeed, neuroscientists have even capture the neural signature of such processes.


This figure shows the activity of neurons in area VIP of monkeys that are deciding whether a cloud of dots moves predominantly to the right or to the left.  As the animal accumulates information for one or alternative, the activity of a cell increases. Figure from Gold and Newsome, Annual Review of Neuroscience, vol. 30, p. 535 (2007).

However, in most classical studies of such decision making the correct (or better) choice is fixed during a trial (For a recent study where this is not the case, see here). Unless there is a some pressure to make a decision quickly, in such cases it is best to accumulate as much evidence as possible, that is, wait indefinitely to make sure that you are making correct choice. However, the natural world constantly changes, and what is a correct choice or better option at one instant, may no longer be so in the next. Our goal was to extend classical models to the case where the truth is not constant.

The model that we derived has some interesting features: An ideal evidence accumulator will discount prior evidence in a way that is determined by the volatility of the environment. In other words, to perform well in a changing environment it is best to forget evidence that is too old, as it is no longer pertinent. Indeed, our model shows that you want to keep evidence that has arrived over about one average environmental epoch. As a consequence, even if for some reason the best option does not change for an extended time, there is a limit to the certainty you can attain about your choice. Your certainty is limited by the amount of information you keep in memory.

The nice thing about differential equation models is that they suggest plausible neural implementations. We proposed such a model with the activity in different neural populations representing the evidence for different choices. Interestingly, unlike in classical models, the different populations here are coupled through excitation.

You can find the paper on BioRxiv.

Evolution in a test tube

The amazing variety of life on this planet is a product of evolution. However, it took billions of years for sharks, chimps and magnolia to evolve from their common ancestor. Given that evolution operates on such an enormous timescale, how could we possibly study it in a laboratory? Human life just seems too short.

But not all evolution is slow. Within our lifetime bacteria have evolved defenses against the most powerful antibiotics. Indeed, many antibiotics are themselves the result of evolution fueled warfare between different bacteria and between bacteria and the fungi they attack. We simply borrow some of their weapons. But, bacteria that are not killed by antibiotics can prosper. They give rise to new resistant generations, rendering our weapons useless. This type of evolution can occur within days, and if we don’t discover new drugs, the resulting antibiotic resistance may end up costing millions of lives by the middle of this century.

Can we control how organisms evolve? Our ancestors have done this to suit their own ends: dogs and wheat are in their current form a result of evolution that humans have been steering.

Scientists have tried to do this more deliberately. Perhaps the first was the Reverend William Dallinger. Just over 20 years after Darwin published his theory of evolution, Reverend Dallinger examined whether single celled organisms could adapt to slow changes in their environment. He started with an incubator filled with microbes that could initially only survive at room temperature. Over six years, the Reverend slowly increased the temperature inside the incubator to 158 degrees F to see whether the microbes would adapt.

More recent versions of this experiment are being carried out by many scientists, including Tim Cooper at the University of Houston, and Richard Lenski at Michigan State University. In 1988, Lenski started growing bacteria giving them just enough food to survive from day to day. He has been observing 12 different lines of bacteria ever since. This amounts to over 60,000 bacterial generations, equivalent to about 1,500,000 human years – longer than our species has been around.

The results of this experiment are giving extraordinary insights into how life changes and adapts. The 12 different lines of bacteria have all evolved to thrive on their meager diets. Looking at their genes reveals that they have often used the same tricks – the same mutations – to achieve this. But in one line something unexpected happened. The bacteria started feeding in a completely new way, a change similar to us evolving the ability to eat wood.

So the churn of mutations, and transfer of genes, keeps creating variants of organisms that have never before existed.  Most quickly disappear.  A few succeed and create offspring that inherit their parents’ characteristics.  And so – as Darwin wrote at the end of his Origin of Species – “from so simple a beginning endless forms most beautiful and most wonderful have been, and are being, evolved.”

 Additional Notes:

We have become so used to antibiotics that it is hard to imagine what medicine would look like without them. Antibiotics are not only used to treat infections, but also make many other medical treatments possible by preventing infections in the first place. Here evolution works against us – medicine without antibiotics will look very different indeed.

Here are more details about how Reverend Dallinger, a devout Methodist, came to accept evolution 

Some of Richard Lenski’s reflections on his long running experiment.