Phil Birnbaum's "bad regression" puzzles
If you've ever wanted to see someone painstakingly deconstruct a regression analysis and show all the subtle reasons it can generate wild, weird and completely wrong results, there is good stuff at Sabermetric Research - Phil Birnbaum. It's a sports blog, but sports knowledge isn't needed (knowledge of regression analysis generally is, if you want to follow all the details).
Birnbaum's not exactly the only person to do takedowns of bad studies. But when Birnbaum notices that something is "off," he doesn't just point it out and move on. He isn't satisfied with "This conclusion is implausible" or "This conclusion isn't robust to sensitivity analysis." He digs all the way to the bottom to understand exactly how a study got its wrong result. His deconstructions of bad regressions are like four-star meals or masterful jazz solos ... I don't want to besmirch them by trying to explain them, so if you're into regression deconstructions you should just click through the links below.
(I'm not going to explain what regression analysis is today, for which I apologize; if I ever do, I will link back to this post. It's very hard to explain it compactly and clearly, as you can see from Wikipedia's attempt, but it is VERY common in social science research. Kind of a bad combination IMO. If you hear "This study shows [something about people]," it's more likely than not that the study relies on regression analysis.)
Some good (old) ones:
- Estimating whether Aaron Rodger's contract overpays or underpays him by making a scatterplot of pay and value-added with other quarterbacks and seeing whether he's above or below the regression line. The answer changes completely when you switch the x- and y-axes. Which one is right, and what exactly is wrong with the other one? (Birnbaum linked to this, but it's now dead and I am linking directly to an Internet Archive version. Birnbaum's "solution" is down in the comments, just search for his name.)
- Deconstructing an NBA time-zone regression: the key coefficients turn out to be literally meaningless.
- Do younger brothers steal more bases? Parts I, II, III although I think it's OK to skip part II (and Part I is short).
- The OBP/SLG regression puzzle: parts I, II, III, IV, V. This one is very weedsy and you'll probably want to skip parts, though it's also kind of glorious to see just how doggedly he digs on every strange numerical result. He also makes an effort to explain what's going on for people who don't know baseball. The essence of the puzzle: OBP and SLG are both indicators of a team's performance, but when one regresses a team's performance on OBP and SLG, the ratio of the coefficients is pretty far off from what the "true" value for (value of OBP / value of SLG) is separately known to be. I think the issues here are extremely general, and not good news for the practice of treating regression coefficients as effect sizes.