Banning the use of common sense in data analysis increases cases of research failure: evidence from Sweden

Olle Folke writes:

I wanted to highlight a paper by an author who has previously been featured on your blog when he was one of the co-authors of a paper on the effect of strip clubs on sex crimes in New York. This paper looks at the effect of criminalizing the buying of sex in Sweden and finds a 40-60% increase. However, the paper is equally problematic as the one on strip clubs. In what I view as his two main specifications he using the timing of the ban to estimate the effect. However, while there is no variation across regions he uses regional data to estimate the effect, which of course does not make any sense. Not surprisingly there is no adjustment for the dependence of the error term across observations.

What makes this analysis particularly weird is that there actually is no shift in the outcome if we use national data (see figure below). So basically the results must have been manufactured. As the author has not posted any replication files it is not possible to figure out what he has done to achieve the huge increase.

I think that his response to this critique is that he has three alternative estimation methods. However, these are not very convincing and my suspicion is that neither those results would hold up for scrutiny. Also, I find the use of alternative methods both strange and problematic. First, it suggests that neither method is convincing it itself. However, doing four additional problematic analysis does not make the first one better. Also, it gives author an out when they are criticized as it involves a lot of labor to work through each analysis (especially when there is not replication data).

I took a look at the linked paper, and . . . yeah, I’m skeptical. The article begins:

This paper leverages the timing of a ban on the purchase of sex to assess its impact on rape offenses. Relying on Swedish high-frequency data from 1997 to 2014, I find that the ban increases the number of rapes by around 44–62%.

But the above graph, supplied by Folke, does not show any apparent effect at all. The linked paper has a similar graph using monthly data that also shows
nothing special going on at 1999:

This one’s a bit harder to read because of the two axes, the log scale, and the shorter time frame, but the numbers seem similar. In the time period under study, the red curve is around 5.0 on the log scale per month, 12*log(5) = 1781, and the annual curve is around 2000, so that seems to line up.

So, not much going on in the aggregate. But then the paper says:

Several pieces of evidence find that rape more than doubled after the introduction of the ban. First, Table 1 finds that the average before the ban is around 6 rapes per region and month, while after the introduction is roughly 12. Second, Table 2 presents the results of the naive analysis of regressing rape on a binary variable taking value 0 before the ban and 1 after, controlling for year, month, and region fixed effects. Results show that the post ban period is associated with an increase of around 100% of cases of rape in logs and 125% of cases of rape in the inverse hyperbolic sine transformation (IHS, hereafter). Third, a simple descriptive exercise –plotting rape normalized before the ban around zero by removing pre-treatment fixed effects– encounters that rape boosted around 110% during the sample period (Fig. 4).

OK, the averages don’t really tell us anything much at all: they’re looking at data from 1997-2014, the policy change happened in 1999, in the midst of a slow increase, and most of the change happened after 2004, as is clearly shown in Folke’s graph. So Table 1 and Table 2 are pretty much irrelevant.

But what about Figure 4:

This looks pretty compelling, no?

I dunno. The first thing is that the claim that of “more than doubling” relies very strongly on the data after 2004. log(2) = 0.69, and if you look at that graph, the points only reach 0.69 around 2007, so the inference is leaning very heavily on the model by which the treatment causes a steady annual increase, rather than a short-term change in level at the time of the treatment. The other issue is the data before 1999, which in this graph are flat but in the two graphs shown earlier in this post showed an increasing trend. That makes a big difference in Figure 4! Replace that flat line pre-1999 with a positively-sloped line, and the story looks much different. Indeed, that line is soooo flat and right on zero, that I wonder if this is an artifact of the statistical fitting procedure (“Pre-treatment fixed effects are removed from the data to normalize the number of rapes around zero before the ban.”). I’m not really sure. The point is that something went wrong.

They next show their regression discontinuity model, which fits a change in level rather than slope:

There’s something else strange going on here: if they’re really fitting fixed effects for years, how can they possibly estimate a change over time? This is not making a lot of sense.

I’m not going to go through all of this paper in detail, I just did the above quick checks in order to get a rough sense what was going on, and to make sure I didn’t see anything immediately wrong with Folke’s basic analysis.

Folke continued:

The paper is even stranger than I have expected. I have gotten part of the regression code and he is estimating models that would not get any estimates on the treatment of there where no coding error (treatment is constant within years but he includes year fixed effects). Also, when I do the RDanalysis he claims he is doing I get the figure below in which there clearly is not a jump of 0.6 log points…

What the hell????

This one goes into the regression discontinuity hall of fame.

The next day, Folke followed up:

It took some digging and coding the figure out how the author was able to find such a large effect. We [Joop Adema, Olle Folke, and Johanna Rickne] have now written up a draft of a comment where we show that it is all based on a specification error and he ends up estimating something entirely different than he claims to be.

The big picture, or, how can this sort of error be avoided or its consequences mitigated

Look, everybody makes mistakes. Statistical models are hard to fit and interpret, data can be a mess, and social science theories are vague enough that if you’re not careful you can explain just about anything.

Still, it looks like this paper was an absolute disaster and a bit of an embarrassment for the Journal of Population Economics, which published it.

Should the problems been noticed earlier? I’d argue yes.

The problems with the regression discontinuity model—OK, we’re not gonna expect the author, reviewers, or editors of a paper to look too carefully at that—it’s a big ugly equation, after all—and we can’t expect author, reviewers, or editors to check the code—that’s a lot of work, right? Equations that don’t make sense, that’s just the cost of doing business.

The clear problem is the pattern in the aggregate data, the national time series that shows no jump in 1999.

I’m not saying that, just cos there’s no jump in 1999, that the policy had no effect. I’m just saying that the lack of jump in 1999 is right there for everyone to see. At the very least, if you’re gonna claim you found an effect, you’re under the scientific obligation to explain how you found that effect given the lack of pattern in the aggregate data. Such things can happen—you can have an effect that happens to be canceled out in the data by some other pattern at the same time—but then you should explain it, give that trail of breadcrumbs.

So, I’m not saying the author, reviewers, and editors of that paper should’ve seen all or even most of the problems with this paper. What I am saying is that they should’ve engaged with the contradiction between their claims and what was shown by the simple time series. To have not done this is a form of “scientism,” a kind of mystical belief in the output of a black box, a “believe the stats, not your lying eyes” kind of attitude.

Also, as Folke points out, the author of this paper has a track record of extracting dramatic findings using questionable data analysis.

I have no reason to think that the author is doing things wrong on purpose. Statistics is hard! The author’s key mistakes in these two papers have been:

1. Following a workflow in which contrary indications were ignored or set aside rather than directly addressed.

2. A lack of openness to the possibility that the work could be fatally flawed.

3. Various technical errors, including insufficient concern about data quality, a misunderstanding of regression discontinuity checks, and an inappropriate faith in robustness checks.

In this case, Adema, Folke, and Rickne did a lot of work to track down what went wrong in that published analysis. A lot of work for an obscure paper in a minor journal. But the result is a useful general lesson, which is why I’m sharing the story here.

The feel-good open science story versus the preregistration (who do you think wins?)

This is Jessica. Null results are hard to take. This may seem especially true when you preregistered your analysis, since technically you’re on the hook to own up to your bad expectations or study design! How embarrassing. No wonder some authors can’t seem to give up hope that the original hypotheses were true, even as they admit that the analysis they preregistered didn’t produce the expected effects. Other authors take an alternative route, one that deviates more dramatically from the stated goals of preregistration: they bury aspects of that pesky original plan and instead proceed under the guise that they preregistered whatever post-hoc analyses allowed them to spin a good story. I’ve been seeing this a lot lately.

On that note, I want to follow up on the previous blog discussion on the 2023 Nature Human Behavior article “High replicability of newly discovered social-behavioural findings is achievable” by Protzko, Krosnick, Nelson (of Data Colada), Nosek (of OSF), Axt, Berent, Buttrick, DeBell, Ebersole, Lundmark, MacInnis, O’Donnell, Perfecto, Pustejovsky, Roeder, Walleczek, and Schooler. It’s been about four months since Bak-Coleman and Devezer posted a critique that raised a number of questions about the validity of the claims the paper makes. 

This was a study that asked four labs to identify (through pilot studies) four effects for possible replication. The same lab then did a larger (n=1500) preregistered confirmation study for each of their four effects, documenting their process and sharing it with three other labs, who attempted to replicate it. The originating lab also attempted a self-replication for each effect. 

The paper presents analyses of the estimated effects and replicability across these studies as evidence that four rigor-enhancing practices used in the post-pilot studies–confirmatory tests, large sample sizes, preregistration, and methodological transparency–lead to high replicability of social psychology findings. The observed replicability is said to be higher than expected based on observed effect sizes and power estimates, and notably higher than prior estimates of replicability in the psych literature. All tests and analyses are described as preregistered, and, according to the abstract, the high replication rate they observe “justifies confidence of rigour-enhancing methods to increase the replicability of new discoveries.” 

On the surface it appears to be an all-around win for open science. The paper has already been cited over fifty times. From a quick glance, many of these citing papers refer to it as if it provides evidence of a causal effect that open practices lead to high replicability. 

But one of the questions raised by Bak-Coleman and Devezer about the published version was about their claim that all of the confirmatory analyses they present were preregistered. There was no such preregistration in sight if you checked the provided OSF link. I remarked back in November that even in the best case scenario where the missing preregistration was found, it was still depressing and ironic that a paper whose message is about the value of preregistration could make claims about its own preregistration that it couldn’t back up at publication time. 

Around that time, Nosek said on social media that the authors were looking for the preregistration for the main results. Shortly after Nature Human Behavior added a warning label indicating an investigation of the work: 

warning added by NHB to the article 

And if I’m trying to tell the truth, it’s all bad

It’s been some months, and the published version hasn’t changed (beyond the added warning), nor do the authors appear to have made any subsequent attempts to respond to the critiques.  Given the “open science works” message of the paper, the high profile author list, and the positive attention it’s received, it’s worth discussing here in slightly more detail how some of these claims seem to have come about.

The original linked project repository has been updated with historical files since the Bak-Coleman and Devezer critique. By clicking through the various versions of the analysis plan, analysis scripts, and versions of the manuscript, we can basically watch the narrative about the work (and what was preregistered) change over time.

The first analysis plan is dated October 2018 by OSF, and outlines a set of analyses of a decline effect, where effects decrease after an initial study, that differ substantially from the story presented in the published paper. This document first describes a data collection process for each of the confirmation studies and replications in two halves that splits the collection of observations into two parts, with 750 observations collected first, and the other 750 second. Each confirmation study and replication study are also assigned to either a) analyze the first half-sample and then the second half-sample or b) analyze the second half-sample and then the first half-sample. 

There were three planned tests:

  1. Whether the effects statistically significantly increase or decrease depending on whether the effects belonged to the first or the second 750 half samples; 
  2. Whether the effect sizes of the originating lab’s self-replication study is statistically larger or smaller than the originating lab’s confirmation study. 
  3. Whether effects statistically significantly decrease or increase across all four waves of data collection (all 16 studies with all 5 confirmations and replications).

If you haven’t already guessed it, the goal of all this is to evaluate whether a supernatural-like effect resulted in a decreased effect size in whatever wave was analyzed second. It appears all this is motivated by hypotheses that some of the authors (okay, maybe just Schooler) felt were within the realm of possibility. There is no mention of comparing replicability in the original analysis plan nor the the preregistered analysis code uploaded last December in a dump of historical files by James Pustejovsky, who appears to have played the role of a consulting statistician. This is despite the blanket claim that all analyses in the main text were preregistered and further description of these analyses in the paper’s supplement as confirmatory. 

The original intent did not go unnoticed by one of the reviewers (Tal Yarkoni) for Nature Human Behavior, who remarks: 

The only hint I can find as to what’s going on here comes from the following sentence in the supplementary methods: “If observer effects cause the decline effect, then whichever 750 was analyzed first should yield larger effect sizes than the 750 that was analyzed second”. This would seem to imply that the actual motivation for the blinding was to test for some apparently supernatural effect of human observation on the results of their analyses. On its face, this would seem to constitute a blatant violation of the laws of physics, so I am honestly not sure what more to say about this. 

It’s also clear that the results were analyzed in 2019. The first public presentation of results from the individual confirmation studies and replications can be traced to a talk Schooler gave at the Metascience 2019 conference in September, where he presents the results as evidence of an incline effect. The definition of replicability he uses is not the one used in the paper. 

Cause if you’re looking for the proof, it’s all there

There are lots of clues in the available files that suggest the main message about rigor-enhancing practices emerged as the decline effects analysis above failed to show the hypothesized effect. For example, there’s a comment on an early version of the manuscript (March 2020) where the multi-level meta-analysis model used to analyze heterogeneity across replications is suggested by James. This is suggested after data collection has been done and initial results analyzed, but the analysis is presented as confirmatory in the paper with p-values and discussion of significance. As further evidence that it wasn’t preplanned, in a historical file added more recently by James, it is described as exploratory. It shows up later in the main analysis code with some additional deviations, no longer designated as exploratory. By the next version of the manuscript, it has been labeled a confirmatory analysis, as it is in the final published version.

This is pretty clear evidence that the paper is not accurately portraying itself. 

Similarly, various definitions of replicability show up in earlier versions of the manuscript: the rate at which the replication is significant, the rate at which the replication effect size falls within the confirmatory study CI, and the rate at which replications produce significant results for significant confirmatory studies. Those which produce higher rates of replicability relative to statistical power are retained and those with lower rates are either moved to the supplement, dismissed, or not explored further because they produced low values. For example, defining replicability using overlapping confidence intervals was moved to the supplement and not discussed in the main text, with the earliest version of the manuscript (Deciphering the Decline Effect P6_JEP.docx) justifying its dismissal because it “produced the ‘worst’ replicability rates” and “performs poorly when original studies and replications are pre-registered.” Statistical power is also recalculated across revisions to align with the new narrative.

In a revision letter submitted prior to publication (Decline Effect Appeal Letterfinal.docx), the authors tell the reviewers they’re burying the supernatural motivation for study:

Reviewer 1’s fourth point raised a number of issues that were confusing in our description of the study and analyses, including the distinction between a confirmation study and a self-replication, the purpose and use of splitting samples of 1500 into two subsamples of 750, the blinding procedures, and the references to the decline effect. We revised the main text and SOM to address these concerns and improve clarity. The short answer to the purpose of many of these features was to design the study a priori to address exotic possibilities for the decline effect that are at the fringes of scientific discourse.

There’s more, like a file where they appeared to try a whole bunch of different models in 2020 after the earliest provided draft of the paper, got some varying results, and never disclose it in the published version or supplement (at least I didn’t see any mention). But I’ll stop there for now.

C’mon baby I’m gonna tell the truth and nothing but the truth

It seems clear that the dishonesty here was in service of telling a compelling story about something. I’ve seen things like this transpire plenty of times: the goal of getting published leads to attempts to find a good story in whatever results you got. Combined with the appearance of rigor and a good reputation, a researcher can be rewarded for work that on closer inspection involves so much post-hoc interpretation that the preregistration seems mostly irrelevant. It’s not surprising that the story here ends up being one that we would expect some of the authors to have faith in a priori. 

Could it be that the authors were pressured by reviewers or editors to change their story? I see no evidence of that. In fact, the same reviewer who noted the disparity between the original analysis plan and the published results encouraged the authors to tell the real story:  

I won’t go so far as to say that there can be no utility whatsoever in subjecting such a hypothesis to scientific test, but at the very least if this is indeed what the authors are doing, I think they should be clear about that in the main text, otherwise readers are likely to misunderstand what the blinding manipulation is supposed to accomplish, and are at risk of drawing incorrect conclusions

You can’t handle the truth, you can’t handle it

It’s funny to me how little attention the warning label or the multiple points raised by Bak-Coleman and Devezer (which I’m simply concretizing here) have drawn, given the zeal with which some members of open science crowd strike to expose questionable practices in other work. My guess is this is because of the feel-good message of the paper and the reputation of the authors. The lack of attention seems selective, which is part of why I’m bringing up some details here. It bugs me (though doesn’t not surprise me) to think that whether questionable practices get called out depends on who exactly is in the author list.

 

What do I care? Why should you? 

On some level, the findings the paper presents – that if you use large studies and attempt to eliminate QRPs, you can get a high rate of statistical significance – are very unsurprising. So why care if the analyses weren’t exactly decided in advance? Can’t we just call it sloppy labeling and move on?  

I care because if deception is occurring openly in papers published in a respected journal for behavioral research by authors who are perceived as champions of rigor, then we still have a very long way to go. Interpreting this paper as a win for open science, as if it cleanly estimated the causal effect of rigor-enhancing practices is not, in my view, a win for open science. The authors’ lack of concern for labeling exploratory analysis as confirmatory, their attempt to spin the null findings from the intended study into a result about effects on replicability even though the definition they use is unconventional and appears to have been chosen because it led to a higher value, and the seemingly selective summary of prior replication rates from the literature should be acknowledged as the paper accumulates citations. At this point months have passed and there have not been any amendments to the paper, nor admission by the authors that the published manuscript makes false claims about the preregistration status. Why not just own up to it? 

It’s frustrating because my own methodological stance has been positively impacted by some of these authors. I value what the authors call rigor-enhancing practices. In our experimental work, my students and I routinely use preregistration, we do design calculations via simulations to choose sample sizes, we attempt to be transparent about how we arrive at conclusions. I want to believe that these practices do work, and that the open science movement is dedicated to honesty and transparency. But if papers like the Nature Human Behavior article are what people have in mind when they laud open science researchers for their attempts to rigorously evaluate their proposals, then we have problems.

There are many lessons to be drawn here. When someone says all the analyses are preregistered, don’t just accept them at their word, regardless of their reputation. Another lesson that I think Andrew previously highlighted is that researchers sometimes form alliances with others that may have different views for the sake of impact but this can lead to compromised standards. Big collaborative papers where you can’t be sure what your co-authors are up to should make all of us nervous. Dishonestly is not worth the citations.

Writing inspiration from J.I.D. and Mereba.

Bayesian inference with informative priors is not inherently “subjective”

The quick way of saying this is that using a mathematical model informed by background information to set a prior distribution for logistic regression is no more “subjective” than deciding to run a logistic regression in the first place.

Here’s a longer version:

Every once in awhile you get people saying that Bayesian statistics is subjective bla bla bla, so every once in awhile it’s worth reminding people of my 2017 article with Christian Hennig, Beyond subjective and objective in statistics. Lots of good discussion there too. Here’s our abstract:

Decisions in statistical data analysis are often justified, criticized or avoided by using concepts of objectivity and subjectivity. We argue that the words ‘objective’ and ‘subjective’ in statistics discourse are used in a mostly unhelpful way, and we propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence. Together with stability, these make up a collection of virtues that we think is helpful in discussions of statistical foundations and practice.

The advantage of these reformulations is that the replacement terms do not oppose each other and that they give more specific guidance about what statistical science strives to achieve. Instead of debating over whether a given statistical method is subjective or objective (or normatively debating the relative merits of subjectivity and objectivity in statistical practice), we can recognize desirable attributes such as transparency and acknowledgement of multiple perspectives as complementary goals. We demonstrate the implications of our proposal with recent applied examples from pharmacology, election polling and socio-economic stratification. The aim of the paper is to push users and developers of statistical methods towards more effective use of diverse sources of information and more open acknowledgement of assumptions and goals.

Philip K. Dick’s character names

The other day I was thinking of some of the wonderful names that Philip K. Dick gave to his characters:
Joe Chip
Glen Runciter
Bob Arctor
Palmer Eldritch
Perky Pat

And, of course, Horselover Fat.

My personal favorite names from these stories are Ragle Gumm from Time out of Joint, and Addison Doug, the main character in an obscure spaceship/time-travel story from 1974.

I feel like it shows a deep confidence to give your characters this sort of name. As names, they’re off, but at the same time they’re just right in context. “Addison Doug,” indeed.

Some authors are good at titles, some are good at last lines, some are good at names. So many books, even great books, have character names that are boring or too cute or just fine, but no more than just fine. To come up with these distinctive names is a high-risk ploy that, when it works, it adds something special to the whole story.

The contrapositive of “Politics and the English Language.” One reason writing is hard:

In his classic essay, “Politics and the English Language,” the political journalist George Orwell drew a connection between cloudy writing and cloudy content.

The basic idea was: if you don’t know what you’re saying, or if you’re trying to say something you don’t really want to say, then one strategy is to write unclearly. Conversely, consistently cloudy writing can be an indication that the writer ultimately doesn’t want to be understood.

In Orwell’s words:

[The English language] becomes ugly and inaccurate because our thoughts are foolish, but the slovenliness of our language makes it easier for us to have foolish thoughts.

He continues:

In our time, political speech and writing are largely the defence of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness.

A few years ago I posted on this topic, drawing an analogy to cloudy writing in science. To be sure, much of the bad writing in science comes from researchers who have never learned to write clearly. Writing is hard!

But it’s not just that. A key problem with a lot of the bad science that we see featured in PNAS, Ted, NPR, Gladwell, Freakonomics, etc., is that the authors are trying to use statistical analysis and storytelling to do something they can’t do with their science, which is to draw near-certain conclusions from noisy data that can’t support strong conclusions. This leads to tortured constructions such as this from a medical journal:

The pair‐wise results (using paired‐samples t‐test as well as in the mixed model regression adjusted for age, gender and baseline BMI‐SDS) showed significant decrease in BMI‐SDS in the parents–child group both after 3 and 24 months, which indicate that this group of children improved their BMI status (were less overweight/obese) and that this intervention was indeed effective.

However, as we wrote in the results and the discussion, the between group differences in the change in BMI‐SDS were not significant, indicating that there was no difference in change in our outcome in either of the interventions. We discussed, in length, the lack of between‐group difference in the discussion section. We assume that the main reason for the non‐significant difference in the change in BMI‐SDS between the intervention groups (parents–child and parents only) as compared to the control group can be explained by the fact that the control group had also a marginal positive effect on BMI‐SDS . . .

Obv not as bad as political journalists in the 1930s defending Stalin’s purges or whatever; the point is that the author is in the awkward position of trying to use the ambiguities of language to say something while not quite saying it. Which leads to unclear and barely readable writing, not just by accident.

The writing and the statistics have to be cloudy, because if they were clear, the emptiness of the conclusions would be apparent.

The problem

Orwell’s statement, when transposed to writing a technical paper, is that if you attempt to cover the gaps in your reasoning with words, this will typically yield bad writing. Indeed, if you’re covering the gaps in your reasoning with words, you’ll either have bad writing or dishonest writing, or both. In some important way, it’s a good thing that this sort of writing is so hard to follow; otherwise it could be really misleading.

Now let’s flip it around.

Often you will find yourself trying to write an article, and it will be very difficult to write it clearly. You’ll go around and around, and whatever you, your written output will feel like the worst of both worlds: a jargon-filled mess, while at the same time being sloppy and imprecise. Try to make it more readable and it becomes even sloppier and harder to follow at a technical level; try to make it accurate and precise, and it reads like a complicated, uninterpretable set of directions.

You’re stuck. You’re in a bad place. And any direction you take makes the writing worse in some important way.

What’s going on?

It could be this: You’re trying to write something you don’t fully understand, you’re trying to bridge a gap between what you want to say and what is actually justified by your data and analysis . . . and the result is “Orwellian,” in the sense that you’re desperately using words to try to paper over this yawning chasm in your reasoning.

The solution

One way out of this trap is to follow what we could call Orwell’s Contrapositive.

It goes like this: Step back. Pause in whatever writing you’re doing. Pull out a new sheet of paper (or an empty document on the computer) and write, as directly as you can, in two columns. Column 1 is what you want to be able to say (the method is effective, the treatment saves lives, whatever); Column 2 is what is supported by your evidence (the method works better than a particular alternative in a particular setting, fewer people died in the treatment than the control group after adjusting this and that, whatever).

At that point, do the work to pull Column 2 to Column 1, or make concessions to reality to shift Column 1 toward Column 2. Do what it takes to get them to line up.

At this point, you’ve left the bad zone in which you’re trying to say more than you can honestly say. And the writing should then go much smoother.

That’s the contrapositive: if bad writing is a sign of someone trying to say the indefensible, then you can make your writing better by not trying to say the defensible, either by expanding what is legitimately defensible or restricting what you’re trying to say.

Remember the folk theorem of statistical computing: When you have computational problems, often there’s a problem with your model. Orwell’s Contrapositive is a sort of literary analogy to that.

One reason writing is hard

To put it another way: One reason writing is hard is that we use writing to cover the gaps in our reasoning. This is not always a bad thing! On the way to the destination of covering these gaps is the important step of revealing these gaps. We write to understand. Writing has an internal logic that can protect us from (some) errors and gaps—if we let it, by reacting to the warning sign that the writing is unclear.

Hey! Here’s a study where all the preregistered analyses yielded null results but it was presented in PNAS as being wholly positive.

Ryan Briggs writes:

In case you haven’t seen this, PNAS (who else) has a new study out entitled “Unconditional cash transfers reduce homelessness.” This is the significant statement:

A core cause of homelessness is a lack of money, yet few services provide immediate cash assistance as a solution. We provided a one-time unconditional CAD$7,500 cash transfer to individuals experiencing homelessness, which reduced homelessness and generated net societal savings over 1 y. Two additional studies revealed public mistrust in homeless individuals’ ability to manage money and the benefit of counter-stereotypical or utilitarian messaging in garnering policy support for cash transfers. This research adds to growing global evidence on cash transfers’ benefits for marginalized populations and strategies to increase policy support. Although not a panacea, cash transfers may hasten housing stability with existing social supports. Together, this research offers a new tool to reduce homelessness to improve homelessness reduction policies.

Based on that, I was surprised to read the pre-registration documents and supplemental information and learn that literally none of the outcomes that the researchers pre-registered were significant. Even the variable that they chose to focus on (days homeless) was essentially the same in the 12 month follow up (0.18 vs 0.17) and, just eyeballing Table S3, it seems the differences were rarely large and not ever significant in any single follow up period.

This is now generating news coverage about how cash transfers work to reduce homelessness (e.g., here and here).

I guess in a sense pre-registration worked because we can see that they did not expect this and had to explore to find it, but what good does that do if the press just reports it all credulously?

I have mixed feelings on this one. On one hand, I don’t like the whole statistical-significance-thresholding thing: if the study found positive results, this could be worth reporting, even if the results are within the margin of error. This within-the-margin-of-error bit should just be mentioned in the news articles. On the other hand, if the researchers are rummaging around through their results looking for something big to report, then, yeah, these results will be massively biased upward.

So, from that perspective, maybe a good headline would not be, “Homeless people were given lump sums of cash. Their spending defied stereotypes” or “B.C. researchers studied how homeless people spent a $7,500 handout. Here’s what they found,” but rather something like, “Preliminary results from a small study suggest . . .”

But then we could step back and ask, How did this study get the press in the first place? I’m guessing PNAS is the reason. So let’s head to the PNAS paper. From the abstract:

Exploratory analyses showed that over 1 y, cash recipients spent fewer days homeless, increased savings and spending with no increase in temptation goods spending, and generated societal net savings of $777 per recipient via reduced time in shelters.

I guess that “exploratory analysis” is code for non-preregistered or non-statistically-significant. Either way, I think it’s irresponsible and statistically incorrect—although, regrettably, absolutely standard practice—to report this “$777” without any regularization or partial pooling toward zero. It’s a biased estimate, and the bias could be huge.

On the other hand, Figure 1 of the paper looks very impressive! This figure displays 35 outcomes, almost all of which go in a positive direction (fewer days homeless, more days in stable housing, higher value of savings . . ., all the way down to lower substance use severity, lower cost of all service use, and cost of shelter use. The very few negative outcomes were tiny compared to their uncertainty. If you look at Figure 1, the evidence looks overwhelming.

But Table 1 does not seem like such a great summary of the data displayed elsewhere in the paper. Looking at Table 3, the good stuff all seems to be happening in the 1-month and 3-month followups without much happening after 1 year.

Here’s what the authors wrote:

The preregistered analyses yielded null effects in cognitive and well-being outcomes, which could be due to the low statistical power from the small participant number in each condition or the possibility that any effect on cognition and well-being may take more than 1 mo to show up.

I agree that these null findings should be mentioned right up there in the abstract. They should also include the possibility that the treatment really has no consistent effect on these outcomes. It’s kinda lame to give all these alibis and never even consider that maybe there’s nothing going on.

What about the housing effects going away after a year? The authors write:

First, the cost of living is extremely high in Vancouver, and the majority of the cash was spent within the first 3 mo for most recipients. Second, while the cash provided immediate benefits, control participants even-tually “caught up” over time.

On the other hand, here’s what they said about a different result:

By combining the two cash and two noncash conditions to increase statistical power, exploratory analyses showed that cash recipients showed higher positive affect at 1 mo and higher executive function at 3 mo. Based on debriefing, participants expressed that while they were initially happy with the cash transfer, moving out of homelessness into stable housing took substantial efforts and hard work in the first few months, which could explain the delayed effect on cognitive function.

They’ve successfully convinced me that they have the ability to explain any possible result they might find.

The thing that bothers me most about the paper is that the authors don’t seem to have wrestled with the ways in which their results seem to refute their theoretical framework. Their choice of what to preregister suggests that they were expecting to find large effects on cognitive and subjective well-being outcomes and then maybe, if they were lucky, they’d find some positive results on financial and housing outcomes. I guess their theory was that the money would give people a better take on life, which could then lead to material benefits. Actually, though, they found no benefits on the cognitive and subjective outcomes—when I say “no benefits,” I mean, yeah, really nothing, not just nothing statistically significant—but the money did seem to help people pay the rent for the first few months. That’s fine—there are worse things than giving low-income people some money to pay the rent!—; it’s just a different story than what they’d started with. It’s less of a psychology story and more of an economics story. In any case, yeah, further study is required. I just think that they could get the most from their existing study if they thought more about what went wrong with their theory.

Hey—let’s collect all the stupid things that researchers say in order to deflect legitimate criticism

When rereading this post the other day, I noticed the post that came immediately before.

I followed the link and came across the delightful story of a researcher who, after one of his papers was criticized, replied, “We can keep debating this after 11 years, but I’m sure we all have much more pressing things to do (grants? papers? family time? attacking 11-year-old papers by former classmates? guitar practice?)” One of the critics responded with appropriate disdain, writing:

This comment exemplifies the proclivity of some authors to view publication as the encasement of work in a casket, buried deeply so as to never be opened again lest the skeletons inside it escape. But is it really beneficial to science that much of the published literature has become . . . a vast graveyard of undead theories?

I agree. To put it another way: Yes, ha ha ha, let’s spend our time on guitar practice rather than exhuming 11-year-old published articles. Fine—I’ll accept that, as long as you also accept that we should not be citing 11-year-old articles.

As is so often the case, the authors of published work are happy to get unthinking positive publicity and citations, but when anything negative comes in, they pull up the drawbridge.

From the perspective of the ladder of responses to criticism, the above behavior isn’t so bad: they’re not suing their critics or using surrogates to attack them critics or labeling anybody as suicide bombers or East German secret police, they’re just trying to laugh it off. From a scientific perspective, though, it’s still pretty bad to act as there’s something wrong with discussing the flaws of a paper that’s still being cited, just cos it’s a decade old.

Putting together a list

Anyway, this made me think of a fun project, which is to list all the different ways that researchers try to avoid addressing legitimate criticism of their published work.

Here are a few responses we’ve seen. I won’t bother finding the links right now, but if we put together a good list, I can go back and provide references for all of them.

1. The corrections do not affect the main results of the paper. (Always a popular claim, even if the corrections actually do affect the main results of the paper.)

2. The criticism should be dismissed because the critics are obsessive/Stasi/terrorists, etc. (Recall the Javert paradox.)

3. The critics are jealous losers sniping at their betters. Or, if that doesn’t work, the critics are picking on unfortunate young researchers. (I don’t think it does any favors to researchers of any age to exempt their work from criticism.)

4. The criticism is illegitimate if it does not go through the peer-review process. (A hard claim to swallow given how the peer-review process is rigged against criticism of published papers.)

5. Criticism should be a discreet exchange between author and critic, with no public criticism. (But the people who claim to hold that attitude seem to have no problem when their work is cited or praised in a public way.)

The most common response to criticism seems to be to just ignore it entirely and hope it goes away. Unfortunately, that strategy often seems to work very well!

Jonathan Bailey vs. Stephen Wolfram

Key quote:

While there are definitely environments where using a ghostwriter is acceptable, academic publishing typically isn’t one of them.

The reason is simple: Using a ghostwriter on an academic paper entails having an author do significant work on the paper without receiving credit or having their work disclosed. This is broadly seen as a breach of authorship and an act of research misconduct unto itself.

Why are all these school cheating scandals happening?

Paul Alper writes:

While the national scene is all about woke, book banning and the like, apparently Columbia University is still dealing with the long-standing conundrum, the best method to teach kids how to read.

He’s referring to this news article, “Amid Reading Wars, Columbia Will Close a Star Professor’s Shop,” which begins:

Lucy Calkins ran a beloved — and criticized — center at Teachers College for four decades. It is being dissolved. . . .

Her curriculum had teachers conduct “mini-lessons” on reading strategies, but also gave students plenty of time for silent reading and freedom to choose their own books. Supporters say those methods empower children, but critics say they waste precious classroom minutes, and allow students to wallow in texts that are too easy.

Some of the practices she once favored, such as prompting children to guess at words using the first letter and context clues, like illustrations, have been discredited.

Over the past three years, several prominent school districts — including New York City, the nation’s largest — dropped her program, though it remains in wide use. . . .

Critics of her ideas, including some cognitive scientists and instructional experts, said her curriculum bypassed decades of settled research, often referred to as the science of reading. That body of research suggests that direct, carefully sequenced instruction in phonics, vocabulary building and comprehension is more effective for young readers than Dr. Calkins’s looser approach.

Alper writes:

This article did not at all mention anything about language specifics. I bring this up because my granddaughters are in a Minneapolis Spanish immersion primary school. Because Spanish is almost 100 per cent phonetic, and English is terrible in this regard, they spell and read better in Spanish than they do in English. The mechanics of learning to read back in my day, was simple and devoid of theory or disagreement. You kept at it until you got it right. The it was English only because no accommodation was made for special needs, immigrants or for the outside world in general.

I know some people at Teachers College but I’ve never encountered Prof. Calkins, nor have I ever looked at the literature on language teaching. So I got nothin’ on this one.

But I did reply that the above story isn’t half as bad as this one from a few years back, which I titled, “What’s the stupidest thing the NYC Department of Education and Columbia University Teachers College did in the past decade?” It involved someone who was found to be a liar, a cheat, and a thief, and then, with that all known, was hired to two jobs as school principal! And then a Teachers College professor said, “We felt that on balance, her recommendations were so glowing from everyone we talked to in the D.O.E. that it was something that we just were able to live with.” This came out in the news after the principal in question was found do have “forged answers on students’ state English exams in April because the students had not finished the tests.” Quelle surprise, no? A liar/cheat/thief gets a new job doing the same thing and then does more lying and cheating (maybe no stealing that time, though).

Alper responded:

You wrote that in 2015 which is about the same time as this story which made Fani Willis RICO famous:

Her most prominent case was her prosecution of the Atlanta Public Schools cheating scandal. Willis, an assistant district attorney at the time, served as lead prosecutor in the 2014 to 2015 trial of twelve educators accused of correcting answers entered by students to inflate the scores of state administered standardized tests.

SAT and all the others did not exist in my 1950 NYC school days, but I believe we did have the so-called Regents Exams and they are still around. It never crossed my mind that the scoring of those exams was not on the up and up. Was I being naive? Was there more honesty and/or less messing around back then and it was just not financially worth it?

Here’s my response:

1. This particular form of cheating sounds no more easy or difficult now than in the past.

2. In the past (i.e., somewhere between 1950 and 2015), tests were important students but not so much for schools. So, yeah, students may have been motivated to cheat, but teachers and school administrators did not have any motivation, either to help students cheat, or to massively cheat on their own. Nowadays, tests can be high stakes for the school administrators, and so, for some of them, cheating is worth the risk.

“Whistleblowers always get punished”

In one of our comment threads about how scholars and journalists should be thanking, not smearing, people who ask for replications, Allan Stam writes:

The corollary to all this, and closely related to Javert’s paradox, is the social law: Whistleblowers always get punished.

The Javert paradox, as regular readers will recall, goes like this: Suppose you find a problem with published work. If you just point it out once or twice, the authors of the work are likely to do nothing. But if you really pursue the problem, then you look like a Javert, that is, like an obsessive, a “hater,” someone who needs to “get a life.” It’s complicated, because some critics really do obsess over unimportant details.

On the other hand, details that are unimportant in themselves can be important as indicating bigger problems. For example, the Nudgelords hyped some junk science. In one way, that’s no big deal: everybody makes mistakes. But their lack of interest in their mistakes and their willingness to memory-hole these errors suggests a deeper problem, in that their workflow is lacking that important feedback loop that can allow themselves to identify places where their model for the world has failed. A lack of interest in confronting the failure of one’s model: that’s something that bothered me with so many Bayesians back in the early 1990s, motivating much of my work on posterior predictive checking, and it bothers me today.

The point is, sometimes to find the problems you have to look at the details in detail, which takes the sort of extra effort that can make you look obsessive—heck, maybe it is obsessive. But, so what? And, sure, sometimes a critic will be obsessive and also just be mistaken, and that’s annoying, but there’s little we can do except to try our best to respond to those mistaken criticisms when they arise.

Now back to Stam’s point.

I pretty much agree with what he’s saying: whistleblowing just about always seems to be a bad career move. The clarification I’d like to make is that the “punishment” received by a whistleblower is not necessary anyone directly trying directly trying to punish anyone.

Here’s how it goes. Scholar A does something wrong—maybe it’s flat-out cheating, maybe it’s just bad work that the scholar doesn’t want anyone to re-examine, which, OK, that attitude is a form of cheating too (Clarke’s law!). Scholar B points out the problem.

At this point, no “whistleblowing” has happened. “Whistleblowing” occurs following two more steps: (1) Scholar A, instead of behaving properly by acknowledging and considering the criticism, instead evades it or flat-out lies about it; (2) Scholar B, instead of just letting this be the end, keeps on about it. I guess that even this is not necessarily whistleblowing. Also the whistleblower has to be on the inside.

OK, so at this point it’s a negative-sum game. Scholar A can get the reputation of someone who does bad work and refuses to learn from mistakes. Scholar B can get the reputation of not being a team player. The more this goes on, the more that both scholars are hurt. Even if final consensus if close to Scholar B’s position, so that Scholar B has “won” the intellectual and social argument, it’s still likely to be a net loss in that Scholar B gets some reputation as a difficult person. Conversely, even if Scholar A “wins” in the sense of there being a consensus judgment that the criticism was misguided, there can still be a vague cloud that hangs over Scholar A’s head.

Part of this whole net-loss thing arises because most academics get no negative coverage at all. In politics, any success brings some negative coverage, and getting into a fight can be worth it, by helping you stand out from the crowd. In academia, you want to be known for positive contributions. At least, in science academia. Humanities and some of social science seem different: there, I guess it’s more common for scholars to make their names through controversy.

Anyway, here’s my point. A scientific dispute involving claims of unethical behavior can easily end up hurting both sides. Even if nobody’s trying to punish a whistleblower, there are negative social consequences, and in that sense I think Stam is correct.

“I was left with an overwhelming feeling that the World Values Survey is simply a vehicle for telling stories about values . . .”

Dale Lehman writes:

My guess is that you are familiar with the World Values Survey – I was not until I saw it described in the Economist this week (August 12, 2023). It has probably been used in the careers of many academics and is a monumental effort to collect survey data about values from across the world over a long period of time (the latest wave of the survey includes around 130,000 respondents from at least 90 countries). With the caveat that I have no experience with this data and have not read anything about its methodological development, I am struck by what seems like a shoddy ill-conceived research effort. To begin with a minor thing that appeared in the Economist story, I’ve attached a screenshot of part of what appears in the print magazine (the online version interactively builds up this view so the print version is more complete but provides less context). I have an issue with the visualization – some might call it a quibble, I’d call it a major problem – and I can’t tell if the blame lies with The Economist or the WVS, but it is what first alerted me to this data. The change in values over time is shown by the line segments ending in a circle marker for the latest survey wave. Why didn’t they use arrows rather than a circle at one end? I think this is inexcusable – arrows invoke pre-attentive visual processing whereas the line segment/circles force me to constantly reassess the picture to understand how things are changing. In other words, the visual presented doesn’t work – arrows would be immensely better. I don’t believe that is just sloppiness – I think it reveals something more fundamental, and that is what really concerns me about the WVS.

Moving on in the graph, I am immediately stuck by the dimensions of the graph. The methodology is described in detail on the WVS website (https://www.worldvaluessurvey.org/WVSContents.jsp) and I haven’t reviewed it in detail. But I have a number of issues about these measurements. Among these:

– The survival-self expression dimensions strikes me as unintuitive. Since the questions involved (such as the importance of religion vs the importance of environmental protection) are linked to wealth, and much of the WVS research concerns changes in values as wealth changes, why not measure wealth directly? My preference would be for the more unambiguous (relatively speaking) measures like GDP than these derived measures that seem vague to me.

– I have similar issues with the other dimension: traditional vs secular-rational. Neither of these seem intuitive to me and the underlying questions don’t improve things. There are questions about “national pride” and self descriptions of whether or not someone feels “very happy.” I find it very difficult to see how these map cleanly into the dimension they are being used for.

– Since these surveys are done across many countries and over time, I think the meaning of the words may change. For example, asking whether “people are trustworthy” requires the idea of “trust” to mean the same thing in different places and different periods of time. I see no evidence of this and can imagine that there might be differences in how people interpret phrases like that. In general, it seems to me that the wording of these survey questions was not carefully thought out or tested (though perhaps I just am not familiar with their development).

– I am disturbed by the use of single points to represent entire countries. Indeed, there is considerable discussion of how heterogeneous countries are, but the graphs use average measures to represent entire countries. As with many things, the average may be less interesting than the variability. This concern is accentuated by the aggregation of these countries into groups such as “Protestant Europe” and “Orthodox Europe.” I don’t find these groups particularly intuitive either.

– I’m unconvinced that the two dimensional picture of values is the best way to analyze values. Are there two dimensions the most important? Why two? Perhaps the changes over time simply reveal how valid the dimensions are rather than any intrinsic changes in values people hold.

There is more, but I’ll stress again that I have no background with this data. I can say it was difficult for me to even read the Economist article since almost every statement struck me as troublesome regarding what was being measured and how it relates to the fundamental methodology of this two dimensional view of values. I also can’t tell how much of my concern lies with the Economist article or the WVS itself. But I was left with an overwhelming feeling that the WVS is simply a vehicle for telling stories about values and how they differ between countries or groups of people and how these change over time. Those stories are naturally interesting, but I don’t see that the methodology and data support any particular story over any other. It seems like a perfect mechanism for academic career development, but little else.

My reply: I’m not sure! I’ve never worked with the World Values Survey myself. Maybe some readers can share their thoughts?

Inspiring story from a chemistry classroom

From former chemistry teacher HildaRuth Beaumont:

I was reminded of my days as a newly qualified teacher at a Leicestershire comprehensive school in the 1970s, when I was given a group of reluctant pupils with the instruction to ‘keep them occupied’. After a couple of false starts we agreed that they might enjoy making simple glass ornaments. I knew a little about glass blowing so I was able to teach them how to combine coloured and transparent glass to make animal figures and Christmas tree decorations. Then one of them made a small bottle complete with stopper. Her classmate said she should buy some perfume, pour some of it into the bottle and give it to her mum as a Mother’s Day gift. ‘We could actually make the perfume too,’ I said. With some dried lavender, rose petals, and orange and lemon peel, we applied solvent extraction and steam distillation to good effect and everyone was able to produce small bottles of perfume for their mothers.

What a wonderful story. We didn’t do anything like this in our high school chemistry classes! Chemistry 1 was taught by an idiot who couldn’t understand the book he was teaching out of. Chemistry 2 was taught with a single-minded goal of teaching us how to solve the problems on the Advanced Placement exam. We did well on the exam and learned essentially zero chemistry. On the plus side, this allowed me to place out of the chemistry requirement in college. On the minus side . . . maybe it would’ve been good for me to learn some chemistry in college. I don’t remember doing any labs in Chemistry 2 at all!

Preregistration is a floor, not a ceiling.

This comes up from time to time, for example someone sent me an email expressing a concern that preregistration stifles innovation: if Fleming had preregistered his study, he never would’ve noticed the penicillin mold, etc.

My response is that preregistration is a floor, not a ceiling. Preregistration is a list of things you plan to do, that’s all. Preregistration does not stop you from doing more. If Fleming had followed a pre-analysis protocol, that would’ve been fine: there would have been nothing stopping him from continuing to look at his bacterial cultures.

As I wrote in comments to my 2022 post, “What’s the difference between Derek Jeter and preregistration?” (which I just added to the lexicon), you don’t preregister “the” exact model specification; you preregister “an” exact model specification, and you’re always free to fit other models once you’ve seen the data.

It can be really valuable to preregister, to formulate hypotheses and simulate fake data before gathering any real data. To do this requires assumptions—it takes work!—and I think it’s work that’s well spent. And then, when the data arrive, do everything you’d planned to do, along with whatever else you want to do.

Planning ahead should not get in the way of creativity. It should enhance creativity because you can focus your data-analytic efforts on new ideas rather than having to first figure out what defensible default thing you’re supposed to do.

Aaaand, pixels are free, so here’s that 2002 post in full:
Continue reading

“On the uses and abuses of regression models: a call for reform of statistical practice and teaching”: We’d appreciate your comments . . .

John Carlin writes:

I wanted to draw your attention to a paper that I’ve just published as a preprint: On the uses and abuses of regression models: a call for reform of statistical practice and teaching (pending publication I hope in a biostat journal). You and I have discussed how to teach regression on a few occasions over the years, but I think with the help of my brilliant colleague Margarita Moreno-Betancur I have finally figured out where the main problems lie – and why a radical rethink is needed. Here is the abstract:

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for “modelling data”, underpinned by what we describe as the “true model myth”, according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualisation and usage should follow from the purpose at hand.

The paper is aimed at the biostat community, but I think the same issues apply very broadly at least across the non-physical sciences.

Interesting. I think this advice is roughly consistent with what Aki, Jennifer, and I say and do in our books Regression and Other Stories and Active Statistics.

More specifically, my take on teaching regression is similar to what Carlin and Moreno say, with the main difference being that I find that students have a lot of difficulty understanding plain old mathematical models. I spend a lot of time teaching the meaning of y = a + bx, how to graph it, etc. I feel that most regression textbooks focus too much on the error term and not enough on the deterministic part of the model. Also, I like what we say on the first page of Regression and Other Stories, about the three tasks of statistics being generalizing from sample to population, generalizing from control to treatment group, and generalizing from observed data to underlying constructs of interest. I think models are necessary for all three of these steps, so I do think that understanding models is important, and I’m not happy with minimalist treatments of regression that describe it as a way of estimating conditional expectations.

The first of these tasks is sampling inference, the second is causal inference, and the third refers to measurement. Statistics books (including my own) spend lots of time on sampling and causal inference, not so much on measurement. But measurement is important! For an example, see here.

If any of you have reactions to Carlin and Moreno’s paper, or if you have reactions to my reactions, please share them in comments, as I’m sure they’d appreciate it.

How often is there a political candidate such as Vivek Ramaswamy who is so much stronger in online polls than telephone polls?

Palko points to this news article, “The mystery of Vivek Ramaswamy’s rapid rise in the polls,” which states:

Ramaswamy’s strength comes almost entirely from polls conducted over the internet, according to a POLITICO analysis. In internet surveys over the past month — the vast majority of which are conducted among panels of people who sign up ahead of time to complete polls, often for financial incentives — Ramaswamy earns an average of 7.8 percent, a clear third behind Trump and DeSantis.

In polls conducted mostly or partially over the telephone, in which people are contacted randomly, not only does Ramaswamy lag his average score — he’s way back in seventh place, at just 2.6 percent.

There’s no singular, obvious explanation for the disparity, but there are some leading theories for it, namely the demographic characteristics and internet literacy of Ramaswamy’s supporters, along with the complications of an overly white audience trying to pronounce the name of a son of immigrants from India over the phone.”

And then, in order for a respondent to choose Ramaswamy in a phone poll, he or she will have to repeat the name back to the interviewer. And the national Republican electorate is definitely older and whiter than the country as a whole: In a recent New York Times/Siena College poll, more than 80 percent of likely GOP primary voters were white, and 38 percent were 65 or older.

‘When your candidate is named Vivek Ramaswamy,’ said one Republican pollster, granted anonymity to discuss the polling dynamics candidate, ‘that’s like DEFCON 1 for confusion and mispronunciation.’

Palko writes:

Keeping in mind that the “surge” was never big (maxed out at 10% and has been flat since), we’re talking about fairly small numbers in absolute terms, here are some questions:

1. How much do we normally expect phone and online to agree?

2. Ramaswamy generally scores around 3 times higher online than with phone. Have we seen that magnitude before?

3. How about a difficult name bias. Have we seen that before? How about Buttigieg, for instance? Did a foreign-sounding name hurt Obama in early polls?

4. Is the difference in demographics great enough to explain the difference? Aren’t things like gender and age normally reweighted?

5. Are there other explanations we should consider?

I don’t have any answers here, just one thought which is that it’s early in the campaign (I guess I should call it the pre-campaign, given that the primary elections haven’t started yet), and so perhaps journalists are reasoning that, even if this candidate is not very popular among voters, his active internet presence makes him a reasonable dark-horse candidate looking forward. An elite taste now but could perhaps spread to the non-political-junkies in the future? Paradoxically, the fact that Ramaswamy has this strong online support despite his extreme political stances could be taken as a potential sign of strength? I don’t know.

Conformal prediction and people

This is Jessica. A couple weeks I wrote a post in response to Ben Recht’s critique of conformal prediction for quantifying uncertainty in a prediction. Compared to Ben, I am more open-minded about conformal prediction and associated generalizations like conformal risk control. Quantified uncertainty is inherently incomplete as an expression of the true limits of our knowledge, but I still often find value in trying to quantify it over stopping at a point estimate.

If expressions of uncertainty are generally wrong in some ways but still sometimes useful, then we should be interested in how people interact with different approaches to quantifying uncertainty. So I’m interested in seeing how people use conformal prediction sets relative to alternatives. This isn’t to say that I think conformal approaches can’t be useful without being human-facing (which is the direction of some recent work on conformal decision theory). I just don’t think I would have spent the last ten years thinking about how people interact and make decisions with data and models if I didn’t believe that they need to be involved in many decision processes. 

So now I want to discuss what we know from the handful of controlled studies that have looked at human use of prediction sets, starting with the one I’m most familiar with since it’s from my lab.

In Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling, we study people making decisions with the assistance of a predictive model. Specifically, they label images with access to predictions from a pre-trained computer vision model. In keeping with the theme that real world conditions may deviate from expectations, we consider two scenarios: one where the model makes highly accurate predictions because the new images are from the same distribution as those that the model is trained on, and one where the new images are out of distribution. 

We compared their accuracy and the distance between their responses and the true label (in the Wordnet hierarchy, which conveniently maps to ImageNet) across four display conditions. One was no assistance at all, so we could benchmark unaided human accuracy against model accuracy for our setting. People were generally worse than the model in this setting, though the human with AI assistance was able to do better than the model alone in a few cases.

The other three displays were variations on model assistance, including the model’s top prediction with the softmax probability, the top 10 model predictions with softmax probabilities, and a prediction set generated using split conformal prediction with 95% coverage.

We calibrated the prediction sets we presented offline, not dynamically. Because the human is making decisions conditional on the model predictions, we should expect the distribution to change. But often we aren’t going to be able to calibrate adaptively because we don’t immediately observe the ground truth. And even if we do, at any particular point in time we could still be said to hover on the boundary of having useful prior information and steering things off course. So when we introduce a new uncertainty quantification to any human decision setting, we should be concerned with how it works when the setting is as expected and when it’s not, i.e., the guarantees may be misleading.

Our study partially gets at this. Ideally we would have tested some cases where the stated coverage guarantee for the prediction sets was false. But for the out-of-distribution images we generated, we would have had to do a lot of cherry-picking of stimuli to break the conformal coverage guarantee as much as the top-1 coverage broke. The coverage degraded a little but stayed pretty high over the entire set of out-of-distribution instances for the types of perturbations we focused on (>80%, compared to 70% for top 1- and 43% for top 1). For the set of stimuli we actually tested, the coverage for all three was a bit higher, with top 1 coverage getting the biggest bump (70% compared to 83% top 10, 95% conformal). Below are some examples of the images people were classifying (where easy and hard is based on the cross-entropy loss given the model’s predicted probabilities, and smaller and larger refers to the size of the prediction sets).

We find that prediction sets don’t offer much value over top-1 or top-10 displays when the test instances are iid, and they can reduce accuracy on average for some types of instances. However, when the test instances are out of distribution, accuracy is slightly higher with access to prediction sets than with either top-k. This was the case even though the prediction sets for the OOD instances get very large (the average set size for “easy” OOD instances, as defined by the distribution of softmax values, was ~17, for “hard” OOD instances it was ~61, with people sometimes seeing sets with over 100 items). For the in-distribution cases, average set size was about 11 for the easy instances, and 30 for the hard ones.  

Based on the differences in coverage across the conditions we studied, our results are more likely to be informative for situations where conformal prediction is used because we think it’s going to degrade more gracefully under unexpected shifts. I’m not sure it’s reasonable to assume we’d have a good hunch about that in practice though.

In designing this experiment in discussion with my co-authors, and thinking more about the value of conformal prediction to model-assisted human decisions, I’ve been thinking about when a “bad” (in the sense of coming with a misleading guarantee) interval might still be better than no uncertainty quantification. I was recently reading Paul Meehl’s clinical vs statistical prediction, where he contrasts clinical judgments  doctors make based on intuitive reasoning to statistical judgments informed by randomized controlled experiments. He references a distinction between the “context of justification” for some internal sense of probability that leads to a decision like a diagnosis, and the “context of verification” where we collect the data we need to verify the quality of a prediction. 

The clinician may be led, as in the present instance, to a guess which turns out to be correct because his brain is capable of that special “noticing the unusual” and “isolating the pattern” which is at present not characteristic of the traditional statistical techniques. Once he has been so led to a formulable sort of guess, we can check up on him actuarially. 

Thinking about the ways prediction intervals can affect decisions makes me think that whenever we’re dealing with humans, there’s potentially going to be a difference between what an uncertainty expression says and can guarantee and the value of that expression for the decision-maker. Quantifications with bad guarantees can still be useful if they change the context of discovery in ways that promote broader thinking or taking the idea of uncertainty seriously. This is what I meant when in my last post I said “the meaning of an uncertainty quantification depends on its use.” But precisely articulating how they do this is hard. It’s much easier to identify ways calibration can break.

There a few other studies that look at human use of conformal prediction sets, but to avoid making this post even longer, I’ll summarize them in an upcoming post.

P.S. There have been a few other interesting posts on uncertainty quantification in the CS blogosphere recently, including David Stutz’s response to Ben’s remarks about conformal prediction, and on designing uncertainty quantification for decision making from Aaron Roth.

“Hot hand”: The controversy that shouldn’t be. And thinking more about what makes something into a controversy:

I was involved in a recent email discussion, leading to this summary:

There is no theoretical or empirical reason for the hot hand to be controversial. The only good reason for there being a controversy is that the mistaken paper by Gilovich et al. appeared first. At this point we should give Gilovich et al. credit for bringing up the hot hand as a subject of study and accept that they were wrong in their theory, empirics, and conclusions, and we can all move on. There is no shame in this for Gilovich et al. We all make mistakes, and what’s important is not the personalities but the research that leads to understanding, often through tortuous routes.

“No theoretical reason”: see discussion here, for example.

“No empirical reason”: see here and lots more in the recent literature.

“The only good reason . . . appeared first”: Beware the research incumbency rule.

More generally, what makes something a controversy? I’m not quite sure, but I think the news media play a big part. We talked about this recently in the context of the always-popular UFOs-as-space-aliens theory, which used to be considered a joke in polite company but now seems to have reached the level of controversy.

I don’t have anything systematic to say about all this right now, but the general topic seems very worthy of study.

“Here’s the Unsealed Report Showing How Harvard Concluded That a Dishonesty Expert Committed Misconduct”

Stephanie Lee has the story:

Harvard Business School’s investigative report into the behavioral scientist Francesca Gino was made public this week, revealing extensive details about how the institution came to conclude that the professor committed research misconduct in a series of papers.

The nearly 1,300-page document was unsealed after a Tuesday ruling from a Massachusetts judge, the latest development in a $25 million lawsuit that Gino filed last year against Harvard University, the dean of the Harvard Business School, and three business-school professors who first notified Harvard of red flags in four of her papers. All four have been retracted. . . .

According to the report, dated March 7, 2023, one of Gino’s main defenses to the committee was that the perpetrator could have been someone else — someone who had access to her computer, online data-storage account, and/or data files.

Gino named a professor as the most likely suspect. The person’s name was redacted in the released report, but she is identified as a female professor who was a co-author of Gino’s on a 2012 now-retracted paper about inducing honest behavior by prompting people to sign a form at the top rather than at the bottom. . . . Allegedly, she was “angry” at Gino for “not sufficiently defending” one of their collaborators “against perceived attacks by another co-author” concerning an experiment in the paper.

But the investigation committee did not see a “plausible motive” for the other professor to have committed misconduct by falsifying Gino’s data. “Gino presented no evidence of any data falsification actions by actors with malicious intentions,” the committee wrote. . . .

Gino’s other main defense, according to the report: Honest errors may have occurred when her research assistants were coding, checking, or cleaning the data. . . .

Again, however, the committee wrote that “she does not provide any evidence of [research assistant] error that we find persuasive in explaining the major anomalies and discrepancies.”

The full report is at the link.

Some background is here, also here, and some reanalyses of the data are linked here.

Now we just have to get to the bottom of the story about the shredder and the 80-pound rock and we’ll pretty much have settled all the open questions in this field.

We’ve already determined that the “burly coolie” story and the “smallish town” story never happened.

It’s good we have dishonesty experts. There’s a lot of dishonesty out there.

Abraham Lincoln and confidence intervals

This one from 2017 is good; I want to share it with all of you again:

Our recent discussion with mathematician Russ Lyons on confidence intervals reminded me of a famous logic paradox, in which equality is not as simple as it seems.

The classic example goes as follows: Abraham Lincoln is the 16th president of the United States, but this does not mean that one can substitute the two expressions “Abraham Lincoln” and “the 16th president of the United States” at will. For example, consider the statement, “If things had gone a bit differently in 1860, Stephen Douglas could have become the 16th president of the United States.” This becomes flat-out false if we do the substitution: “If things had gone a bit differently in 1860, Stephen Douglas could have become Abraham Lincoln.”

Now to confidence intervals. I agree with Rink Hoekstra, Richard Morey, Jeff Rouder, and Eric-Jan Wagenmakers that the following sort of statement, “We can be 95% confident that the true mean lies between 0.1 and 0.4,” is not in general a correct way to describe a classical confidence interval. Classical confidence intervals represent statements that are correct under repeated sampling based on some model; thus the correct statement (as we see it) is something like, “Under repeated sampling, the true mean will be inside the confidence interval 95% of the time” or even “Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.” Russ Lyons, however, felt the statement “We can be 95% confident that the true mean lies between 0.1 and 0.4,” was just fine. In his view, “this is the very meaning of “confidence.'”

This is where Abraham Lincoln comes in. We can all agree on the following summary:

A. Averaging over repeated samples, we can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.

And we could even perhaps feel that the phrase “confidence interval” implies “averaging over repeated samples,” and thus the following statement is reasonable:

B. “We can be 95% confident that the true mean lies between the lower and upper endpoints of the confidence interval.”

Now consider the other statement that caused so much trouble:

C. “We can be 95% confident that the true mean lies between 0.1 and 0.4.”

In a problem where the confidence interval is [0.1, 0.4], “the lower and upper endpoints of the confidence interval” is just “0.1 and 0.4.” So B and C are the same, no? No. Abraham Lincoln, meet the 16th president of the United States.

In statistical terms, once you supply numbers on the interval, you’re conditioning on it. You’re no longer implicitly averaging over repeated samples. Just as, once you supply a name to the president, you’re no longer implicitly averaging over possible elections.

So here’s what happened. We can all agree on statement A. Statement B is a briefer version of A, eliminating the explicit mention of replications because they are implicit in the reference to a confidence interval. Statement C does a seemingly innocuous switch but, as a result, implies conditioning on the interval, thus resulting in a much stronger statement that is not necessarily true (that is, in mathematical terms, is not in general true).

None of this is an argument over statistical practice. One might feel that classical confidence statements are a worthy goal for statistical procedures, or maybe not. But, like it or not, confidence statements are all about repeated sampling and are not in general true about any particular interval that you might see.

P.S. More here.

You probably don’t have a general algorithm for an MLE of Gaussian mixtures

Those of you who are familiar with Garey and Johnson’s 1979 classic, Computers and Intractability: a guide to the theory of NP-completeness, may notice I’m simply “porting” their introduction, including the dialogue, to the statistics world.

Imagine Andrew had tasked me and Matt Hoffman with fitting simple standard (aka isotropic, aka spherical) Gaussian mixtures rather than hierarchical models. Let’s say that Andrew didn’t like that K-means got a different answer every time he ran it, K-means++ wasn’t much better, and even using soft-clustering (i.e., fitting the stat model with EM) didn’t let him replicate simulated data. Would we have something like Stan for mixtures. Sadly, no. Matt and I may have tried and failed. We wouldn’t want to go back to Andrew and say,

  • “We can’t find an efficient algorithm. I guess we’re just too dumb.”

We’re computer scientists and we know about proving hardness. We’d like to say,

  • “We can’t find an efficient algorithm, because no such algorithm is possible.”

But that would’ve been beyond Matt’s and my grasp, because, in this particular case, it would require solving the biggest open problem in theoretical computer science. Instead, it’s almost certain we would have come back and said,

  • “We can’t find an efficient algorithm, but neither can all these famous people.”

That seems weak. Why would we say that? Because we could’ve proven that the problem is NP-hard. A problem is in the class P if it can be solved in polynomial time with a deterministic algorithm. A problem is in the class NP when there is a non-deterministic (i.e., infinitely parallel) algorithm to solve it in polynomial time. It’s NP-hard if it’s just as hard as any other NP algorithm (formally specified through reductions, a powerful CS proof technique that’s the basis of Gödel’s incompleteness theorem). An NP-hard algorithm often has a non-deterministic algorithm to solve it makes a complete set of (exponentially many) guesses in parallel and then spends polynomial time on each one verifying whether or not it is a solution. An algorithm is NP-complete if it is NP-hard and a member of NP. Some well known NP-complete problems are bin packing, satisfiability in propositional logic, and the traveling salesman problem—there’s a big list of NP-complete problems.

Nobody has found a tractable algorithm to solve an NP-hard problem. When we (computer scientists) say “tractable,” we mean solvable in polynomial time with a deterministic algorithm (i.e., the problem is in P). The only known algorithms for NP-hard problems are exponential. Researchers have been working for the last 50+ years trying to prove that the class of NP problems is disjoint from the class of P problems.

In other words, there’s a Turing Award waiting for you if you can actually turn response (3) into response (2).

In the case of mixtures of standard (spherical, isotropic) Gaussians there’s a short JMLR paper with a proof that maximum likelihood estimation is NP-hard.

And yes, that’s the same Tosh as who was the first author of the “piranha” paper.

Ising models that are not restricted to be planar are also NP-hard.

What both these problems have in common is that they are combinatorial and require inference over sets. I think (though am really not sure) that one of the appeals of quantum computing is potentially solving NP-hard problems.

P.S. How this story really would’ve went is that we would’ve told Andrew that some simple distributions over NP-hard problem instances lead to expected polynomial time algorithms and we’d be knee-deep in the kinds of heuristics used to pack container ships efficiently.