One of the wisest things my PhD advisor said to me was “You need to think more and do less,” and this is probably good advice for many of us. It can seem much more productive to clean data or run regressions than to sit in silence trying to figure something out. But doing before thinking can quickly lead one down the path of hours, days, or even weeks of wasted effort.
The principle of thinking things through first applies to all stages of a project, including robustness checks. Performing robustness checks in most cases is not the same as trying every possible permutation of your regression specification and/or dataset. In fact, for most projects, estimating every possible combination of sample, fixed effects, and controls will quickly leave you feeling overwhelmed and exhausted (I tried this early on in my career so I can vouch for this myself). Moreover, you are bound to get a weird result here and there if you are running dozens of regressions. Then you will be left wondering if it was a fluke or if your project has failed an important robustness check. (Yes, you can do multiple hypotheses adjustments, but if you do not pare down the number of regressions ex ante, such adjustments will needlessly kill your statistical power).
So here are some broad tips for how to conduct robustness tests and empirical analysis in general. Not all these tips apply in all cases, but they should be a good starting point for many quasi-experimental empirical papers. Of course, if a referee has asked for a specific robustness check/specification that falls outside these guidelines, you should do it if feasible (or, at the very least, explain why it is a bad idea on ex ante grounds).
Always estimate differences in pre-treatment trends if you can. I am surprised how often I review papers where the authors simply assert that the treatment/treatment timing is as good as random but make no attempts to empirically evaluate differences in pre-treatment trends. Estimating these differences is something one should do at the very beginning of any diff-in-diff project, even if you are pretty sure the treatment is as good as randomly assigned. Do not forget that even researchers who run randomized controlled experiments test for covariate balance just to be sure randomization worked as intended!
Pick your preferred regression specification/sample before you know what your results are. Of course, you are not necessarily wedded to this sample/specification. Sometimes during the course of a project you find out something about your data you didn’t realize when starting out. But choosing one early on helps ensure that your regressions are on firm ground conceptually.
Your preferred specification should be based on the most natural sample of treated and control units for your study (e.g., counties in hurricane-prone states). This will often be the sample that includes the largest number of treated units and enough high-quality control units to estimate a credible counterfactual. For example, try not to pare down your control units so much that you have 3 times as many treated as control units. At the other extreme, it is also unlikely that having a sample with 10 times as many control as treated units will be more useful than something closer to a 1-to-1 ratio. If you have a panel dataset, your preferred specification should be based on a balanced panel.
Your preferred specification should not necessarily include the most comprehensive controls. Rather, think about which controls take care of the likely confounders and only include those in your preferred specification. For example, if you think that the timing of a policy is as good as random, stick with a specification that only has time and unit fixed effects and leave other time-varying controls for a robustness check. There is nothing more frustrating than seeing specifications with potentially endogenous or simply excessive controls and not knowing what the results look like without them!
To see if the results based on your preferred specification/sample are robust, change one thing at a time based on the most likely threats to your empirical approach. If you decided to not have the most comprehensive controls as your preferred specification, you can now make them more or less comprehensive. Not all sets of fixed effects/controls are reasonable. For example, if you have a panel diff-in-diff design, showing specifications without any time or unit fixed effects is unlikely to be informative.
If you have multiple outcomes, pick the most important outcome (or two) and show your robustness checks for that outcome. For less important outcomes, it should be fine to only show your preferred specification or, at the very least, way fewer robustness checks than for the main one(s).
If there is a smaller/larger set of treated units for which the estimated effect might be different, there is no need to replicate the same set of regressions as in your main sample. Show your preferred specification for these samples and maybe a few other key specifications, depending on how many other robustness checks you already have.
Avoid specifications with potentially endogenous controls, i.e., variables that are potentially influenced by the treatment itself. Whether or not your results change when such controls are included is not informative about the robustness of your estimates.
The broad guiding principle behind each robustness check should be that the check addresses a specific concern the reader might have about your study. Do not run a regression just because you can!
Consider adding some falsification/placebo exercises. Should the lead of your treatment variable be significant? Do you have so many predictor variables that one might worry about overfitting? Might the treatment be spilling over to nearby control units? Do you have so many instruments that you could have a many weak instruments problem? Are there some outcomes that should not be affected by the treatment? Are you not 100% confident in whether your standard errors are correctly calculated? Address these concerns with a falsification exercise!
There are many possible placebo exercises: estimating the “effect” of treatment leads, generating many random predictor variables, estimating the “effect” of treatment on nearby control units, re-shuffling the treatment variable at random many times, etc. A key principle for choosing one is the same as for robustness checks in general: a placebo exercise should address a specific concern a reader might have about your study.
Remember that not all robustness checks “failures” are failures. If you oversaturate your model with too many fixed effects, you may lose a lot of useful variation and make your estimated treatment effect very noisy. If your key outcome variable has a long thin right tail then running regressions using its levels may also get you a whole lot of noise. If you throw in an endogenous control (aka “bad control”) thinking you’re doing a robustness check, the resulting “treatment” effect will be biased and may even flip signs. If treatment assignment is conditional on a covariate and the results are not there if you do not control for that covariate, that does not mean your results are not robust. The key question is: is there a good reason a particular check did not work out? Thinking through your robustness checks before running them will help you minimize the possibility of ex post rationalizing!
Finally, when evaluating the results of your robustness checks, do not just pay attention to the number of stars. Look at the point estimates and the standard errors. If a robustness check results in the same point estimates but larger standard errors, that may be perfectly fine, depending on what you changed. For example, if you threw out 90% of your data and your stars disappeared, I wouldn’t worry too much about it. By contrast, if your point estimate falls because of a reasonable robustness check and your standard errors are sufficiently small to rule out your preferred point estimate with 95% confidence, that is a very bad sign.
There are likely other useful tips for doing robustness checks out there. If you have one, leave it in the comments below and I’ll update this post as needed!