printUse ctrl + p to print the page

Context Matters for Size: Why External Validity Claims and Development Practice Don't Mix - Working Paper 336

Authors: Pritchett, L. & and Sandefur, J.
Publication date: July 8th, 2013

In this working paper of the Center for Global Development, the authors examine how policymakers and practitioners should interpret the impact evaluation literature when presented with conflicting experimental and non-experimental estimates of the same intervention across varying contexts.

Louise Shaxson ( reviewed the paper and shared the following comments:

Pritchett and Sandefur (p3) note the existence of an impact evaluation paradigm described by four features:

  • “Evidence rankings that ignore external validity” (i.e. the ones that begin with a systematic review at the top and case studies at the bottom)
  • “Meta-analysis of the average effect of a vaguely-specified ‘intervention’ which likely varies enormously across contexts
  • “Clustering evaluation resources in a few expensive studies in locations chosen for researchers’ convenience
  •  The irresistible urge to formulate global policy recommendations.”

In the face of this they argue for much greater attention to be paid to the issues of context and heterogeneity than to ideas of rigour and internal validity when considering what evidence to use for making policy recommendations.  The reason for this is their observation that “…the current state of the literature appears to suggest that external validity is a much greater threat to making accurate estimates of policy-relevant parameters than is structural bias undermining the internal validity of parameters derived from observational data” (p29). 

They take issue with the idea that RCTs can be useful in a ‘planning’ approach to development[1] based on a ‘rigorous’ approach to evidence.   They show that this approach is superficially attractive but logically incoherent: in trying to extrapolate findings from one or more RCTs, external validity is a much more important factor determining whether results are transferable to other contexts than internal validity.  “…once extrapolated from its exact context (where context includes everything), RCT estimates lose any claim to superior ‘rigor’” (p3: their emphasis).

They say that there are two trade-offs we need to make when considering what constitutes rigorous evidence (rigorous here being very broadly defined).  First, experimental and non-experimental results across different contexts are in tension with each other.  So when we’re trying to assess which is more ‘rigorous’, should we be more concerned with the internal validity of evidence from the ‘wrong’ context, or external validity of evidence from the ‘right’ context?  Second, there is a tension between equally well-identified results across contexts, because the true causal parameters may not be specified in the analysis and not homogeneous across contexts. This could result from non-random placement of RCTs (contexts being purposively chosen), or from a variable as simple as organisational capacity to implement the study.  They reference an education study in Kenya to show that if two organisations implementing the same RCT are differently capable in terms of their institutional capacity to run the study, they may come out with significantly different results: “the context of implementing organization determined the rigorous estimate of causal impact” (p31). 

There are two specific conclusions for the consumers of experimental research: 

“Avoid strict rankings of evidence.  These can be highly misleading.  At a minimum, evidence rankings must acknowledge a steep trade-off between internal and external validity.”

“Learn to live without external validity.  Experimentation is indispensable to finding new solutions to development challenges….  Given the evidence of significant heterogeneity in underlying causal parameters…experimental methods in development economics appear better suited to contribute to a process of project management, evaluation, and refinement—rather than a (sic) elusive quest to zero in on a single set of universally true parameters or universally effective programs.” (p34)

Their overarching conclusion is that we should not view impact evaluations as the basis for transferring lessons to other contexts and the basis for more and better global policy prescriptions (in the way that Duflo and Banerjee suggest in Poor Economics).  This doesn’t mean they should be abandoned—they argue that more and better RCTs are needed—but that they would be more effectively used within a ‘searcher’ paradigm of development, with practitioners and policymakers using RCTs to improve the learning of development organisations within their own processes. 

[1] As opposed to a ‘searching’ approach to development (see Easterly, 2006).  This also corresponds to the idea that new development economics is ‘an approach of “experimentation” which emphasizes adaptation to local context and a search for “best fit” rather than “best practice”’ (see Rodrik 2008 and Crook & Booth 2011, referenced in the paper)