Hostname: page-component-7c8c6479df-24hb2 Total loading time: 0 Render date: 2024-03-28T12:57:27.878Z Has data issue: false hasContentIssue false

We Have to Be Discrete About This: A Non-Parametric Imputation Technique for Missing Categorical Data

Published online by Cambridge University Press:  19 July 2012

Abstract

Missing values are a frequent problem in empirical political science research. Surprisingly, the match between the measurement of the missing values and the correcting algorithms applied is seldom studied. While multiple imputation is a vast improvement over the deletion of cases with missing values, it is often unsuitable for imputing highly non-granular discrete data. We develop a simple technique for imputing missing values in such situations, which is a variant of hot deck imputation, drawing from the conditional distribution of the variable with missing values to preserve the discrete measure of the variable. This method is tested against existing techniques using Monte Carlo analysis and then applied to real data on democratization and modernization theory. Software for our imputation technique is provided in a free, easy-to-use package for the R statistical environment.

Type
Articles
Copyright
Copyright © Cambridge University Press 2012

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

Department of Political Science, University of North Carolina; and Department of Political Science, Washington University (email: jgill@wustl.edu), respectively. The authors wish to thank Micah Altman, James Fowler, Katie Gan, Adam Glynn, Justin Grimmer, Dominik Hangartner, Michael Kellerman, Gary King, Ryan Moore and Randolph Siverson for valuable comments. Replication data is available at http://www.unc.edu/~skylerc/.

References

1 The term ‘missing data’ can mean either missing values (e.g. item non-response in a survey) or missing observations such as refusal to take an entire survey. Throughout this work, we use the term exclusively to mean the first case.

2 Taagepera, Rein and Shugart, Matthew Soberg, Seats and Votes: The Effects and Determinants of Electoral Systems (New Haven, Conn.: Yale University Press, 1989)Google Scholar

3 Peter Mair and Ingrid van Biezen, ‘Party Membership in Twenty European Democracies, 1980–2000’, Party Politics, 7 (2001), 521CrossRefGoogle Scholar

4 Palmer, Harvey D. and Whitten, Guy D., ‘The Electoral Impact of Unexpected Inflation and Economic Growth’, British Journal of Political Science, 29 (1999), 623639CrossRefGoogle Scholar

5 Reiter, Dan, ‘Does Peace Nurture Democracy?’ Journal of Politics, 63 (2001), 935948CrossRefGoogle Scholar

6 Tsiatis, Anastasios A., Semiparametric Theory and Missing Data (New York: Springer, 2010)Google Scholar

Enders, Craig K., Applied Missing Data Analysis (New York: The Guilford Press, 2010)Google Scholar

Tan, Ming T.Tian, Guo-Liang and Ng, Kai Wang, Bayesian Missing Data Problems: EM, Data Augmentation and Noniterative Computation (New York: Chapman & Hall/CRC, 2009)CrossRefGoogle Scholar

Molenberghs, Geert and Kenward, Michael G., Missing Data in Clinical Studies (New York: Wiley, 2007)CrossRefGoogle Scholar

McKnight, Patrick E., McKnight, Katherine M.Sidani, Souraya and Figueredo, Aurelio Jose, Missing Data: A Gentle Approach (New York: The Guilford Press, 2007)Google Scholar

7 Rees, Phil H. and Duke-Williams, Oliver, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’, International Journal of Population Geography, 3 (1997), 3233683.0.CO;2-Z>CrossRefGoogle ScholarPubMed

8 Rees and Duke-Williams, ‘Methods for Estimating Missing Data on Migrants in the 1991 British Census’.

9 Roderick J. A. Little and Donald B. Rubin, Statistical Analysis with Missing Data, 2nd edn (New York: Wiley, 2002), p. 42Google Scholar

10 Allison, Paul D., Missing Data (Thousand Oaks, Calif.: Sage, 2001)Google Scholar

Little, Roderick J. A., ‘Regression with Missing X's: A Review’, Journal of the American Statistical Association, 87 (1992), 12271237Google Scholar

Little, Roderick J. A., ‘Approximately Calibrated Small Sample Inference about Means from Bivariate Normal Data with Missing Values’, Computational Statistics & Data Analysis, 7 (1988), 161178CrossRefGoogle Scholar

Rubin, Donald B., ‘Inference and Missing Data (with Discussion)’, Biometrika, 63 (1976), 581592CrossRefGoogle Scholar

King, Gary, Honaker, JamesJoseph, Anne and Scheve, Kenneth, ‘Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation’, American Political Science Review, 95 (2001), 4969CrossRefGoogle Scholar

11 Honaker, James and King, Gary, ‘What to Do about Missing Values in Time-Series Cross-Section Data’, American Journal of Political Science, 54 (2010), 561581CrossRefGoogle Scholar

12 Rubin, ‘Inference and Missing Data’; King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Little and Rubin, Statistical Analysis with Missing Data.

13 Little and Rubin, Statistical Analysis with Missing Data, p. 12.

14 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’.

15 Gelman, Andrew and Hill, Jennifer, Data Analysis Using Regression and Multilevel/Hierarchical Models (New York: Cambridge University Press, 2007)Google Scholar

16 Bailar, John C. III and Bailar, Barbara A., ‘Comparison of the Biases of the “Hot Deck” Imputation Procedure with an “Equal Weights” Imputation Procedure’, Symposium on Incomplete Data: Panel on Incomplete Data of the Committee on National Statistics, National Research Council, 1997), 422–47Google Scholar

Cox, Brenda. G., ‘The Weighted Sequential Hot Deck Imputation Procedure’, Proceedings of the Section on Survey Research Methods, American Statistical Association (1980), 721–6Google Scholar

Rockwell, Richard C., ‘An Investigation of Imputation and Differential Quality of Data in the 1970 Census’, Journal of the American Statistical Association, 70 (1975), 3942CrossRefGoogle Scholar

17 Rubin, Donald B., Multiple Imputation for Nonresponse in Surveys (New York: Wiley, 2004)Google Scholar

18 Rubin, ‘Inference and Missing Data’.

19 Rubin, Donald B., ‘Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys’, Journal of the American Statistical Association, 72 (1977), 538543CrossRefGoogle Scholar

Rubin, Donald B., ‘Multiple Imputations in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse’, Proceedings of the Survey Research Methods Section of the American Statistical Association (1978), 20–34Google Scholar

Rubin, Donald B. and Schenker, Nathaniel, ‘Multiple Imputation for Interval Estimation from Simple Random Samples with Ignorable Nonresponse’, Journal of the American Statistical Association, 81 (1986), 366374CrossRefGoogle Scholar

Rubin, Donald B., ‘Statistical Matching Using File Concatenation with Adjusted Weights and Multiple Imputations’, Journal of Business and Economic Statistics, 4 (1986), 8794Google Scholar

Rubin, Donald B.Schafer, J. L. and Schenker, Nathaniel, ‘Imputation Strategies for Missing Values in Post-Enumeration Surveys’, Survey Methodology, 14 (1988), 209221Google Scholar

Rubin, Donald B., ‘Multiple Imputation after 18+ Years’, Journal of the American Statistical Association, 91 (1996), 473489CrossRefGoogle Scholar

20 The combined $$\[-->$<>{{\bar{\theta }}_{{\bi M}}} <$> <!--\]$$ is in fact an average, but the treatment of the variability of this estimate is slightly more complicated than an average since it needs to account for within imputation variation and between imputation variation. The subject of multiple estimate combination will be discussed in some detail below. See Little and Rubin, Statistical Analysis with Missing Data, for a more detailed treatment.

21 Kim, Jae Kwang, ‘Finite Sample Properties of Multiple Imputation Estimators’, Annals of Statistics, 32 (2004), 766783CrossRefGoogle Scholar

Kim, Jae Kwang and Fuller, Wayne, ‘Fractional Hot Deck Imputation’, Biometrika, 91 (2004), 559578CrossRefGoogle Scholar

Fuller, Wayne and Kim, Jae Kwang, ‘Hot Deck Imputation for the Response Model’, Statistics Canada, 31 (2005), 139149Google Scholar

22 Schafer, Joseph L., Analysis of Incomplete Multivariate Data (New York: Chapman & Hall/CRC, 1997)CrossRefGoogle Scholar

23 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.

24 The articles describing the Amelia procedure have received over 330 ISI citations as of this writing.

25 Reilly, Marie, ‘Data Analysis Using Hot Deck Multiple Imputation’, The Statistician, 42 (1993), 307313CrossRefGoogle Scholar

26 Kalton, Graham and Kish, Leslie, ‘Some Efficient Random Imputation Methods’, Communications in Statistics – Theory and Methods, 13 (1984), 19191939CrossRefGoogle Scholar

Fay, Robert E., ‘Alternative Paradigms for the Analysis of Imputed Survey Data’, Journal of the American Statistical Association, 91 (1996), 490498CrossRefGoogle Scholar

27 Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.

28 Reilly, ‘Data Analysis Using Hot Deck Multiple Imputation’.

29 For linguistic parsimony, we generally use the term ‘respondent’ below, but these methods are immediately applicable to datasets where the rows reflect any other type of observation.

30 Gower, J. C., ‘A General Coefficient of Similarity and Some of its Properties’, Biometrics, 27 (1971), 857871CrossRefGoogle Scholar

31 Rosenbaum, Paul R. and Rubin, Donald B., ‘The Central Role of the Propensity Score in Observational Studies for Causal Effects’, Biometrika, 70 (1983), 4155CrossRefGoogle Scholar

32 Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’; Kim and Fuller, ‘Fractional Hot Deck Imputation’; Fuller and Kim, ‘Hot Deck Imputation for the Response Model’.

33 Kim, ‘Finite Sample Properties of Multiple Imputation Estimators’.

34 Little and Rubin, Statistical Analysis with Missing Data; Rubin, ‘Multiple Imputations in Sample Surveys’; Rubin, Multiple Imputation for Nonresponse in Surveys; Rubin, ‘Multiple Imputation after 18+ Years’.

35 Little and Rubin, Statistical Analysis with Missing Data.

36 Our software formats its output so that the output can be used seamlessly with the R package Zelig; Koske Imai, Gary King and Olivia Lau, ‘Zelig: Everyone's Statistical Software’, Comprehensive R Archive Network (2006). This has the advantage of allowing the user to run, in a single line of code, a great variety of models on the multiple imputed datasets and have the combination handled automatically.

37 King, Honaker, Joseph and Scheve, ‘Analyzing Incomplete Political Science Data’; Honaker and King, ‘What to Do about Missing Values in Time-Series Cross-Section Data’.

38 Stef van Buuren, Jaap P. L. Brand, C. G. M. Groothuis-Oudshoorn and Donald B. Rubin, ‘Fully Conditional Specification in Multivariate Imputation’, Journal of Statistical Computation and Simulation, 76 (2006), 10491064CrossRefGoogle Scholar

Stef van Buuren, ‘Multiple Imputation of Discrete and Continuous Data by Fully Conditional Specification’, Statistical Methods in Medical Research, 16 (2007), 219242CrossRefGoogle Scholar

39 Dempster, A. P.Laird, N. M. and Rubin, D. B., ‘Maximum Likelihood from Incomplete Data via the EM Algorithm’, Journal of the Royal Statistical Society, Series B, 39 (1977), 493510Google Scholar

40 We also ran experiments where the missing values were MCAR, but, as we would expect theoretically, no method was biased under those conditions.

41 Lipset, Seymour M., ‘Some Social Requisites of Democracy: Economic Development and Political Legitimacy’, American Political Science Review, 53 (1959), 69105CrossRefGoogle Scholar

42 Cutright, Phillips, ‘National Political Development: Its Measurement and Social Correlates’, in Nelson W. Polsby, Robert A. Dentler and Paul A. Smith, eds, Politics and Social Life: An Introduction to Political Behavior (Boston, Mass.: Houghton Mifflin, 1963), 569581Google Scholar

Deutsch, Karl W., ‘Social Mobilization and Political Development’, American Political Science Review, 55 (1961), 493510CrossRefGoogle Scholar

Dahl, Robert A., Polyarchy: Participation and Opposition (New Haven, Conn.: Yale University Press, 1971)Google Scholar

Burkhart, Ross E. and Lewis-Beck, Michael S., ‘The Economic Development Thesis’, American Political Science Review, 88 (1994), 903910CrossRefGoogle Scholar

Londregan, John B. and Poole, Keith T., ‘Does High Income Promote Democracy?’ World Politics, 49 (1996) 1–30Google Scholar

43 Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)CrossRefGoogle Scholar

Przeworski, Adam, Democracy and the Market: Political and Economic Reforms in Eastern Europe (New York: Cambridge University Press, 1991)CrossRefGoogle Scholar

Przeworski, Adam and Limongi, Fernando, ‘Political Regimes and Economic Growth’, Journal of Economic Perspectives, 7 (1993), 5169CrossRefGoogle Scholar

Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, ‘What Makes Democracies Endure?’ Journal of Democracy, 7 (1996), 3955Google Scholar

Przeworski, Adam and Limongi, Fernando, ‘Modernization: Theories and Facts’, World Politics, 49 (1997), 155183CrossRefGoogle Scholar

Przeworski, Adam, Alvarez, Michael E.Cheibub, Jose A. and Limongi, Fernando, Democracy and Development: Political Institutions and Well-Being in the World, 1950–1990 (New York: Cambridge University Press, 2000)CrossRefGoogle Scholar

44 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

45 Boix, Carles, Democracy and Redistribution (New York: Cambridge University Press, 2002)Google Scholar

Boix, Carles and Stokes, Susan, ‘Endogenous Democratization’, World Politics, 55 (2003), 517549CrossRefGoogle Scholar

Epstein, David L., Bates, Robert, Goldstone, JackKristensen, Ida and O'Halloran, Sharyn, ‘Democratic Transitions’, American Journal of Political Science, 50 (2006), 551569CrossRefGoogle Scholar

46 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

47 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

48 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

49 The true results are true to the extent that they are the results actually obtained by analysing the complete data. They are not true in the more traditional sense of being the true population parameters an empirical analysis attempts to estimate.

50 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

51 Przeworski, Alvarez, Cheibub and Limongi, Democracy and Development.

52 Imai, King and Lau, ‘Zelig’.