Scholarly research

My research interests lie in developing Bayesian statistical methodology for applied problems that intersect with the social sciences. In particular, I focus on Bayesian methodology on how to help statistical agencies to release their collected microdata from surveys to the public, in a useful and safe way. Through the synthetic data approach, data disseminators can generate simulated data from statistical models estimated from the confidential data, and release synthetic data to the public. I am also working on differential privacy approaches.

Peer-reviewed publications:

* indicates an undergraduate student co-author

  1.  Hu, J., Williams, M. R. and Savitsky, T. D. (forthcoming), Mechanisms for global differential privacy under Bayesian data synthesis, to appear in Data Privacy special issue at Statistica Sinica. arXiv link
  2. Hu, J. and Bowen, C. M. (2024), Advancing microdata privacy protection: a review of synthetic data methods, WIREs Computational Statistics, e1636. doi:10.1002/wics.1636.
  3. Hu, J. and Savitsky, T. D. (2023), Bayesian data synthesis and disclosure risk quantification: an application to the Consumer Expenditure Surveys, Transactions on Data Privacy, 16:2, 83-121.
  4. Guo, S.* and Hu, J. (2023), Data privacy protection and utility preservation through Bayesian data synthesis: a case study on Airbnb listings, The American Statistician, 77(2), 192-200. link to the published paper
  5. Schneider, M. J, Hu, J., Mankad, S. and Bale, C. D. (2023), Protecting the anonymity of online users through Bayesian data synthesis, Expert Systems With Applications, 216, 119409. ResearchGate link
  6. Hu, J., Savitsky, T. D. and Williams, M. R. (2022), Risk-efficient Bayesian pseudo posterior data synthesis for privacy protection, Journal of Survey Statistics and Methodology, 10(5), 1370-1399. link to the published paper
  7. Cao, Y.* and Hu, J. (2022), Privacy protection for youth risk behavior using Bayesian data synthesis: a case study to the YRBS, Privacy in Statistical Databases e-proceedings.
  8. Hu, J., Drechsler, J. and Kim, H. J. (2022), Accuracy gains from privacy amplification through sampling for differential privacy, Journal of Survey Statistics and Methodology, Special Issue: Privacy, Confidentiality, and Disclosure Protection, 10(3), 688-719. link to the published paper
  9. Hu, J., Savitsky, T. D. and Williams, M. R. (2022), Private tabular survey data products through synthetic microdata generation, Journal of Survey Statistics and Methodology, Special Issue: Privacy, Confidentiality, and Disclosure Protection, 10(3), 720-752. link to the published paper
  10. Savitsky, T. D., Williams, M. R. and Hu, J. (2022), Bayesian pseudo posterior mechanism under asymptotic differential privacy, Journal of Machine Learning Research, 23(55), 1−37. Open Access
  11. Hu, J., Akande, O. and Wang, Q. (2021), Multiple imputation and synthetic data generation with the R package NPBayesImputeCat, The R Journal, 13:2, 90-110. Open Access
  12. Drechsler, J. and Hu, J. (2021), Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data, Journal of Survey Statistics and Methodology, 9(3), 523-548. Open Access
  13. Hornby, R.* and Hu, J. (2021), Identification risks evaluation of partially synthetic data with the IdentificationRiskCalculation R package, Transactions on Data Privacy, 14:1, 37-52. Open Access
  14. Hu, J., Savitsky, T. D. and Williams, M. R. (2020), Risk-weighted data synthesizers for microdata dissemination, Special Issue: A New Generation of Statisticians Tackles Data Privacy, CHANCE, 33(4), 29-36. Open Access
  15. Ros, K.*, Olsson, H.* and Hu, J. (2020), Two-phase data synthesis for income: an application to the NHIS, Privacy in Statistical Databases e-proceedings. link
  16. Hu, J. (2019), Bayesian estimation of attribute and identification disclosure risks in synthetic data, Transactions on Data Privacy, 12:1, 61-89. Open Access
  17. Hu, J. and Hoshino, N. (2018), The Quasi-Multinomial synthesizer for categorical data,  Privacy in Statistical Databases, Lecture Notes in Computer Science 11126 ed. J. Domingo-Ferrer and F. Montes, Springer, 75-91.
  18. Manrique-Vallier, D. and Hu, J. (2018), Bayesian non-parametric generation of synthetic multivariate categorical data in the presence of structural zeros, Journal of the Royal Statistical Society, Series A (Statistics in Society), 181(3), 635-647.
  19. Hu, J., Reiter, J. P. and Wang, Q. (2018), Dirichlet Process mixture models for modeling and generating synthetic versions of nested categorical data, Bayesian Analysis, 13(1), 183-200. Open Access; see software page for NestedCategBayesImpute for method implementation.
  20. Hu, J. and Drechsler, J. (2015), Generating synthetic geocoding information for public release, In: S. A. Europäische Kommission (Hrsg.), NTTS – Conferences on New Techniques and Technologies for Statistics, 56-59.
  21. Hu, J., Reiter, J. P. and Wang, Q. (2014), Disclosure risk evaluation for fully synthetic categorical dataPrivacy in Statistical Databases, Lecture Notes in Computer Science 8744 ed. J. Domingo-Ferrer, Springer, 185-199.; see software page for NPBayesImputeCat for method implementation.
  22. Hu, J., Mitra, R. and Reiter, J. P. (2013), Are independent parameter draws necessary for multiple imputation? The American Statistician. 67(3), 143-149.
  23. Hu, J. and Reiter, J. P. (2013), Non-parametric Bayesian model for generating synthetic household dataJoint UNECE/Eurostat Work Session on Statistical Data Condentiality 2013.

Technical reports:

* indicates an undergraduate student co-author

  1. Savitsky, T. D., Hu, J. and Williams, M. R., Re-weighting of vector-weighted mechanisms for utility maximization under differential privacy. arXiv link
  2. Hornby, R.* and Hu, J., Bayesian estimation of attribute disclosure risks in synthetic data with the AttributeRiskCalculation R package. arXiv link

Work in progress:

* indicates an undergraduate student co-author