My research interests lie in developing Bayesian statistical methodology for applied problems that intersect with the social sciences. In particular, I focus on Bayesian methodology on how to help statistical agencies to release their collected microdata from surveys to the public, in a useful and safe way. Through the synthetic data approach, data disseminators can generate simulated data from statistical models estimated from the confidential data, and release synthetic data to the public. I am also working on differential privacy approaches.
Peer-reviewed publications:
* indicates an undergraduate student co-author
- Hu, J., Williams, M. R. and Savitsky, T. D. (forthcoming), Mechanisms for global differential privacy under Bayesian data synthesis, to appear in Data Privacy special issue at Statistica Sinica. arXiv link
- Hu, J. and Bowen, C. M. (2024), Advancing microdata privacy protection: a review of synthetic data methods, WIREs Computational Statistics, e1636. doi:10.1002/wics.1636.
- Hu, J. and Savitsky, T. D. (2023), Bayesian data synthesis and disclosure risk quantification: an application to the Consumer Expenditure Surveys, Transactions on Data Privacy, 16:2, 83-121.
- Guo, S.* and Hu, J. (2023), Data privacy protection and utility preservation through Bayesian data synthesis: a case study on Airbnb listings, The American Statistician, 77(2), 192-200. link to the published paper
- Schneider, M. J, Hu, J., Mankad, S. and Bale, C. D. (2023), Protecting the anonymity of online users through Bayesian data synthesis, Expert Systems With Applications, 216, 119409. ResearchGate link
- Hu, J., Savitsky, T. D. and Williams, M. R. (2022), Risk-efficient Bayesian pseudo posterior data synthesis for privacy protection, Journal of Survey Statistics and Methodology, 10(5), 1370-1399. link to the published paper
- Cao, Y.* and Hu, J. (2022), Privacy protection for youth risk behavior using Bayesian data synthesis: a case study to the YRBS, Privacy in Statistical Databases e-proceedings.
- Hu, J., Drechsler, J. and Kim, H. J. (2022), Accuracy gains from privacy amplification through sampling for differential privacy, Journal of Survey Statistics and Methodology, Special Issue: Privacy, Confidentiality, and Disclosure Protection, 10(3), 688-719. link to the published paper
- Hu, J., Savitsky, T. D. and Williams, M. R. (2022), Private tabular survey data products through synthetic microdata generation, Journal of Survey Statistics and Methodology, Special Issue: Privacy, Confidentiality, and Disclosure Protection, 10(3), 720-752. link to the published paper
- Savitsky, T. D., Williams, M. R. and Hu, J. (2022), Bayesian pseudo posterior mechanism under asymptotic differential privacy, Journal of Machine Learning Research, 23(55), 1−37. Open Access
- Hu, J., Akande, O. and Wang, Q. (2021), Multiple imputation and synthetic data generation with the R package NPBayesImputeCat, The R Journal, 13:2, 90-110. Open Access
- Drechsler, J. and Hu, J. (2021), Synthesizing geocodes to facilitate access to detailed geographical information in large scale administrative data, Journal of Survey Statistics and Methodology, 9(3), 523-548. Open Access
- Hornby, R.* and Hu, J. (2021), Identification risks evaluation of partially synthetic data with the IdentificationRiskCalculation R package, Transactions on Data Privacy, 14:1, 37-52. Open Access
- Hu, J., Savitsky, T. D. and Williams, M. R. (2020), Risk-weighted data synthesizers for microdata dissemination, Special Issue: A New Generation of Statisticians Tackles Data Privacy, CHANCE, 33(4), 29-36. Open Access
- Ros, K.*, Olsson, H.* and Hu, J. (2020), Two-phase data synthesis for income: an application to the NHIS, Privacy in Statistical Databases e-proceedings. link
- Hu, J. (2019), Bayesian estimation of attribute and identification disclosure risks in synthetic data, Transactions on Data Privacy, 12:1, 61-89. Open Access
- Hu, J. and Hoshino, N. (2018), The Quasi-Multinomial synthesizer for categorical data, Privacy in Statistical Databases, Lecture Notes in Computer Science 11126 ed. J. Domingo-Ferrer and F. Montes, Springer, 75-91.
- Manrique-Vallier, D. and Hu, J. (2018), Bayesian non-parametric generation of synthetic multivariate categorical data in the presence of structural zeros, Journal of the Royal Statistical Society, Series A (Statistics in Society), 181(3), 635-647.
- Hu, J., Reiter, J. P. and Wang, Q. (2018), Dirichlet Process mixture models for modeling and generating synthetic versions of nested categorical data, Bayesian Analysis, 13(1), 183-200. Open Access; see software page for NestedCategBayesImpute for method implementation.
- Hu, J. and Drechsler, J. (2015), Generating synthetic geocoding information for public release, In: S. A. Europäische Kommission (Hrsg.), NTTS – Conferences on New Techniques and Technologies for Statistics, 56-59.
- Hu, J., Reiter, J. P. and Wang, Q. (2014), Disclosure risk evaluation for fully synthetic categorical data, Privacy in Statistical Databases, Lecture Notes in Computer Science 8744 ed. J. Domingo-Ferrer, Springer, 185-199.; see software page for NPBayesImputeCat for method implementation.
- Hu, J., Mitra, R. and Reiter, J. P. (2013), Are independent parameter draws necessary for multiple imputation? The American Statistician. 67(3), 143-149.
- Hu, J. and Reiter, J. P. (2013), Non-parametric Bayesian model for generating synthetic household data, Joint UNECE/Eurostat Work Session on Statistical Data Condentiality 2013.
Technical reports:
* indicates an undergraduate student co-author
- Savitsky, T. D., Hu, J. and Williams, M. R., Re-weighting of vector-weighted mechanisms for utility maximization under differential privacy. arXiv link
- Hornby, R.* and Hu, J., Bayesian estimation of attribute disclosure risks in synthetic data with the AttributeRiskCalculation R package. arXiv link
Work in progress:
* indicates an undergraduate student co-author