In my current collaboration with Stephen Jenkins, we are grappling with the challenge of providing a self-contained replication package alongside our paper.
It’s relatively easy to share the code for our model estimations, including code developed by other authors. However, many researchers face the same challenge we do: how to distribute data that we’re not allowed to share due to privacy or proprietary reasons.
In fact, for this particular project, only Stephen has had access to the data. I’ve mainly worked on the code that estimates new models (for those interested, see references below).
Now that it’s time to publish our “big” paper, we need a strategy to create a synthetic dataset that satisfies privacy protection constraints while still preserving the moments’ structure we care about, as well as those that others may find interesting.
To this end, I propose a simple strategy that could work: Multiple Imputation. While it may not be the best method available, I welcome any feedback or suggestions.
To explain how the method works, I’ll use the Swiss Labor Market Survey 1998 dataset, which is publicly available and used as an example dataset in the command -oaxaca- (Jann 2008).
The Problem
Assume you signed a confidentiality agreement to work with Swiss Survey data and are ready to submit your work. However, you are required to provide a replication package with a code to produce the tables and the dataset itself. Since you cannot share the original data, you suggest generating 5 synthetic datasets instead. By doing so, people can apply your code and reach similar conclusions to your main paper, but with the advantage that the data is simulated, thus fulfilling privacy concerns.
Here is a piece of code that can be used for that:
Four variables (Wages, tenure, experience, and ISCO) have missing data when lfp=0 (people are not working).
The solution
The first step is to decide on the size of the synthetic dataset. You could create a dataset with the same number of observations or adjust it to your desired sample size. I will expand the dataset to double the size, tag the new observations and make all variables, except for lfp, missing. This is because some data is missing as it’s only available for those in the labor force. Alternatively, you could have created lfp using a random draw from a Bernoulli distribution with the same probability as the original data.
Code
expand 2, gen(tag)foreach i ofvarlist lnwage educ exper tenure isco female age single married divorced kids6 kids714 wt {qui:replace`i'=. iftag==1}
(1,647 observations created)
Next, create multiple imputed datasets using the predictive mean matching strategy. To do this, set the data and register all variables to be imputed. Then, impute all variables using chain pmm. Make sure none of the variables are collinear, and variables with structural missing data are specified separately. The only explanatory variable or exogenous variable here is lfp.
Code
misetwidemi register impute lnwage educ exper tenure isco female age single married kids6 kids714 wtsetseed 101qui:miimpute chain (pmm, knn(20)) educ female age single married kids6 kids714 wt (pmm if lfp==1, knn(20) ) lnwage exper tenure isco = lfp, add(5)
You now have 5 sets of variables that can be used to create unique synthetic datasets with a similar structure to the original confidential dataset. Let’s now put the newly created data into frames, so we can estimate few models and compare them with the original data.
Code
forvalues i = 1/5 { frame put _`i'_* lfp iftag==1, into(fr`i') frame fr`i':ren _`i'_* *}use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
Comparing Results
You can now estimate 4 models using the original data and the synthetic data.
Now lets compare the models:
Linear Regression
Original
Fake1
Fake2
Fake3
Fake4
Fake5
educ
0.085***
0.076***
0.059***
0.077***
0.069***
0.063***
(0.005)
(0.005)
(0.006)
(0.005)
(0.005)
(0.005)
exper
0.011***
0.010***
0.011***
0.012***
0.006***
0.009***
(0.002)
(0.001)
(0.002)
(0.002)
(0.001)
(0.001)
tenure
0.008***
0.008***
0.002
0.005**
0.007***
0.005**
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
female
-0.084***
-0.027
-0.136***
-0.063*
-0.056*
-0.113***
(0.025)
(0.025)
(0.026)
(0.025)
(0.023)
(0.024)
_cons
2.213***
2.297***
2.580***
2.336***
2.464***
2.529***
(0.068)
(0.068)
(0.074)
(0.070)
(0.063)
(0.064)
N
1434
1434
1434
1434
1434
1434
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001
Quantile Regression 10
Original
Fake1
Fake2
Fake3
Fake4
Fake5
educ
0.103***
0.088***
0.069***
0.083***
0.075***
0.067***
(0.017)
(0.015)
(0.018)
(0.011)
(0.011)
(0.013)
exper
0.020***
0.012**
0.014**
0.012***
0.008*
0.009*
(0.005)
(0.004)
(0.005)
(0.003)
(0.003)
(0.004)
tenure
0.001
0.006
0.004
0.002
0.006
0.010*
(0.006)
(0.005)
(0.006)
(0.004)
(0.004)
(0.005)
female
-0.151
0.022
-0.128
-0.019
-0.161**
-0.154*
(0.081)
(0.070)
(0.079)
(0.053)
(0.054)
(0.063)
_cons
1.462***
1.681***
1.939***
1.835***
1.971***
1.994***
(0.219)
(0.193)
(0.228)
(0.149)
(0.146)
(0.171)
N
1434
1434
1434
1434
1434
1434
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001
Quantile Regression 90
Original
Fake1
Fake2
Fake3
Fake4
Fake5
educ
0.064***
0.073***
0.047***
0.069***
0.062***
0.057***
(0.009)
(0.008)
(0.009)
(0.008)
(0.008)
(0.007)
exper
0.004
0.009***
0.009***
0.011***
0.003
0.005**
(0.003)
(0.002)
(0.003)
(0.002)
(0.002)
(0.002)
tenure
0.008*
0.009***
-0.001
0.008**
0.012***
0.005
(0.003)
(0.003)
(0.003)
(0.003)
(0.003)
(0.003)
female
-0.054
-0.054
-0.149***
-0.052
-0.009
-0.106**
(0.044)
(0.035)
(0.041)
(0.039)
(0.039)
(0.033)
_cons
2.984***
2.804***
3.247***
2.863***
2.990***
3.121***
(0.119)
(0.097)
(0.118)
(0.111)
(0.106)
(0.089)
N
1434
1434
1434
1434
1434
1434
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001
Heckman selection model
Original
Fake1
Fake2
Fake3
Fake4
Fake5
lnwage
educ
0.072***
0.066***
0.052***
0.068***
0.059***
0.057***
(0.005)
(0.005)
(0.006)
(0.005)
(0.005)
(0.005)
exper
0.002
0.001
-0.000
0.004*
-0.002
0.002
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
tenure
0.002
0.003
-0.002
-0.000
0.002
0.001
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
female
-0.105***
-0.067**
-0.189***
-0.105***
-0.096***
-0.144***
(0.029)
(0.025)
(0.026)
(0.025)
(0.024)
(0.024)
age
0.015***
0.013***
0.015***
0.012***
0.013***
0.011***
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
(0.002)
_cons
1.991***
2.071***
2.226***
2.133***
2.201***
2.317***
(0.073)
(0.073)
(0.081)
(0.075)
(0.069)
(0.070)
lfp
educ
0.149***
0.055**
0.024
0.018
0.064***
0.028
(0.028)
(0.020)
(0.020)
(0.019)
(0.019)
(0.018)
female
-1.785***
-0.074
-0.246**
-0.104
-0.011
-0.177
(0.161)
(0.091)
(0.091)
(0.088)
(0.090)
(0.090)
age
-0.039***
-0.021***
-0.023***
-0.018***
-0.025***
-0.023***
(0.007)
(0.005)
(0.005)
(0.005)
(0.005)
(0.005)
single
-0.100
-0.792**
-0.586**
-0.498**
-0.863***
-0.780**
(0.231)
(0.241)
(0.201)
(0.186)
(0.229)
(0.238)
married
-0.867***
-0.929***
-0.765***
-0.489**
-0.882***
-1.017***
(0.158)
(0.219)
(0.169)
(0.160)
(0.198)
(0.213)
kids6
-0.716***
-0.730***
-0.648***
-0.699***
-0.790***
-0.714***
(0.082)
(0.067)
(0.064)
(0.066)
(0.069)
(0.062)
kids714
-0.343***
-0.378***
-0.373***
-0.281***
-0.402***
-0.201***
(0.065)
(0.059)
(0.056)
(0.058)
(0.058)
(0.058)
_cons
3.543***
2.720***
3.006***
2.560***
2.729***
3.135***
(0.486)
(0.382)
(0.388)
(0.345)
(0.383)
(0.383)
/mills
lambda
-0.123
0.128*
0.251***
0.062
0.206***
0.077
(0.065)
(0.061)
(0.065)
(0.070)
(0.057)
(0.061)
N
1647
1647
1647
1647
1647
1647
Standard errors in parentheses *p < 0.05, **p < 0.01, ***p < 0.001
I wont spend too much time interpretting the models. However, it is important to notice that they do provide similar outcomes, except for quantile regressions. Still, with this, one could redistribute replication codes that use both the true data and synthetic data, providing transparency to the work.
Conclusions
As evident from the analysis, the results of the synthetic dataset are not expected to perfectly replicate the original data due to the introduction of random errors. However, by keeping this in mind, we can create synthetic datasets like this one, along with two sets of results - one based on the actual data, and the other based on the synthetic dataset(s).
This should help providing replication packages with code and data, improving the transparency of research when using restricted data.
---title: "Constructing synthetic Datasets"subtitle: "How to come public, with private data"format: htmlbibliography: references.bibnocite: | @jenkins_rios_2020, @jenkins_rios_2021---## Introduction In my current collaboration with Stephen Jenkins, we are grappling with the challenge of providing a self-contained replication package alongside our paper.It's relatively easy to share the code for our model estimations, including code developed by other authors. However, many researchers face the same challenge we do: how to distribute data that we're not allowed to share due to privacy or proprietary reasons.In fact, for this particular project, only Stephen has had access to the data. I've mainly worked on the code that estimates new models (for those interested, see references below).Now that it's time to publish our "big" paper, we need a strategy to create a synthetic dataset that satisfies privacy protection constraints while still preserving the moments' structure we care about, as well as those that others may find interesting.To this end, I propose a simple strategy that could work: Multiple Imputation. While it may not be the best method available, I welcome any feedback or suggestions.To explain how the method works, I'll use the Swiss Labor Market Survey 1998 dataset, which is publicly available and used as an example dataset in the command -oaxaca- [@jann_2008].## The ProblemAssume you signed a confidentiality agreement to work with Swiss Survey data and are ready to submit your work. However, you are required to provide a replication package with a code to produce the tables and the dataset itself. Since you cannot share the original data, you suggest generating 5 synthetic datasets instead. By doing so, people can apply your code and reach similar conclusions to your main paper, but with the advantage that the data is simulated, thus fulfilling privacy concerns.Here is a piece of code that can be used for that:```{stata}frame resetset linesize 255use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clearmisstable summarize```Four variables (Wages, tenure, experience, and ISCO) have missing data when `lfp=0` (people are not working). ## The solutionThe first step is to decide on the size of the synthetic dataset. You could create a dataset with the same number of observations or adjust it to your desired sample size. I will expand the dataset to double the size, tag the new observations and make all variables, except for `lfp`, missing. This is because some data is missing as it's only available for those in the labor force. Alternatively, you could have created `lfp` using a random draw from a Bernoulli distribution with the same probability as the original data.```{stata}expand 2, gen(tag)foreach i of varlist lnwage educ exper tenure isco female age single married divorced kids6 kids714 wt { qui:replace `i'=. if tag==1}```Next, create multiple imputed datasets using the predictive mean matching strategy. To do this, set the data and register all variables to be imputed. Then, impute all variables using chain `pmm`. Make sure none of the variables are collinear, and variables with structural missing data are specified separately. The only explanatory variable or exogenous variable here is `lfp`.```{stata}mi set widemi register impute lnwage educ exper tenure isco female age single married kids6 kids714 wtset seed 101qui:mi impute chain (pmm, knn(20)) educ female age single married kids6 kids714 wt (pmm if lfp==1, knn(20) ) lnwage exper tenure isco = lfp, add(5)```You now have 5 sets of variables that can be used to create unique synthetic datasets with a similar structure to the original confidential dataset. Let's now put the newly created data into frames, so we can estimate few models and compare them with the original data.```{stata}forvalues i = 1/5 { frame put _`i'_* lfp if tag==1, into(fr`i') frame fr`i':ren _`i'_* *}use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear```## Comparing ResultsYou can now estimate 4 models using the original data and the synthetic data. ```{stata}*|echo: false qui:reg lnwage educ exper tenure female est sto m1 qui:qreg lnwage educ exper tenure female, q(10) est sto m2 qui:qreg lnwage educ exper tenure female, q(90) est sto m3 qui:heckman lnwage educ exper tenure female age, selec(lfp =educ female age single married kids6 kids714) two est sto m4forvalues i = 1/5 { frame fr`i' { qui:reg lnwage educ exper tenure female est sto m1`i' qui:qreg lnwage educ exper tenure female, q(10) est sto m2`i' qui:qreg lnwage educ exper tenure female, q(90) est sto m3`i' qui:heckman lnwage educ exper tenure female age, selec(lfp =educ female age single married kids6 kids714) two est sto m4`i' }} ```Now lets compare the models:```{stata}*| echo: false*| output: asisdisplay "**Linear Regression**"esttab m1 m1?, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5) nonum nogaps md se compress b(3)display "**Quantile Regression 10**"esttab m2 m2? , mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5) nonum nogaps md se compress b(3)display "**Quantile Regression 90**"esttab m3 m3? , mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5) nonum nogaps md se compress b(3)display "**Heckman selection model**"esttab m4 m4? , mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5) nonum nogaps md se compress b(3)```I wont spend too much time interpretting the models. However, it is important to notice that they do provide similar outcomes, except for quantile regressions. Still, with this, one could redistribute replication codes that use both the true data and synthetic data, providing transparency to the work.## ConclusionsAs evident from the analysis, the results of the synthetic dataset are not expected to perfectly replicate the original data due to the introduction of random errors. However, by keeping this in mind, we can create synthetic datasets like this one, along with two sets of results - one based on the actual data, and the other based on the synthetic dataset(s).This should help providing replication packages with code and data, improving the transparency of research when using restricted data.## References::: {#refs}:::