Constructing synthetic Datasets

How to come public, with private data

Introduction

In my current collaboration with Stephen Jenkins, we are grappling with the challenge of providing a self-contained replication package alongside our paper.

It’s relatively easy to share the code for our model estimations, including code developed by other authors. However, many researchers face the same challenge we do: how to distribute data that we’re not allowed to share due to privacy or proprietary reasons.

In fact, for this particular project, only Stephen has had access to the data. I’ve mainly worked on the code that estimates new models (for those interested, see references below).

Now that it’s time to publish our “big” paper, we need a strategy to create a synthetic dataset that satisfies privacy protection constraints while still preserving the moments’ structure we care about, as well as those that others may find interesting.

To this end, I propose a simple strategy that could work: Multiple Imputation. While it may not be the best method available, I welcome any feedback or suggestions.

To explain how the method works, I’ll use the Swiss Labor Market Survey 1998 dataset, which is publicly available and used as an example dataset in the command -oaxaca- (Jann 2008).

The Problem

Assume you signed a confidentiality agreement to work with Swiss Survey data and are ready to submit your work. However, you are required to provide a replication package with a code to produce the tables and the dataset itself. Since you cannot share the original data, you suggest generating 5 synthetic datasets instead. By doing so, people can apply your code and reach similar conclusions to your main paper, but with the advantage that the data is simulated, thus fulfilling privacy concerns.

Here is a piece of code that can be used for that:

Code
frame reset
set linesize 255
use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
misstable summarize
(Excerpt from the Swiss Labor Market Survey 1998)
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        lnwage |       213               1,434  |   >500    .507681    5.259097
         exper |       213               1,434  |   >500          0    49.16667
        tenure |       213               1,434  |    323          0    44.83333
          isco |       213               1,434  |      9          1           9
  -----------------------------------------------------------------------------

Four variables (Wages, tenure, experience, and ISCO) have missing data when lfp=0 (people are not working).

The solution

The first step is to decide on the size of the synthetic dataset. You could create a dataset with the same number of observations or adjust it to your desired sample size. I will expand the dataset to double the size, tag the new observations and make all variables, except for lfp, missing. This is because some data is missing as it’s only available for those in the labor force. Alternatively, you could have created lfp using a random draw from a Bernoulli distribution with the same probability as the original data.

Code
expand 2, gen(tag)
foreach i of varlist lnwage educ exper tenure isco female age single married divorced kids6 kids714 wt {
  qui:replace `i'=. if tag==1
}
(1,647 observations created)

Next, create multiple imputed datasets using the predictive mean matching strategy. To do this, set the data and register all variables to be imputed. Then, impute all variables using chain pmm. Make sure none of the variables are collinear, and variables with structural missing data are specified separately. The only explanatory variable or exogenous variable here is lfp.

Code
mi set wide
mi register impute lnwage educ exper tenure isco female age single married kids6 kids714 wt
set seed 101
qui:mi impute chain (pmm, knn(20))  educ female age single married kids6 kids714 wt (pmm if lfp==1, knn(20) ) lnwage  exper tenure isco  = lfp, add(5)

You now have 5 sets of variables that can be used to create unique synthetic datasets with a similar structure to the original confidential dataset. Let’s now put the newly created data into frames, so we can estimate few models and compare them with the original data.

Code
forvalues i = 1/5 {
  frame put _`i'_* lfp if tag==1, into(fr`i')
  frame fr`i':ren _`i'_* *
}
use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)

Comparing Results

You can now estimate 4 models using the original data and the synthetic data.

Now lets compare the models:

Linear Regression

Original Fake1 Fake2 Fake3 Fake4 Fake5
educ 0.085*** 0.076*** 0.059*** 0.077*** 0.069*** 0.063***
(0.005) (0.005) (0.006) (0.005) (0.005) (0.005)
exper 0.011*** 0.010*** 0.011*** 0.012*** 0.006*** 0.009***
(0.002) (0.001) (0.002) (0.002) (0.001) (0.001)
tenure 0.008*** 0.008*** 0.002 0.005** 0.007*** 0.005**
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
female -0.084*** -0.027 -0.136*** -0.063* -0.056* -0.113***
(0.025) (0.025) (0.026) (0.025) (0.023) (0.024)
_cons 2.213*** 2.297*** 2.580*** 2.336*** 2.464*** 2.529***
(0.068) (0.068) (0.074) (0.070) (0.063) (0.064)
N 1434 1434 1434 1434 1434 1434

Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001

Quantile Regression 10

Original Fake1 Fake2 Fake3 Fake4 Fake5
educ 0.103*** 0.088*** 0.069*** 0.083*** 0.075*** 0.067***
(0.017) (0.015) (0.018) (0.011) (0.011) (0.013)
exper 0.020*** 0.012** 0.014** 0.012*** 0.008* 0.009*
(0.005) (0.004) (0.005) (0.003) (0.003) (0.004)
tenure 0.001 0.006 0.004 0.002 0.006 0.010*
(0.006) (0.005) (0.006) (0.004) (0.004) (0.005)
female -0.151 0.022 -0.128 -0.019 -0.161** -0.154*
(0.081) (0.070) (0.079) (0.053) (0.054) (0.063)
_cons 1.462*** 1.681*** 1.939*** 1.835*** 1.971*** 1.994***
(0.219) (0.193) (0.228) (0.149) (0.146) (0.171)
N 1434 1434 1434 1434 1434 1434

Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001

Quantile Regression 90

Original Fake1 Fake2 Fake3 Fake4 Fake5
educ 0.064*** 0.073*** 0.047*** 0.069*** 0.062*** 0.057***
(0.009) (0.008) (0.009) (0.008) (0.008) (0.007)
exper 0.004 0.009*** 0.009*** 0.011*** 0.003 0.005**
(0.003) (0.002) (0.003) (0.002) (0.002) (0.002)
tenure 0.008* 0.009*** -0.001 0.008** 0.012*** 0.005
(0.003) (0.003) (0.003) (0.003) (0.003) (0.003)
female -0.054 -0.054 -0.149*** -0.052 -0.009 -0.106**
(0.044) (0.035) (0.041) (0.039) (0.039) (0.033)
_cons 2.984*** 2.804*** 3.247*** 2.863*** 2.990*** 3.121***
(0.119) (0.097) (0.118) (0.111) (0.106) (0.089)
N 1434 1434 1434 1434 1434 1434

Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001

Heckman selection model

Original Fake1 Fake2 Fake3 Fake4 Fake5
lnwage
educ 0.072*** 0.066*** 0.052*** 0.068*** 0.059*** 0.057***
(0.005) (0.005) (0.006) (0.005) (0.005) (0.005)
exper 0.002 0.001 -0.000 0.004* -0.002 0.002
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
tenure 0.002 0.003 -0.002 -0.000 0.002 0.001
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
female -0.105*** -0.067** -0.189*** -0.105*** -0.096*** -0.144***
(0.029) (0.025) (0.026) (0.025) (0.024) (0.024)
age 0.015*** 0.013*** 0.015*** 0.012*** 0.013*** 0.011***
(0.002) (0.002) (0.002) (0.002) (0.002) (0.002)
_cons 1.991*** 2.071*** 2.226*** 2.133*** 2.201*** 2.317***
(0.073) (0.073) (0.081) (0.075) (0.069) (0.070)
lfp
educ 0.149*** 0.055** 0.024 0.018 0.064*** 0.028
(0.028) (0.020) (0.020) (0.019) (0.019) (0.018)
female -1.785*** -0.074 -0.246** -0.104 -0.011 -0.177
(0.161) (0.091) (0.091) (0.088) (0.090) (0.090)
age -0.039*** -0.021*** -0.023*** -0.018*** -0.025*** -0.023***
(0.007) (0.005) (0.005) (0.005) (0.005) (0.005)
single -0.100 -0.792** -0.586** -0.498** -0.863*** -0.780**
(0.231) (0.241) (0.201) (0.186) (0.229) (0.238)
married -0.867*** -0.929*** -0.765*** -0.489** -0.882*** -1.017***
(0.158) (0.219) (0.169) (0.160) (0.198) (0.213)
kids6 -0.716*** -0.730*** -0.648*** -0.699*** -0.790*** -0.714***
(0.082) (0.067) (0.064) (0.066) (0.069) (0.062)
kids714 -0.343*** -0.378*** -0.373*** -0.281*** -0.402*** -0.201***
(0.065) (0.059) (0.056) (0.058) (0.058) (0.058)
_cons 3.543*** 2.720*** 3.006*** 2.560*** 2.729*** 3.135***
(0.486) (0.382) (0.388) (0.345) (0.383) (0.383)
/mills
lambda -0.123 0.128* 0.251*** 0.062 0.206*** 0.077
(0.065) (0.061) (0.065) (0.070) (0.057) (0.061)
N 1647 1647 1647 1647 1647 1647

Standard errors in parentheses
* p < 0.05, ** p < 0.01, *** p < 0.001

I wont spend too much time interpretting the models. However, it is important to notice that they do provide similar outcomes, except for quantile regressions. Still, with this, one could redistribute replication codes that use both the true data and synthetic data, providing transparency to the work.

Conclusions

As evident from the analysis, the results of the synthetic dataset are not expected to perfectly replicate the original data due to the introduction of random errors. However, by keeping this in mind, we can create synthetic datasets like this one, along with two sets of results - one based on the actual data, and the other based on the synthetic dataset(s).

This should help providing replication packages with code and data, improving the transparency of research when using restricted data.

References

Jann, Ben. 2008. “The BlinderOaxaca Decomposition for Linear Regression Models.” The Stata Journal 8 (4): 453–79. https://doi.org/10.1177/1536867X0800800401.