I describe how to use merge, and the meaning of the different types of merging options
Stata
Tips
Programming
Author
Fernando Rios-Avila
Published
April 20, 2023
Merging 1:1
When you can uniquely identify all observations in each dataset using a common identifier viariable.
use file1merge 1:1 id using file2
Merging m:1
When you are trying to transfer family level data to each family member. The family member file is uniquely identified
use family_membersmergem:1 hid usingfamily
Merging 1:m
When you try to merge together data from various family members to a single family file. Each member is merged to its family.
usefamilymerge 1:m hid using family_members
Merging m:m
Something you NEVER want to do. It tries to merge to datasets based on a single set identification variable. This variable does not identify unique observations in either file. The merge is done by “id” based on the order they appear in the data. It will typically provide very odd results.
use family_mem1mergem:m id using family_mem2
True Merge m x m
If you were trying m:m merge (which is probably wrong), means you want to joinby. This means merge both files using all combinations of individuals that have the same id. This will create a very large dataset, unless other restrictions are applied.
use family_mem1joinby id using family_mem2
Source Code
---title: "Merging Data 1:1, m:1, 1:m and m:m?"description: "I describe how to use `merge`, and the meaning of the different types of merging options"author: "Fernando Rios-Avila"date: "4/20/2023"categories: - Stata - Tips - Programmingformat: html: code-line-numbers: true mermaid: theme: neutraldraft: false---## Merging 1:1When you can uniquely identify all observations in each dataset using a common identifier viariable.```statause file1merge 1:1 id using file2``````{mermaid}flowchart BT subgraph File2 db1-1["id=1"] db1-2["id=2"] db1-3["id=3"] end subgraph File1 db2-1["id=1"] db2-2["id=2"] db2-3["id=3"] end db1-1-->db2-1 db1-2-->db2-2 db1-3-->db2-3```## Merging m:1When you are trying to transfer *family* level data to each family member. The family member file is uniquely identified ```statause family_membersmerge m:1 hid using family``````{mermaid}flowchart BT subgraph members ["Family Members"] db1-1["hid=1,pid=1"] db1-2["hid=1,pid=2"] db1-3["hid=2,pid=1"] db1-4["hid=2,pid=2"] db1-5["hid=3,pid=1"] db1-6["hid=3,pid=2"] db1-7["hid=3,pid=3"] end subgraph Family db2-1["hid=1"] db2-2["hid=2"] db2-3["hid=3"] end db2-1-->db1-1 db2-1-->db1-2 db2-2-->db1-3 db2-2-->db1-4 db2-3-->db1-5 db2-3-->db1-6 db2-3-->db1-7```## Merging 1:mWhen you try to merge together data from various family members to a single *family* file. Each member is merged to its family.```statause familymerge 1:m hid using family_members``````{mermaid}flowchart BT subgraph members ["Family Members"] db1-1["hid=1,pid=1"] db1-2["hid=1,pid=2"] db1-3["hid=2,pid=1"] db1-4["hid=2,pid=2"] db1-5["hid=3,pid=1"] db1-6["hid=3,pid=2"] db1-7["hid=3,pid=3"] end subgraph Family db2-1["hid=1"] db2-2["hid=2"] db2-3["hid=3"] end db1-1-->db2-1 db1-2-->db2-1 db1-3-->db2-2 db1-4-->db2-2 db1-5-->db2-3 db1-6-->db2-3 db1-7-->db2-3```## Merging m:mSomething you **NEVER** want to do. It tries to merge to datasets based on a single set identification variable. This variable does not identify unique observations in either file. The merge is done by "id" based on the order they appear in the data. It will typically provide very odd results.```statause family_mem1merge m:m id using family_mem2``````{mermaid}flowchart BT subgraph members ["Family Members 1"] db1-1["hid=1,pid=1"] db1-2["hid=1,pid=2"] db1-3["hid=2,pid=1"] db1-4["hid=2,pid=2"] db1-5["hid=3,pid=1"] db1-6["hid=3,pid=2"] db1-7["hid=3,pid=3"] end subgraph Family ["Family Members 2"] db2-1["hid=1,pid=2"] db2-2["hid=1,pid=3"] db2-3["hid=1,pid=4"] db2-4["hid=2,pid=1"] db2-5["hid=2,pid=3"] db2-6["hid=3,pid=2"] db2-7["hid=3,pid=4"] end db2-1-->db1-1 db2-2-->db1-2 db2-4-->db1-3 db2-5-->db1-4 db2-6-->db1-5 db2-7-->db1-6```## True Merge m x mIf you were trying `m:m` merge (which is probably wrong), means you want to `joinby`. This means merge both files using all combinations of individuals that have the same id. This will create a very large dataset, unless other restrictions are applied.```statause family_mem1joinby id using family_mem2``````{mermaid}flowchart BT subgraph members ["Family Members 1"] db1-1["hid=1,pid=1"] db1-2["hid=1,pid=2"] db1-3["hid=2,pid=1"] db1-4["hid=2,pid=2"] db1-5["hid=3,pid=1"] db1-6["hid=3,pid=2"] db1-7["hid=3,pid=3"] end subgraph Family ["Family Members 2"] db2-1["hid=1,pid=2"] db2-2["hid=1,pid=3"] db2-3["hid=1,pid=4"] db2-4["hid=2,pid=1"] db2-5["hid=2,pid=3"] db2-6["hid=3,pid=2"] db2-7["hid=3,pid=4"] end db2-1-->db1-1 db2-2-->db1-1 db2-3-->db1-1 db2-1-->db1-2 db2-2-->db1-2 db2-3-->db1-2 db2-4-->db1-3 db2-4-->db1-4 db2-5-->db1-4 db2-5-->db1-3 db2-6-->db1-5 db2-6-->db1-6 db2-6-->db1-7 db2-7-->db1-5 db2-7-->db1-6 db2-7-->db1-7```