Chapter 1 Key variables
Created on: 24 July 2025
Authors: Daniel Gebhardt, Arne Bethmann, Madhu Chauhan and Milena Mühlmeister
Introduction: This blog series is an extension of the PASS User Guide and is intended to help users understand how the Scientific Use File (SUF) of the panel study ‘Labour Market and Social Security’ (PASS) can be processed in the programming languages Stata and R. In the blog articles you will find code examples for using the PASS data. If R packages are used for the examples in R, these packages are listed in the respective blog articles. The R package haven is used to import the Stata datasets.
Disclaimer: Since we use the SUF wave 17 to develop this blog, the values in the output of the code examples may change if new waves or the campus file (CF) are used. The information presented is generated using the R code. However, the code for Stata works in the same way as the R code and leads to the same results.
Disclaimer: Since we use the SUF wave 17 to develop this blog, the values in the output of the code examples may change if new waves or the campus file (CF) are used. The information presented is generated using the R code. However, the code for Stata works in the same way as the R code and leads to the same results.
Description: This blog article provides an overview of the key variables in the different PASS datasets. The R package dplyr is used in this chapter.
Key variables are used to identify units and observations and to establish links between different datasets. These variables are essential whenever a certain research question requires information from different datasets which must therefore be combined before analyses can be carried out.
Disclaimer: Since we use the SUF wave 17 to develop this blog, the values in the output of the code examples may change if new waves or the campus file (CF) are used. The information presented is generated using the R code. However, the code for Stata works in the same way as the R code and leads to the same results.
This section aims to explain the key variables of PASS and how they are put to use. In a first step, this section will explain how the key variables are connected to the structure of the scientific use file (SUF) and its datasets, which were discussed in section 5 (reference). Secondly, these variables are described in more detail. Also, an overview of the key variables included in the different datasets of the scientific use file is given. The section concludes with several practical examples illustrating the use of the key variables.
1.1 Key variables and their connection to the structure of the scientific use file
The structure of the SUF and its datasets were illustrated in Chapter 5 (reference). There it was shown that the datasets of the SUF can be classified by their level (household or individual), their type (register, cross section, weight or spell) and which formats they are prepared (wide, long, spell) in. Which key variables can be used to identify units and their respective observations depends on the level and format of the dataset.
On the household as well as on the individual level PASS uses specific identification numbers (ID) that are constant across waves. These ID-numbers can be used to identify certain units - households or persons - in all datasets of the SUF and across all waves.
A certain household can be identified via the current household number hnr and can be related to its household of origin via the original household number uhnr1. Households keep their hnr across waves. If a part of an already surveyed household splits off, the newly formed household gets a new hnr and keeps it for future waves.
Individuals are assigned a constant personal ID-number pnr when they are a member of a successfully surveyed household in PASS for the first time. Persons keep their pnr across waves, even if they change between households, e. g. when they leave their household of origin and form a new split-off household.
Using only the ID-numbers - hnr on the household and pnr on the individual level - one can clearly identify a unit in each of the different datasets but not necessarily a certain observation. PENDDAT <- read_dta(paste0(data, "PENDDAT.dta"))Additional information is required to clearly identify an observation, which depends on the format of the specific dataset in question.
Datasets that are prepared in wide format (the register datasets) contain only one observation per unit while the wave-specific information is stored in wave-specific variables, e. g. age1 for a persons’ age in wave 1, age2 for the age in wave 2 and so on. In these datasets each unit has exactly one observation and therefore can be clearly identified using the ID-variables.
Datasets that are prepared in long format (the cross-sectional datasets and the weights) as well as the datasets that are prepared in spell format (the different spell datasets) can contain more than one observation per unit. Datasets in long format contain as many wave-specific observations for each unit as there are waves this unit was interviewed in, e. g. if a household was interviewed twice, the household dataset contains two observations for this household - one for each wave with an interview. Therefore, the wave indicator welle is required in addition to the household or personal ID-number in order to identify an observation clearly. In spell format datasets, the spell number spellnr has to be taken into account when identifying an observation. The spell datasets contain as many observations as there are episodes reported by the household or person, e. g. the employment spells contain two observations for a person if this person reported two episodes of employment.
All datasets include key variables which are used to identify units and observations and to establish links to other datasets of the SUF. The key variables included in the dataset are listed in Table 1.1 and Table 1.2. For further information about their meaning and on how to use them, see the corresponding chapter in @berg2025codebook. We strongly recommend PASS users to make themselves familiar with the structure of the datasets, their meaning and the key variables before combining different datasets.
| Key variable | Description |
|---|---|
| hnr | Current household number Eight-digit, constant ID number of a household, which is allocated when the household joins the panel. The first digit indicates the wave in which the household was first part of the gross sample of PASS. E. g.: 10010008 - household in gross sample for first time in 1st wave, 21011685 - household in gross sample for first time in 2nd wave, … |
| uhnr | Original household number Eight-digit, constant ID number that points to the original household. In the case of households that were drawn directly for one of the subsamples, the uhnr is the same as the respective hnr. In the case of households which have split off from panel households (split-off households) the uhnr corresponds to the hnr of the household from which the split-off household originated. |
| hnr$ | Household number in wave$ Eight-digit, constant ID number of the household in wave$ of PASS. This variable is only contained in the register datasets processed in wide format. |
| pnr | Constant personal ID number Ten-digit, constant ID number of the individual. The pnr is allocated when a person first joins a PASS survey household. The first eight figures consist of the household number of the household to which the person belonged when he or she joined PASS and the last two figures are the serial number that the person had within this household. E. g.: 1001000801 - person joined the PASS in household 10010008 and had the serial number 01 in this household |
| zplfd$ | Serial number of the target person in the household in wave$ Two-digit serial number within the household in wave$, which indicates the person’s position in the household structure. Within a particular household the zplfd is constant in principle. If a person moves to a different household between the waves, then a new zplfd is allocated in the new household - in this case zplfd1 and zplfd2 differ. Serial numbers that were already used for a certain household in one of the previous waves are not allocated to anyone else. The numbering of new people in a household begins at N+1 (N = highest zplfd ever allocated in that household). |
| welle | Indicator for survey wave Both the household and individual datasets as well as the corresponding weighting files of PASS are processed in long format. For every interview that was conducted with a household or a person there is a row in the data matrix. By means of a wave indicator (welle) it is possible to assign these different observations for a household or a person to the respective survey wave. |
| spellnr | Spell number As in the datasets processed in long format, another variable is necessary in addition to the household and personal ID numbers in order to identify observations clearly in the spell datasets. In the different subject-related datasets the spells were put into chronological order and then each one was given a serial number, the spell number, within the household or the person. It is not easily possible to relate spell information clearly to a survey wave as the spells contain cross-wave information. |
| Dataset | Key variables contained | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| hnr | uhnr | hnr$* | pnr | zplfd$* | welle | spellnr | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Household level | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Household register (hh_register) | x | x | x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Household dataset (HHENDDAT) | x | x | x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Household weights (hweights) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Household dataset on retirement provision (HAVDAT, wave 3 only) | x | x | x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Citizens’ Benefits (Bürgergeld)a spells (alg2_spells) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Individual level | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Person register (p_register) | x | x | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Person dataset (PENDDAT) | x | x | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Children dataset (KINDER) | x | x | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Person weights (pweights) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Websurvey (Web, wave 16 only) | x | x | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Vignettes: employment mothers (VIGDAT_MUK, wave 15 only) | x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Vignettes: evaluation of vacancies (VIGDAT_KON, wave 12 only) | x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Vignettes: readiness to accept a job (VIGDAT_SUB, wave 5 only) | x | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Person dataset on retirement provision (PAVDAT, wave 3 only) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Employment biographies (bio_spells, from wave 2) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| One-Euro-Job spells (ee_spells, from wave 4) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Measure spells (mn_spells, wave 2 and 3 only) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Measure spells (massnahmespells, wave 1 only) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Unemployment Benefit I spells (alg1_spells, wave 1 only) | x | x | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| a The Citizen’s Benefit (Bürgergeld) came into effect on 1 January 2023 and replaced the former Unemployment Benefit II (Arbeitslosengeld II). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * $ represents the number of a certain wave and indicates a wave-specific variable, e. g. hnr$ represents the household number in wave$ – therefore the variable name for wave 1 is hnr1. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
1.2 Example: Merging household data with the individual dataset
If household data are to be merged with the individual dataset (e. g. the information on the type of the household which is contained in the variable hhtyp), then the two relevant key variables - the household number (hnr) and the wave indicator (welle) - must be used.
* Example using Stata:
quietly use "${data}\PENDDAT.dta", clear
quietly merge m:1 hnr welle using "${data}\HHENDDAT.dta", keepusing(hhtyp)
tabulate _merge, missing
drop if _merge == 2
# Example using R:
PENDDAT <- read_dta(paste0(data, "PENDDAT.dta"))
HHENDDAT <- read_dta(paste0(data, "HHENDDAT.dta"))
df <- PENDDAT %>%
left_join(
HHENDDAT %>%
select(hnr, welle, hhtyp),
by = c("hnr", "welle"))
old text version: The tabulation of the _merge variable shows that information from the household dataset was merged for some cases from wave 2 (N=140) and wave 3 (N=190) for which no personal interviews were available. These are re-interviewed households without personal interviews in the respective wave. These cases are dropped for the example.
This example shows that information from the household dataset is merged with the personal interviews. There are some cases of re-interviewed households for which no personal interviews were available. Only those observations for which both a household and a personal interview are available are merged.
1.3 Example: Merging the household weights with the household dataset
The household dataset and the household weights are available in the same format and on the same level. Accordingly, the datasets can be merged directly. The same procedure is used for merging the individual dataset and the person weights.
* Example using Stata:
quietly use "${data}\HHENDDAT.dta", clear
quietly merge 1:1 hnr welle using "${data}\hweights.dta"
tabulate _merge, missing
*tabulate _merge welle, missing **delete this line?**
# Example using R:
HHENDDAT <- read_dta(paste0(data, "HHENDDAT.dta"))
hweights <- read_dta(paste0(data, "hweights.dta"))
df <- inner_join(HHENDDAT, hweights, by = c("hnr", "welle"))
old text version: The tabulation of the _merge variable shows a perfect match of the household dataset and the household weights. For each household that was interviewed in a certain wave an observation from the weighting dataset was merged. See blog 4 on the use of the weights.
This example shows a perfect match of the household dataset and the household weights. For each household that was interviewed in a certain wave an observation from the weighting dataset was merged. See blog 4 on the use of the weights.
1.4 Example: Merging information from the individual dataset with the person-specific spell data
When merging spell data and the household or individual dataset, it is always necessary to take the different logics of the datasets into account. Whilst the household and individual datasets contain wave-specific observations of the study units, the spells cannot be assigned clearly to one particular wave. A spell of employment, for example, can span several survey dates. This spell is then visible in the data structure as a single observation with its respective start and end dates. If, for instance, individual-level information is to be merged with the person-specific spell data (spells of employment, unemployment, gaps, employment and training measures), then these different data structures have to be taken into consideration. As it is not straight forward to assign every spell clearly to a particular survey wave, only the personal ID number can be used as a key variable. The information from the individual dataset therefore has to be converted to wide format first and then merged with all of a person’s spells. This is demonstrated below using the example of the date of the personal interview which is available in the individual dataset and is to be merged with the employment spells.
First the individual dataset, reduced to the relevant variables, is converted to wide format. For this the information on the interview date, which has been stored in wave-specific observations so far, is restructured. Instead of there being one observation per survey wave, there is now only one single observation for each individual in the dataset. The information on the interview date is now stored in the wave-specific variables pintdat1, pintdat2, et cetera. For many individuals the spell dataset contains more than one observation. By linking via the personal ID number, the respective interview dates of each individual wave are added to each of a person’s spells and are available for further calculations.
The biography spell dataset consists of different spell types: employment, unemployment, as well as other times out of employment, e. g. retirement, housewife/-husband, and military or civil service. You can keep certain types of spells by using the variable spelltyp. In the example only the employment spells are kept in the dataset.
* Example using Stata:
quietly use "${data}\PENDDAT.dta", clear
keep pnr welle pintdat
reshape wide pintdat, i(pnr) j(welle)
forvalues j = 1/17 {
la var pintdat`j' "Datum des Personeninterviews in Welle `j'"
}
save "${storage}\PINTDAT.dta", replace
quietly use "${data}\bio_spells.dta"
keep if spelltyp == 1
quietly merge m:1 pnr using "${storage}\PINTDAT.dta"
tabulate _merge, missing
drop if _merge == 2
# Example using R:
PINTDAT <- read_dta(paste0(data, "PENDDAT.dta")) %>%
select(pnr, welle, pintdat) %>%
pivot_wider(names_from = welle,
values_from = pintdat,
names_prefix = "pintdat")
for (j in 1:17) {
attr(PINTDAT[[paste0("pintdat",j)]], "label") <- paste0("Datum des Personeninterviews in Welle ", j)
}
bio_spells <- read_dta(paste0(data, "bio_spells.dta")) %>%
filter(spelltyp == 1)
df <- bio_spells %>%
left_join(
PINTDAT,
by = "pnr"
)
old text version: The tabulation of the *_merge* variable shows that no employment spell is available for over 19,000 individuals. Some of these individuals were only interviewed in the 1st wave, some had not reported any employment spells since and some were not asked about their employment owing to a filter. These cases are dropped.
This example shows how information from the personal interview is merged with the biography spell dataset. In particular, the date of the personal interview is merged with the spell data, but only for those individuals reporting an employment spell (spelltyp == 1). There are also some individuals who were only interviewed in the 1st wave, some who have not reported any employment spells since and some who were never asked about their employment owing to a filter. These cases are not included in the example.
References
Footnotes
- For households that have been drawn directly for one of the samples, the uhnr is identical to the hnr. Households that have split off from another household in PASS carry an uhnr representing the hnr of the household of origin.↩︎
