# A New Approach for the Development of a Public Use Microdata File for Canada's 2011 National Household Survey

3. Overall approach for creating the PUMF

The creation of the PUMF required finalizing the data content, selecting the 2.7% sample from the 2011 NHS, developing and applying a disclosure control strategy, and producing household weights for estimation and variance estimation.

## 3.1 Finalizing the data content

The data content was determined in consultation with IPUMS-International and subject-matter specialists. Variable categories were grouped in an attempt to balance confidentiality and analytical needs. The only geographical variables are Province, with the Territories combined, and a rural indicator, which was collapsed in PEI and the North. Instead of grouping, Age and income variables would be subjected to noise addition and top/bottom coding. For place of birth (POB), mother tongue (MTN), Occupation (OCC) and Citizenship, a minimum population size of 125,000 was used to determine the final categories. For OCC and POB, categories were grouped to respect the target. Smaller hard-to-collapse categories, for example Oceania, were left as is. For MTN and Citizenship, categories below the target were placed in a residual category. For Industry (IND), the 2-digit NAICS was generally followed. In the North, POB, MTN, derived Visible Minority (DVM), and non-Christian Religion (REL) categories with less than 400 people were put in a separate category. Those separate categories should have a negligible impact on analyses as they cover a very tiny portion of the Canadian population.

## 3.2 Selecting the public use microdata samples (PUMS) from the 2011 NHS

Sample selection was complicated by the presence of related PUMFs, which increases the risk of reidentification. As noted, two *traditional* 2011 NHS PUMFs were being produced. Additionally, post-censal surveys like the 2012 Aboriginal Peoples Survey (APS) and the 2012 Programme for the International Assessment of Adult Competencies (PIAAC), which selected Aboriginal and Immigrant households from the 2011 NHS, were also releasing PUMFs. To minimize the risk of reidentification, it was decided to avoid overlap with those PUMFs as much as possible. Overlap among the three 2011 NHS PUMFs was avoided by splitting the 2011 NHS into three portions, with 45/109^{th} reserved for the present PUMF. By incorporating frame and design information from the post-censal surveys in the selection process, overlap with those surveys was nearly eliminated. The result was a 2.78% sample with minimum sample weight of 32 (shared by more than 85% of the households) and maximum weight of 242.22=10,900/45.

## 3.3 Developing and applying a disclosure control strategy

The disclosure control strategy is based on a mix of global recoding and targeted perturbation. As noted before, categorical variables were grouped to reduce the need for perturbation. The minimum threshold for categories for variables like POB and MTN was set at 125,000 after a threshold of 100,000 yielded too many problem cases. Further grouping was done in the North. Age and income variables were subjected to both noise addition and top/bottom coding. Age was top-coded at 85. Top codes for income variables were calculated by province (with the North treated separately), rural indicator and sex. For employment income and government transfers, 99^{th} percentiles were used, whereas for other market income 98^{th} percentiles were used. Values above each top code were replaced by their weighted mean. All income variables were bottom coded at -30,000 for women and for men in the Atlantic provinces and the North, and at -50,000 for men elsewhere. Perturbed income values were also rounded to base 100 except for nonzero values between -50 and +50, which were set to ±1.

Data perturbation efforts focussed on individuals and households with a high risk of disclosure. Candidates were identified through the application of rules and the use of disclosure risk measures.

For several years now Statistics Canada has been using disclosure risk measures as part of its strategy for creating PUMFs. For this project, it was decided to combine two of them: multiplicity and a Data Intrusion Simulation (DIS) measure. Both work from a set of *identifying variables* (*IVs*), i.e., actual or derived PUMF variables that, in combination, can be used to identify individuals with unique values in the sample *as well as in the population* (e.g., a 68-year old male dentist in PEI). A microdata file can include a large number of IVs, but it does not make practical sense to use them simultaneously to identify individuals at risk – nearly everyone would be unique. Risk scenarios involving a few IVs at a time were thus created and results were obtained by subgroups created from widely or easily known characteristics (e.g., all combinations of three IVs, by province and sex). Subgroups are useful to include certain characteristics, like sex, in every risk scenario or when different subpopulations are at different levels of risk (e.g., respondents from smaller provinces are usually more at risk than those from larger ones).

The first risk measure, multiplicity, takes combinations of IVs, say three at a time within subgroup, and counts the number of times each unit is *sample unique* (i.e., the only respondent in his/her subgroup with a particular combination of IV values). The count, called multiplicity score, is related to the unit's identification risk since units that appear as unique in more tables are more likely to be identified as unique in the population. The multiplicity score is a heuristic concept. One problem with it is that it ignores features of the sample design that affect risk, such as the sample rates. Another problem is that it is difficult to come up with a theoretical threshold for a maximum acceptable score unless one is dealing with the simplest of sample designs (Bernoulli sampling).

The other risk measure is based on the Data Intrusion Simulation, a method of estimating the probability that, for a given scenario (set of IVs), a hacker who matched an arbitrary unit in the population against a sample unique on the PUMF is correct. For Bernoulli and Poisson sampling this probability is estimated from the number of sample uniques and pairs, and the average weight of units in pairs (Skinner and Elliot (2002), Skinner and Carter (2003)). The probability, calculated for any subgroup and scenario, can be assigned to every sample unique therein. This assignment can generate some peculiarities. For example the estimated probability for, say, a dentist who is sample unique may be affected by whether civil and electrical engineers are placed in the same or different categories. The estimated probability is 1 when there are sample uniques but no doubles for a given scenario. Its variance can be quite high. However, the simplicity and theoretical foundation of DIS makes it a very attractive tool for comparing strategies, like the impact of providing different levels of geographical detail on the overall risk.

The two measures were combined into a single unit-level risk measure as follows. Given a set of IVs, for n = 1, 2, and 3, all possible n-way tables by subgroup were generated, and the DIS probability for each table was assigned to each of its sample uniques. Rather than count up the number of times that a unit was unique, the five worst (highest) probabilities for that unit were taken. The probability was 0 for tables where the unit was not unique. The risk measure used, called DIS_{(5)} here, is *the probability that the unit does not get correctly matched in any of the five n-way tables where it is most likely to be matched* (treating tables as if they were independent). If DIS_{[i]} represents the *i*^{th} highest risk for a unit, then ${\text{DIS}}_{\left(5\right)}=1-{\displaystyle \prod}_{i=1}^{5}(1-{\text{DIS}}_{\left[i\right]})$. Units with DIS_{(5)} above our threshold were considered to be at risk.

The DIS probability estimates assume that the population is unclustered. The hierarchical nature of the PUMF necessitated some adaptation. Household members can share characteristics like place of birth, level of education and religion that will affect the table counts of uniques and pairs. In calculating person level risk, households were only allowed to contribute 0 or 1 unit to each cell. For household level risk the approach taken was to generate household level IVs and subgroups. Household IVs used in one-couple households include household type (e.g., three-generation household), the places of birth present, the highest level of education, the joint occupations of both spouses, the age/sex distribution of their children (counts for 13 age-sex groups collated into a single IV). Some household IVs, like the previous one, could have hundreds of categories.

Once units at risk were identified it was necessary to determine which variable(s) should be perturbed. The target variable was often determined using the risk measure again. We generated an individual DIS_{(IV)} measure for each IV. DIS_{(IV)} was similar to DIS_{(5)} except that it was based on the five worst tables that did not include the specified IV. If a unit's DIS_{(5)} was above our threshold, but some of its DIS_{(IV)} were below, then those IVs were preferred candidates for perturbation. The choice of IV depended on factors such as how easy it was to perturb the IV and whether a particular IV was already perturbed enough times when it was the only choice (which was the case for OCC). When none of the DIS_{(IV)} went below the thresholds, the IVs with the lowest DIS_{(IV)} may have been perturbed, or more than one IV was perturbed. Sometimes, a different IV was used just because it was a good candidate for masking. For example, changing someone's DVM would make it much less likely for them to be identified through spontaneous recognition. For a few thousand households, both DIS_{(5)} and the DIS_{(IV)} never went below 1. Many of these so-called *unresolvable* households underwent more drastic perturbation measures such as changing province, adding, removing or swapping members, swapping members' education-employment histories, etc.

Candidates for perturbation were also identified by applying deterministic rules and using estimated population size thresholds. This was done particularly for ethno-cultural variables like POB, MTN, DVM and religion. Rules also targeted large households. Household size was capped at 11. Larger households were either split in two or had a few members removed. Large household below 11 were also subjected to more drastic perturbation. Section 4 gives more detail about some of the other deterministic and threshold rules used. Although for reasons of confidentiality and brevity not all methods used are presented, the section provides a good overview of the types of strategies used.

## 3.4 Producing household weights for estimation and variance estimation

After the perturbation was done, sample weights were calibrated so that PUMF estimates added up to Census population counts for 33 post-strata. Post-strata were generated by crossing province/the North, and three household types: households with 2 or more Aboriginal members, other households with 6 or more members and other households.

The PUMF sample is close to a self-weighting sample of households, with 85% of the weights equal to 32. As was the case for the traditional 2011 NHS PUMFs, the Random Group Method (Wolter, 2007) was used to allow users to generate variance estimates. With this method, sample households are randomly distributed among 8 replicates. Normally, following this step, each household has a replicate weight that is 8 times larger than its original weight, if it was selected for that replicate, and zero otherwise. The calibration step is then repeated for each replicate, leading to a set of 8 calibrated replicate weights for each household, which are equal to zero 7 times out of 8.

One problem with zero weights is that they can yield estimates of zero, which could lead to problems for certain types of analyses; a solution is proposed by Rao and Shao (1999). To avoid having replicate weights equal to zero, rather than using the replicate weights as is, each replicate weight was replaced by the simple average of the original weight and the replicate weight. This yields a weight that is one-half of the original weight when the household is not in a particular replicate. It is those weights that were calibrated. The result is a set of 8 nonzero replicate weights for each household that can be used to generate replicate estimates for variance estimation purposes. More information on the estimation of variance using the (calibrated) replicate weights is provided in *2011 National Household Survey Special PUMF* (Statistics Canada, 2015).