View the most recent version.

## Archived Content

# 4. Estimation from the census sample

## 4.1 Operational considerations

## 4.2 Theoretical considerations

## 4.3 Developing an estimation procedure for the census sample

## 4.4 The two-step Pseudo-optimal Regression estimator

## 4.5 Two-pass processing

## 4.6 Differences between population counts and final weighted estimates

## 4.7 Different universes

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

4.1 Operational considerations

4.2 Theoretical considerations

4.3 Developing an estimation procedure for the census sample

4.4 The two-step Pseudo-optimal Regression estimator

4.5 Two-pass processing

4.6 Differences between population counts and final weighted estimates

4.7 Different universes

Any sampling procedure requires an associated estimation procedure for scaling sample data up to the population level. The choice of an estimation procedure is generally governed by both operational and theoretical constraints. From the operational viewpoint, the procedure must be feasible within the processing system of which it is a part, while from the theoretical viewpoint, the procedure should minimize the sampling error of the estimates it produces. Sections 4.1 and 4.2 describe the operational and theoretical considerations relevant to the choice of estimation procedures for the census sample. Sections 4.3 and 4.4 discuss some of the methodology used in developing the census weights. The remaining sections introduce the data universes used in the weighting process, and briefly discuss why discrepancies may occur between population counts and weighted estimates.

Mathematically, an estimation procedure can be described by an algebraic formula, or estimator, that shows how the estimate for the population is calculated as a function of the observed sample values. In small surveys that collect only one or two characteristics, or in cases where the estimation formula is very simple, it might be possible to calculate the sample estimates by applying the given formula to the sample data for each estimate required. However, in a survey or census in which a wide range of characteristics is collected, or in which the estimation formula is at all complex, the procedure of applying a formula separately for each estimate required is not feasible. For example, a separate application of the estimation formula would be required for every cell of every published census tabulation based on sample data. In addition, the calculation of each estimate separately would not necessarily lead to consistency between the various estimates made from the same census sample.

Therefore, the approach taken in the census (and in many sample surveys) is to split the estimation procedure into two steps: (a) the calculation of weights (known as the weighting procedure) and (b) the summing of weights to produce estimated population counts. Any mathematical complexity is then contained in step (a) which is performed just once, while step (b) is reduced to a simple process of summing weights which takes place at the time a tabulation is retrieved. It should be noted that since the weight attached to each sample unit is the same for whatever tabulation is being retrieved, consistency between different estimates based on sample data is assured.

For a given sample design and a given estimation procedure, one can, from sampling theory, make a statement about the chances that a certain interval will contain the unknown population value being estimated. The primary criterion in the choice of an estimation procedure is minimization of the width of such intervals so that these statements about the unknown population values are as precise as possible. The usual measure of precision for comparing estimation procedures is known as the standard error. Provided that certain relatively mild conditions are met, intervals of plus or minus two standard errors from the estimate will contain the population value for approximately 95% of all possible samples.

As well as minimizing standard error, a second objective in the choice of estimation procedure for the census sample is to ensure, as far as possible, that sample estimates for basic (i.e., 2A) characteristics are consistent with the corresponding known population values. Fortunately, these two objectives are usually complementary in the sense that sampling error tends to be reduced by ensuring that sample estimates for certain basic characteristics are consistent with the corresponding population figures. However, while this is true in general, forcing sample estimates for basic characteristics to be consistent with corresponding population figures for very small subgroups can have a detrimental effect on the standard error of estimates for the sample characteristics themselves.

In the absence of any information about the population being sampled other than that collected for sample units, the estimation procedure would be restricted to weighting the sample units inversely to their probabilities of selection (e.g., if all units had a one-in-five chance of selection, then all selected units would receive a weight of 5). In practice, however, one almost always has some supplementary knowledge about the population (e.g., its total size, and possibly its breakdown by a certain variable—perhaps by province). Such information can be used to improve the estimation formula so as to produce estimates with a greater chance of lying close to the unknown population value. In the case of the census sample, a large amount of very detailed information about the population being sampled is available in the form of the basic 100% data at every geographic level. We can take advantage of this wealth of population information to improve the estimates made from the census sample. However, this information can also be an embarrassment in the sense that it is impossible to make the sample estimates for basic characteristics consistent with all the population information at every geographic level. Differences between sample estimates and population values become visible when a cross-tabulation of a sample variable and a basic variable is produced. The tabulation has to be based on sample data with the result that the marginal totals for the basic variable are sample estimates that can be compared with the corresponding population figures appearing in a different tabulation based on 100% data. They will not necessarily agree. These differences are discussed further in Section 4.6 of this report.

Given that a weight has to be assigned to each unit (person, family or household) in the sample,
the simplest procedure would be to give each unit a weight of 5 (because a one-in-five sample was
selected). Such a procedure would be simple and unbiased^{1} and,
if nothing but the sample data were known, it might be the optimum procedure. However, although
we know that the sample will contain almost exactly one- fifth of all dwellings (excluding
collective dwellings and those in canvasser areas), one cannot be certain that it will contain
exactly one-fifth of all persons, or one-fifth of each type of household, or one-fifth of all
females aged 25 to 34, and so on. Therefore, this procedure would not ensure consistency even for
the most important subgroups of the population. For large subgroups, these fractions should be
very close to one-fifth, but for smaller subgroups they could differ markedly from one-fifth. The
next most simple procedure would be to define certain important subgroups (e.g., age-sex groups
within province) and, for each subgroup, to count the number of units in the population in the
subgroup (N) and the number in the sample (n) and to assign to each sample unit in the subgroup
a weight equal to N/n. These subgroups are often called 'post-strata.'

For example, if there were 5,000 males aged 20 to 24 enumerated in Prince Edward Island, and if 1,020 of these fell in the sample of dwellings, then a weight of 5,000/1,020 = 4.90 would be assigned to each male aged 20 to 24 in the sample in Prince Edward Island. This would ensure that whenever sex and age in five-year groups were cross-classified against a sample characteristic for Prince Edward Island, the marginal total for the male 20-24 age-sex group would agree with the population total of 5,000. This type of estimation procedure is known as 'ratio estimation.' By contrast, note that if a simple weight of 5 was used, it would have resulted in a sample estimate of 5,100 (1,020 x 5).

Adjusting the simple weights of 5 by small amounts to achieve perfect agreement between estimates and population counts is known as calibration. Prior to the 1991 Census, calibration was achieved using a procedure called Raking Ratio estimation. Household level estimates were generated using a household-level calibrated weight while the person-level estimates were generated using a person-level calibrated weight.

In 1991, the two-step Generalized Regression estimator (GREG) was introduced. It achieved a higher level of agreement between population counts and the corresponding estimates at the enumeration area (EA) level than had been possible with Raking Ratio estimation. In addition, a single household level calibrated weight was used to produce both the household and person level estimates. This eliminated inconsistencies that had been observed in some estimates prior to 1991. The two-step GREG estimator was also used in 1996.

In 2001 and 2006, a pseudo-optimal regression estimator was used because it typically gave slightly better agreement between the population counts and estimates than the GREG, while ensuring the calibrated weights were all equal to one or more. See Bankier (2002) for a more detailed comparison of the regression estimators.

With the Pseudo-optimal Regression estimator, the initial weights of approximately 5 were adjusted as little as possible for individual dwellings such that there was perfect agreement between the estimates and the population counts for as many of the basic characteristics as possible that are listed in Appendix B. (These will be called constraints or auxiliary variables.) It was required that this perfect agreement be achieved at the weighting area (WA) level. More information on WAs is given in Section 7.1 of this report.

In 2006, Canada was divided into approximately 50,000 collection units to be used in the collection of census data. The collection unit (CU) is similar in size and has similar attributes to the enumeration area (EA) used prior to the 2006 Census. A one-in-five systematic sample of dwellings was selected from most CUs to be used in the census weighting process. Dissemination areas (DAs) are another geographic level similar in size to CUs. Entire DAs were combined to form WAs. On average there are eight DAs and seven sampled CUs in a WA.

There are 34 auxiliary variables used in the regression process. These include five-year age ranges, marital status, common-law status, sex, household size, and dwelling type. See Appendix B for the 34 auxiliary variables. The objectives for the 2006 Census weighting procedure are:

- To have
**exact**population/estimate agreement at the WA level for as many of the 34 auxiliary variables as possible. - To have
**approximate**population/estimate agreement for the larger DAs for the 34 auxiliary variables.

In addition, it is required that:

- there be
**exact**population/estimate agreement for 'total number of households' and 'total number of persons' for as many DAs as possible - final census weights be in the range 1 to 25 inclusive. A lower bound of 1 is required because it is felt that each sampled person should, at minimum, represent themselves
- the method to generate weights be highly automated since the 6,602 WAs with households subject to sampling must be processed in a short period of time. This method must also adjust automatically for the different patterns of responses in WAs across the country.

Weights are calculated separately in each WA by using an automated weighting system. For each WA being processed, a set of user-defined parameters are passed to the system. An initial weight is assigned to each sampled private household in the WA, and these weights have either two or three weighting adjustment factors applied to them. First of all, households are sometimes post-stratified at the WA level based on household size because small and large households are under-represented in the sample. A second adjustment is then applied to the weights to try to achieve approximate population/estimate agreement at the DA level, as is described in objective (b) above. Finally, a third adjustment is applied to achieve exact population/estimate agreement at the WA and DA levels, as is described in objectives (a) and (c) above. For simplification purposes, the dropping of constraints and the various reasons for this will only be discussed once the initial weights and the three adjustments have been described in more detail.

First, an initial CU-level weight is assigned to each private household in the WA. The weight is equal to the number of private households in the CU divided by the number of private households that were sampled in that CU. Since approximately one in five households would be sampled, initial weights tend to be near five. In 2001, senior units were not part of the census weighting process, and were excluded from the sampling process. However, in 2006, senior units were treated similarly to private households, so they made up part of the sampling frame. Since the proportion of senior units in any CU was usually very small, they typically had little effect on the weighting results. However, for a small number of CUs where there were a high proportion of senior units, the private households and the senior residences were treated as two distinct populations, so two sets of initial weights were calculated for each of these CUs in order to reduce sampling bias. Once the initial weights were created, senior units were treated no differently than private households throughout the remainder of the weighting process. When the standard error adjustment factors of Chapter 9 were calculated, however, a CU where the private households and senior units were treated as two distinct populations were considered as two sampling strata rather than one.

In the first adjustment step, households are sometimes post-stratified based on household size (1, 2, 3, 4, 5, or 6+ persons) at the WA level. The initial weights are multiplied by a factor to generate the post-stratified weights. For example, based on the post-stratified weights, the estimated number of one-person households for a WA would agree with the number of one-person households in the WA population. Very occasionally, a post-stratified weight is constrained to ensure that it lies within the range 1 to 20 inclusive. An upper limit of 20 rather than 25 is used to give some 'room' for further adjustment.

Next, a first-step regression weighting adjustment factor is calculated at the DA level. The 34 auxiliary variables (age, sex, marital status, household size, and dwelling type) that are to be applied at the WA level in the second adjustment step are sorted in descending order based on the number of households they apply to in the population at the DA level. On this ordered list, the first constraint, third constraint, and so on, go into one group while the other 17 constraints go into a second group. The resulting weighting adjustment factors for each group of constraints are averaged together and applied to the post-stratified weights (or the initial weights if post-stratification was not done). Population/estimate differences at the DA level for the 34 constraints are usually reduced—but not eliminated—by using the first-step weights.

Finally, a second-step regression weighting adjustment factor is calculated at the WA level. The 34 auxiliary variables are applied at the WA level along with two auxiliary variables (number of households and number of persons) for each DA in the WA to determine the second-step weighting adjustment factors. These are applied to the first-step weights to generate the final weights. Population/estimate differences at the WA level for the 34 auxiliary variables are eliminated or reduced significantly using the final weights.

Constraints are discarded in the first and second steps because:

- they are small (they only apply to a few households in the population)
- they are redundant (also called linearly dependent (LD) constraints)
- they are nearly redundant (also called nearly linearly dependent (NLD) constraints)
- they cause outlier weights (weights outside the range 1 to 25 inclusive) during the calculation of the weights.

For example, since the total number of females plus the total number of males equals the total number of persons, the total number of females can be dropped as a redundant or LD constraint since any two of the constraints being satisfied guarantees that the third will also be satisfied. If the 'Marital status = widowed' constraint is dropped for being small (since there are very few widows in the WA), then the sum of the remaining marital status constraints (single, married, divorced, and separated) will nearly equal the total number of persons, suggesting that one constraint from this group of four could perhaps be dropped for being nearly redundant or NLD.

Initially, a check is done at the WA level for small, LD and NLD constraints, according to the following procedure:

- The size of a constraint is defined by the number of households in the population to which the constraint applies. A constraint whose size is less than or equal to the SMALL parameter (which equalled 20, 30 or 40 households in 2006) is discarded since estimates, for small constraints, tend to be very unstable.

- Next, LD constraints are discarded.

- Following this, the condition number of the matrix being inverted to determine the weighting adjustment factors is lowered by discarding NLD constraints. The condition number (see Press et al. , 1992) is the ratio of the largest eigenvalue to the smallest eigenvalue of the matrix being inverted. High condition numbers indicate near collinearity among the constraints, which could cause the estimates to be unstable. To lower the condition number, a forward-selection approach is used. The matrix is recalculated based only on the two largest constraints. If the condition number exceeds the COND parameter (which equalled 1,000, 2,000, 4,000, 8,000 or 16,000 in 2006), the second largest constraint is discarded. From here, the next largest constraint is added to the list of constraints being applied, the matrix is recalculated and its condition number determined. If the condition number increases by more than COND, the just-added constraint is discarded. This process continues until all constraints have been checked. If, after dropping these NLD constraints, the condition number exceeds the MAXC parameter (which equalled 10,000, 20,000, 40,000, 80,000 or 160,000 in 2006), additional constraints are dropped. Constraints are dropped in descending order, based on the amount by which they increased the condition number when they were initially included in the matrix. The condition number of the matrix is recalculated every time a constraint is dropped. When the condition number drops below MAXC, no more constraints are dropped. It should be noted that in 2006, MAXC always equaled 10 times the value of COND.

- Any constraints dropped up to this point are not used in the weighting calculations.

Next, before calculating the first-step weighting adjustment factors for a DA, any remaining constraints which are small are dropped for that DA. Those that remain are partitioned into two groups, as was previously described. Then, for each group, any linearly dependent constraints are identified and dropped (constraints which are linearly dependent at the DA level may not be linearly dependent at the WA level). The first-step weighting adjustment factors are then calculated for the remaining constraints in each group. If any of the first-step adjusted weights fall outside the range 1 to 25 inclusive, additional constraints are dropped. A method similar to that used to discard NLD constraints is applied here except that a constraint is discarded if it causes outlier weights. In the interest of computational efficiency, the bisection method (see Press et al. , 1992) is used to identify which constraints should be dropped.

Finally, the second-step weighting adjustment factors are calculated based on the constraints that were not discarded for being small, linearly dependent or nearly linearly dependent during the initial analysis of the matrix being inverted. If any of the second-step adjusted weights fall outside the range 1 to 25 inclusive, then additional constraints are dropped using the method outlined for the first-step adjustment.

The census weights are calculated independently in each WA. This makes it possible to use a different set of weighting system parameters for each WA (e.g., SMALL, COND, MAXC, whether to post-stratify, whether to use dwelling type constraints). In 1996, an identical set of parameters was used for every WA in the country. In 2001 and 2006, with the increased processing power achieved through running the weighting system on multiple personal computers (PCs), it was decided to calculate the weights for each WA with several different sets of parameters. Two dwelling type constraints were introduced in 2006 due to large discrepancies observed for these characteristics in certain regions in 2001. These constraints were single detached dwellings and apartments in buildings with less than five storeys. Although the introduction of new constraints may reduce the discrepancies for these characteristics, it may result in other constraints being dropped in their place, which would result in a larger discrepancy for those other characteristics. Therefore, the use of the dwelling type constraints was parameterized so they would only be used in WAs where they had an overall positive effect on the discrepancies. Twenty different sets of parameters were used to calculate the weights in each WA in 2006. These represented the 10 sets of parameters used in 2001 with the dwelling type constraints excluded (as in 2001) and included. A statistic was calculated for each set of parameters to determine which set minimized the differences between the population counts and the sample estimates for the constraints. The weights arrived at with this set of parameters were used for the corresponding WA. This process of selecting the best weights on a WA-by-WA basis was called 'cherry-picking' the parameters.

For more details on regression estimators, see Bankier (2002) and Fuller (2002).

Regression weights are calculated only for sampled-CU private households and sampled senior units that have received the long census questionnaire (one-fifth of these were sampled; four-fifths were not). Sampled-CU private households and senior units that received a short questionnaire receive a weight of 0 because they contain no information on sample variables. All non-sampled CU private households and senior units receive a weight of 1 since 100% of the respondents in these areas provide information on Form 2B or 2D. Collective households also receive a weight of 1. In this report, the term 'household' will refer to a private household or a senior unit unless otherwise specified.

For the 1996, 2001 and 2006 censuses, short form (2A) write-in responses to the relationship variables were not captured due to budgetary constraints. Instead, they were coded under the generic value 'Other.' Long-form (2B) write-in responses to the relationship variables were still captured and coded in the normal fashion.

During two-pass processing, the long-form data are processed in two stages. In the first stage, called 'Pass 1,' the long and short forms are processed together, representing 100% of the data. The captured long-form write-in responses for relationship are ignored and assigned the generic value 'Other' to coincide with the short form write-in responses. Editing and imputation is performed the same way for both the long and short forms. In the second stage, called 'Pass 2,' only the long forms are processed; the short forms are not available during imputation. The captured long-form write-in responses for relationship are used rather than the 'Other' responses. Because of the availability of the write-in responses, the quality of the results is assumed to be higher in Pass 2 than in Pass 1.

The weighting system uses the Pass 1 results for all households to calculate the household weights. While it might be possible to use the Pass 1 results for the short forms and Pass 2 results for the long forms, this method could bias the census estimates. This is because of differences in the distribution of the responses for the demographic variables between Pass 1 and Pass 2 as a result of the write-in responses for relationship being present in Pass 2. Published census estimates were produced using Pass 1 weights applied to Pass 2 long form imputed results. The difference between the population counts (based on Pass 2 data for the sampled population and Pass 1 results for the remaining 80% of the population) and Pass 2 estimates is small for most constraints. See Table 7.2.2.2, Chart 7.2.2.3, and Chart 7.2.2.4 in Section 7.2.2 for a comparison of Pass 1 and Pass 2 results.

Final household weights are generated such that the population counts match the weighted estimates for as many characteristics as possible. Characteristics available from both the long and short form for which consistency is attempted include five-year age ranges, sex, marital status, common-law status, household size and dwelling type. The weighting process attempts to control the population/estimate differences at the weighting area (WA) level where WAs typically contain 1,000 to 3,000 dwellings that are subject to sampling.

There are a few reasons why sample estimates may be different from population counts, particularly for small areas. The main ones are listed below:

- Constraints dropped during the regression process: As described in Section 4.4, constraints can be dropped for generating outlier weights, having small counts, or by being linearly dependent or nearly linearly dependent. Constraints which are dropped are not controlled on, and will usually have some difference between the population counts and the estimates.
- Sub-WA areas: The weighting area is the smallest geographic area for which the weighting system attempts to have agreement between the population counts and weighted estimates for as many auxiliary variables as possible. Therefore, small areas that are contained within WAs (such as DAs or very small municipalities) will usually see discrepancies between the population counts and the weighted estimates.

There are three separate universes for which the census data may be observed:

- Private dwellings – Consists of private households and senior units that were subject to sampling. These households were used in the creation of the final household level weights. The majority of the information that is presented in this publication represents the private dwelling universe.
- Private dwellings and non-institutional collectives – Consists of sampled private households and senior units, non-institutional collectives, and also private households and senior units from non-sampled CUs. The additional persons in this universe all received a long form questionnaire, and therefore have 2B data present. This universe is used for all census publications related to sampled variables.
- Private and collective dwellings – Consists of all private households and senior units (sampled and non-sampled) and all collectives (institutional and non-institutional). Residents of institutional collectives answer the short form questionnaire and therefore do not have sampled data available. For this reason, this complete universe is used for all census publications related to basic variables (questions asked on both the short and long forms) but cannot be used for sampled publications.

The institutional collective's population represents some of the differences that will be observed by someone comparing a 2B publication to a 2A publication. The counts and estimates for the three universes discussed above can be found in Table 7.2.2.3.

**Notes:**