# Acknowledgements

This blog couldn’t have been written without the scripts published by the authors of Sturgis et al 2017^{1} and the responses to my questions provided by people at ‘Centre d’Estudis d’Opinió’ (CEO).

# How many people voted?

The official data offered by the Catalan Government a few days after the independence referendum registered a turnout of almost 2.3 million votes or 43.03% of the census^{2}. This count, however, missed the ballot boxes seized by Spanish police forces. The by then Catalan Government spokesman, said that, under ‘normal circumstances’, participation could have easily reached 55%^{3}.

The real number of votes is thus still unknown, and very little data from surveys on this issue is publicly available. The main exception to this is the ‘Centre d’Estudis d’Opinió’ (CEO) ‘barometer’ survey. To my knowledge, this poll is the only one that asked respondents about their participation on the 1-O referendum.

The CEO survey, collected data during the last two weeks of that month and used a quota sampling with sample allocation proportional to population estimates^{4}. The sample size in three out of the four provinces was too small to compute reliable estimates. The sample size of the Province of Barcelona, however, was close to 1,000 respondents and should allow for a decent calculation with the methods presented below.

According to the official information from the Catalan Government the province had a turnout of 41.2% or around 1.6 million votes. As shown in Figure 1, my estimate from CEO survey data would be that the turnout was most likely between 50.6% and 56.8%. This would mean that between 2 and 2.26 million people voted in this province.

Even knowing that some ballot boxes were taken away by Spanish police forces, the difference between the official information and the CEO survey is large. Thus, estimates from the survey should be taken with a grain of salt. We know that people tend to lie about their participation in elections, stating that they voted when they actually didn’t^{5}. Moreover, a quota sample like the one used by CEO is based on assumptions that might have been violated in this case. These assumptions and the exact methods used to compute the estimates are explained in the next sections in a slightly more technical way. The last part of this blog also compares the current design of the survey with simpler alternatives.

*Note: See section ‘Estimates of turnout in 1-O referendum’ below for details on the computation of these estimates.*

# Quota sampling

Before explaining the particularities of quota sampling, let me give a brief overview about the concepts of probability and non-probability sampling. These concepts are very relevant to understand the large difference between the official data and CEO’s survey estimates.

## Probability vs non-probability sampling

Concisely explained, there are two approaches to survey sampling: probability and non-probability sampling. In the survey industry the first one is still the gold standard. In probability sampling, we draw a random sample of units from a population. This might take various forms (e.g. within stratums, proportional to some variable), but the important fact is that we know the probability of each unit being selected. In probability sampling, all units should have a chance to be in the survey (i.e. their probability of inclusion should not be equal to 0). The great advantage of probability sampling is that it’s unbiased. This is, if the sampling was to be repeated many times, the expected value of the estimates from the samples would be identical to the value for the whole population. Thus, the sample will tend to be (on average) a small-scale representation of the population from which we can obtain unbiased estimates.

Non-probability samples are those in which the selection of units is not randomised according to principles of probability theory. There are a wide variety of non-probability sample methods^{6}. They are extensively used in market research but considered of poor quality in many other fields. When compared to probability sample surveys, they don’t have the ‘unbiasedness’ characteristic by design. This type of surveys tend to rely on adjustments made by statisticians once the sample is collected. Making inferences from non-probability surveys requires some reliance on modelling assumptions and thus significant statistical expertise.

Three main issues of probability samples are coverage error, non-response error and their cost. The first two are problems that affect the capacity of the sample to properly represent the population. Coverage error is caused by some units not being in the list of units from which the random sample is selected^{7}. Non-response error is when the survey fails to collect responses for some of the units. This increases the uncertainty of estimates due to smaller sample sizes and risks biasing the survey as certain profiles of respondents might have a larger propensity to respond than others. Some statisticians have pointed that, due to non-response, no survey is really a probability sample^{8}. The same voices say that the low response rates that we see in certain research areas might make probability samples worthless.

## Quota sampling and the CEO survey

The survey produced by CEO used a quotas rather than probability sampling. Quotas specify a number of respondents in each category which makes the final sample match the known distributions in the population. Quotas were applied to size of municipality and crossed categories of age, gender and place of birth. From the technical specifications of the survey and CEO’s response to my queries about their sampling procedure I understood that: * They first selected the municipalities within each province. They did this by sampling random municipalities within stratums according to their size. * The interviewers were sent to do ‘random walks’^{9} in sampled municipalities. The starting point was selected at random from a list of streets. * Interviewers then collected responses with face-to-face interviews until quotas by crossed categories of age gender and place of birth were filled.

In this particular survey, some quotas could not be filled because of low response rates and budget constraints. Thus, CEO researchers applied a calibration step after the responses were collected.

The main problem with quota sampling is that the selection of respondents within quotas is not randomised. Interviewers can replace any potential respondent with another one with the same characteristics in quota variables. The estimates from a quota sampling survey will be biased if respondents within quotas differ from those in the population. Therefore, the estimation of the proportion of people that voted in the 1-O referendum will be biased if, within each quota (e.g. males, 18 to 24 born in Catalonia), the proportion of people that voted isn’t the same as that in the population for people with the same characteristics.

## Variance estimation for quota sampling

As important as providing an estimate of the proportion of people that voted is the capacity to provide an estimate of its uncertainty. This means that we don’t only want to know how many people voted, but how confident we are on this estimate. This is the point where methodology gets trickier and the actual reason why I started doing this analysis.

Most researchers compute the uncertainty of estimates as if the data proceeded from simple random samples from the population^{10}. Thus, they do not take into account the actual design of the survey. Sturgis et al. (2017) suggests a method for calculation the precision of estimates from quota samples. This method is based on ‘bootstrap resamples’ or resamples with replacement taken from the survey data. In their article, Sturgis et al. explain it in the following way:

- draw M independent samples by sampling respondents from the full achieved sample, with replacement and in a way which matches the quota sampling design;
- for each sample thus drawn, calculate the point estimates of interest in the same way as for the original sample, including calibration and turnout weighting;
- use the distribution of the estimates from the M resamples to quantify the uncertainty in the poll estimates.

For drawing the M independent resamples simulating quota designs (point A above) Sturgis et al. (2017) proposed an algorithm in page 8 of their article (also available from their R code). Explained in my own words, this would be:

- Set quota targets for the quota variables of the survey. Quota variables, unlike stratums, don´t need to form mutually exclusive categories
^{11}; - Do a first iteration of resampling with replacement with size equal to the total number of observations in the survey. The order of the resampled units is important for the next steps. Thus, do not sort the resample nor re-arrange the sequence in any way;
- Drop those responses from the first iteration resample that overfill a quota category, retain the rest
^{12}. - Do another iteration of resampling to fill the respondents dropped in the previous point. Use only those respondents from non-completed quotas as pool of potential resampled units.
- Repeat points 3 and 4 iteratively until all quotas are full or until there are no respondents from the survey which fit in the non-completed quotas. In this later case, the resample will be shorter than the original survey.

## Estimation of 1-O turnout using bootstrap resamples

To estimate the participation in the 1-O referendum using the CEO sample I used the method proposed in Sturgis et al. (2017) and the R code accompanying the article. The scripts with the code for my analysis can be found in the GitHub repository for this article. Summarising the process, after cleaning and rearranging the data I computed the bootstrap resamples simulating 10,000 different samples with the quota design. Some of these bootstrap resamples had slightly less observations than the initial data due to no respondents available which would fit non-completed quotas (explained in point 5 of the algorithm above). The analysis of these resamples should represent the variability in the quota sampling process if this was to be implemented a large number of times. An important difference with the original CEO’s sampling design is that I included the size of municipalities as another quota (with its categories not crossed with other quota variables). Ideally, there should probably be first a step with a resample of municipalities^{13}. The publicly available data, however, didn’t have an identification variable for all municipalities.

After I finished drawing the bootstrap resamples simulating the quota design, I calibrated these using official statistics data from IDESCAT, the official body responsible for collecting and publishing statistics in Catalonia. I performed calibration using: 1. Crossed categories of age and place of birth and 2. First language of respondent^{14}.

Calibration for each resample as well as for the main survey was implemented with a raking algorithm. This procedure is different from that used by CEO when computing their survey weights. They use a post-stratification calibration with crossed categories of age, place of birth and first language^{15}. I didn’t repeat their procedure for two reasons:

- I couldn’t find estimates from official statistics for the cross-classification of these three variables.
- This results in some very small cells and thus some extreme weights.

Some resamples had some large weights, so I applied a minor trim to limit the most extreme weights^{16}. The code accompanying Sturgis et al. (2017) does not include this step. I think it’s reasonable to include it, however, to make simulations close to the actual process which they are meant to reproduce. We would expect researchers to trim the weights of their surveys if they found some extreme weights. At the same time, some resamples will need more trimming than others. Thus, we should incorporate this uncertainty in the variance computation. The final estimates, however, were computed using both trimmed and untrimmed weights for robustness checks. This sensitivity analysis showed that trimming weights had only a very small impact on point estimates.

For computing the bootstrap adjusted percentile intervals (see section below) I also computed and calibrate jackknife resamples from the CEO survey. For these, as for the main sample, I left the weights untrimmed as they didn’t have extreme values.

# Estimates of turnout in 1-O referendum

Following the methods implemented by Sturgis et al. (2017) I calculated three kinds of bootstrap confidence intervals. These three kinds of CIs are shown in the table below as well as in the figure at the top of the blog:

- Bootstrap symmetric normal interval (Normal): Uses the standard deviation from the bootstrap estimates as Standard Error.
- Bootstrap standard percentile interval (Percentile): Uses the 2.5% (lower bound) and 97.5% (upper bound) percentiles of the bootstrap distribution of estimates.
- Bootstrap bias-corrected and accelerated interval (BCa): It is also based on percentiles from the bootstrap distribution of estimates, but these depend on the ‘acceleration’ and ‘bias-correction’. These changes are meant to correct deficiencies of the other bootstrap methods
^{17}. Table 1 below shows that differences between the three types of confidence intervals computed are really small, meaning that the variance estimation doesn’t really depend on the exact method used. They all point at an estimated proportion of turnout in the province roughly between 50% and 57%. Again, this is way above the official proportion of 41% and thus difficult to believe. The fact that the confidence intervals are so distant from the official data would seem to discard sampling error as a main reason for the difference between official and survey estimates. In other words, it’s very unlikely that if the survey was repeated under the exact conditions, it would produce an estimate of turnout around 41%. Measurement error (i.e. people declaring that they voted when they actually didn’t) and/or a biased sample seem more plausible options of error for the CEO survey. A way of checking the later source of error would be to include further variables in the survey which can be benchmarked against official estimates.

# How much more efficient was the Quota sample?

As an extra step of the analysis, I compared the estimated variance from the bootstrap samples using the quota design (seen in previous sections) with that of bootstrap samples assuming a “simple random sample” (SRS) design^{18}^{19}. The latter is a mere resample with replacement from the initial survey respondents, with weights computed for each resample. The comparison of the properties of estimates computed assuming both designs should give us an insight of the theoretical ‘gains’ of using a quota sample instead of just interviewing the same number of people without quotas. This exercise is similar to what Kuha and Sturgis presented last year in the ESRA ‘Conference on Inference for Non-Probability Samples’^{20} and could be used to assess the impact of changes in design of surveys. For example, we could analyse the effect of using different variables in quotas and/or calibration steps. Alike procedures are also common for testing other sampling design changes such as how accurate surveys would be if the sample size increased or decreased^{21}.

Figure 2 shows the distribution of bootstrap estimates by comparing the quota and “SRS” designs. Summary estimates for the variance of these distributions can be found in Table 2 below. The weights clearly pushed the estimates for vote on the referendum down. Comparing the upper and lower graphs, we see that the unweighted estimates are centred around 61%, while the weighted ones are around 54%. This means that the original sample underrepresented profiles of people that had lower propensity to vote in the referendum. It also means that the variables used in the weighting procedure (age crossed by place of birth and first language) are strongly related to the probability of people voting on the referendum. This won’t come as a great surprise for people with knowledge on Catalan politics.

The distributions in the upper graph compare 10,000 unweighted bootstrap estimates. We see that without quotas, the estimates are more disperse. The standard deviation of the distribution without quotas is 1.56% and that of the distribution with quotas is 1.44%. Thus, the estimated standard error using quota sample is 0.92 times what it would be if the survey didn’t use them. In survey statistics jargon, this ratio is usually called ‘design factor’. In terms of effective sample, this would mean that the initial sample of 968 respondents in Barcelona province obtained from a quota design would be as informative as ~1144 respondents obtained from design without quotas (SRS)^{22}.

This advantage in efficiency of the quota design seems to almost disappear once the resamples are weighted to match population proportions in auxiliary variables. This can be seen in the lower graph of Figure 2 and summarised in Table 2. The ratio between the estimated SE of the design with quotas and that without these is close to 1, meaning that the gain in information would have been small, with a gain in effective sample size equivalent to 20 respondents. This is slightly surprising (although plausible) as the weights applied to bootstrap resamples under “SRS” design were slightly larger than those computed for resamples with quota designs.

# Summary, discussion and limitations

The (survey) data and methods used here are unlikely to have produced a good estimate of vote turnout in the 1st of October referendum. The distance from the official estimates is very large and most certainly not due to sampling error. Thus, the CEO survey data, most probably has underlying problems that persist even after being adjusted with population information from official statistics. Two usual suspects here might be 1) measurement error (i.e. people lying about their participation in the referendum) and 2) violation of the assumptions of quota design (i.e. some sort of bias/self-selection mechanism).

Hopefully, the blog presented a way of calculating uncertainty of surveys implementing quota sampling. This bootstrap technique was suggested in Sturgis et al. (2017). Similar applications can also be used to compare the efficiency of different survey designs as presented by Kuha, J. and Sturgis, P. (2017).

A limitation that wasn’t addressed in this blog is the presence of missing values in the dependent variable (participation in the 1-O referendum). It would certainly have been useful to check for patterns of missing data and, if needed, use some imputation technique combined with the bootstrap resamples.

Sturgis, P., Kuha, J., Baker, N., Callegaro, M., Fisher, S., Green, J., Jennings, W., Lauderdale, B. E., and Smith, P. (2017) An assessment of the causes of the errors in the 2015 UK General Election opinion polls. Journal of the Royal Statistical Society A. Article available here and code available here.↩

Official data on the referendum participation can be found here.↩

The technical specifications of the survey can be found here (in English) and with greater detail here (in Catalan)↩

See examples of research done on this issue in Holbrook,A.L. & Krosnick, J.A.(2009) Social Desirability Bias in Voter Turnout Reports: Tests Using the Item Count Technique. Available here↩

Some of these designs are explained and commented in the report from the American Association for Public Opinion Research (AAPOR) on Non-Probability Sampling↩

This list is commonly known as sampling frame.↩

See for example: This article by Andrew Gelman↩

I.e. They compute the Standard Error of the estimate under simple random sampling and no finite population correction using the formula \(SE = \sqrt{\frac{s^2}{n} }\) or \(SE = \sqrt{\frac{p(1-p)}{n} }\) for proportions.↩

For example, we can have a sample which resembles the population by categories in quota variable A and quota variable B but not for crossed categories of A and B.↩

For example, if the quota target was 10 males 18-24 and the resample included 13 of these, drop the last 3.↩

This would be something similar to the bootstrap for Stratified or Multistage Sampling explained in pages 207 to 213 of Wolter, K. M. (2007) Introduction to Variance Estimation, 2nd edn. New York: Springer.↩

The language that respondents first used when they were children.↩

See the table at the end of the technical report of the survey here (In Catalan)↩

I trimmed the top 0.5% of weights for all samples and limited the ratio between the minimum and maximum weight in each resample to be equal to 8. The code for this can be found in script 03_calibrate_bootstrap_resamples.R↩

This correction uses both the bootstrap and the jackknife resample estimates. For more details see pag 184 of Efron, B. & Tibshirani, R.J. (1993) An Introduction to the Bootstrap. New York: Chapman & Hall.↩

It is actually not a ‘simple random sample’ as there is no randomisation of respondents in a non-probability sample. However, I’ll use this terminology for consistency with the presentation given by Kuha, J. & Sturgis, P. (2017) A bootstrap method for estimating the sampling variation in point estimates from quota samples. Available here↩

SEs have been calculated here as the standard deviation of the bootstrap estimates.↩

This is mentioned in page 19 of Hesterberg, T. (2014) What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum available here↩

The effective sample size is calculated as: \(neff = \frac{n}{DEFT^{2}}\)↩