Sample selection methods
Sample selection methods are the specific methods used to select the records contained in a sample.
For record sampling and monetary unit sampling, Analytics supports three sample selection methods:
- fixed interval
- cell
- random
For classical variables sampling, the random selection method is the only possibility.
Sample selection method versus sampling type
It is important to understand the distinction between sample selection method and sampling type.
Sampling type refers to the overall statistical method used to arrive at an estimate about a population.
Sample selection method refers to the way in which records are drawn from a population for inclusion in a sample.
Sampling type | Sample selection methods available | Details |
---|---|---|
Record sampling |
|
The records contained in the sample are directly selected |
Monetary unit sampling |
|
The records contained in the sample are those that correspond to the selected monetary units |
Classical variables sampling |
|
The records contained in the sample are directly selected |
Fixed interval selection method
With the fixed interval selection method, an initial monetary unit or record is selected, and all subsequent selections are a fixed interval or distance apart – for example, every 5000th monetary unit, or every 20th record, after the initial selection.
To use the fixed interval selection method, you specify:
- The interval value that Analytics generates when you calculate the sample size
- A start number greater than zero and less than or equal to the interval value
The start number and the interval value are used to select which records are contained in the sample.
Note
If you want Analytics to randomly select a start number, you can enter a start number of ‘0’, or leave the start number blank.
Example
If 62 is the interval generated by Analytics, and you choose 17 as the start number, the following monetary units, or record numbers, are selected:
- 17
- 79 (17+62)
- 141 (79+62)
- 203 (141+62)
- and so on
Each selection is the same distance, or fixed interval, apart.
For monetary unit sampling, the actual record numbers selected are the ones that correspond to the selected monetary units. For more information, see How monetary unit sampling selects records.
Considerations
When you use the fixed interval selection method, you need to be alert to any patterns in the data. Because a fixed interval is used for sample selection, a non-representative sample can result if the data has a pattern that coincides with the interval you specify.
For example, you sample expenses using an interval of $10,000, and the same expense category appears at ten-thousand-dollar intervals in the file, which results in all the selected records coming from a single expense category. This type of scenario is uncommon, but you should be aware that it could occur.
Cell selection method
With the cell selection method, the data set is divided into multiple equal-sized cells or groups, and one monetary unit, or one record, is randomly selected from each cell.
To use the cell selection method, you specify:
- The interval value that Analytics generates when you calculate the sample size
- A seed value used to initialize the random number generator in Analytics
The interval value dictates the size of each cell. The random number generator specifies which monetary unit or which record number is selected from each cell.
Note
If you want Analytics to randomly select a seed value, you can enter a seed value of ‘0’, or leave the seed value blank.
Example
If 62 is the interval generated by Analytics, one monetary unit, or one record number, is randomly selected from each of the following cells:
- cell 1 (1 to 62)
- cell 2 (63 to 124)
- cell 3 (125 to 186)
- and so on
Each selection is a random distance apart, but constrained within its cell.
For monetary unit sampling, the actual record numbers selected are the ones that correspond to the selected monetary units. For more information, see How monetary unit sampling selects records.
The seed value
If you specify a seed value, it can be any number. Each unique seed value results in a different set of random numbers. If you respecify the same seed value, the same set of random numbers is generated. Explicitly specify a seed value, and save it, if you want to replicate a particular sample selection.
Considerations
The main advantage of the cell selection method over the fixed interval selection method is that it avoids problems related to patterns in the data.
For monetary unit sampling, two disadvantages exist:
- Amounts can span the dividing point between two cells, which means they could be selected twice, yielding a less consistent sample than the sample generated by the fixed interval method.
- Larger amounts that are less than the top stratum cutoff have a slightly reduced chance of being selected.
Random selection method
With the random selection method, all monetary units or records are randomly selected from the entire data set, or from each stratum if you are using classical variables sampling.
Note
Do not use the random selection method with monetary unit sampling if you intend to use Analytics to evaluate any misstatement detected in the resulting sample. Evaluating monetary unit samples requires that you use the fixed interval or the cell selection methods.
To use the random selection method, you specify:
- The sample size, as calculated by Analytics – that is, the number of samples to select
- A seed value used to initialize the random number generator in Analytics
- The population size – that is, the absolute value of the sample field, or the total number of records in the data set
For classical variables sampling, sample size and population size can be automatically prefilled by Analytics.
The random number generator specifies which monetary units or which record numbers are selected from the data set. Each selection is a random distance apart.
Note
If you want Analytics to randomly select a seed value, you can enter a seed value of ‘0’, or leave the seed value blank.
The seed value
If you specify a seed value, it can be any number. For classical variables sampling, the seed value must be a positive number not greater than 2,147,483,647.
Each unique seed value results in a different set of random numbers. If you respecify the same seed value, the same set of random numbers is generated. Explicitly specify a seed value, and save it, if you want to replicate a particular sample selection. You can also retrieve a seed value from the command log.
Considerations
Large amounts may be excluded from a monetary unit sample
With the random selection method, each monetary unit has an equal chance of selection, and there is no guarantee that the resulting sample will be evenly distributed. As a result, the distance or gap between selected units may be large in some instances. If all the monetary units associated with a large amount happen to fall into a gap, the amount is not included in the sample. There is also no top stratum cutoff available when using the random selection method.
With the fixed interval and cell selection methods, there is an assurance that the selected units are evenly distributed, or relatively evenly distributed. And top stratum cutoff is available.
Amounts may be included more than once in a monetary unit sample
Analytics does not generate the same random number twice, however random numbers that are close, or sequential, can occur.
With monetary unit sampling, close or sequential random numbers equate to close or sequential monetary units being selected, which in turn can lead to an associated amount being selected more than once.
With record sampling and classical variables sampling, the same problem does not exist because each random number equates to a different record.
Random number algorithms
For record sampling and monetary unit sampling, the random number generator in Analytics has two algorithm options:
- Mersenne-Twister
- The default Analytics algorithm
Mersenne-Twister is a widely used random number algorithm and it has better statistical properties than the default Analytics algorithm. Use the default algorithm if you require backward compatibility with Analytics scripts or sampling results created prior to Analytics version 12.
For classical variables sampling, Mersenne-Twister is not an option and the default Analytics algorithm is used.
Add a record number field
You may find it useful to add a record number field to the Analytics table from which you are drawing a sample. After you draw the sample, the specific record numbers that were selected from the source table are displayed in the output table containing the sample.
Note
A record number field is automatically included in the output table when you use classical variables sampling.
- In the source table, create a computed field that uses the following expression:
RECNO( )
For more information, see Define a conditional computed field.
- When you sample the data, output by Fields, not by Record.
You must output by Fields in order to convert the computed record number field to a physical field that preserves the record numbers from the source table.
- Include the computed record number field in the output fields you specify.