Data classification is the process of grouping raw data into meaningful categories to enable statistical analysis and interpretation.
## Core concept
Classification involves dividing a large mass of data into homogeneous groups or classes based on common characteristics. This is a prerequisite step before tabulation and frequency distribution. It reduces complexity, reveals patterns, and makes data suitable for statistical analysis.
Types of classification:
- Geographical classification – Data grouped by location/region (e.g., sales by state)
- Chronological classification – Data grouped by time period (e.g., production by year)
- Qualitative classification – Data grouped by attributes/qualities (e.g., by gender, colour, status)
- Quantitative classification – Data grouped by numerical values/magnitudes (e.g., by income range, weight)
## Formula / rule
Class interval formula:
$$\text{Class size (h)} = \frac{\text{Range}}{\text{Number of classes}}$$
Where Range = Highest value − Lowest value
Number of classes (Sturges' rule):
$$k = 1 + 3.322 \times \log_{10}(n)$$
Where n = total number of observations
Class boundaries vs. class limits: - Class limits – The stated upper and lower values (e.g., 10–20) - Class boundaries – Actual limits for continuous data (e.g., 9.5–20.5) to avoid gaps/overlaps - Class width – Difference between upper and lower boundaries
## Common exam applications
Example: Classify the following marks into appropriate classes
Marks obtained by 20 students: 15, 23, 34, 42, 18, 56, 64, 71, 38, 45, 52, 61, 27, 39, 48, 55, 68, 74, 31, 47
Solution: - Highest value = 74, Lowest value = 15 - Range = 74 − 15 = 59 - Using Sturges' rule: k = 1 + 3.322 × log₁₀(20) = 1 + 3.322 × 1.301 ≈ 5.3 ≈ 5 classes - Class width = 59 ÷ 5 = 11.8 ≈ 12
Classification:
| Class | Frequency | |-------|-----------| | 10–22 | 3 | | 22–34 | 4 | | 34–46 | 5 | | 46–58 | 4 | | 58–70 | 3 | | 70–82 | 1 |
## Common mistakes
- Overlapping classes – "10–20, 20–30" creates ambiguity. Use "10–19, 20–29" or boundaries "9.5–19.5, 19.5–29.5"
- Unequal class widths – Causes distortion in frequency distribution analysis; always keep widths uniform unless justified
- Too many/too few classes – More than 15 classes reduces clarity; fewer than 4 classes loses detail. Follow Sturges' rule
- Ignoring class boundaries for continuous data – For continuous variables, always use boundaries; for discrete variables, limits suffice
- Not ensuring mutually exclusive, exhaustive classes – Every observation must fit into exactly one class
- Poor choice of classification type – Selecting qualitative classification for numerical data wastes analytical potential
Key examination tip: In numerical problems, always compute range first, apply Sturges' rule to determine optimal number of classes, then calculate class width. State assumptions clearly (e.g., inclusive lower limit, exclusive upper limit) to avoid ambiguity.