Bivariate data analysis involves studying the relationship between two variables to understand how one variable changes with respect to another.
## Core concept
Bivariate data consists of two variables observed on the same unit of observation. Unlike univariate data (single variable), bivariate analysis explores: - Correlation: strength and direction of linear association between two quantitative variables - Regression: predicting one variable (dependent) from another (independent) - Association: relationship between categorical or mixed-type variables
Key distinction: Correlation measures *association*; regression predicts one variable from the other.
## Classification of bivariate data
| Type | Variables | Tool | Example | |------|-----------|------|---------| | Quantitative–Quantitative | Both continuous/discrete | Scatter plot, Correlation, Regression | Height vs Weight | | Qualitative–Qualitative | Both categorical | Contingency table, Chi-square | Gender vs Product preference | | Mixed | One quantitative, one categorical | Box plot, two-way ANOVA | Sales by Region |
## Correlation analysis
Karl Pearson's Correlation Coefficient (r): $$r = \frac{\sum(x - \bar{x})(y - \bar{y})}{\sqrt{\sum(x - \bar{x})^2 \sum(y - \bar{y})^2}}$$
- Range: −1 to +1
- r = +1: Perfect positive correlation (as x increases, y increases proportionally)
- r = −1: Perfect negative correlation (as x increases, y decreases proportionally)
- r = 0: No linear correlation
- |r| > 0.7: Strong correlation; 0.3 < |r| < 0.7: Moderate; |r| < 0.3: Weak
Spearman's Rank Correlation Coefficient (ρ): Used when data is ordinal or skewed. $$\rho = 1 - \frac{6\sum d^2}{n(n^2-1)}$$ where *d* = difference in ranks, *n* = number of pairs.
## Regression analysis (CA Foundation scope)
Simple Linear Regression: Equation of line of best fit. $$\hat{y} = a + bx$$
where: - b (slope) = $\frac{\sum(x - \bar{x})(y - \bar{y})}{\sum(x - \bar{x})^2}$ - a (intercept) = $\bar{y} - b\bar{x}$
Interpretation: - *a*: value of *y* when *x* = 0 - *b*: change in *y* for one-unit increase in *x*
Line of regression of y on x: Used to predict *y* from *x*. Line of regression of x on y: Used to predict *x* from *y*.
## Worked example
Data: Advertising spend (₹ lakhs) vs Sales (₹ crores) for 5 companies.
| Ad Spend (x) | Sales (y) | |---|---| | 2 | 5 | | 3 | 7 | | 4 | 8 | | 5 | 10 | | 6 | 12 |
Calculate: Pearson's *r* and regression equation.
$\bar{x} = 4$, $\bar{y} = 8.4$
$\sum(x-\bar{x})(y-\bar{y}) = 8 + 3.2 + 0 + 3.2 + 8 = 22.4$
$\sum(x-\bar{x})^2 = 4 + 1 + 0 + 1 + 4 = 10$
$r = \frac{22.4}{\sqrt{10 \times 20.8}} ≈ 0.98$ (very strong positive correlation)
$b = \frac{22.4}{10} = 2.24$; $a = 8.4 - 2.24(4) = 0.44$
Regression equation: $\hat{y} = 0.44 + 2.24x$
*Interpretation*: For every ₹1 lakh increase in ad spend, sales increase by ₹2.24 crore.
## Common exam applications
- Predict sales/demand from advertising spend or price
- Test whether correlation is statistically significant
- Compare strength of two correlations
- Identify outliers in scatter plots
- Use regression for forecasting
## Common mistakes
- Confusing correlation with causation: *r* = 0.8 means association, not that x causes y
- Ignoring data type: Pearson's *r* requires quantitative data; use Spearman's for ranks
- Extrapolating beyond data range: Regression is unreliable outside observed values
- Misreading regression direction: *y* = *a* + *b*(*x*) predicts *y* from *x*, not vice versa