Skip to content
noobtoproTake the free diagnostic
Mathematics · High School · Statistics & probability

Inference & regression (intro)

The idea

Inference asks what a sample can responsibly say about a population, and regression asks how one quantity changes with another. The least-squares line ŷ = a + bx is the line minimizing the squared vertical misses, called residuals, between data and predictions; the correlation coefficient r, trapped between −1 and 1, grades the strength and direction of a linear relationship, and r² reports the fraction of variation in y that the line accounts for. Both topics share one discipline: every conclusion ships with uncertainty attached, never as bare fact.

Tidy relationships connect the regression pieces: the slope is b = r × (sy/sx), where sx and sy are the standard deviations of the two variables, and the line always passes through the point of means (x̄, ȳ), so the intercept is a = ȳ − b × x̄. A residual — observed minus predicted — measures how far a real point falls from the line. The misconception headlines are built on: correlation is not causation, since ice cream sales and drowning rates rise together through summer heat, not through each other. And predicting far outside the data's x-range, called extrapolation, is fiction dressed as arithmetic.

Worked example

For 20 students, hours studied (x) and exam score (y) have x̄ = 4, sx = 2, ȳ = 70, sy = 5, and correlation r = 0.8. Find the least-squares line, predict the score for 7 hours of study, and find the residual for a student who studied 7 hours and scored 80.

  1. Slope first: b = r × sy/sx = 0.8 × 5/2 = 2, so each extra hour of study is associated with about 2 more exam points.
  2. The line passes through the point of means (4, 70), so the intercept is a = 70 − 2 × 4 = 62, and the line is ŷ = 62 + 2x.
  3. Predict at x = 7: ŷ = 62 + 2 × 7 = 76 points.
  4. The residual is observed minus predicted: 80 − 76 = 4, so this student outperformed the line's prediction by 4 points.
  5. Frame the strength honestly: r² = 0.64 means study hours account for 64% of score variation, leaving 36% to everything else — and since hours were not randomly assigned, associated with is the only defensible verb.

Answer. The line is ŷ = 62 + 2x; the prediction at 7 hours is 76 points, and the student's residual is +4 points.

Check your understanding

  • Why does the least-squares method minimize squared vertical distances rather than plain or horizontal ones?
  • What does r² = 0.64 tell you, and what does it leave completely unsaid about the remaining 36%?
  • Why is predicting the score for 15 hours of study dangerous here, even though the formula happily produces a number?
  • What change to this study's design would let it support a causal claim about studying and scores?

Build the foundations first

Inference & regression (intro) builds on these concepts. If any feel shaky, start there.

Slope & rate of changeProbability (intro)
Can you reason it out?
noobtopro grades how you think, not just the answer — a sound method scores even when the final number is wrong.
Practice this concept

← All High School mathematics concepts