135 Key Statistics Definitions for Students & Interviews 📊

The Complete Statistics & Data Science Cheat Sheet: 135 Definitions to Ace Your Interview

Preparing for a data science, data analyst, or business intelligence interview? This comprehensive guide is your ultimate cheat sheet. We've compiled 135 essential statistics definitions, organized for quick learning and revision. Bookmark this page, master these concepts, and walk into your next interview with confidence!

Quick Navigation

Section 1: Population & Sampling
Section 2: Data, Variables & Visualization
Section 3: Distributions & Estimation
Section 4: Hypothesis Testing
Section 5: Regression & Modeling

Fundamental Concept: Population vs. Sample

Population (N) All items of interest. We calculate Parameters from this.

Sample (n)
A subset of the population. We calculate Statistics from this.

Section 1: Population & Sampling

#	Lesson	Word	Definition
1	Population vs sample	population	The collections of all items of interest to our study; denoted N.
2	Population vs sample	sample	A subset of the population; denoted n.
3	Population vs sample	parameter	A value that refers to a population. It is the opposite of statistic.
4	Population vs sample	statistic	A value that refers to a sample. It is the opposite of a parameter.
5	Population vs sample	random sample	A sample where each member is chosen from the population strictly by chance.

Section 2: Data, Variables & Visualization

#	Lesson	Word	Definition
6	Types of data	representative sample	A sample taken from the population to reflect the population as a whole.
7	Types of data	variable	A characteristic of a unit which may assume more than one value. Eg. height, occupation, age etc.
8	Types of data	type of data	A way to classify data. There are two types of data - categorical and numerical.
9	Types of data	categorical data	A subgroup of types of data. Describes categories or groups.
10	Types of data	numerical data	A subgroup of types of data. Represents numbers. Can be further classified into discrete and continuous.
11	Types of data	discrete data	Data that can be counted in a finite matter. Opposite of continuous.
12	Types of data	continuous data	Data that is 'infinite' and impossible to count. Opposite of discrete.
13	Levels of measurement	levels of measurement	A way to classify data. There are two levels of measurement - qualitative and quantitative, which are further classed into nominal & ordinal, and ratio & interval, respectively.
14	Levels of measurement	qualitative data	A subgroup of levels of measurement. There are two types of qualitative data - nominal and ordinal.
15	Levels of measurement	quantitative data	A subgroup of levels of measurement. There are two types of quantitative data - ratio and interval.
16	Levels of measurement	nominal	Refers to variables that describe different categories and cannot be put in any order.
17	Levels of measurement	ordinal	Refers to variables that describe different categories, but can be ordered.
18	Levels of measurement	ratio	A number that has a unique and unambiguous zero point, no matter if a whole number or a fraction.
19	Levels of measurement	interval	An interval variable represents a number or an interval. There isn't a unique and unambiguous zero point. For example, degrees in Celsius and Fahrenheit are interval variables, while Kelvin is a ratio variable.
20	Categorical variables. Visualization techniques	frequency distribution table	A table that represents the frequency of each variable.
21	Categorical variables. Visualization techniques	frequency	Measures the occurrence of a variable.
22	Categorical variables. Visualization techniques	absolute frequency	Measures the NUMBER of occurrences of a variable.
23	Categorical variables. Visualization techniques	relative frequency	Measures the RELATIVE NUMBER of occurrences of a variable. Usually, expressed in percentages.
24	Categorical variables. Visualization techniques	cumulative frequency	The sum of relative frequencies so far. The cumulative frequency of all members is 100% or 1.
25	Categorical variables. Visualization techniques	Pareto diagram	A type of bar chart where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative frequency.
26	The Histogram	histogram	A type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observation to the last. The intervals (bins) are adjacent - where one stops, the other starts.
27	The Histogram	bins (histogram)	The intervals that are represented in a histogram.
28	Cross table and scatter plot	cross table	A table which represents categorical data. On one axis we have the categories, and on the other - their frequencies. It can be built with absolute or relative frequencies.
29	Cross table and scatter plot	contingency table	See cross table.
30	Cross table and scatter plot	scatter plot	A plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot.
31	Mean, median and mode	measures of central tendency	Measures that describe the data through 'averages'. The most common are the mean, median and mode. There is also geometric mean, harmonic mean, weighted-average mean, etc.
32	Mean, median and mode	mean	The simple average of the dataset. Denoted μ.
33	Mean, median and mode	median	The middle number in an ordered dataset.
34	Mean, median and mode	mode	The value that occurs most often. A dataset can have 0, 1 or multiple modes.
35	Skewness	measures of asymmetry	Measures that describe the data through the level of symmetry that is observed. The most common are skewness and kurtosis.
36	Skewness	skewness	A measure that describes the dataset's symmetry around its mean.
37	Variance	sample formula	A formula that is calculated on a sample. The value obtained is a statistic.
38	Variance	population formula	A formula that is calculated on a population. The value obtained is a parameter.
39	Variance	measures of variability	Measures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation.
40	Variance	variance	Measures the dispersion of the dataset around its mean. It is measured in units squared. Denoted σ² for a population and s² for a sample.
41	Standard deviation and coefficient of variation	standard deviation	Measures the dispersion of the dataset around its mean. It is measured in original units. It is equal to the square root of the variance. Denoted σ for a population and s for a sample.
42	Standard deviation and coefficient of variation	coefficient of variation	Measures the dispersion of the dataset around its mean. It is also called 'relative standard deviation'. It is useful for comparing different datasets in terms of variability.
43	Covariance	univariate measure	A measure which refers to a single variable.
44	Covariance	multivariate measure	A measure which refers to multiple variables.
45	Covariance	covariance	A measure of relationship between two variables. Usually, because of its scale of measurement, covariance is not directly interpretable. Denoted σxy for a population and sxy for a sample.
46	Correlation	linear correlation coefficient	A measure of relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a population and rxy for a sample.
47	Correlation	correlation	A measure of the relationship between two variables. There are several ways to compute it, the most common being the linear correlation coefficient.

The Normal Distribution (Bell Curve)

Section 3: Distributions & Estimation

#	Lesson	Word	Definition
48	What is a distribution	distribution	A function that shows the possible values for a variable and the probability of their occurrence.
49	The normal distribution	Bell curve	A common name for the normal distribution.
50	The normal distribution	Gaussian distribution	The original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work on the Gaussian function.
51	The normal distribution	to control for the mean/std/etc	While holding a particular value constant, we change the other variables and observe the effect.
52	The standard normal distribution	standard normal distribution	A normal distribution with a mean of 0, and a standard deviation of 1.
53	The standard normal distribution	z-statistic	The statistic associated with the normal distribution.
54	The standard normal distribution	standardized variable	A variable which has been standardized using the z-score formula - by first subtracting the mean and then dividing by the standard deviation.
55	The central limit theorem	central limit theorem	No matter the distribution of the underlying dataset, the sampling distribution of the means of the dataset approximate a normal distribution.
56	The central limit theorem	sampling distribution	The distribution of a sample.
57	Standard error	standard error	The standard error is the standard deviation of the sampling distribution. It takes the size of the sample into account.
58	Estimators and estimates	estimator	Estimations we make according to a function or rule.
59	Estimators and estimates	estimate	The particular value that was estimated through an estimator.
60	Estimators and estimates	bias	An unbiased estimator has an expected value the population parameter. A biased one has an expected value different from the population parameter. The bias is the deviation from the true value.
61	Estimators and estimates	efficiency (in estimators)	In the context of estimators, the efficiency loosely refers to 'lack of variability'. The most efficient estimator is the one with the least variability. It is a comparative measure.
62	Estimators and estimates	point estimator	A function or a rule, according to which we make estimations that will result in a single number.
63	Estimators and estimates	point estimate	A single number that is derived from a certain point estimator.
64	Estimators and estimates	interval estimator	A function or a rule, according to which we make estimations that will result in an interval. In this course, we will only consider confidence intervals.
65	Estimators and estimates	interval estimate	A particular result that was obtained from an interval estimator. It is an interval.
66	Definition of confidence intervals	confidence interval	A confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct.
67	Definition of confidence intervals	reliability factor	A value from a z-table, t-table, etc. that is associated with our test.
68	Definition of confidence intervals	level of confidence	Shows in what % of cases we expect the population parameter to fall into the confidence interval we obtained. Denoted 1 - α. Example: 95% confidence level means that in 95% of the cases, the population parameter will fall into the specified interval.
69	Population variance known, z-score	critical value	A value coming from a table for a specific statistic (z, t, F, etc.) associated with the probability (α) that the researcher has chosen.
70	Population variance known, z-score	z-table	A table associated with the Z-statistic, where given a probability (α), we can see the value of the standardized variable, following the standard normal distribution.
71	Student's T distribution	t-statistic	A statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distribution.
72	Student's T distribution	a rule of thumb	A principle which is approximately true and is widely used in practice due to its simplicity.
73	Student's T distribution	t-table	A table associated with the t-statistic, where given a probability (α), and certain degrees of freedom, we can check the reliability factor.
74	Student's T distribution	degrees of freedom	The number of variables in the final calculation that are free to vary.
75	Margin of error	margin of error	Half the width of a confidence interval. It drives the width of the interval.

Visualizing a Two-Tailed Hypothesis Test

H₀: Null Hypothesis (Status Quo)

H₁: Reject H₀

Section 4: Hypothesis Testing

#	Lesson	Word	Definition
76	Null vs alternative	hypothesis	Loosely, a hypothesis is 'an idea that can be tested'.
77	Null vs alternative	hypothesis test	A test that is conducted in order to verify if a hypothesis is true or false.
78	Null vs alternative	null hypothesis	The null hypothesis is the one to be tested. Whenever we are conducting a test, we are trying to reject the null hypothesis.
79	Null vs alternative	alternative hypothesis	The alternative hypothesis is the opposite of the null. It is usually the opinion of the researcher, as he is trying to reject the null hypothesis and thus accept the alternative one.
80	Null vs alternative	to accept a hypothesis	The statistical evidence shows that the hypothesis is likely to be true.
81	Null vs alternative	to reject a hypothesis	The statistical evidence shows that the hypothesis is likely to be false.
82	Null vs alternative	one-tailed test	Tests which determine if a value is lower (or equal) or higher (or equal) to a certain value are one-sided. This is because they can only be rejected on one side.
83	Null vs alternative	two-tailed test	Tests which determine if a value is equal (or different) to a certain value are two-sided. This is because they can be rejected on two sides - if the parameter is too big or too small.
84	Rejection region and significance level	significance level	The probability of rejecting the null hypothesis, if it is true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the test.
85	Rejection region and significance level	rejection region	The part of the distribution, for which we would reject the null hypothesis.
86	Type I error vs type II error	type I error (false positive)	This error consists of rejecting a null hypothesis that is true. The probability of committing it is α, the significance level.
87	Type I error vs type II error	type II error (false negative)	This error consists of accepting a null hypothesis that is false. The probability of committing it is β.
88	Type I error vs type II error	power of the test	Probability of rejecting a null hypothesis that is false (the researcher's goal). Denoted by 1- β.
89	Test for the mean. Population variance known	z-score	The standardized variable associated with the dataset we are testing. It is observed in the table with an α equal to the level of significance of the test.
90	Test for the mean. Population variance known	μ₀	The hypothesized population mean.
91	p-value	p-value	The smallest level of significance at which we can still reject the null hypothesis given the observed sample statistic.
92	Test for the mean. Population variance unknown	email open rate	A measure of how many people on an email list actually open the emails they have received.

Section 5: Regression & Modeling

#	Lesson	Word	Definition
93	Correlation vs causation	causation	Causation refers to a causal relationship between two variables. When one variable changes, the other changes accordingly. When we have causality, variable A affects variable B, but it is not required that B causes a change in A.
94	Correlation vs causation	GDP	Gross domestic product is a monetary measure of the market value of all final goods and services produced for a specific country for a period.
95	The linear regression model	regression analysis	A statistical process for estimating relationships between variables. Usually, it is used for building predictive models.
96	The linear regression model	linear regression model	A linear approximation of a causal relationship between two or more variables.
97	The linear regression model	dependent variable (ŷ)	The variable that is going to be predicted. It also 'depends' on the other variables. Usually, denoted y.
98	The linear regression model	independent variable (xi)	A variable that is going to predict. It is the observed data (your sample data). Usually, denoted x1, x2 to xk.
99	The linear regression model	coefficient (βi)	A numerical or constant quantity placed before and multiplying the variable in an algebraic expression.
100	The linear regression model	constant (β₀)	This is a constant value, which does not affect any independent variable, but affects the dependent one in a constant manner.
101	The linear regression model	epsilon (ε)	The error of prediction. Difference between the observed value and the (unobservable) true value.
102	The linear regression model	regression equation	An equation, where the coefficients are estimated from the sample data. Think of it as an estimator of the linear regression model.
103	The linear regression model	b₀, b₁,..., bₖ	Estimates of the coefficients β₀, β₁, ... βₖ.
104	Geometrical representation	regression line	The best-fitting line through the data points.
105	Geometrical representation	residual (e)	Difference between the observed value and the estimated value by the regression line. Point estimate of the error (ε).
106	Geometrical representation	b₀	The intercept of the regression line with the y-axis for a simple linear regression.
107	Geometrical representation	b₁	The slope of the regression line for a simple linear regression.
108	Example	SAT	The SAT is a standardized test for college admission in the US.
109	Example	GPA	Grade point average.
110	Decomposition	ANOVA	Abbreviation of 'analysis of variance'. A statistical framework for analyzing variance of means.
111	Decomposition	SST	Sum of squares total. SST is the squared differences between the observed dependent variable and its mean.
112	Decomposition	SSR	Sum of squares regression. SSR is the sum of the differences between the predicted value and the mean of the dependent variable. This is the variability explained by our model.
113	Decomposition	SSE	Sum of squares error. SSE is the sum of the differences between the observed value and the predicted value. This is the variability that is NOT explained by our model.
114	R-squared	r-squared (R²)	A measure ranging from 0 to 1 that shows how much of the total variability of the dataset is explained by our regression model.
115	OLS	OLS	An abbreviation of 'ordinary least squares'. It is a method for estimation of the regression equation coefficients.
116	Regression tables	regression tables	In this context, they refer to the tables that are going to be created after you use a software to determine your regression equation.
117	Multivariate linear regression model	multivariate linear regression	Also known as multiple linear regression. There is a slight difference between the two, but are generally used interchangeably. In this course, it refers to a linear regression with more than one independent variable.
118	Adjusted R-squared	adjusted r-squared	A measure, based on the idea of R-squared, which penalizes the excessive use of independent variables.
119	F-test	F-statistic	The F-statistic is connected with the F-distribution in the same way the z-statistic is related to the Normal distribution.
120	F-test	F-test	A test for the overall significance of the model.
121	Assumptions	assumptions	When performing linear regression analysis, there are several assumptions about your data. They are known as the linear regression assumptions.
122	Assumptions	linearity	Refers to linear.
123	Assumptions	homoscedasticity	Literally means the same variance.
124	Assumptions	endogeneity	In statistics refers to a situation, where an independent variable is correlated with the error term.
125	Assumptions	autocorrelation	When different error terms in the same model are correlated to each other.
126	Assumptions	multicollinearity	Refers to high correlation.
127	A2. No endogeneity	omitted variable bias	A bias to the error term, which is introduced when you forget to include an important variable in your model.
128	A3. Normality and homoscedasticity	heteroscedasticity	Literally means a different variance. Opposite of homoscedasticity.
129	A3. Normality and homoscedasticity	log transformation	A transformation of a variable(s) in your model, where you substitute that variable(s) with its logarithm.
130	A3. Normality and homoscedasticity	semi-log model	One part of the model is log, the other is not.
131	A3. Normality and homoscedasticity	log-log model	Both parts of the model are logarithmical.
132	A4. No autocorrelation	serial correlation	Autocorrelation.
133	A4. No autocorrelation	cross-sectional data	Data taken at one moment in time.
134	A4. No autocorrelation	time series data	A type of panel data. Usually, time series is a sequence taken at successive, equally spaced points in time, e.g. stock prices.
135	A4. No autocorrelation	day of the week effect	A well-known phenomenon in finance. Consists in disproportionately high returns on Fridays and low returns on Mondays.

That's it! You've reviewed 135 of the most critical definitions in statistics and data science. Understanding the context behind these terms is the key to success. Revisit this guide often, practice explaining the concepts in your own words, and you'll be more than ready for any analytical challenge. Good luck with your interviews!

135 Key Statistics Definitions for Students & Interviews 📊

The Complete Statistics & Data Science Cheat Sheet: 135 Definitions to Ace Your Interview

Quick Navigation

Section 1: Population & Sampling

Section 2: Data, Variables & Visualization

Section 3: Distributions & Estimation

Section 4: Hypothesis Testing

Section 5: Regression & Modeling

Post a Comment

0 Comments

Pages

Career Resources for Exams & AI Interview Prep | Targets

Contact Us

Categories

Labels

Total Pageviews

Search This Blog

Report Abuse

Popular Posts

UPPCS CSAT Mock Test 2025 (Free): Full-Length Online Practice Paper with Solutions 👩‍💻

Computer GK for All Competitive Exams: The Ultimate Guide (Notes & History)

Categories

Tags

Most Popular

Quick Top 30 Python Full Stack Developer Interview Questions and Answers FAQ to Help You Ace Your Next Interview! 🐍

The Ultimate Guide: Quick Must-Know Top 50 SQL Interview Questions and Answers to Land Your Dream Job

Computer GK for All Competitive Exams: The Ultimate Guide (Notes & History)

Latest Current Affairs 2025: Complete Revision Guide (for UPSC, PSC, UPPSC, SSC & Other Exams)

Free 100 MCQ Polity Mock Test with Answers | IAS, UPSC, UPPSC, PSC, SSC, NDA, CDS Exams👩‍🎓

Labels

Menu Footer Widget

135 Key Statistics Definitions for Students & Interviews 📊

The Complete Statistics & Data Science Cheat Sheet: 135 Definitions to Ace Your Interview

Quick Navigation

Section 1: Population & Sampling

Section 2: Data, Variables & Visualization

Section 3: Distributions & Estimation

Section 4: Hypothesis Testing

Section 5: Regression & Modeling

You may like these posts

Post a Comment

0 Comments

Pages

Contact Us

Categories

Labels

Total Pageviews

Search This Blog

Popular Posts

Categories

Tags

Most Popular

Labels

Menu Footer Widget