135 Key Statistics Definitions for Students & Interviews 📊

The Complete Statistics & Data Science Cheat Sheet: 135 Definitions to Ace Your Interview

The Complete Statistics & Data Science Cheat Sheet: 135 Definitions to Ace Your Interview

Preparing for a data science, data analyst, or business intelligence interview? This comprehensive guide is your ultimate cheat sheet. We've compiled 135 essential statistics definitions, organized for quick learning and revision. Bookmark this page, master these concepts, and walk into your next interview with confidence!

Fundamental Concept: Population vs. Sample
Population (N) All items of interest. We calculate Parameters from this.
Sample (n)
A subset of the population. We calculate Statistics from this.

Section 1: Population & Sampling

#LessonWordDefinition
1Population vs samplepopulationThe collections of all items of interest to our study; denoted N.
2Population vs samplesampleA subset of the population; denoted n.
3Population vs sampleparameterA value that refers to a population. It is the opposite of statistic.
4Population vs samplestatisticA value that refers to a sample. It is the opposite of a parameter.
5Population vs samplerandom sampleA sample where each member is chosen from the population strictly by chance.

Section 2: Data, Variables & Visualization

#LessonWordDefinition
6Types of datarepresentative sampleA sample taken from the population to reflect the population as a whole.
7Types of datavariableA characteristic of a unit which may assume more than one value. Eg. height, occupation, age etc.
8Types of datatype of dataA way to classify data. There are two types of data - categorical and numerical.
9Types of datacategorical dataA subgroup of types of data. Describes categories or groups.
10Types of datanumerical dataA subgroup of types of data. Represents numbers. Can be further classified into discrete and continuous.
11Types of datadiscrete dataData that can be counted in a finite matter. Opposite of continuous.
12Types of datacontinuous dataData that is 'infinite' and impossible to count. Opposite of discrete.
13Levels of measurementlevels of measurementA way to classify data. There are two levels of measurement - qualitative and quantitative, which are further classed into nominal & ordinal, and ratio & interval, respectively.
14Levels of measurementqualitative dataA subgroup of levels of measurement. There are two types of qualitative data - nominal and ordinal.
15Levels of measurementquantitative dataA subgroup of levels of measurement. There are two types of quantitative data - ratio and interval.
16Levels of measurementnominalRefers to variables that describe different categories and cannot be put in any order.
17Levels of measurementordinalRefers to variables that describe different categories, but can be ordered.
18Levels of measurementratioA number that has a unique and unambiguous zero point, no matter if a whole number or a fraction.
19Levels of measurementintervalAn interval variable represents a number or an interval. There isn't a unique and unambiguous zero point. For example, degrees in Celsius and Fahrenheit are interval variables, while Kelvin is a ratio variable.
20Categorical variables. Visualization techniquesfrequency distribution tableA table that represents the frequency of each variable.
21Categorical variables. Visualization techniquesfrequencyMeasures the occurrence of a variable.
22Categorical variables. Visualization techniquesabsolute frequencyMeasures the NUMBER of occurrences of a variable.
23Categorical variables. Visualization techniquesrelative frequencyMeasures the RELATIVE NUMBER of occurrences of a variable. Usually, expressed in percentages.
24Categorical variables. Visualization techniquescumulative frequencyThe sum of relative frequencies so far. The cumulative frequency of all members is 100% or 1.
25Categorical variables. Visualization techniquesPareto diagramA type of bar chart where frequencies are shown in descending order. There is an additional line on the chart, showing the cumulative frequency.
26The HistogramhistogramA type of bar chart that represents numerical data. It is divided into intervals (or bins) that are not overlapping and span from the first observation to the last. The intervals (bins) are adjacent - where one stops, the other starts.
27The Histogrambins (histogram)The intervals that are represented in a histogram.
28Cross table and scatter plotcross tableA table which represents categorical data. On one axis we have the categories, and on the other - their frequencies. It can be built with absolute or relative frequencies.
29Cross table and scatter plotcontingency tableSee cross table.
30Cross table and scatter plotscatter plotA plot that represents numerical data. Graphically, each observation looks like a point on the scatter plot.
31Mean, median and modemeasures of central tendencyMeasures that describe the data through 'averages'. The most common are the mean, median and mode. There is also geometric mean, harmonic mean, weighted-average mean, etc.
32Mean, median and modemeanThe simple average of the dataset. Denoted μ.
33Mean, median and modemedianThe middle number in an ordered dataset.
34Mean, median and modemodeThe value that occurs most often. A dataset can have 0, 1 or multiple modes.
35Skewnessmeasures of asymmetryMeasures that describe the data through the level of symmetry that is observed. The most common are skewness and kurtosis.
36SkewnessskewnessA measure that describes the dataset's symmetry around its mean.
37Variancesample formulaA formula that is calculated on a sample. The value obtained is a statistic.
38Variancepopulation formulaA formula that is calculated on a population. The value obtained is a parameter.
39Variancemeasures of variabilityMeasures that describe the data through the level of dispersion (variability). The most common ones are variance and standard deviation.
40VariancevarianceMeasures the dispersion of the dataset around its mean. It is measured in units squared. Denoted σ² for a population and s² for a sample.
41Standard deviation and coefficient of variationstandard deviationMeasures the dispersion of the dataset around its mean. It is measured in original units. It is equal to the square root of the variance. Denoted σ for a population and s for a sample.
42Standard deviation and coefficient of variationcoefficient of variationMeasures the dispersion of the dataset around its mean. It is also called 'relative standard deviation'. It is useful for comparing different datasets in terms of variability.
43Covarianceunivariate measureA measure which refers to a single variable.
44Covariancemultivariate measureA measure which refers to multiple variables.
45CovariancecovarianceA measure of relationship between two variables. Usually, because of its scale of measurement, covariance is not directly interpretable. Denoted σxy for a population and sxy for a sample.
46Correlationlinear correlation coefficientA measure of relationship between two variables. Very useful for direct interpretation as it takes on values from [-1,1]. Denoted ρxy for a population and rxy for a sample.
47CorrelationcorrelationA measure of the relationship between two variables. There are several ways to compute it, the most common being the linear correlation coefficient.
The Normal Distribution (Bell Curve)

Section 3: Distributions & Estimation

#LessonWordDefinition
48What is a distributiondistributionA function that shows the possible values for a variable and the probability of their occurrence.
49The normal distributionBell curveA common name for the normal distribution.
50The normal distributionGaussian distributionThe original name of the normal distribution. Named after the famous mathematician Gauss, who was the first to explore it through his work on the Gaussian function.
51The normal distributionto control for the mean/std/etcWhile holding a particular value constant, we change the other variables and observe the effect.
52The standard normal distributionstandard normal distributionA normal distribution with a mean of 0, and a standard deviation of 1.
53The standard normal distributionz-statisticThe statistic associated with the normal distribution.
54The standard normal distributionstandardized variableA variable which has been standardized using the z-score formula - by first subtracting the mean and then dividing by the standard deviation.
55The central limit theoremcentral limit theoremNo matter the distribution of the underlying dataset, the sampling distribution of the means of the dataset approximate a normal distribution.
56The central limit theoremsampling distributionThe distribution of a sample.
57Standard errorstandard errorThe standard error is the standard deviation of the sampling distribution. It takes the size of the sample into account.
58Estimators and estimatesestimatorEstimations we make according to a function or rule.
59Estimators and estimatesestimateThe particular value that was estimated through an estimator.
60Estimators and estimatesbiasAn unbiased estimator has an expected value the population parameter. A biased one has an expected value different from the population parameter. The bias is the deviation from the true value.
61Estimators and estimatesefficiency (in estimators)In the context of estimators, the efficiency loosely refers to 'lack of variability'. The most efficient estimator is the one with the least variability. It is a comparative measure.
62Estimators and estimatespoint estimatorA function or a rule, according to which we make estimations that will result in a single number.
63Estimators and estimatespoint estimateA single number that is derived from a certain point estimator.
64Estimators and estimatesinterval estimatorA function or a rule, according to which we make estimations that will result in an interval. In this course, we will only consider confidence intervals.
65Estimators and estimatesinterval estimateA particular result that was obtained from an interval estimator. It is an interval.
66Definition of confidence intervalsconfidence intervalA confidence interval is the range within which you expect the population parameter to be. You have a certain probability of it being correct.
67Definition of confidence intervalsreliability factorA value from a z-table, t-table, etc. that is associated with our test.
68Definition of confidence intervalslevel of confidenceShows in what % of cases we expect the population parameter to fall into the confidence interval we obtained. Denoted 1 - α. Example: 95% confidence level means that in 95% of the cases, the population parameter will fall into the specified interval.
69Population variance known, z-scorecritical valueA value coming from a table for a specific statistic (z, t, F, etc.) associated with the probability (α) that the researcher has chosen.
70Population variance known, z-scorez-tableA table associated with the Z-statistic, where given a probability (α), we can see the value of the standardized variable, following the standard normal distribution.
71Student's T distributiont-statisticA statistic that is generally associated with the Student's T distribution, in the same way the z-statistic is associated with the normal distribution.
72Student's T distributiona rule of thumbA principle which is approximately true and is widely used in practice due to its simplicity.
73Student's T distributiont-tableA table associated with the t-statistic, where given a probability (α), and certain degrees of freedom, we can check the reliability factor.
74Student's T distributiondegrees of freedomThe number of variables in the final calculation that are free to vary.
75Margin of errormargin of errorHalf the width of a confidence interval. It drives the width of the interval.
Visualizing a Two-Tailed Hypothesis Test
H₀: Null Hypothesis (Status Quo)
H₁: Reject H₀
H₁: Reject H₀

Section 4: Hypothesis Testing

#LessonWordDefinition
76Null vs alternativehypothesisLoosely, a hypothesis is 'an idea that can be tested'.
77Null vs alternativehypothesis testA test that is conducted in order to verify if a hypothesis is true or false.
78Null vs alternativenull hypothesisThe null hypothesis is the one to be tested. Whenever we are conducting a test, we are trying to reject the null hypothesis.
79Null vs alternativealternative hypothesisThe alternative hypothesis is the opposite of the null. It is usually the opinion of the researcher, as he is trying to reject the null hypothesis and thus accept the alternative one.
80Null vs alternativeto accept a hypothesisThe statistical evidence shows that the hypothesis is likely to be true.
81Null vs alternativeto reject a hypothesisThe statistical evidence shows that the hypothesis is likely to be false.
82Null vs alternativeone-tailed testTests which determine if a value is lower (or equal) or higher (or equal) to a certain value are one-sided. This is because they can only be rejected on one side.
83Null vs alternativetwo-tailed testTests which determine if a value is equal (or different) to a certain value are two-sided. This is because they can be rejected on two sides - if the parameter is too big or too small.
84Rejection region and significance levelsignificance levelThe probability of rejecting the null hypothesis, if it is true. Denoted α. You choose the significance level. All else equal, the lower the level, the better the test.
85Rejection region and significance levelrejection regionThe part of the distribution, for which we would reject the null hypothesis.
86Type I error vs type II errortype I error (false positive)This error consists of rejecting a null hypothesis that is true. The probability of committing it is α, the significance level.
87Type I error vs type II errortype II error (false negative)This error consists of accepting a null hypothesis that is false. The probability of committing it is β.
88Type I error vs type II errorpower of the testProbability of rejecting a null hypothesis that is false (the researcher's goal). Denoted by 1- β.
89Test for the mean. Population variance knownz-scoreThe standardized variable associated with the dataset we are testing. It is observed in the table with an α equal to the level of significance of the test.
90Test for the mean. Population variance knownμ₀The hypothesized population mean.
91p-valuep-valueThe smallest level of significance at which we can still reject the null hypothesis given the observed sample statistic.
92Test for the mean. Population variance unknownemail open rateA measure of how many people on an email list actually open the emails they have received.

Section 5: Regression & Modeling

#LessonWordDefinition
93Correlation vs causationcausationCausation refers to a causal relationship between two variables. When one variable changes, the other changes accordingly. When we have causality, variable A affects variable B, but it is not required that B causes a change in A.
94Correlation vs causationGDPGross domestic product is a monetary measure of the market value of all final goods and services produced for a specific country for a period.
95The linear regression modelregression analysisA statistical process for estimating relationships between variables. Usually, it is used for building predictive models.
96The linear regression modellinear regression modelA linear approximation of a causal relationship between two or more variables.
97The linear regression modeldependent variable (ŷ)The variable that is going to be predicted. It also 'depends' on the other variables. Usually, denoted y.
98The linear regression modelindependent variable (xi)A variable that is going to predict. It is the observed data (your sample data). Usually, denoted x1, x2 to xk.
99The linear regression modelcoefficient (βi)A numerical or constant quantity placed before and multiplying the variable in an algebraic expression.
100The linear regression modelconstant (β₀)This is a constant value, which does not affect any independent variable, but affects the dependent one in a constant manner.
101The linear regression modelepsilon (ε)The error of prediction. Difference between the observed value and the (unobservable) true value.
102The linear regression modelregression equationAn equation, where the coefficients are estimated from the sample data. Think of it as an estimator of the linear regression model.
103The linear regression modelb₀, b₁,..., bₖEstimates of the coefficients β₀, β₁, ... βₖ.
104Geometrical representationregression lineThe best-fitting line through the data points.
105Geometrical representationresidual (e)Difference between the observed value and the estimated value by the regression line. Point estimate of the error (ε).
106Geometrical representationb₀The intercept of the regression line with the y-axis for a simple linear regression.
107Geometrical representationb₁The slope of the regression line for a simple linear regression.
108ExampleSATThe SAT is a standardized test for college admission in the US.
109ExampleGPAGrade point average.
110DecompositionANOVAAbbreviation of 'analysis of variance'. A statistical framework for analyzing variance of means.
111DecompositionSSTSum of squares total. SST is the squared differences between the observed dependent variable and its mean.
112DecompositionSSRSum of squares regression. SSR is the sum of the differences between the predicted value and the mean of the dependent variable. This is the variability explained by our model.
113DecompositionSSESum of squares error. SSE is the sum of the differences between the observed value and the predicted value. This is the variability that is NOT explained by our model.
114R-squaredr-squared (R²)A measure ranging from 0 to 1 that shows how much of the total variability of the dataset is explained by our regression model.
115OLSOLSAn abbreviation of 'ordinary least squares'. It is a method for estimation of the regression equation coefficients.
116Regression tablesregression tablesIn this context, they refer to the tables that are going to be created after you use a software to determine your regression equation.
117Multivariate linear regression modelmultivariate linear regressionAlso known as multiple linear regression. There is a slight difference between the two, but are generally used interchangeably. In this course, it refers to a linear regression with more than one independent variable.
118Adjusted R-squaredadjusted r-squaredA measure, based on the idea of R-squared, which penalizes the excessive use of independent variables.
119F-testF-statisticThe F-statistic is connected with the F-distribution in the same way the z-statistic is related to the Normal distribution.
120F-testF-testA test for the overall significance of the model.
121AssumptionsassumptionsWhen performing linear regression analysis, there are several assumptions about your data. They are known as the linear regression assumptions.
122AssumptionslinearityRefers to linear.
123AssumptionshomoscedasticityLiterally means the same variance.
124AssumptionsendogeneityIn statistics refers to a situation, where an independent variable is correlated with the error term.
125AssumptionsautocorrelationWhen different error terms in the same model are correlated to each other.
126AssumptionsmulticollinearityRefers to high correlation.
127A2. No endogeneityomitted variable biasA bias to the error term, which is introduced when you forget to include an important variable in your model.
128A3. Normality and homoscedasticityheteroscedasticityLiterally means a different variance. Opposite of homoscedasticity.
129A3. Normality and homoscedasticitylog transformationA transformation of a variable(s) in your model, where you substitute that variable(s) with its logarithm.
130A3. Normality and homoscedasticitysemi-log modelOne part of the model is log, the other is not.
131A3. Normality and homoscedasticitylog-log modelBoth parts of the model are logarithmical.
132A4. No autocorrelationserial correlationAutocorrelation.
133A4. No autocorrelationcross-sectional dataData taken at one moment in time.
134A4. No autocorrelationtime series dataA type of panel data. Usually, time series is a sequence taken at successive, equally spaced points in time, e.g. stock prices.
135A4. No autocorrelationday of the week effectA well-known phenomenon in finance. Consists in disproportionately high returns on Fridays and low returns on Mondays.

That's it! You've reviewed 135 of the most critical definitions in statistics and data science. Understanding the context behind these terms is the key to success. Revisit this guide often, practice explaining the concepts in your own words, and you'll be more than ready for any analytical challenge. Good luck with your interviews!

Post a Comment

0 Comments