 # Statistical Testing: How to select the best test for your data? || Hypothesis Testing

#### Statistical Testing

The most confusing aspect of statistical testing is which test needs to use and when, especially those who are new to statistical analysis. Through this post I will try to make it easy to understand in choose the correct statistical test for our data.

Let us first understand the variables.

• Dependent variable– also know as output variable
• Independent variables – also known as a predictor variable which impact value of the dependent varaible

Type of variables

Why do we need statistical tests

Statistical testing is required to find out risks associated with decision-making, which comes from a hypothesis or a gut feeling or assumption. There are two kindly of risk which is associated with any hypothesis testing is known as alpha risk and beta risk. Assumptions are inferred that allow the estimation of the probability (known as the p-value) of getting a wrong conclusion.

Statistical assumption in statistical testing

In statistical testing there are three assumption that a data needs to be met otherwise data is not considered fit for statistical testings.

• Homogeneity of variance: The variance within each group should be similar as compared to other groups.
• Independence of variables: The variables should not related to each other.
• Normality of data: The data should follow normal distribution. Normality distribution means it should follow bell shape curve.

Type of statistical testing

Basically we bifurcate testing in two categories 1) parametric test and 2) Non parametric test

• Parametric test :
• follows normality test
• sample size is large enough to satisfy central tendency theorem
• the data should be interval or ratio data.
• Non parametric test :
• When data type is nominal or ordinal
• when data distribution is unknown

Parametric Tests

When both X is Continuous and Y is Continuous

• Correlation Test
• It looks for relationship between the two variables.
• The strength of relationship represents by using R value and it varies from -1 to +1.
• When R value is towards +1/-1 then it is considered to be as a strong positive/negative relationship.
• It doesn’t differentiates X and Y but treats both as a variable.
• Regression Test
• It shows magnitude of relationship between the two variables.
• The magnitude of relationship depicted by using R – Square.
• R – Square explains % variation in Y as contributed by X. R – Square value of ~65% is considered as significant.
• It gives us a regression equation of Y= mX + C

When X is discrete and Y is Continuous — When we are comparing means

• 1 sample T test
• When we wanted to compare mean of a sample with a standard target.
• It could be used for baselining purpose basis the confidence interval.
• H0=mean is equal to standard and Ha=mean is not equal to standard.
• Examples: 1) When we wanted to check a particular batch of car’s mileage is 15 km/h 2) Checking average weight of new born babies in a city.
• 2 sample T test
• Used when we wanted to compare mean of two samples
• H0=mean is equal for both samples and Ha=mean is not equal for both samples.
• Examples: 1) comparing mean mileage of two batches of car 2) comparing quality score of type of call.
• 1 Way Anova
• We used 1 way Anova when we wanted to compare mean for more than 2 samples.
• H0=mean is equal for all samples and Ha=mean is not equal for all samples.
• Examples: 1) when comparing mean of 6 groups of a particular class. 2) average delivery time of pizza by 5 companies.

When X is discrete and Y is Continuous — When we are comparing variance

• 1 variance test
• When we want to compare variance or standard deviation of one sample with pre defined target
• We can also used it for baselining based on confidence interval
• H0=standard deviation or variance is equal to standard and Ha=standard deviation or variance is not equal to standard.
• 2 variance test
• Used when we wanted to compare standard deviation or variance of two samples
• H0=standard deviation or variance is equal for both samples and Ha=standard deviation or variance is not equal for both samples.
• Test for equal variance
• We used 1 way Anova when we wanted to compare standard deviation or variance more than 2 samples.
• H0=standard deviation or variance is equal for all samples and Ha=standard deviation or variance is not equal for all samples.

When X is discrete and Y is discrete — Y is Binary

• 1 proportion test
• used when comparing proportion of event in a sample to a predefined standard
• H0=the proportion is equal to standard proportion and Ha=the proportion is not equal to standard proportion.
• 2 proportion test
• used when comparing proportion of event in two samples
• H0=the proportion is equal for both sample and Ha=the proportion is not equal for both sample.

When X is continuous and Y is discrete

• Binary Logistic Regression
• When Y discrete and binary while X is continuous.
• H0=X doesn’t significantly impacts Y and Ha=X significantly impacts Y.
• Binary Logistic Regression
• When Y discrete and ordinal while X is continuous.
• H0=X doesn’t significantly impacts Y and Ha=X significantly impacts Y.
• Nominal Logistic Regression
• When Y discrete and nominal while X is continuous.
• H0=X doesn’t significantly impacts Y and Ha=X significantly impacts Y.

Non Parametric Tests

When X is discrete and Y is continuous — when we are comparing centering

• 1 sample sign
• When we wanted to compare median of a sample to a predefine standard.
• We can also use it for baselining purpose based on the predefine target.
• H0=median equals to predefined target and Ha=median doesn’t equal to predefined target.
• Mann Whitney
• When we wanted to compare median of two samples.
• H0=median of two samples are equal and Ha=median of two samples are not equal.
• Moods Median
• When we wanted to compare median of more than two samples.
• H0=median of all samples are equal and Ha=median of all samples are not equal.

When X is discrete and Y is discrete

• Chi Square(Cross tabulation)
• When both X and Y are categorical
• H0=X doesn’t significantly impacts Y and Ha=X does significantly impacts Y.
• Chi Square(2 way table)
• When Y and X both are numeric data
• Expected value
• H0=X doesn’t significantly impacts Y and Ha=X does significantly impacts Y.

That is all for now. You can read more from the below article.

https://datascience.foundation/sciencewhitepaper/statistical-testing-understanding-which-testing-methods-to-use