ISA 225

Planted 02021-08-17

Notes for Principles of Business Analytics - ISA 225

Prepping resources:

Review of Statistical Inference

  • Population:
    • N Size
    • μ Mean
    • 𝜎 Standard Deviation
    • P Ratio (proportion)
  • Sample:
    • n Size
    • Ratio (proportion) (sample portion size / sample size)
    • Mean
    • Portion Size

$$Z-score = {sampleRatio - populationRatio \over \sqrt{populationRatio(1-populationRatio) \over sampleSize} }$$

$$Z-score = {sampleMean - populationMean \over{populationStandardDeviation \over \sqrt{sampleSize}} }$$

$$Z-score = {p̂ - P \over \sqrt{p̂(1-p̂) \over n} }$$

$$Z-score = {x̄ - μ \over{𝜎 \over \sqrt{n}} }$$

Checks

  • Checks:
    • Independent:
      • ✅ Randomly selected
      • ✅ 10% condition: n < 10% of P
    • Normality:
      • ✅ Success/Failure:
        • ✅ np >= 10
        • ✅ n(1-p) >= 10
      • ✅ Expect at least 10 success and 10 failures

Common Z-Values

αConfidence LevelZ*
0.190%1.645
0.0595%1.960 ≈ 2
0.0199%2.576

Determining Sample Size

$$sampleSize = {sampleRatio(1-sampleRatio) \times\left({zIndex \over{marginOfError}}\right)^2 }$$

$$n = {p̂(1-p̂) \times\left({Z^* \over{ME}}\right)^2 }$$

$$sampleSize = {\left({zIndex \times{ populationStandardDeviation} \over{marginOfError}}\right)^2}$$

$$n = {\left({Z^* \times{𝜎} \over{ME}}\right)^2}$$

$$sampleSize = {\left({zIndex \times{ sampleStandardDeviation} \over{marginOfError}}\right)^2}$$

$$n = {\left({Z^* \times{s} \over{ME}}\right)^2}$$

Notes

The t-distribution describes the statistical properties of sample means that are estimated from small samples; the standard normal distribution is used for large samples.

Hypothesis test for population proportion and mean

A hypothesis is a claim about a population parameter (proportion, mean)

Steps to compute a hypothesis test:

  1. State hypothesis
  2. Calculate test statistics
  3. Find p-value
  4. Make conclusions based on p-value

The null hypothesis Ho, is the starting assumption (nothing has changed).

$$H_o: populationParameter = claimedValue$$

The alternative hypothesis, or Ha is a claim the population parameter value differs from the null hypothesis. It can take these different forms depending on what you want to test (H_a):

Left-tailed hypothesis test: $$H_a: populationParameter \lt claimedValue$$

Right-tailed hypothesis test: $$H_a: populationParameter \gt claimedValue$$

Two-tailed hypothesis test: $$H_a: populationParameter \neq claimedValue$$

Step 2: Calculate the Test Statistics

  • Test statistics about population proportion P (One-prop Z-test)
    • $$Z-score = {p̂ - P \over \sqrt{p(1-p) \over n} }$$
  • Test statistics about population mean μ (One-sample test)
    1. When the population 𝜎 Standard Deviation is known: (One-sample Z-test)
    • $$Z-score = {sampleMean - claimedValue \over{populationMean \over \sqrt{sampleSize}} }$$
    • $$Z-score = {x̄ - μ \over{𝜎 \over \sqrt{n}} }$$
    1. When the population 𝜎 Standard Deviation is unknown (One-same t test / student t-test)
    • $$t = {sampleMean - claimedValue \over{sampleStdDev \over \sqrt{sampleSize}} }$$
    • $$t = {x̄ - μ_0 \over{s \over \sqrt{n}} }$$

Step 4: Make conclusion based on p-value

Compare p-value with significance level α (always given before test). The smaller α, the more accurate the test is.

  1. Type I errors, the null hypothesis is true, but we reject it (false negative)
  2. Type II errors, the null hypothesis is false, but we fail to reject it (false positive)

If p-value < α, then reject null hypothesis, we have enough evidence to support Ha.

If p-value > α, then do not reject null hypothesis, we do not have enough evidence to support Ha.

Comparing Two Population Parameters

Two Sample t-test (comparing two population means)

  1. State hypothesis
  2. Check assumptions and calculate test statistics
  3. Find p-value based on test statistics
  4. Make conclusion based on p-value

$$Z = {(ȳ_1 - ȳ_2) - (μ_1 - μ_2) \over\sqrt{ {𝜎_1^2 \over{ n_1 }} + {𝜎_2^2 \over{n_2}} } }$$

Since population standard deviations are unknown, we use the standard errors instead:

$$t = {(ȳ_1 - ȳ_2) - (μ_1 - μ_2) \over\sqrt{ {s_1^2 \over{ n_1 }} + {s_2^2 \over{n_2}} } }$$

Confidence Interval for Difference between Two Population Means

Two sample Z-interval (when $$𝜎_1$$ and $$𝜎_2$$ are known)

$${(ȳ_1 - ȳ_2) \pm Z^* * \sqrt{ {𝜎_1^2 \over{ n_1 }} + {𝜎_2^2 \over{n_2}} } }$$

Two sample Z-interval (when $$𝜎_1$$ and $$𝜎_2$$ are unknown)

$${(ȳ_1 - ȳ_2) \pm t^* * \sqrt{ {s_1^2 \over{ n_1 }} + {s_2^2 \over{n_2}} } }$$

The $$t^*_{df, a/2}$$ here depends on the confidence level 100(1-α)% and the calculated df.

Interpretation of C.I. is similar to one-sample test

Chi-Square Tests

  • One variable?
    • Goodness of Fit Test
    • H0: model fits data
    • Ha: model does not fit data
  • Two variables?
    • Test for independence
    • H0: variables are independent
    • Ha: variables are not independent

Goodness-of-Fit Tests (one variable)

A χ2 goodness of fit test is applied when you have one categorical variable from a single population.

  1. State the hypothesis:
  • H0: model fits. (hypothesized model fits the sample we collected)
  • Ha: model doesn’t fit. (hypothesized model doesn’t fit the sample we collected)
  1. Assumptions and Test Statistics:
  • Assumptions:
    • Counted Data Condition – The data must be counts for the categories of a single categorical variable.
    • Independence Assumption – The counts should be independent of each other.
    • Randomization Condition – The counted individuals should be a random sample of the population.
    • Sample Size Assumption – We expect at least 5 individuals per cell.
  • Test statistics:
    • $${\chi^2 = \sum_{allCells} {(Obs - Exp)^2\over{Exp} } }$$
  1. Find p-value based on the test statistics
  • Df= (#cells -1), use the χ2 table, fix the line of df, then with the test statistics to find the corresponding p-value, which is the right-tail probability of the test statistics.
  • (or by technology) p-value= P(χ2 > test statistics)
  1. Make Conclusions based the p-value
  • If p-value < α, reject the H0, which means the hypothesized model doesn’t fit the sampled data.
  • If p-value > α, fail to reject H0, we do not have significant evidence to say the model doesn’t fit the sampled data.

Chi-Square test for Independence (two variables)

  1. State the hypothesis:
  • H0: variables are independent.
  • Ha: variables are not independent.
  1. Assumptions and Test Statistics:
  • Assumptions:
    • Counted Data Condition – The data must be counts for the categories of a single categorical variable.
    • Randomization Condition – The counted individuals should be a random sample of the population.
    • Sample Size Assumption – We expect at least 5 individuals per cell.
  • Test statistics:
    • $${\chi^2 = \sum_{allCells} {(Obs - Exp)^2\over{Exp} } }$$
    • Assuming H0 is true, which means that the variables are independent.
      • $${Exp_{ij} = {totalRow_i \times totalCol_i}\over{tableTotal} }$$
  1. Find p-value based on the test statistics
  • Df= (# of rows -1)×(# of cols-1), use the χ2 table, fix the line of df, then with the test statistics to find the corresponding p-value, which is the right-tail probability of the test statistics.
  • p-value= P(χ2 > test statistics)
  1. Make Conclusions based the p-value
  • If p-value < α, reject the H0, which means the two variables are not independent.
  • If p-value > α, fail to reject H0, we do not have significant evidence to say the two variables are not independent.

Simple regression (linear)

Sample regression line:

  • ŷ the predicted value of response variable (y), when x is given as a specific value.
  • b0 the sample y-intercept
  • b1 the sample slope
  • r the correlation coefficient
    • value from -1 to 1
    • closer to 0, the weaker relationship they have
  • r2 the proportion of the observed variation in y that can be accounted for by x, or modeling by x.
    • shows how well the model fits the data
    • value from 0 to 1
    • closer is to 1, the stronger the regression relationship.
  • $${e = y - \hat{y}}$$ the residual, difference between predicted (ŷ), and observed (y) values
  • Ɛ the population mean residual
  • μy the population mean of y at a given value of x
  • 𝛽0 the population mean value of Y when X = 0
  • 𝛽1 the population mean value of Y for each unit increase in X

Step 1: State the hypothesis

  • $$H_o: \beta_0 = 0$$
  • $$H_a: \beta_1 \ne 0$$

Step 2: Test statistics

  • df=n-2
  • Se is called “Root Mean Squared Error”
  • $${t-test = {b_1-\beta_1\over{SE(b_1)}}}$$
  • Confidence interval = $${b_1 \pm t^*_{df,{\alpha\over{2}}} \times SE(b_1) }$$

Regression Assumptions

  1. Linearity Assumption: scatterplot looks like a linear relationship
  2. Independence Assumption: randomly selected
  3. Equal Variance Assumption: scatterplot equally spread out, no clumping and spread around the line in residual plot is reasonably consistent at line 0
  4. Normal Population Assumption: the residuals satisfy the Nearly Normal Condition and

MathJax Help