Descriptive Statistics and Counting Principles

This document provides a comprehensive introduction to descriptive statistics, data classification, measures of central tendency and dispersion, and fundamental counting principles including permutations and combinations.

1. Introduction to Statistics

Statistics is the art of learning from data, involving the collection, description, and analysis of data to draw conclusions.

1.1 Population and Sample

Example: In a study of house prices in Tamil Nadu, all houses in Tamil Nadu constitute the population, while 1000 randomly selected houses from urban areas form the sample.

1.2 Branches of Statistics

2. Data

Data are facts and figures collected, analyzed, and summarized for presentation and interpretation. Data is typically collected to understand characteristics or attributes of groups of people, places, things, or events.

2.1 Types of Data Structure

2.2 Variables and Cases

Example: In a student dataset, each student (e.g., Anjali, Pradeep) is a case, while "Name," "Gender," "Date of Birth," and "Board" are variables.
In a tabular format, rows represent cases (same attribute recorded for each), and columns represent variables (same type of value recorded for each case).

2.3 Classification of Data

Data is broadly classified into categorical and numerical types.

2.3.1 Categorical Data (Qualitative Variables)

2.3.2 Numerical Data (Quantitative Variables)

2.3.3 Time-series and Cross-sectional Data

2.3.4 Scales of Measurement

There are four scales of measurement: nominal, ordinal, interval, and ratio.

3. Describing Categorical Data

3.1 Frequency Distribution

A frequency distribution for qualitative data lists distinct values and their frequencies. Each row in a frequency table shows a category and the number of cases in that category.

3.2 Relative Frequency

The relative frequency is the ratio of a category's frequency to the total number of observations. It provides a standard for comparison between 0 and 1.

3.3 Charts of Categorical Data

The most common displays for categorical variables are bar charts and pie charts.

3.4 The Area Principle and Misleading Graphs

The Area Principle states that the area occupied by a part of a graph should correspond to the amount of data it represents. Violations of this principle are common ways to mislead with statistics.

3.5 Summarizing Categorical Data

Descriptive measures are quantities that summarize a dataset. Measures of central tendency indicate the center or most typical value.

4. Describing Numerical Data

4.1 Types of Variables

4.2 Organizing Numerical Data

Numerical data can be organized by grouping observations into classes (categories or bins) and then constructing frequency and relative-frequency distributions.

Terminology:

4.3 Stem-and-leaf Diagram (Stemplot)

A stem-and-leaf diagram separates each observation into a stem (all but the rightmost digit) and a leaf (the rightmost digit).

Construction Steps:

  1. Separate observations into stem and leaf.
  2. Write stems vertically, smallest to largest, to the left of a vertical rule.
  3. Write each leaf to the right of the rule in its appropriate stem row.
  4. Arrange leaves in each row in ascending order.

4.4 Descriptive Measures for Numerical Data

4.4.1 Measures of Central Tendency

Mean:

The sum of observations divided by the number of observations; commonly referred to as the average.

Formulas:

Effect of Constants:

Median:

The middle value in an ordered dataset, dividing the bottom 50% from the top 50%.

Steps:

  1. Arrange data in increasing order.
  2. If \(n\) is odd, median is the \(\frac{n+1}{2}\)th observation.
  3. If \(n\) is even, median is the mean of the \(\frac{n}{2}\)th and \(\frac{n}{2}+1\)th observations.

Effect of Constants: Adding \(c\) → +\(c\); Multiplying by \(c\)\(\times c\). Not sensitive to outliers.

Mode: The most frequently occurring value in a dataset.
Effect of Constants: Adding \(c\) → +\(c\); Multiplying by \(c\)\(\times c\).

4.4.2 Measures of Dispersion (Variability/Spread)

Range:

The difference between the largest and smallest values in a dataset.

Formula: Range = Max - Min
Sensitivity: Highly sensitive to outliers as it only considers extreme values.

Variance:

Measures the variability by considering deviations of data values from the central value. Takes into account all observations.

Formulas:

Units: Expressed in square units of the original variable.

Effect of Constants:

Standard Deviation:

The square root of the variance.

Formulas:

Units: Expressed in the same units as the original data.

Effect of Constants:

4.5 Percentiles

The sample \(100p\) percentile is the data value such that at least \(100p\%\) of the data are ≤ it, and at least \(100(1-p)\%\) are ≥ it.

Computing Percentiles:

  1. Arrange data in increasing order.
  2. Calculate \(np\).
  3. If \(np\) not integer → smallest integer greater than \(np\)
  4. If \(np\) integer → average of positions \(np\) and \(np+1\)

4.6 Quartiles

Quartiles divide a dataset into four equal parts using three values.

4.7 Five Number Summary

Minimum – Q1 – Median – Q3 – Maximum

4.8 Interquartile Range (IQR)

IQR = Q3 − Q1

5. Association Between Two Variables

Association between two variables means that knowing information about one variable provides information about the other.

5.1 Association Between Two Categorical Variables

To find associations, a contingency table is used.
Rule: If row relative frequencies (or column relative frequencies) are the same for all rows (or columns), the variables are not associated. If they differ, the variables are associated.

5.2 Association Between Two Numerical Variables

A scatter plot is a visual test for association between two numerical variables, displaying pairs of values as points on a two-dimensional plane.

Describing Association: Direction (upward, downward, no trend), Curvature (linear, curved), Variation (tightly clustered or not), Outliers.

5.2.1 Measures of Association for Numerical Variables

Covariance:

Quantifies the strength of the linear association between two numerical variables.

Formulas:

Positive → large x with large y; Negative → large x with small y.

Correlation (Pearson r):

\( r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} = \frac{\text{Cov}(x,y)}{s_x s_y} \)

Unitless, −1 ≤ r ≤ +1.

5.3 Association Between Categorical and Numerical Variables

Point Bi-serial Correlation Coefficient (\(r_{pb}\)): Measures association between a numerical variable (X) and a dichotomous categorical variable (Y, e.g. 0 or 1).

6. Basic Principle of Counting

6.1 Addition Rule of Counting

If action A can occur in \(n_1\) ways and action B in \(n_2\) ways, and A and B cannot occur simultaneously, then total ways = \(n_1 + n_2\).

Choosing one item (shirt or pant) from 4 shirts and 3 pants → 4 + 3 = 7 choices.

6.2 Multiplication Rule of Counting

If action A in \(n_1\) ways and action B in \(n_2\) ways, then both → \(n_1 \times n_2\).

Choosing one shirt (4) AND one pant (3) → 4 × 3 = 12 combinations.

7. Factorial

\(n! = n \times (n-1) \times \cdots \times 1\)
By convention \(0! = 1\)
\(n! = n \times (n-1)!\)

8. Permutation

A permutation is an ordered arrangement of all or some of n distinct objects. Order matters.

8.1 Permutation Formula (No Repetition)

\(P(n,r) = {}^nP_r = \frac{n!}{(n-r)!}\)

Special cases: \(^nP_0 = 1\), \(^nP_1 = n\), \(^nP_n = n!\)

8.2 Permutation Formula (With Repetition)

\(n^r\)

8.3 Rearranging Letters (with Identical Items)

One kind identical: \(\frac{n!}{p!}\)
Multiple kinds: \(\frac{n!}{p_1! \, p_2! \cdots p_k!}\)

8.4 Circular Permutation

Clockwise ≠ anticlockwise: \((n-1)!\)
Clockwise = anticlockwise: \(\frac{(n-1)!}{2}\)

9. Combination

A combination is a selection of r objects from n distinct objects where order does not matter.

9.1 Combination Formula

\({}^nC_r = \binom{n}{r} = \frac{n!}{r!(n-r)!}\)

Properties: \({}^nC_r = {}^nC_{n-r}\), {}^nC_0 = {}^nC_n = 1

9.2 Distinguishing Permutation vs. Combination

Permutation → order matters (e.g. medals)
Combination → order does not matter (e.g. committee)

9.3 Drawing Lines in a Circle

Undirected lines: \({}^nC_2\)
Directed lines: \({}^nP_2\)