This document provides a comprehensive introduction to descriptive statistics, data classification, measures
of central tendency and dispersion, and fundamental counting principles including permutations and
combinations.
1. Introduction to Statistics
Statistics is the art of learning from data, involving the collection, description, and analysis of data to draw
conclusions.
1.1 Population and Sample
Population: The entire collection of elements of interest.
Sample: A subgroup of the population selected for detailed study.
Example: In a study of house prices in Tamil Nadu, all houses in Tamil Nadu
constitute the population, while 1000 randomly selected houses from urban areas form the sample.
1.2 Branches of Statistics
Descriptive Statistics: Focuses on describing and summarizing data, often using numerical
or graphical summaries to highlight main points. This can be applied to both sample and population data.
Purpose: To examine and explore information about collected data only.
Example: Calculating the average marks of a class of 50 students.
Inferential Statistics: Involves drawing conclusions or making inferences about a
population based on data obtained from a sample.
Purpose: To use sample information to draw conclusions about the population.
Example: A teacher uses the average marks of a sample of students to conclude the average
marks of all students in the school.
2. Data
Data are facts and figures collected, analyzed, and summarized for presentation and interpretation. Data is
typically collected to understand characteristics or attributes of groups of people, places, things, or events.
2.1 Types of Data Structure
Unstructured Data: Data not organized in a predefined manner, often text-heavy, and
requires more effort to process. Examples include YouTube comments, image files, and social media posts.
Structured Data: Data organized in a standardized, clearly defined, and searchable format,
making it easy to analyze and understand. It typically appears in tabular form.
Example: A student dataset with columns for Name, Gender, Date of Birth, Marks, and Board.
2.2 Variables and Cases
Case (Observation): A unit for which data is collected, uniquely identifying each row in a
dataset.
Variable: A characteristic or attribute that varies across all units.
Example: In a student dataset, each student (e.g., Anjali, Pradeep) is a case,
while "Name," "Gender," "Date of Birth," and "Board" are variables. In a tabular format, rows represent cases
(same attribute recorded for each), and columns represent variables (same type of value recorded for each case).
2.3 Classification of Data
Data is broadly classified into categorical and numerical types.
2.3.1 Categorical Data (Qualitative Variables)
Identifies group membership.
Meaningful mathematical operations cannot be performed on it.
Examples: Gender (F, M), Board (State Board, ICSE, CBSE).
Requires common measurement units (e.g., weights in kilograms, prices in rupees).
Example: Marks obtained by students.
2.3.3 Time-series and Cross-sectional Data
Time-series Data: Recorded over a period of time for a single entity.
Example: Temperature in Delhi for seven different days.
Cross-sectional Data: Observed at the same time for several entities.
Example: Temperature of Delhi, Chennai, Jaipur, and Bhopal on a particular day.
2.3.4 Scales of Measurement
There are four scales of measurement: nominal, ordinal, interval, and ratio.
Nominal Scale: Data consists of labels or names used to identify characteristics. No
inherent order or ranking. Can be numerically coded, but the numbers have no mathematical meaning beyond
identification. Examples: Name, Board, Gender, Blood group, Hair color, Brand name of mobile phone,
Number plate of cars.
Ordinal Scale: Possesses properties of nominal data, but the order or rank of data is
meaningful. Differences between categories are not necessarily equal or meaningful. Example: Customer
service ratings (excellent, good, poor) can be ordered, but the difference between "good" and "excellent"
isn't quantifiable as the difference between "good" and "bad".
Interval Scale: Has all properties of ordinal data, and the interval between values is
expressed in terms of a fixed unit of measure. Always numeric, and differences between values are
meaningful. Ratios of values have no meaning because the value of zero is arbitrary (no absolute
zero). Example: Temperature in Celsius or Fahrenheit. A 20°C difference is meaningful, but 40°C is not
"twice as hot" as 20°C.
Ratio Scale: Has all properties of interval data, and the ratio of two values is
meaningful. Possesses an absolute zero property, meaning zero indicates the complete absence of the
characteristic. Allows for addition, subtraction, multiplication, and division. Examples: Height (in cm),
Weight (in kg), Marks.
3. Describing Categorical Data
3.1 Frequency Distribution
A frequency distribution for qualitative data lists distinct values and their frequencies. Each row in a
frequency table shows a category and the number of cases in that category.
3.2 Relative Frequency
The relative frequency is the ratio of a category's frequency to the total number of observations. It provides a
standard for comparison between 0 and 1.
3.3 Charts of Categorical Data
The most common displays for categorical variables are bar charts and pie charts.
Pie Chart: A circle divided into pieces proportional to the relative frequencies of
qualitative data. Best for comparing parts of a whole, especially when one category makes up more than half.
Bar Chart: Displays distinct values on a horizontal axis and frequencies/relative
frequencies on a vertical axis. Bars represent frequency/relative frequency, with height proportional to the
value. Bars should not touch. Appropriate for representing counts of categories and comparing different
groups. For ordinal variables, the bar chart must preserve the ordering.
Pareto Chart: A bar chart where categories are sorted by frequency in descending order.
Popular in quality control to identify problems.
Stacked Bar Chart: Represents counts for a category, with each bar further segmented to
show frequencies of subcategories within it.
100% Stacked Bar Chart: Useful for visualizing part-to-whole relationships by showing
proportions within each category.
3.4 The Area Principle and Misleading Graphs
The Area Principle states that the area occupied by a part of a graph should correspond to the amount of data it
represents. Violations of this principle are common ways to mislead with statistics.
Decorated Graphs: Charts with unnecessary decorations can violate the area principle by distorting visual
representation.
Truncated Graphs: Bar charts where the baseline is not at zero can exaggerate differences.
Manipulated Y-axis: Expanding or compressing the y-axis scale can make changes in data seem less or more
significant than they are.
Round-off Errors: Occur when percentages or proportions in tables are rounded, causing the total sum to
slightly differ from 100% or 1, which can lead to inaccuracies in pie charts.
3.5 Summarizing Categorical Data
Descriptive measures are quantities that summarize a dataset. Measures of central tendency indicate the center or
most typical value.
Mode: The most common category in a categorical variable (highest frequency). Corresponds
to the longest bar in a bar chart, the widest slice in a pie chart, and the first category in a Pareto
chart. Bimodal/Multimodal Data: If two (bimodal) or more (multimodal) categories tie for the highest
frequency. Can be defined for both nominal and ordinal data.
Median: For an ordinal variable, it is the category of the middle observation after sorting
the values. If an even number of observations, either category on either side of the middle can be chosen as
the median. Can only be defined for ordinal data.
4. Describing Numerical Data
4.1 Types of Variables
Discrete Variable: Involves a count of something. Examples: Number of people in a
household, number of spelling mistakes.
Continuous Variable: Involves a measurement of something. Examples: Weight of a person,
height, speed.
4.2 Organizing Numerical Data
Numerical data can be organized by grouping observations into classes (categories or bins) and then constructing
frequency and relative-frequency distributions.
Organizing Discrete Data (single value): If few distinct values, use a frequency table where each class is a
distinct value.
Organizing Continuous Data: Number of Classes: Typically 5 to 20 classes. Each observation must belong to
exactly one class. Class intervals are often of equal length.
Terminology:
Lower Class Limit: Smallest value in a class
Upper Class Limit: Largest value in a class
Class Width: Difference between lower limits of consecutive classes
Class Mark: Average of the two class limits
A class interval usually contains its left-end but not its right-end boundary point.
4.3 Stem-and-leaf Diagram (Stemplot)
A stem-and-leaf diagram separates each observation into a stem (all but the rightmost digit) and a leaf (the
rightmost digit).
Construction Steps:
Separate observations into stem and leaf.
Write stems vertically, smallest to largest, to the left of a vertical rule.
Write each leaf to the right of the rule in its appropriate stem row.
Arrange leaves in each row in ascending order.
4.4 Descriptive Measures for Numerical Data
4.4.1 Measures of Central Tendency
Mean:
The sum of observations divided by the number of observations; commonly referred to as the average.
Units: Expressed in the same units as the original data.
Effect of Constants:
Adding \(c\): No change
Multiplying by \(c\): New sd = \( |c| \times\) Old
sd
4.5 Percentiles
The sample \(100p\) percentile is the data value such that at least \(100p\%\) of the data are ≤ it, and at least \(100(1-p)\%\) are ≥ it.
Computing Percentiles:
Arrange data in increasing order.
Calculate \(np\).
If \(np\) not integer → smallest integer greater than \(np\)
If \(np\) integer → average of positions \(np\)
and \(np+1\)
4.6 Quartiles
Quartiles divide a dataset into four equal parts using three values.
First Quartile (Q1): 25th percentile
Second Quartile (Q2): 50th percentile = Median
Third Quartile (Q3): 75th percentile
4.7 Five Number Summary
Minimum – Q1 – Median – Q3 – Maximum
4.8 Interquartile Range (IQR)
IQR = Q3 − Q1
5. Association Between Two Variables
Association between two variables means that knowing information about one variable provides information about
the other.
5.1 Association Between Two Categorical Variables
To find associations, a contingency table is used. Rule: If row relative frequencies (or column relative
frequencies) are the same for all rows (or columns), the variables are not associated. If they differ, the
variables are associated.
5.2 Association Between Two Numerical Variables
A scatter plot is a visual test for association between two numerical variables, displaying pairs of values as
points on a two-dimensional plane.
Describing Association: Direction (upward, downward, no trend), Curvature (linear, curved), Variation (tightly
clustered or not), Outliers.
5.2.1 Measures of Association for Numerical Variables
Covariance:
Quantifies the strength of the linear association between two numerical variables.
5.3 Association Between Categorical and Numerical Variables
Point Bi-serial Correlation Coefficient (\(r_{pb}\)): Measures association between a numerical
variable (X) and a dichotomous categorical variable (Y, e.g. 0 or 1).
6. Basic Principle of Counting
6.1 Addition Rule of Counting
If action A can occur in \(n_1\) ways and action B in \(n_2\) ways, and A and B cannot occur simultaneously, then total ways = \(n_1 + n_2\).
Choosing one item (shirt or pant) from 4 shirts and 3 pants → 4 + 3 = 7 choices.
6.2 Multiplication Rule of Counting
If action A in \(n_1\) ways and action B in \(n_2\)
ways, then both → \(n_1 \times n_2\).
Choosing one shirt (4) AND one pant (3) → 4 × 3 = 12 combinations.
7. Factorial
\(n! = n \times (n-1) \times \cdots \times 1\) By convention \(0! = 1\) \(n! = n \times (n-1)!\)
8. Permutation
A permutation is an ordered arrangement of all or some of n distinct objects. Order matters.