[π§π· PortuguΓͺs] [πΊπΈ English]
Institution: Pontifical Catholic University of SΓ£o Paulo (PUC-SP)
School: Faculty of Interdisciplinary Studies
Program: Humanistic AI and Data Science
Semester: 2nd Semester 2025
Professor:  Professor Doctor in Mathematics Daniel Rodrigues da Silva
Important
- Projects and deliverables may be made publicly available whenever possible.
- The course emphasizes practical, hands-on experience with real datasets to simulate professional consulting scenarios in the fields of Data Analysis and Data Mining for partner organizations and institutions affiliated with the university.
- All activities comply with the academic and ethical guidelines of PUC-SP.
- Any content not authorized for public disclosure will remain confidential and securely stored in private repositories.
πΆ Prelude Suite no.1 (J. S. Bach) - Sound Design Remix
Statistical.Measures.and.Banking.Sector.Analysis.at.Bovespa.mp4
πΊ For better resolution, watch the video on YouTube.
Tip
This repository is a review of the Statistics course from the undergraduate program Humanities, AI and Data Science at PUC-SP.
β Access Data Mining Main Repository
If youβd like to explore the full materials from the 1st year (not only the review), you can visit the complete repository here.
This repository contains materials and examples for the Introduction to Data Mining with Python Class 1 course, focusing on fundamental statistical concepts and data analysis techniques essential for data mining applications.
βββ data/                 # Sample datasets
βββ notebooks/           # Jupyter notebooks with examples
βββ scripts/             # Python scripts for analysis
βββ images/              # Generated plots and visualizations
βββ docs/                # Additional documentation
- Python 3.7+
- Required libraries: pandas, numpy, matplotlib, seaborn, scikit-learn
pip install pandas numpy matplotlib seaborn scikit-learn jupyterimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load sample data
data = [50, 40, 41, 17, 11, 7, 22, 44, 28, 21, 19, 23, 37, 51, 54, 42, 86,
        41, 78, 56, 72, 56, 17, 7, 69, 30, 80, 56, 29, 33, 46, 31, 39, 20,
        18, 29, 34, 59, 73, 77, 36, 39, 30, 62, 54, 67, 39, 31, 53, 44]
# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=7, edgecolor='black')
plt.title('Internet Usage Distribution')
plt.xlabel('Minutes Online')
plt.ylabel('Frequency')
plt.show()
# Calculate statistics
print(f"Mean: {np.mean(data):.2f}")
print(f"Median: {np.median(data):.2f}")
print(f"Standard Deviation: {np.std(data):.2f}")After completing this course, students will be able to:
- Construct and interpret frequency distributions from raw data
- Create various types of histograms and understand their relationship to frequency distributions
- Identify and handle outliers in datasets
- Analyze distribution shapes and their implications
- Calculate and interpret central tendency measures
- Apply statistical concepts to data mining problems
- Use Python tools for statistical analysis and visualization
- Outliers require careful consideration - they may represent valuable insights or data quality issues
- Histogram bins should be chosen thoughtfully - too few may hide patterns, too many may create noise
- Frequency distributions are fundamental to understanding data structure before applying advanced data mining techniques
- Visual analysis complements numerical statistics for comprehensive data understanding
This material is part of the Introduction to Data Mining with Python course, focusing on fundamental statistical concepts essential for effective data analysis and mining.
- Descriptive Statistics Review
- Data Mining Concepts
- Exploratory Data Analysis
- Predictive Analysis
- Clustering
- Association Rules
- Minimum 75% attendance required
- Final grade β₯ 5.0
- Formula: MF = (Nβ + Nβ)/2, where Nα΅’ = (Pα΅’ + Aα΅’)/2
- Pα΅’ = Project grade for semester i
- Aα΅’ = Activity/exam grade for semester i
 
1. Frequency Distribution
A frequency distribution is a table that shows classes or intervals of data with a count of the number of entries in each class. It's fundamental for understanding data patterns and is the foundation for creating histograms.
- Class limits: Lower and upper boundaries of each class
- Class size: The width of each class interval
- Frequency (f): Number of data entries in each class
- Relative frequency: Proportion of data in each class (f/n)
- Cumulative frequency: Sum of frequencies up to a given class
- Decide the number of classes (typically 5-20)
- Calculate class size: (max - min) / number of classes
- Determine class limits
- Count frequencies for each class
- Calculate additional measures (relative, cumulative frequencies)
Histograms are vectorially related to frequency distributions - they are the graphical representation of frequency distribution tables.
- Bar chart representing frequency distribution
- Horizontal axis: Quantitative data values (class boundaries)
- Vertical axis: Frequencies or relative frequencies
- Consecutive bars must touch (unlike regular bar charts)
- Class boundaries: Numbers that separate classes without gaps
- Frequency Histogram: Shows absolute frequencies
- Relative Frequency Histogram: Shows proportions/percentages
- Frequency Polygon: Line graph emphasizing continuous change
Outliers, by definition, have few values and can represent various phenomena:
- Data entry errors (typing mistakes)
- Measurement errors
- Fraudulent activities
- Genuine extreme values
- Equipment malfunctions
- Generate few bars (sparse representation)
- Create gaps in the distribution
- Skew the overall pattern
- Affect central tendency measures
- May require special handling in analysis
- Visible as isolated bars far from main distribution
- Large gaps between bars
- Extremely tall or short bars at distribution extremes
- Asymmetric patterns in otherwise normal distributions
Understanding distribution shapes helps identify data characteristics:
- Mean β Median β Mode
- Bell-shaped or uniform patterns
- Equal spread on both sides
- Mean < Median < Mode
- Tail extends to the left
- Few extremely low values
- Mode < Median < Mean
- Tail extends to the right
- Few extremely high values
- All classes have equal frequencies
- Rectangular shape in histogram
- Sum of all values divided by count
- Most affected by outliers
- Uses all data points
- Middle value when data is ordered
- Less affected by outliers
- Robust measure
- Most frequently occurring value
- May not exist or may be multiple
- Good for categorical data
- Pattern Recognition: Identifying data distributions
- Anomaly Detection: Finding outliers
- Data Quality Assessment: Checking for errors
- Feature Engineering: Understanding variable distributions
- Model Selection: Choosing appropriate algorithms based on data distribution
import matplotlib.pyplot as plt
import numpy as np
# Create frequency distribution
def create_frequency_distribution(data, num_classes=7):
    min_val, max_val = min(data), max(data)
    class_size = (max_val - min_val) / num_classes
    
    # Define class boundaries
    boundaries = [min_val + i * class_size for i in range(num_classes + 1)]
    
    # Count frequencies
    frequencies = []
    for i in range(num_classes):
        count = sum(1 for x in data if boundaries[i] <= x < boundaries[i+1])
        frequencies.append(count)
    
    return boundaries, frequencies
# Create histogram
def plot_histogram(data, title="Frequency Distribution"):
    plt.figure(figsize=(10, 6))
    plt.hist(data, bins=7, edgecolor='black', alpha=0.7)
    plt.title(title)
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    plt.grid(True, alpha=0.3)
    plt.show()Exemple 1 - Finding the Mean of a Frequency Distribution
| In Words | In Symbols | 
|---|---|
| 1. Find the midpoint of each class. | $ x = \frac{lower limit + upper limit}{2} $ | 
| 2. Multiply each midpoint by its class frequency and sum the results. | $ \sum (x \cdot f) $ | 
| 3. Find the sum of all frequencies. | $ n = \sum f $ | 
| 4. Calculate the mean by dividing the sum from step 2 by step 3. | $ \bar{x} = \frac{\sum (x \cdot f)}{n} $ | 
Example: Finding the Mean of a Frequency Distribution
Use the frequency distribution below to approximate the average number of minutes that a sample of internet users spent connected in their last session.
| Class | Midpoint ( | Frequency ( | 
|---|---|---|
| 7 β 18 | 12.5 | 6 | 
| 19 β 30 | 24.5 | 10 | 
| 31 β 42 | 36.5 | 13 | 
| 43 β 54 | 48.5 | 8 | 
| 55 β 66 | 60.5 | 5 | 
| 67 β 78 | 72.5 | 6 | 
| 79 β 90 | 84.5 | 2 | 
| Class | Midpoint ( | Frequency ( | |
|---|---|---|---|
| 7 β 18 | 12.5 | 6 | 75.0 | 
| 19 β 30 | 24.5 | 10 | 245.0 | 
| 31 β 42 | 36.5 | 13 | 474.5 | 
| 43 β 54 | 48.5 | 8 | 388.0 | 
| 55 β 66 | 60.5 | 5 | 302.5 | 
| 67 β 78 | 72.5 | 6 | 435.0 | 
| 79 β 90 | 84.5 | 2 | 169.0 | 
| Total | 50 | 2089.0 | 
\Huge
\bar{x} = \frac{\sum (x \cdot f)}{n} = \frac{2089}{50} \approx 41.8 \text{ minutes}- A vertical line can be drawn at the middle of the graph, and the halves are nearly identical.
- A vertical line can be drawn at the middle of the graph, and the halves are nearly identical.
Uniform Distribution (Rectangular)
- All entries have equal or nearly equal frequencies.
- The distribution is symmetric.
Left-Skewed Distribution (Negatively Skewed)
- The "tail" of the graph extends more to the left.
- The mean is to the left of the median.
Right-Skewed Distribution (Positively Skewed)
- The "tail" of the graph extends more to the right.
- The mean is to the right of the median.
Sometimes, the mean is calculated considering different "weights" for each value.
- 50% for the average of exams
- 15% for the midterm exam
- 20% for the final exam
- 10% for computer lab work
- 5% for homework
- Exam average: 86
- Midterm: 96
- Final Exam: 82
- Lab: 98
- Homework: 100
| Source | Grade ( | Weight ( | |
|---|---|---|---|
| Exam Average | 86 | 0.50 | 43.0 | 
| Midterm | 96 | 0.15 | 14.4 | 
| Final Exam | 82 | 0.20 | 16.4 | 
| Lab | 98 | 0.10 | 9.8 | 
| Homework | 100 | 0.05 | 5.0 | 
| Sum | 1 | 88.6 | 
\Huge
\bar{x} = \frac{\sum (x \cdot w)}{\sum w} = \frac{88.6}{1} = 88.6So, the student did not get an A (minimum required is 90).
The mean of a frequency distribution is calculated as:
\Huge
\bar{x} = \frac{\sum (x \cdot f)}{n}Where x is the class midpoint and f is the frequency of the class.
1. Castro, L. N. & Ferrari, D. G. (2016). Introduction to Data Mining: Basic Concepts, Algorithms, and Applications. Saraiva.
2. Ferreira, A. C. P. L. et al. (2024). Artificial Intelligence β A Machine Learning Approach. 2nd Ed. LTC.
3. Larson & Farber (2015). Applied Statistics. Pearson.
πΈΰΉ My Contacts Hub
ββββββββββββββ πβ ββββββββββββββ
β£β’β€ Back to Top
Copyright 2025 Quantum Software Development. Code released under the MIT License license.