Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,383 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e",
"metadata": {
"id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e"
},
"source": [
"# Lab | Data Structuring and Combining Data"
]
},
{
"cell_type": "markdown",
"id": "a2cdfc70-44c8-478c-81e7-2bc43fdf4986",
"metadata": {
"id": "a2cdfc70-44c8-478c-81e7-2bc43fdf4986"
},
"source": [
"## Challenge 1: Combining & Cleaning Data\n",
"\n",
"In this challenge, we will be working with the customer data from an insurance company, as we did in the two previous labs. The data can be found here:\n",
"- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\n",
"\n",
"But this time, we got new data, which can be found in the following 2 CSV files located at the links below.\n",
"\n",
"- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file2.csv\n",
"- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file3.csv\n",
"\n",
"Note that you'll need to clean and format the new data.\n",
"\n",
"Observation:\n",
"- One option is to first combine the three datasets and then apply the cleaning function to the new combined dataset\n",
"- Another option would be to read the clean file you saved in the previous lab, and just clean the two new files and concatenate the three clean datasets"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "492d06e3-92c7-4105-ac72-536db98d3244",
"metadata": {
"id": "492d06e3-92c7-4105-ac72-536db98d3244"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape of Combined Data: (8888, 11)\n",
"\n",
"Combined Data Info:\n",
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 8888 entries, 0 to 8887\n",
"Data columns (total 11 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 st 2060 non-null object \n",
" 1 gender 8888 non-null object \n",
" 2 education 8887 non-null object \n",
" 3 customer_lifetime_value 8880 non-null float64\n",
" 4 income 8887 non-null float64\n",
" 5 monthly_premium_auto 8887 non-null float64\n",
" 6 number_of_open_complaints 8887 non-null object \n",
" 7 policy_type 8887 non-null object \n",
" 8 vehicle_class 8887 non-null object \n",
" 9 total_claim_amount 8887 non-null float64\n",
" 10 state 6827 non-null object \n",
"dtypes: float64(4), object(7)\n",
"memory usage: 763.9+ KB\n",
"\n",
"First 5 rows of Combined Data:\n",
" st gender education customer_lifetime_value income \\\n",
"0 Washington <Na> Master NaN 0.0 \n",
"1 Arizona Female Bachelor 697953.59 0.0 \n",
"2 Nevada Female Bachelor 1288743.17 48767.0 \n",
"3 California Male Bachelor 764586.18 0.0 \n",
"4 Washington Male High School Or Below 536307.65 36357.0 \n",
"\n",
" monthly_premium_auto number_of_open_complaints policy_type \\\n",
"0 1000.0 1/0/00 Personal Auto \n",
"1 94.0 1/0/00 Personal Auto \n",
"2 108.0 1/0/00 Personal Auto \n",
"3 106.0 1/0/00 Corporate Auto \n",
"4 68.0 1/0/00 Personal Auto \n",
"\n",
" vehicle_class total_claim_amount state \n",
"0 Four-Door Car 2.704934 NaN \n",
"1 Four-Door Car 1131.464935 NaN \n",
"2 Two-Door Car 566.472247 NaN \n",
"3 Suv 529.881344 NaN \n",
"4 Four-Door Car 17.269323 NaN \n"
]
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Load Data \n",
"file_urls = [\n",
" \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\",\n",
" \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file2.csv\",\n",
" \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file3.csv\"\n",
"]\n",
"\n",
"# Load all three files into a list of DataFrames\n",
"dataframes = [pd.read_csv(url) for url in file_urls]\n",
"\n",
"# Define Cleaning Function\n",
"\n",
"def clean_data(df):\n",
" \"\"\"\n",
" Performs standard cleaning and formatting on the customer data.\n",
" \"\"\"\n",
" \n",
" # Standardize Column Names (convert to snake_case)\n",
" df.columns = df.columns.str.lower().str.replace(' ', '_', regex=False)\n",
" \n",
" # Drop unnecessary columns (often old index columns from prior saving)\n",
" df = df.drop(columns=['unnamed:_0', 'customer'], errors='ignore')\n",
" \n",
" # Clean and Convert Numeric/Currency Columns\n",
" \n",
" # Clean 'customer_lifetime_value'\n",
" if 'customer_lifetime_value' in df.columns:\n",
" # Remove all non-numeric characters except the decimal point\n",
" df['customer_lifetime_value'] = (\n",
" df['customer_lifetime_value']\n",
" .astype(str)\n",
" .str.replace('[^0-9.]', '', regex=True)\n",
" )\n",
" # Convert to numeric, coercing errors to NaN\n",
" df['customer_lifetime_value'] = pd.to_numeric(df['customer_lifetime_value'], errors='coerce')\n",
"\n",
" # Clean 'income'\n",
" if 'income' in df.columns:\n",
" # Simply coerce to numeric (removes commas if present and handles strings)\n",
" df['income'] = pd.to_numeric(df['income'], errors='coerce')\n",
" \n",
" # Clean and Standardize String Columns\n",
" \n",
" # Standardize 'gender' (f/m to female/male)\n",
" if 'gender' in df.columns:\n",
" df['gender'] = (\n",
" df['gender']\n",
" .astype(str).str.lower().str.strip()\n",
" .replace({'f': 'female', 'm': 'male'}, regex=False)\n",
" .replace('nan', pd.NA) # Replace the string 'nan' with a proper NA value\n",
" )\n",
"\n",
" # Standardize all other object (string) columns to Title Case and handle 'NaN' string\n",
" for col in df.select_dtypes(include='object').columns:\n",
" df[col] = df[col].astype(str).str.title().replace('Nan', pd.NA)\n",
" \n",
" # Handle Missing Values and Duplicates\n",
" \n",
" # Drop rows where all non-ID/Index columns are NaN\n",
" df = df.dropna(how='all') \n",
" \n",
" # Drop duplicate rows\n",
" df = df.drop_duplicates()\n",
"\n",
" return df\n",
"\n",
"# Clean and Combine Data \n",
"\n",
"# Apply the cleaning function to each DataFrame\n",
"clean_dataframes = [clean_data(df) for df in dataframes]\n",
"\n",
"# Concatenate all cleaned DataFrames into one\n",
"combined_data = pd.concat(clean_dataframes, ignore_index=True)\n",
"\n",
"# Verification \n",
"print(f\"Shape of Combined Data: {combined_data.shape}\")\n",
"print(\"\\nCombined Data Info:\")\n",
"combined_data.info()\n",
"print(\"\\nFirst 5 rows of Combined Data:\")\n",
"print(combined_data.head())"
]
},
{
"cell_type": "markdown",
"id": "31b8a9e7-7db9-4604-991b-ef6771603e57",
"metadata": {
"id": "31b8a9e7-7db9-4604-991b-ef6771603e57"
},
"source": [
"# Challenge 2: Structuring Data"
]
},
{
"cell_type": "markdown",
"id": "a877fd6d-7a0c-46d2-9657-f25036e4ca4b",
"metadata": {
"id": "a877fd6d-7a0c-46d2-9657-f25036e4ca4b"
},
"source": [
"In this challenge, we will continue to work with customer data from an insurance company, but we will use a dataset with more columns, called marketing_customer_analysis.csv, which can be found at the following link:\n",
"\n",
"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis_clean.csv\n",
"\n",
"This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by performing data cleaning, formatting, and structuring."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "aa10d9b0-1c27-4d3f-a8e4-db6ab73bfd26",
"metadata": {
"id": "aa10d9b0-1c27-4d3f-a8e4-db6ab73bfd26"
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"# Load the dataset\n",
"url = \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis_clean.csv\"\n",
"df = pd.read_csv(url)\n",
"\n",
"# Display a quick preview to confirm column names\n",
"# print(df.columns)\n",
"# print(df['Sales Channel'].unique())\n",
"\n",
"# Standardize column names for easier access (if necessary, though they look clean)\n",
"df.columns = df.columns.str.lower().str.replace(' ', '_')"
]
},
{
"cell_type": "markdown",
"id": "df35fd0d-513e-4e77-867e-429da10a9cc7",
"metadata": {
"id": "df35fd0d-513e-4e77-867e-429da10a9cc7"
},
"source": [
"1. You work at the marketing department and you want to know which sales channel brought the most sales in terms of total revenue. Using pivot, create a summary table showing the total revenue for each sales channel (branch, call center, web, and mail).\n",
"Round the total revenue to 2 decimal points. Analyze the resulting table to draw insights."
]
},
{
"cell_type": "markdown",
"id": "640993b2-a291-436c-a34d-a551144f8196",
"metadata": {
"id": "640993b2-a291-436c-a34d-a551144f8196"
},
"source": [
"2. Create a pivot table that shows the average customer lifetime value per gender and education level. Analyze the resulting table to draw insights."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e71722bf-bc0c-4506-827f-6bd842d33e6c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total Revenue (CLV) by Sales Channel:\n",
" Total Revenue (CLV)\n",
"sales_channel \n",
"Agent 33057887.85\n",
"Branch 24359201.21\n",
"Call Center 17364288.37\n",
"Web 12697632.90\n"
]
}
],
"source": [
"# Create the pivot table for total revenue (CLV) by Sales Channel\n",
"revenue_by_channel = pd.pivot_table(\n",
" df, \n",
" values='customer_lifetime_value', \n",
" index='sales_channel', \n",
" aggfunc='sum'\n",
")\n",
"\n",
"# Rename the column and round to 2 decimal places\n",
"revenue_by_channel = revenue_by_channel.rename(\n",
" columns={'customer_lifetime_value': 'Total Revenue (CLV)'}\n",
").round(2)\n",
"\n",
"print(\"Total Revenue (CLV) by Sales Channel:\")\n",
"print(revenue_by_channel.sort_values(by='Total Revenue (CLV)', ascending=False))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "e16ca0ed-e242-473d-a457-c3ca3d2286e1",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Average Customer Lifetime Value (CLV) by Gender and Education:\n",
"education Bachelor College Doctor High School or Below Master\n",
"gender \n",
"F 7874.27 7748.82 7328.51 8675.22 8157.05\n",
"M 7703.60 8052.46 7415.33 8149.69 8168.83\n"
]
}
],
"source": [
"# Create the pivot table for average CLV by Gender and Education\n",
"avg_clv_by_demographics = pd.pivot_table(\n",
" df, \n",
" values='customer_lifetime_value', \n",
" index='gender', \n",
" columns='education', \n",
" aggfunc='mean'\n",
").round(2)\n",
"\n",
"print(\"\\nAverage Customer Lifetime Value (CLV) by Gender and Education:\")\n",
"print(avg_clv_by_demographics)"
]
},
{
"cell_type": "markdown",
"id": "32c7f2e5-3d90-43e5-be33-9781b6069198",
"metadata": {
"id": "32c7f2e5-3d90-43e5-be33-9781b6069198"
},
"source": [
"## Bonus\n",
"\n",
"You work at the customer service department and you want to know which months had the highest number of complaints by policy type category. Create a summary table showing the number of complaints by policy type and month.\n",
"Show it in a long format table."
]
},
{
"cell_type": "markdown",
"id": "e3d09a8f-953c-448a-a5f8-2e5a8cca7291",
"metadata": {
"id": "e3d09a8f-953c-448a-a5f8-2e5a8cca7291"
},
"source": [
"*In data analysis, a long format table is a way of structuring data in which each observation or measurement is stored in a separate row of the table. The key characteristic of a long format table is that each column represents a single variable, and each row represents a single observation of that variable.*\n",
"\n",
"*More information about long and wide format tables here: https://www.statology.org/long-vs-wide-data/*"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a069e0b-b400-470e-904d-d17582191be4",
"metadata": {
"id": "3a069e0b-b400-470e-904d-d17582191be4"
},
"outputs": [],
"source": [
"# Your code goes here"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading