data-bootcamp-v4 · Martukigl · Nov 28, 2025
diff --git a/.ipynb_checkpoints/lab-dw-data-structuring-and-combining-checkpoint.ipynb b/.ipynb_checkpoints/lab-dw-data-structuring-and-combining-checkpoint.ipynb
@@ -0,0 +1,383 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e",
+   "metadata": {
+    "id": "25d7736c-ba17-4aff-b6bb-66eba20fbf4e"
+   },
+   "source": [
+    "# Lab | Data Structuring and Combining Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a2cdfc70-44c8-478c-81e7-2bc43fdf4986",
+   "metadata": {
+    "id": "a2cdfc70-44c8-478c-81e7-2bc43fdf4986"
+   },
+   "source": [
+    "## Challenge 1: Combining & Cleaning Data\n",
+    "\n",
+    "In this challenge, we will be working with the customer data from an insurance company, as we did in the two previous labs. The data can be found here:\n",
+    "- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\n",
+    "\n",
+    "But this time, we got new data, which can be found in the following 2 CSV files located at the links below.\n",
+    "\n",
+    "- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file2.csv\n",
+    "- https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file3.csv\n",
+    "\n",
+    "Note that you'll need to clean and format the new data.\n",
+    "\n",
+    "Observation:\n",
+    "- One option is to first combine the three datasets and then apply the cleaning function to the new combined dataset\n",
+    "- Another option would be to read the clean file you saved in the previous lab, and just clean the two new files and concatenate the three clean datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "492d06e3-92c7-4105-ac72-536db98d3244",
+   "metadata": {
+    "id": "492d06e3-92c7-4105-ac72-536db98d3244"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Shape of Combined Data: (8888, 11)\n",
+      "\n",
+      "Combined Data Info:\n",
+      "<class 'pandas.core.frame.DataFrame'>\n",
+      "RangeIndex: 8888 entries, 0 to 8887\n",
+      "Data columns (total 11 columns):\n",
+      " #   Column                     Non-Null Count  Dtype  \n",
+      "---  ------                     --------------  -----  \n",
+      " 0   st                         2060 non-null   object \n",
+      " 1   gender                     8888 non-null   object \n",
+      " 2   education                  8887 non-null   object \n",
+      " 3   customer_lifetime_value    8880 non-null   float64\n",
+      " 4   income                     8887 non-null   float64\n",
+      " 5   monthly_premium_auto       8887 non-null   float64\n",
+      " 6   number_of_open_complaints  8887 non-null   object \n",
+      " 7   policy_type                8887 non-null   object \n",
+      " 8   vehicle_class              8887 non-null   object \n",
+      " 9   total_claim_amount         8887 non-null   float64\n",
+      " 10  state                      6827 non-null   object \n",
+      "dtypes: float64(4), object(7)\n",
+      "memory usage: 763.9+ KB\n",
+      "\n",
+      "First 5 rows of Combined Data:\n",
+      "           st  gender             education  customer_lifetime_value   income  \\\n",
+      "0  Washington    <Na>                Master                      NaN      0.0   \n",
+      "1     Arizona  Female              Bachelor                697953.59      0.0   \n",
+      "2      Nevada  Female              Bachelor               1288743.17  48767.0   \n",
+      "3  California    Male              Bachelor                764586.18      0.0   \n",
+      "4  Washington    Male  High School Or Below                536307.65  36357.0   \n",
+      "\n",
+      "   monthly_premium_auto number_of_open_complaints     policy_type  \\\n",
+      "0                1000.0                    1/0/00   Personal Auto   \n",
+      "1                  94.0                    1/0/00   Personal Auto   \n",
+      "2                 108.0                    1/0/00   Personal Auto   \n",
+      "3                 106.0                    1/0/00  Corporate Auto   \n",
+      "4                  68.0                    1/0/00   Personal Auto   \n",
+      "\n",
+      "   vehicle_class  total_claim_amount state  \n",
+      "0  Four-Door Car            2.704934   NaN  \n",
+      "1  Four-Door Car         1131.464935   NaN  \n",
+      "2   Two-Door Car          566.472247   NaN  \n",
+      "3            Suv          529.881344   NaN  \n",
+      "4  Four-Door Car           17.269323   NaN  \n"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "#  Load Data \n",
+    "file_urls = [\n",
+    "    \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file1.csv\",\n",
+    "    \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file2.csv\",\n",
+    "    \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/file3.csv\"\n",
+    "]\n",
+    "\n",
+    "# Load all three files into a list of DataFrames\n",
+    "dataframes = [pd.read_csv(url) for url in file_urls]\n",
+    "\n",
+    "# Define Cleaning Function\n",
+    "\n",
+    "def clean_data(df):\n",
+    "    \"\"\"\n",
+    "    Performs standard cleaning and formatting on the customer data.\n",
+    "    \"\"\"\n",
+    "    \n",
+    "    #  Standardize Column Names (convert to snake_case)\n",
+    "    df.columns = df.columns.str.lower().str.replace(' ', '_', regex=False)\n",
+    "    \n",
+    "    # Drop unnecessary columns (often old index columns from prior saving)\n",
+    "    df = df.drop(columns=['unnamed:_0', 'customer'], errors='ignore')\n",
+    "    \n",
+    "    #  Clean and Convert Numeric/Currency Columns\n",
+    "    \n",
+    "    # Clean 'customer_lifetime_value'\n",
+    "    if 'customer_lifetime_value' in df.columns:\n",
+    "        # Remove all non-numeric characters except the decimal point\n",
+    "        df['customer_lifetime_value'] = (\n",
+    "            df['customer_lifetime_value']\n",
+    "            .astype(str)\n",
+    "            .str.replace('[^0-9.]', '', regex=True)\n",
+    "        )\n",
+    "        # Convert to numeric, coercing errors to NaN\n",
+    "        df['customer_lifetime_value'] = pd.to_numeric(df['customer_lifetime_value'], errors='coerce')\n",
+    "\n",
+    "    # Clean 'income'\n",
+    "    if 'income' in df.columns:\n",
+    "        # Simply coerce to numeric (removes commas if present and handles strings)\n",
+    "        df['income'] = pd.to_numeric(df['income'], errors='coerce')\n",
+    "        \n",
+    "    # Clean and Standardize String Columns\n",
+    "    \n",
+    "    # Standardize 'gender' (f/m to female/male)\n",
+    "    if 'gender' in df.columns:\n",
+    "        df['gender'] = (\n",
+    "            df['gender']\n",
+    "            .astype(str).str.lower().str.strip()\n",
+    "            .replace({'f': 'female', 'm': 'male'}, regex=False)\n",
+    "            .replace('nan', pd.NA) # Replace the string 'nan' with a proper NA value\n",
+    "        )\n",
+    "\n",
+    "    # Standardize all other object (string) columns to Title Case and handle 'NaN' string\n",
+    "    for col in df.select_dtypes(include='object').columns:\n",
+    "        df[col] = df[col].astype(str).str.title().replace('Nan', pd.NA)\n",
+    "        \n",
+    "    #  Handle Missing Values and Duplicates\n",
+    "    \n",
+    "    # Drop rows where all non-ID/Index columns are NaN\n",
+    "    df = df.dropna(how='all') \n",
+    "    \n",
+    "    # Drop duplicate rows\n",
+    "    df = df.drop_duplicates()\n",
+    "\n",
+    "    return df\n",
+    "\n",
+    "# Clean and Combine Data \n",
+    "\n",
+    "# Apply the cleaning function to each DataFrame\n",
+    "clean_dataframes = [clean_data(df) for df in dataframes]\n",
+    "\n",
+    "# Concatenate all cleaned DataFrames into one\n",
+    "combined_data = pd.concat(clean_dataframes, ignore_index=True)\n",
+    "\n",
+    "# Verification \n",
+    "print(f\"Shape of Combined Data: {combined_data.shape}\")\n",
+    "print(\"\\nCombined Data Info:\")\n",
+    "combined_data.info()\n",
+    "print(\"\\nFirst 5 rows of Combined Data:\")\n",
+    "print(combined_data.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "31b8a9e7-7db9-4604-991b-ef6771603e57",
+   "metadata": {
+    "id": "31b8a9e7-7db9-4604-991b-ef6771603e57"
+   },
+   "source": [
+    "# Challenge 2: Structuring Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a877fd6d-7a0c-46d2-9657-f25036e4ca4b",
+   "metadata": {
+    "id": "a877fd6d-7a0c-46d2-9657-f25036e4ca4b"
+   },
+   "source": [
+    "In this challenge, we will continue to work with customer data from an insurance company, but we will use a dataset with more columns, called marketing_customer_analysis.csv, which can be found at the following link:\n",
+    "\n",
+    "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis_clean.csv\n",
+    "\n",
+    "This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by performing data cleaning, formatting, and structuring."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "aa10d9b0-1c27-4d3f-a8e4-db6ab73bfd26",
+   "metadata": {
+    "id": "aa10d9b0-1c27-4d3f-a8e4-db6ab73bfd26"
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "# Load the dataset\n",
+    "url = \"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis_clean.csv\"\n",
+    "df = pd.read_csv(url)\n",
+    "\n",
+    "# Display a quick preview to confirm column names\n",
+    "# print(df.columns)\n",
+    "# print(df['Sales Channel'].unique())\n",
+    "\n",
+    "# Standardize column names for easier access (if necessary, though they look clean)\n",
+    "df.columns = df.columns.str.lower().str.replace(' ', '_')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "df35fd0d-513e-4e77-867e-429da10a9cc7",
+   "metadata": {
+    "id": "df35fd0d-513e-4e77-867e-429da10a9cc7"
+   },
+   "source": [
+    "1. You work at the marketing department and you want to know which sales channel brought the most sales in terms of total revenue. Using pivot, create a summary table showing the total revenue for each sales channel (branch, call center, web, and mail).\n",
+    "Round the total revenue to 2 decimal points.  Analyze the resulting table to draw insights."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "640993b2-a291-436c-a34d-a551144f8196",
+   "metadata": {
+    "id": "640993b2-a291-436c-a34d-a551144f8196"
+   },
+   "source": [
+    "2. Create a pivot table that shows the average customer lifetime value per gender and education level. Analyze the resulting table to draw insights."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "e71722bf-bc0c-4506-827f-6bd842d33e6c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Total Revenue (CLV) by Sales Channel:\n",
+      "               Total Revenue (CLV)\n",
+      "sales_channel                     \n",
+      "Agent                  33057887.85\n",
+      "Branch                 24359201.21\n",
+      "Call Center            17364288.37\n",
+      "Web                    12697632.90\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create the pivot table for total revenue (CLV) by Sales Channel\n",
+    "revenue_by_channel = pd.pivot_table(\n",
+    "    df, \n",
+    "    values='customer_lifetime_value', \n",
+    "    index='sales_channel', \n",
+    "    aggfunc='sum'\n",
+    ")\n",
+    "\n",
+    "# Rename the column and round to 2 decimal places\n",
+    "revenue_by_channel = revenue_by_channel.rename(\n",
+    "    columns={'customer_lifetime_value': 'Total Revenue (CLV)'}\n",
+    ").round(2)\n",
+    "\n",
+    "print(\"Total Revenue (CLV) by Sales Channel:\")\n",
+    "print(revenue_by_channel.sort_values(by='Total Revenue (CLV)', ascending=False))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "e16ca0ed-e242-473d-a457-c3ca3d2286e1",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\n",
+      "Average Customer Lifetime Value (CLV) by Gender and Education:\n",
+      "education  Bachelor  College   Doctor  High School or Below   Master\n",
+      "gender                                                              \n",
+      "F           7874.27  7748.82  7328.51               8675.22  8157.05\n",
+      "M           7703.60  8052.46  7415.33               8149.69  8168.83\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create the pivot table for average CLV by Gender and Education\n",
+    "avg_clv_by_demographics = pd.pivot_table(\n",
+    "    df, \n",
+    "    values='customer_lifetime_value', \n",
+    "    index='gender', \n",
+    "    columns='education', \n",
+    "    aggfunc='mean'\n",
+    ").round(2)\n",
+    "\n",
+    "print(\"\\nAverage Customer Lifetime Value (CLV) by Gender and Education:\")\n",
+    "print(avg_clv_by_demographics)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32c7f2e5-3d90-43e5-be33-9781b6069198",
+   "metadata": {
+    "id": "32c7f2e5-3d90-43e5-be33-9781b6069198"
+   },
+   "source": [
+    "## Bonus\n",
+    "\n",
+    "You work at the customer service department and you want to know which months had the highest number of complaints by policy type category. Create a summary table showing the number of complaints by policy type and month.\n",
+    "Show it in a long format table."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e3d09a8f-953c-448a-a5f8-2e5a8cca7291",
+   "metadata": {
+    "id": "e3d09a8f-953c-448a-a5f8-2e5a8cca7291"
+   },
+   "source": [
+    "*In data analysis, a long format table is a way of structuring data in which each observation or measurement is stored in a separate row of the table. The key characteristic of a long format table is that each column represents a single variable, and each row represents a single observation of that variable.*\n",
+    "\n",
+    "*More information about long and wide format tables here: https://www.statology.org/long-vs-wide-data/*"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a069e0b-b400-470e-904d-d17582191be4",
+   "metadata": {
+    "id": "3a069e0b-b400-470e-904d-d17582191be4"
+   },
+   "outputs": [],
+   "source": [
+    "# Your code goes here"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.5"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}