diff --git a/lab-dw-aggregating.ipynb b/lab-dw-aggregating.ipynb index fadd718..917a2f2 100644 --- a/lab-dw-aggregating.ipynb +++ b/lab-dw-aggregating.ipynb @@ -1,165 +1,366 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "31969215-2a90-4d8b-ac36-646a7ae13744", - "metadata": { - "id": "31969215-2a90-4d8b-ac36-646a7ae13744" - }, - "source": [ - "# Lab | Data Aggregation and Filtering" - ] - }, - { - "cell_type": "markdown", - "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d", - "metadata": { - "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d" - }, - "source": [ - "In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:\n", - "\n", - "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv\n", - "\n", - "This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring." - ] - }, - { - "cell_type": "markdown", - "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50", - "metadata": { - "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50" - }, - "source": [ - "1. Create a new DataFrame that only includes customers who:\n", - " - have a **low total_claim_amount** (e.g., below $1,000),\n", - " - have a response \"Yes\" to the last marketing campaign." - ] - }, - { - "cell_type": "markdown", - "id": "b9be383e-5165-436e-80c8-57d4c757c8c3", - "metadata": { - "id": "b9be383e-5165-436e-80c8-57d4c757c8c3" - }, - "source": [ - "2. Using the original Dataframe, analyze:\n", - " - the average `monthly_premium` and/or customer lifetime value by `policy_type` and `gender` for customers who responded \"Yes\", and\n", - " - compare these insights to `total_claim_amount` patterns, and discuss which segments appear most profitable or low-risk for the company." - ] - }, - { - "cell_type": "markdown", - "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0", - "metadata": { - "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0" - }, - "source": [ - "3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers." - ] - }, - { - "cell_type": "markdown", - "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d", - "metadata": { - "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d" - }, - "source": [ - "4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions." - ] - }, - { - "cell_type": "markdown", - "id": "b42999f9-311f-481e-ae63-40a5577072c5", - "metadata": { - "id": "b42999f9-311f-481e-ae63-40a5577072c5" - }, - "source": [ - "## Bonus" - ] - }, - { - "cell_type": "markdown", - "id": "81ff02c5-6584-4f21-a358-b918697c6432", - "metadata": { - "id": "81ff02c5-6584-4f21-a358-b918697c6432" - }, - "source": [ - "5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows." - ] - }, + "cells": [ + { + "cell_type": "markdown", + "id": "31969215-2a90-4d8b-ac36-646a7ae13744", + "metadata": { + "id": "31969215-2a90-4d8b-ac36-646a7ae13744" + }, + "source": [ + "# Lab | Data Aggregation and Filtering" + ] + }, + { + "cell_type": "markdown", + "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d", + "metadata": { + "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d" + }, + "source": [ + "In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:\n", + "\n", + "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv\n", + "\n", + "This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "5bd271c8-1839-4025-bf62-0ad53cb2b56c", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "id": "b6aec097-c633-4017-a125-e77a97259cda", - "metadata": { - "id": "b6aec097-c633-4017-a125-e77a97259cda" - }, - "source": [ - "6. Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.\n", - "\n", - "*Hint:*\n", - "- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*\n", - "- *Next, you will select the top 3 states with the highest number of policies sold.*\n", - "- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*" + "data": { + "text/plain": [ + "Index(['Unnamed: 0', 'Customer', 'State', 'Customer Lifetime Value',\n", + " 'Response', 'Coverage', 'Education', 'Effective To Date',\n", + " 'EmploymentStatus', 'Gender', 'Income', 'Location Code',\n", + " 'Marital Status', 'Monthly Premium Auto', 'Months Since Last Claim',\n", + " 'Months Since Policy Inception', 'Number of Open Complaints',\n", + " 'Number of Policies', 'Policy Type', 'Policy', 'Renew Offer Type',\n", + " 'Sales Channel', 'Total Claim Amount', 'Vehicle Class', 'Vehicle Size',\n", + " 'Vehicle Type'],\n", + " dtype='object')" ] - }, + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import pandas as pd\n", + "df = pd.read_csv(\"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv\")\n", + "df.columns" + ] + }, + { + "cell_type": "markdown", + "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50", + "metadata": { + "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50" + }, + "source": [ + "1. Create a new DataFrame that only includes customers who:\n", + " - have a **low total_claim_amount** (e.g., below $1,000),\n", + " - have a response \"Yes\" to the last marketing campaign." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "4fea8b87-c1c8-4442-8103-9ff6c7e18e9c", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009", - "metadata": { - "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009" - }, - "source": [ - "7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.\n", - "\n", - "Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded \"Yes\") by marketing channel." - ] - }, + "name": "stdout", + "output_type": "stream", + "text": [ + " Unnamed: 0 Customer State Customer Lifetime Value Response \\\n", + "3 3 XL78013 Oregon 22332.439460 Yes \n", + "8 8 FM55990 California 5989.773931 Yes \n", + "15 15 CW49887 California 4626.801093 Yes \n", + "19 19 NJ54277 California 3746.751625 Yes \n", + "27 27 MQ68407 Oregon 4376.363592 Yes \n", + "... ... ... ... ... ... \n", + "10844 10844 FM31768 Arizona 5979.724161 Yes \n", + "10852 10852 KZ80424 Washington 8382.478392 Yes \n", + "10872 10872 XT67997 California 5979.724161 Yes \n", + "10887 10887 BY78730 Oregon 8879.790017 Yes \n", + "10897 10897 MM70762 Arizona 9075.768214 Yes \n", + "\n", + " Coverage Education Effective To Date EmploymentStatus \\\n", + "3 Extended College 1/11/11 Employed \n", + "8 Premium College 1/19/11 Employed \n", + "15 Basic Master 1/16/11 Employed \n", + "19 Extended College 2/26/11 Employed \n", + "27 Premium Bachelor 2/28/11 Employed \n", + "... ... ... ... ... \n", + "10844 Extended High School or Below 2/7/11 Employed \n", + "10852 Basic Bachelor 1/27/11 Employed \n", + "10872 Extended High School or Below 2/7/11 Employed \n", + "10887 Basic High School or Below 2/3/11 Employed \n", + "10897 Basic Master 1/26/11 Employed \n", + "\n", + " Gender ... Number of Open Complaints Number of Policies \\\n", + "3 M ... 0.0 2 \n", + "8 M ... 0.0 1 \n", + "15 F ... 0.0 1 \n", + "19 F ... 1.0 1 \n", + "27 F ... 0.0 1 \n", + "... ... ... ... ... \n", + "10844 F ... 0.0 3 \n", + "10852 M ... 0.0 2 \n", + "10872 F ... 0.0 3 \n", + "10887 F ... 0.0 7 \n", + "10897 M ... 0.0 8 \n", + "\n", + " Policy Type Policy Renew Offer Type Sales Channel \\\n", + "3 Corporate Auto Corporate L3 Offer2 Branch \n", + "8 Personal Auto Personal L1 Offer2 Branch \n", + "15 Special Auto Special L1 Offer2 Branch \n", + "19 Personal Auto Personal L2 Offer2 Call Center \n", + "27 Personal Auto Personal L3 Offer2 Agent \n", + "... ... ... ... ... \n", + "10844 Personal Auto Personal L1 Offer2 Agent \n", + "10852 Personal Auto Personal L2 Offer2 Call Center \n", + "10872 Personal Auto Personal L3 Offer2 Agent \n", + "10887 Special Auto Special L2 Offer1 Agent \n", + "10897 Personal Auto Personal L1 Offer1 Agent \n", + "\n", + " Total Claim Amount Vehicle Class Vehicle Size Vehicle Type \n", + "3 484.013411 Four-Door Car Medsize A \n", + "8 739.200000 Sports Car Medsize NaN \n", + "15 547.200000 SUV Medsize NaN \n", + "19 19.575683 Two-Door Car Large A \n", + "27 60.036683 Four-Door Car Medsize NaN \n", + "... ... ... ... ... \n", + "10844 547.200000 Four-Door Car Medsize NaN \n", + "10852 791.878042 NaN NaN A \n", + "10872 547.200000 Four-Door Car Medsize NaN \n", + "10887 528.200860 SUV Small A \n", + "10897 158.077504 Sports Car Medsize A \n", + "\n", + "[1399 rows x 26 columns]\n" + ] + } + ], + "source": [ + "# 1. Filter customers: low total_claim_amount (<1000) and responded \"Yes\"\n", + "low_claim_yes = df[(df['Total Claim Amount'] < 1000) & (df['Response'] == 'Yes')]\n", + "print(low_claim_yes)" + ] + }, + { + "cell_type": "markdown", + "id": "b9be383e-5165-436e-80c8-57d4c757c8c3", + "metadata": { + "id": "b9be383e-5165-436e-80c8-57d4c757c8c3" + }, + "source": [ + "2. Using the original Dataframe, analyze:\n", + " - the average `monthly_premium` and/or customer lifetime value by `policy_type` and `gender` for customers who responded \"Yes\", and\n", + " - compare these insights to `total_claim_amount` patterns, and discuss which segments appear most profitable or low-risk for the company." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "8e47546a-03f3-4eff-ac7c-a1840cc477df", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d", - "metadata": { - "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d" - }, - "source": [ - "External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9" - ] - }, + "name": "stdout", + "output_type": "stream", + "text": [ + " Monthly Premium Auto Customer Lifetime Value\n", + "Policy Type Gender \n", + "Corporate Auto F 94.30 7712.63\n", + " M 92.19 7944.47\n", + "Personal Auto F 99.00 8339.79\n", + " M 91.09 7448.38\n", + "Special Auto F 92.31 7691.58\n", + " M 86.34 8247.09\n" + ] + } + ], + "source": [ + "# 2. Average monthly_premium and customer lifetime value by policy_type and gender for responded \"Yes\"\n", + "avg_metrics = df[df['Response'] == 'Yes'].groupby(['Policy Type', 'Gender'])[['Monthly Premium Auto', 'Customer Lifetime Value']].mean().round(2)\n", + "print(avg_metrics)" + ] + }, + { + "cell_type": "markdown", + "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0", + "metadata": { + "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0" + }, + "source": [ + "3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "15716ae7-2833-4498-a2c6-ab5d98e9d690", + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "id": "449513f4-0459-46a0-a18d-9398d974c9ad", - "metadata": { - "id": "449513f4-0459-46a0-a18d-9398d974c9ad" - }, - "outputs": [], - "source": [ - "# your code goes here" - ] + "name": "stdout", + "output_type": "stream", + "text": [ + "State\n", + "California 3552\n", + "Oregon 2909\n", + "Arizona 1937\n", + "Nevada 993\n", + "Washington 888\n", + "Name: count, dtype: int64\n" + ] } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.13" + ], + "source": [ + "# 3. Total number of customers per state, filter states with more than 500 customers\n", + "state_counts = df['State'].value_counts()\n", + "state_over_500 = state_counts[state_counts > 500]\n", + "print(state_over_500)" + ] + }, + { + "cell_type": "markdown", + "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d", + "metadata": { + "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d" + }, + "source": [ + "4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "231b9324-fc0a-4a14-9d28-2c0043a42489", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + " max min median\n", + "Education Gender \n", + "Bachelor F 73225.96 1904.00 5640.51\n", + " M 67907.27 1898.01 5548.03\n", + "College F 61850.19 1898.68 5623.61\n", + " M 61134.68 1918.12 6005.85\n", + "Doctor F 44856.11 2395.57 5332.46\n", + " M 32677.34 2267.60 5577.67\n", + "High School or Below F 55277.45 2144.92 6039.55\n", + " M 83325.38 1940.98 6286.73\n", + "Master F 51016.07 2417.78 5729.86\n", + " M 50568.26 2272.31 5579.10\n" + ] } + ], + "source": [ + "# 4. Max, min, median customer lifetime value by education and gender\n", + "clv_stats = df.groupby(['Education', 'Gender'])['Customer Lifetime Value'].agg(['max','min','median']).round(2)\n", + "print(clv_stats)" + ] + }, + { + "cell_type": "markdown", + "id": "b42999f9-311f-481e-ae63-40a5577072c5", + "metadata": { + "id": "b42999f9-311f-481e-ae63-40a5577072c5" + }, + "source": [ + "## Bonus" + ] + }, + { + "cell_type": "markdown", + "id": "81ff02c5-6584-4f21-a358-b918697c6432", + "metadata": { + "id": "81ff02c5-6584-4f21-a358-b918697c6432" + }, + "source": [ + "5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows." + ] + }, + { + "cell_type": "markdown", + "id": "b6aec097-c633-4017-a125-e77a97259cda", + "metadata": { + "id": "b6aec097-c633-4017-a125-e77a97259cda" + }, + "source": [ + "6. Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.\n", + "\n", + "*Hint:*\n", + "- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*\n", + "- *Next, you will select the top 3 states with the highest number of policies sold.*\n", + "- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*" + ] + }, + { + "cell_type": "markdown", + "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009", + "metadata": { + "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009" + }, + "source": [ + "7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.\n", + "\n", + "Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded \"Yes\") by marketing channel." + ] + }, + { + "cell_type": "markdown", + "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d", + "metadata": { + "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d" + }, + "source": [ + "External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "449513f4-0459-46a0-a18d-9398d974c9ad", + "metadata": { + "id": "449513f4-0459-46a0-a18d-9398d974c9ad" + }, + "outputs": [], + "source": [ + "# your code goes here" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python [conda env:base] *", + "language": "python", + "name": "conda-base-py" }, - "nbformat": 4, - "nbformat_minor": 5 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.5" + } + }, + "nbformat": 4, + "nbformat_minor": 5 }