diff --git a/lab-dw-aggregating.ipynb b/lab-dw-aggregating.ipynb index fadd718..85ee28e 100644 --- a/lab-dw-aggregating.ipynb +++ b/lab-dw-aggregating.ipynb @@ -1,165 +1,602 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "31969215-2a90-4d8b-ac36-646a7ae13744", - "metadata": { - "id": "31969215-2a90-4d8b-ac36-646a7ae13744" - }, - "source": [ - "# Lab | Data Aggregation and Filtering" - ] - }, - { - "cell_type": "markdown", - "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d", - "metadata": { - "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d" - }, - "source": [ - "In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:\n", - "\n", - "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv\n", - "\n", - "This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring." - ] - }, - { - "cell_type": "markdown", - "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50", - "metadata": { - "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50" - }, - "source": [ - "1. Create a new DataFrame that only includes customers who:\n", - " - have a **low total_claim_amount** (e.g., below $1,000),\n", - " - have a response \"Yes\" to the last marketing campaign." - ] - }, - { - "cell_type": "markdown", - "id": "b9be383e-5165-436e-80c8-57d4c757c8c3", - "metadata": { - "id": "b9be383e-5165-436e-80c8-57d4c757c8c3" - }, - "source": [ - "2. Using the original Dataframe, analyze:\n", - " - the average `monthly_premium` and/or customer lifetime value by `policy_type` and `gender` for customers who responded \"Yes\", and\n", - " - compare these insights to `total_claim_amount` patterns, and discuss which segments appear most profitable or low-risk for the company." - ] - }, - { - "cell_type": "markdown", - "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0", - "metadata": { - "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0" - }, - "source": [ - "3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers." - ] - }, - { - "cell_type": "markdown", - "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d", - "metadata": { - "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d" - }, - "source": [ - "4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions." - ] - }, - { - "cell_type": "markdown", - "id": "b42999f9-311f-481e-ae63-40a5577072c5", - "metadata": { - "id": "b42999f9-311f-481e-ae63-40a5577072c5" - }, - "source": [ - "## Bonus" - ] - }, + "cells": [ + { + "cell_type": "markdown", + "id": "31969215-2a90-4d8b-ac36-646a7ae13744", + "metadata": { + "id": "31969215-2a90-4d8b-ac36-646a7ae13744" + }, + "source": [ + "# Lab | Data Aggregation and Filtering" + ] + }, + { + "cell_type": "markdown", + "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d", + "metadata": { + "id": "a8f08a52-bec0-439b-99cc-11d3809d8b5d" + }, + "source": [ + "In this challenge, we will continue to work with customer data from an insurance company. We will use the dataset called marketing_customer_analysis.csv, which can be found at the following link:\n", + "\n", + "https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv\n", + "\n", + "This dataset contains information such as customer demographics, policy details, vehicle information, and the customer's response to the last marketing campaign. Our goal is to explore and analyze this data by first performing data cleaning, formatting, and structuring." + ] + }, + { + "cell_type": "markdown", + "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50", + "metadata": { + "id": "9c98ddc5-b041-4c94-ada1-4dfee5c98e50" + }, + "source": [ + "1. Create a new DataFrame that only includes customers who:\n", + " - have a **low total_claim_amount** (e.g., below $1,000),\n", + " - have a response \"Yes\" to the last marketing campaign." + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "fa191ee5-a1dc-4554-8dd8-e1eb48843742", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "id": "81ff02c5-6584-4f21-a358-b918697c6432", - "metadata": { - "id": "81ff02c5-6584-4f21-a358-b918697c6432" - }, - "source": [ - "5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows." + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ResponseTotal Claim Amount
3Yes484.013411
8Yes739.200000
15Yes547.200000
19Yes19.575683
27Yes60.036683
.........
10844Yes547.200000
10852Yes791.878042
10872Yes547.200000
10887Yes528.200860
10897Yes158.077504
\n", + "

1399 rows × 2 columns

\n", + "
" + ], + "text/plain": [ + " Response Total Claim Amount\n", + "3 Yes 484.013411\n", + "8 Yes 739.200000\n", + "15 Yes 547.200000\n", + "19 Yes 19.575683\n", + "27 Yes 60.036683\n", + "... ... ...\n", + "10844 Yes 547.200000\n", + "10852 Yes 791.878042\n", + "10872 Yes 547.200000\n", + "10887 Yes 528.200860\n", + "10897 Yes 158.077504\n", + "\n", + "[1399 rows x 2 columns]" ] - }, + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import pandas as pd \n", + "url = 'https://raw.githubusercontent.com/data-bootcamp-v4/data/main/marketing_customer_analysis.csv'\n", + "customer_df = pd.read_csv (url)\n", + "#unique_columns = list(set(\n", + "\n", + "Lowpo_df = customer_df[(customer_df['Response']=='Yes') & (customer_df['Total Claim Amount'] < 1000)] \n", + "filtered_df = Lowpo_df[['Response', 'Total Claim Amount']]\n", + "display (filtered_df)\n", + "\n", + "#customer_df.columns\n", + "#Index(['Unnamed: 0', 'Customer', 'State', 'Customer Lifetime Value',\n", + " # 'Response', 'Coverage', 'Education', 'Effective To Date',\n", + " # 'EmploymentStatus', 'Gender', 'Income', 'Location Code',\n", + " # 'Marital Status', 'Monthly Premium Auto', 'Months Since Last Claim',\n", + " # 'Months Since Policy Inception', 'Number of Open Complaints',\n", + " #'Number of Policies', 'Policy Type', 'Policy', 'Renew Offer Type',\n", + " #'Sales Channel', 'Total Claim Amount', 'Vehicle Class', 'Vehicle Size',\n", + " #'Vehicle Type'],\n", + " #dtype='object')" + ] + }, + { + "cell_type": "markdown", + "id": "b9be383e-5165-436e-80c8-57d4c757c8c3", + "metadata": { + "id": "b9be383e-5165-436e-80c8-57d4c757c8c3" + }, + "source": [ + "2. Using the original Dataframe, analyze:\n", + " - the average `monthly_premium` and/or customer lifetime value by `policy_type` and `gender` for customers who responded \"Yes\", and\n", + " - compare these insights to `total_claim_amount` patterns, and discuss which segments appear most profitable or low-risk for the company." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "8995c2b5-e809-4fed-a3d7-7c77fdcc3484", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "id": "b6aec097-c633-4017-a125-e77a97259cda", - "metadata": { - "id": "b6aec097-c633-4017-a125-e77a97259cda" - }, - "source": [ - "6. Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.\n", - "\n", - "*Hint:*\n", - "- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*\n", - "- *Next, you will select the top 3 states with the highest number of policies sold.*\n", - "- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*" + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Monthly Premium AutoCustomer Lifetime Value
Policy TypeGender
Corporate AutoF94.3017757712.628736
M92.1883127944.465414
Personal AutoF98.9981488339.791842
M91.0858217448.383281
Special AutoF92.3142867691.584111
M86.3437508247.088702
\n", + "
" + ], + "text/plain": [ + " Monthly Premium Auto Customer Lifetime Value\n", + "Policy Type Gender \n", + "Corporate Auto F 94.301775 7712.628736\n", + " M 92.188312 7944.465414\n", + "Personal Auto F 98.998148 8339.791842\n", + " M 91.085821 7448.383281\n", + "Special Auto F 92.314286 7691.584111\n", + " M 86.343750 8247.088702" ] + }, + "metadata": {}, + "output_type": "display_data" }, { - "cell_type": "markdown", - "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009", - "metadata": { - "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009" - }, - "source": [ - "7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.\n", - "\n", - "Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded \"Yes\") by marketing channel." + "data": { + "text/plain": [ + "Policy Type Gender\n", + "Corporate Auto F 433.738499\n", + " M 408.582459\n", + "Personal Auto F 452.965929\n", + " M 457.010178\n", + "Special Auto F 453.280164\n", + " M 429.527942\n", + "Name: Total Claim Amount, dtype: float64" ] - }, + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#groupby['Policy Type', 'GENDER'] \n", + "yes_df = customer_df[customer_df['Response'] == 'Yes']\n", + "\n", + "tg_df = yes_df.groupby(['Policy Type','Gender']) [['Monthly Premium Auto', 'Customer Lifetime Value']].mean()\n", + "claim_df = yes_df.groupby(['Policy Type','Gender']) ['Total Claim Amount'].mean()\n", + "display (tg_df)\n", + "display (claim_df)" + ] + }, + { + "cell_type": "markdown", + "id": "affbb2e8-e3c9-4400-9f57-af8638fac1c1", + "metadata": {}, + "source": [ + "#2 Answer :\n", + "Personal Auto female is the highest LTV & Monthly fee. \n", + "But given lowest Total Claim amount in Corporate Auto male and relatively high LTV (#3 overall), \n", + "overall Corporate Auto in male might be the best segment." + ] + }, + { + "cell_type": "markdown", + "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0", + "metadata": { + "id": "7050f4ac-53c5-4193-a3c0-8699b87196f0" + }, + "source": [ + "3. Analyze the total number of customers who have policies in each state, and then filter the results to only include states where there are more than 500 customers." + ] + }, + { + "cell_type": "code", + "execution_count": 87, + "id": "77f7d305-3af5-4093-9696-cf41ba52318e", + "metadata": {}, + "outputs": [ { - "cell_type": "markdown", - "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d", - "metadata": { - "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d" - }, - "source": [ - "External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9" + "data": { + "text/plain": [ + "State\n", + "California 3552\n", + "Oregon 2909\n", + "Arizona 1937\n", + "Nevada 993\n", + "Washington 888\n", + "Name: count, dtype: int64" ] - }, + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "#customer_df.groupby('State').valueagg=({\"Age\":['count', 'mean', 'std'], \"Fare\": ['min','max']} )\n", + "\n", + "state_df = customer_df['State'].value_counts()\n", + "display(state_df)\n", + "filtered_df = state_df[state_df.iloc[:, 1] > 500] \n", + "#display (named_df)" + ] + }, + { + "cell_type": "markdown", + "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d", + "metadata": { + "id": "b60a4443-a1a7-4bbf-b78e-9ccdf9895e0d" + }, + "source": [ + "4. Find the maximum, minimum, and median customer lifetime value by education level and gender. Write your conclusions." + ] + }, + { + "cell_type": "code", + "execution_count": 86, + "id": "bec3edf2-ef3d-4718-a587-67e47a637126", + "metadata": {}, + "outputs": [ { - "cell_type": "code", - "execution_count": null, - "id": "449513f4-0459-46a0-a18d-9398d974c9ad", - "metadata": { - "id": "449513f4-0459-46a0-a18d-9398d974c9ad" - }, - "outputs": [], - "source": [ - "# your code goes here" + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
maxminmedian
EducationGender
BachelorF73225.956521904.0008525640.505303
M67907.270501898.0076755548.031892
CollegeF61850.188031898.6836865623.611187
M61134.683071918.1197006005.847375
DoctorF44856.113972395.5700005332.462694
M32677.342842267.6040385577.669457
High School or BelowF55277.445892144.9215356039.553187
M83325.381191940.9812216286.731006
MasterF51016.067042417.7770325729.855012
M50568.259122272.3073105579.099207
\n", + "
" + ], + "text/plain": [ + " max min median\n", + "Education Gender \n", + "Bachelor F 73225.95652 1904.000852 5640.505303\n", + " M 67907.27050 1898.007675 5548.031892\n", + "College F 61850.18803 1898.683686 5623.611187\n", + " M 61134.68307 1918.119700 6005.847375\n", + "Doctor F 44856.11397 2395.570000 5332.462694\n", + " M 32677.34284 2267.604038 5577.669457\n", + "High School or Below F 55277.44589 2144.921535 6039.553187\n", + " M 83325.38119 1940.981221 6286.731006\n", + "Master F 51016.06704 2417.777032 5729.855012\n", + " M 50568.25912 2272.307310 5579.099207" ] + }, + "execution_count": 86, + "metadata": {}, + "output_type": "execute_result" } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.13" - } + ], + "source": [ + "customer_df.groupby(['Education','Gender'])['Customer Lifetime Value'].agg(['max', 'min', 'median'])" + ] + }, + { + "cell_type": "markdown", + "id": "f7d9c372-4d5d-4307-b77e-c4f1307c8fde", + "metadata": {}, + "source": [ + "#4 The best customer profile is California residing male with Educational background of High School or Below &\n", + "Corporate Auto, or female with Bachelor degree with Personal Auto " + ] + }, + { + "cell_type": "markdown", + "id": "b42999f9-311f-481e-ae63-40a5577072c5", + "metadata": { + "id": "b42999f9-311f-481e-ae63-40a5577072c5" + }, + "source": [ + "## Bonus" + ] + }, + { + "cell_type": "markdown", + "id": "81ff02c5-6584-4f21-a358-b918697c6432", + "metadata": { + "id": "81ff02c5-6584-4f21-a358-b918697c6432" + }, + "source": [ + "5. The marketing team wants to analyze the number of policies sold by state and month. Present the data in a table where the months are arranged as columns and the states are arranged as rows." + ] + }, + { + "cell_type": "markdown", + "id": "b6aec097-c633-4017-a125-e77a97259cda", + "metadata": { + "id": "b6aec097-c633-4017-a125-e77a97259cda" + }, + "source": [ + "6. Display a new DataFrame that contains the number of policies sold by month, by state, for the top 3 states with the highest number of policies sold.\n", + "\n", + "*Hint:*\n", + "- *To accomplish this, you will first need to group the data by state and month, then count the number of policies sold for each group. Afterwards, you will need to sort the data by the count of policies sold in descending order.*\n", + "- *Next, you will select the top 3 states with the highest number of policies sold.*\n", + "- *Finally, you will create a new DataFrame that contains the number of policies sold by month for each of the top 3 states.*" + ] + }, + { + "cell_type": "markdown", + "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009", + "metadata": { + "id": "ba975b8a-a2cf-4fbf-9f59-ebc381767009" + }, + "source": [ + "7. The marketing team wants to analyze the effect of different marketing channels on the customer response rate.\n", + "\n", + "Hint: You can use melt to unpivot the data and create a table that shows the customer response rate (those who responded \"Yes\") by marketing channel." + ] + }, + { + "cell_type": "markdown", + "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d", + "metadata": { + "id": "e4378d94-48fb-4850-a802-b1bc8f427b2d" + }, + "source": [ + "External Resources for Data Filtering: https://towardsdatascience.com/filtering-data-frames-in-pandas-b570b1f834b9" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "449513f4-0459-46a0-a18d-9398d974c9ad", + "metadata": { + "id": "449513f4-0459-46a0-a18d-9398d974c9ad" + }, + "outputs": [], + "source": [ + "# your code goes here" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "data_env", + "language": "python", + "name": "data_env" }, - "nbformat": 4, - "nbformat_minor": 5 + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 }