Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions lab-sql-aggregation-and-transformation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | SQL Data Aggregation and Transformation

<details>
<summary>
<h2>Learning Goals</h2>
</summary>

This lab allows you to practice and apply the concepts and techniques taught in class.

Upon completion of this lab, you will be able to:

- Use SQL built-in functions such as COUNT, MAX, MIN, AVG to aggregate and summarize data, and use GROUP BY to group data by specific columns. Use the HAVING clause to filter data based on aggregate functions.
- Use SQL to clean, transform, and prepare data for analysis by handling duplicates, null values, renaming columns, and converting data types. Use functions like ROUND, DATE_DIFF, CONCAT, and SUBSTRING to manipulate data and generate insights.
- Use conditional expressions for creating new columns.


<br>
<hr>

</details>

<details>
<summary>
<h2>Prerequisites</h2>
</summary>

Before this starting this lab, you should have learnt about:

- SELECT, FROM, ORDER BY, LIMIT, WHERE, GROUP BY, and HAVING clauses.
- DISTINCT keyword to return only unique values, AS keyword for using aliases.
- Built-in SQL functions such as COUNT, MAX, MIN, AVG, ROUND, DATEDIFF, or DATE_FORMAT.
- CASE statement for conditional logic.
<br>
<hr>

</details>


## Introduction

Welcome to the SQL Data Aggregation and Transformation lab!

In this lab, you will practice how to use SQL queries to extract insights from the [Sakila](https://dev.mysql.com/doc/sakila/en/) database which contains information about movie rentals.

You will build on your SQL skills by practicing how to use the `GROUP BY` and `HAVING` clauses to group data and filter results based on aggregate values. You will also practice how to handle null values, rename columns, and use built-in functions like `MAX`, `MIN`, `ROUND`, `DATE_DIFF`, `CONCAT`, and `SUBSTRING` to manipulate and transform data for generating insights.

Throughout the lab, you will work with two SQL query files: `sakila-schema.sql`, which creates the database schema, and `sakila-data.sql`, which inserts the data into the database. You can download the necessary files locally by following the steps listed in [Sakila sample database - installation](https://dev.mysql.com/doc/sakila/en/sakila-installation.html).

You can also refer to the Entity Relationship Diagram (ERD) of the database to guide your analysis:

<br>

![DB schema](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/database-sakila-schema.png)

<br><br>

Imagine you work at a movie rental company as an analyst. By using SQL in the challenges below, you are required to gain insights into different elements of its business operations.

## Challenge 1

1. You need to use SQL built-in functions to gain insights relating to the duration of movies:
- 1.1 Determine the **shortest and longest movie durations** and name the values as `max_duration` and `min_duration`.
- 1.2. Express the **average movie duration in hours and minutes**. Don't use decimals.
- *Hint: Look for floor and round functions.*
2. You need to gain insights related to rental dates:
- 2.1 Calculate the **number of days that the company has been operating**.
- *Hint: To do this, use the `rental` table, and the `DATEDIFF()` function to subtract the earliest date in the `rental_date` column from the latest date.*
- 2.2 Retrieve rental information and add two additional columns to show the **month and weekday of the rental**. Return 20 rows of results.
- 2.3 *Bonus: Retrieve rental information and add an additional column called `DAY_TYPE` with values **'weekend' or 'workday'**, depending on the day of the week.*
- *Hint: use a conditional expression.*
3. You need to ensure that customers can easily access information about the movie collection. To achieve this, retrieve the **film titles and their rental duration**. If any rental duration value is **NULL, replace** it with the string **'Not Available'**. Sort the results of the film title in ascending order.
- *Please note that even if there are currently no null values in the rental duration column, the query should still be written to handle such cases in the future.*
- *Hint: Look for the `IFNULL()` function.*

4. *Bonus: The marketing team for the movie rental company now needs to create a personalized email campaign for customers. To achieve this, you need to retrieve the **concatenated first and last names of customers**, along with the **first 3 characters of their email** address, so that you can address them by their first name and use their email address to send personalized recommendations. The results should be ordered by last name in ascending order to make it easier to use the data.*

## Challenge 2

1. Next, you need to analyze the films in the collection to gain some more insights. Using the `film` table, determine:
- 1.1 The **total number of films** that have been released.
- 1.2 The **number of films for each rating**.
- 1.3 The **number of films for each rating, sorting** the results in descending order of the number of films.
This will help you to better understand the popularity of different film ratings and adjust purchasing decisions accordingly.
2. Using the `film` table, determine:
- 2.1 The **mean film duration for each rating**, and sort the results in descending order of the mean duration. Round off the average lengths to two decimal places. This will help identify popular movie lengths for each category.
- 2.2 Identify **which ratings have a mean duration of over two hours** in order to help select films for customers who prefer longer movies.
3. *Bonus: determine which last names are not repeated in the table `actor`.*



## Requirements

- Fork this repo
- Clone it to your machine


## Getting Started

Complete the challenges in this readme in a `.sql`file.

## Submission

- Upon completion, run the following commands:

```bash
git add .
git commit -m "Solved lab"
git push origin master
```

- Paste the link of your lab in Student Portal.
87 changes: 87 additions & 0 deletions lab-sql-aggregation-and-transformation/answers.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
USE SAKILA;

#Challenge 1.1.1
SELECT MAX(length)
AS max_duration
FROM film;

SELECT MIN(length)
AS min_duration
FROM film;

#Challenge 1.1.2
SELECT ROUND(AVG(length))
FROM film;

#Challenge 1.2.1
SELECT DATEDIFF(
(SELECT MAX(rental_date) FROM rental),
(SELECT MIN(rental_date) FROM rental)
) AS operating_days;

#Challenge 1.2.2
SELECT *,
MONTH(rental_date) AS rental_month,
DAYNAME(rental_date) AS rental_weekday
FROM rental
LIMIT 20;

#Challenge 1.2.3
SELECT *,
DAYNAME(rental_date) AS rental_weekday,
CASE
WHEN DAYOFWEEK(rental_date) IN (1, 7) THEN 'weekend'
ELSE 'workday'
END AS day_type
FROM rental;

#Challenge 1.3
SELECT
title,
IFNULL(rental_duration, 'Not Available') AS rental_duration
FROM film
ORDER BY title ASC;

#Challenge 1.4
SELECT
CONCAT(first_name, ' ', last_name) AS full_name,
LEFT(email, 3) AS email_prefix
FROM customer
ORDER BY last_name ASC;

#Challenge 2.1.1
SELECT COUNT(*)
FROM film;

#Challenge 2.1.2
SELECT COUNT(*), rating
FROM film
GROUP BY rating;

#Challenge 2.1.3
SELECT COUNT(*) AS number_films_rating, rating
FROM film
GROUP BY rating
ORDER BY number_films_rating ASC;

#Challenge 2.2.1
SELECT
ROUND(AVG(length),2) AS avg_lenght_films,
rating
FROM film
GROUP BY rating
ORDER BY avg_lenght_films DESC;

#Challenge 2.2.2
SELECT
ROUND(AVG(length),2) AS avg_lenght_films,
rating
FROM film
GROUP BY rating
HAVING AVG(length) > 120
ORDER BY avg_lenght_films DESC;

#Challenge 2.3
SELECT DISTINCT
last_name AS unique_last_name
FROM actor;