diff --git a/2.6_web_scraping (1).ipynb b/2.6_web_scraping (1).ipynb new file mode 100644 index 0000000..82df147 --- /dev/null +++ b/2.6_web_scraping (1).ipynb @@ -0,0 +1,1499 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "5171aeed-67b1-4643-bd50-1ffcdfdc9c41", + "metadata": { + "id": "5171aeed-67b1-4643-bd50-1ffcdfdc9c41" + }, + "source": [ + "# Web Scraping" + ] + }, + { + "cell_type": "markdown", + "id": "3117eb6b-47f2-4bc0-afb7-04cab8095055", + "metadata": { + "id": "3117eb6b-47f2-4bc0-afb7-04cab8095055" + }, + "source": [ + "![legtsgo](https://media.giphy.com/media/dwmNhd5H7YAz6/giphy.gif)" + ] + }, + { + "cell_type": "markdown", + "id": "9cd76a2d-6ef9-4aca-a201-197d778af242", + "metadata": { + "id": "9cd76a2d-6ef9-4aca-a201-197d778af242", + "jp-MarkdownHeadingCollapsed": true, + "tags": [] + }, + "source": [ + "By the end of this lesson, you will be able to:\n", + "\n", + "- Identify the primary components of web technologies and their roles: HTML, CSS, and JavaScript.\n", + "- Explain the hierarchical structure of HTML and the significance of tags, attributes, and their relationship.\n", + "- Utilize the **requests** and **Beautiful Soup** libraries to scrape data from a static web page.\n", + "- Construct and execute a script to scrape data from a webpage and export it into a structured text file using the pandas library." + ] + }, + { + "cell_type": "markdown", + "id": "fb8e2aa9", + "metadata": { + "id": "fb8e2aa9", + "tags": [], + "toc": true + }, + "source": [ + "

Table of Contents

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "id": "f65c0446-74e5-4e78-89e4-e14bb186407a", + "metadata": { + "id": "f65c0446-74e5-4e78-89e4-e14bb186407a" + }, + "source": [ + "## What is Web Scraping" + ] + }, + { + "cell_type": "markdown", + "id": "3fe39426-4de3-433c-9058-06506fc6e056", + "metadata": { + "id": "3fe39426-4de3-433c-9058-06506fc6e056" + }, + "source": [ + "Web scraping is a method employed by data analysts and developers to retrieve information from web pages. It involves fetching a web page and then parsing that page to obtain desired information. This technique is especially useful when the desired data isn't available through APIs. The extracted data can then be cleaned, analyzed, or stored in databases for further data analytics tasks." + ] + }, + { + "cell_type": "markdown", + "id": "421cf815-cf22-478d-871e-a5ff67219d0b", + "metadata": { + "id": "421cf815-cf22-478d-871e-a5ff67219d0b" + }, + "source": [ + "## Web structure" + ] + }, + { + "cell_type": "markdown", + "id": "56ebf2f1-04c5-4dca-96fa-02cbd095fd09", + "metadata": { + "id": "56ebf2f1-04c5-4dca-96fa-02cbd095fd09" + }, + "source": [ + "The fundamental web technologies that form the structure of the websites we aim to scrape are:\n", + "\n", + "- **HTML**: Standing as the backbone of almost all websites, HTML, the core markup language, is instrumental in creating web pages. It houses all the content available on a webpage.\n", + " \n", + "- **CSS**: This stylesheet language works alongside HTML, taking charge of the presentation aspect of the webpages. It controls how HTML elements are displayed, setting the stage for a visually pleasing and organized web interface.\n", + "\n", + "- **JavaScript**: Adding a dynamic touch to the websites, JavaScript comes into play to create interactive and animated content. This programming language has the power to alter webpage content even after it has loaded, bringing a dynamic and responsive element to web designs.\n", + "\n", + "In this lesson, we will work with the HTML from the websites." + ] + }, + { + "cell_type": "markdown", + "id": "10d38b73-2533-43f6-a765-0ad26dfa61f3", + "metadata": { + "id": "10d38b73-2533-43f6-a765-0ad26dfa61f3" + }, + "source": [ + "## HTML" + ] + }, + { + "cell_type": "markdown", + "id": "ddca406e-f9ab-48d1-b29b-eaed702dafd6", + "metadata": { + "id": "ddca406e-f9ab-48d1-b29b-eaed702dafd6" + }, + "source": [ + "In the realm of web scraping, understanding HTML (Hypertext Markup Language) is crucial.\n", + "\n", + "HTML is the standard markup language used to create web pages. Think of it as the skeleton or blueprint of a website. It structures content on the web, defining elements like paragraphs, headings, links, lists, and images. These elements are represented by \"tags\", which enclose content to give it meaning and context.\n", + "\n", + "When web scraping, you'll often navigate through this HTML structure to pinpoint and extract the exact data you need. Tools like web browsers' \"Inspect\" or \"View Source\" features allow you to see the underlying HTML of a page, which is invaluable when determining how to access specific pieces of content programmatically." + ] + }, + { + "cell_type": "markdown", + "id": "6ba82ec5-2130-41ac-bf06-9a981b2e65b6", + "metadata": { + "id": "6ba82ec5-2130-41ac-bf06-9a981b2e65b6" + }, + "source": [ + "![](https://github.com/data-bootcamp-v4/lessons/blob/main/img/html.png?raw=true)" + ] + }, + { + "cell_type": "markdown", + "id": "2f3291a2-5fc6-4e93-b84c-4ef6d45bb433", + "metadata": { + "id": "2f3291a2-5fc6-4e93-b84c-4ef6d45bb433" + }, + "source": [ + "### Exploring Web Page Structures\n", + "\n", + "To inspect the underlying HTML of a web page, right-click anywhere on the page. Choose \"View Page Source\" in browsers like Chrome or Firefox. For Internet Explorer, choose \"View Source,\" and for Safari, select \"Show Page Source.\" (In Safari, if this option isn't visible, navigate to Safari Preferences, click on the Advanced tab, and enable \"Show Develop menu in menu bar.\")\n", + "\n", + "To embark on your web scraping journey, you just need to grasp **three foundational aspects** of HTML.\n" + ] + }, + { + "cell_type": "markdown", + "id": "70d58c48-9b0d-4aa6-8903-e5cf504d16cd", + "metadata": { + "id": "70d58c48-9b0d-4aa6-8903-e5cf504d16cd" + }, + "source": [ + "### Fact 1: HTML is Built on Tags\n", + "\n", + "At its core, HTML is composed of content enveloped in ``. It typically houses the textual content we aim to scrape, highlighted by these \"tags\" enclosed by angle brackets. These tags give structure and meaning, instructing the browser on how to present the content. The acronym \"HTML\" stands for Hyper Text Markup Language.\n", + "\n", + "HTML follows a tree-like structure, encompassing parent tags, child tags, and sibling tags:\n", + "```\n", + "\n", + " \n", + " Page Title\n", + " \n", + " \n", + "

My First Heading

\n", + "

My first paragraph.

\n", + " \n", + "\n", + "```\n", + "\n", + "For instance, consider the `` tag, signaling bold formatting. If \"Jan. 21\" is encapsulated between an opening `` tag and its corresponding closing `` tag, it denotes where the bold styling begins and ends. This pair of tags instructs the browser to render the enclosed text, \"Jan. 21\", in bold.\n", + "\n", + "Tags come in various types, each suited to encapsulate specific content:\n", + " * **Headings**: `

`, `

`, `

`, `

`...\n", + " * **Phrasing**: ``, ``, ``, ``, ``...\n", + " * **Embedded Content**: `