From 4318710caeb32e7546ea7ed7407036e087ec3dcd Mon Sep 17 00:00:00 2001 From: Miguel Florindo Date: Sat, 25 Oct 2025 14:01:30 +0100 Subject: [PATCH] DataAnalist_PT_NOV25_MiguelFlorindo --- filtered_books.csv | 7 + lab-web-scraping.ipynb | 290 ++++++++++++++++++++++++++++++++++++++++- 2 files changed, 292 insertions(+), 5 deletions(-) create mode 100644 filtered_books.csv diff --git a/filtered_books.csv b/filtered_books.csv new file mode 100644 index 0000000..4f9c7ad --- /dev/null +++ b/filtered_books.csv @@ -0,0 +1,7 @@ +UPC,Title,Price (£),Rating,Genre,Availability,Description +ce6396b0f23f6ecc,Set Me Free,17.46,5.0,Young Adult,In stock (19 available),"Aaron Ledbetter’s future had been planned out for him since before he was born. Each year, the Ledbetter family vacation on Tybee Island gave Aaron a chance to briefly free himself from his family’s expectations. When he meets Jonas “Lucky” Luckett, a caricature artist in town with the traveling carnival, he must choose between the life that’s been mapped out for him, and Aaron Ledbetter’s future had been planned out for him since before he was born. Each year, the Ledbetter family vacation on Tybee Island gave Aaron a chance to briefly free himself from his family’s expectations. When he meets Jonas “Lucky” Luckett, a caricature artist in town with the traveling carnival, he must choose between the life that’s been mapped out for him, and the chance at true love. ...more" +6258a1f6a6dcfe50,The Four Agreements: A Practical Guide to Personal Freedom,17.66,5.0,Spirituality,In stock (18 available),"In The Four Agreements, don Miguel Ruiz reveals the source of self-limiting beliefs that rob us of joy and create needless suffering. Based on ancient Toltec wisdom, the Four Agreements offer a powerful code of conduct that can rapidly transform our lives to a new experience of freedom, true happiness, and love. The Four Agreements are: Be Impeccable With Your Word, Don't In The Four Agreements, don Miguel Ruiz reveals the source of self-limiting beliefs that rob us of joy and create needless suffering. Based on ancient Toltec wisdom, the Four Agreements offer a powerful code of conduct that can rapidly transform our lives to a new experience of freedom, true happiness, and love. The Four Agreements are: Be Impeccable With Your Word, Don't Take Anything Personally, Don't Make Assumptions, Always Do Your Best. ...more" +6be3beb0793a53e7,Sophie's World,15.94,5.0,Philosophy,In stock (18 available),"A page-turning novel that is also an exploration of the great philosophical concepts of Western thought, Sophie’s World has fired the imagination of readers all over the world, with more than twenty million copies in print.One day fourteen-year-old Sophie Amundsen comes home from school to find in her mailbox two notes, with one question on each: “Who are you?” and “Where A page-turning novel that is also an exploration of the great philosophical concepts of Western thought, Sophie’s World has fired the imagination of readers all over the world, with more than twenty million copies in print.One day fourteen-year-old Sophie Amundsen comes home from school to find in her mailbox two notes, with one question on each: “Who are you?” and “Where does the world come from?” From that irresistible beginning, Sophie becomes obsessed with questions that take her far beyond what she knows of her Norwegian village. Through those letters, she enrolls in a kind of correspondence course, covering Socrates to Sartre, with a mysterious philosopher, while receiving letters addressed to another girl. Who is Hilde? And why does her mail keep turning up? To unravel this riddle, Sophie must use the philosophy she is learning—but the truth turns out to be far more complicated than she could have imagined. ...more" +657fe5ead67a7767,Untitled Collection: Sabbath Poems 2014,14.27,4.0,Poetry,In stock (16 available),"More than thirty-five years ago, when the weather allowed, Wendell Berry began spending his sabbaths outdoors, walking and wandering around familiar territory, seeking a deep intimacy only time could provide. These walks arranged themselves into poems and each year since he has completed a sequence dated by the year of its composition. Last year we collected the lot into a More than thirty-five years ago, when the weather allowed, Wendell Berry began spending his sabbaths outdoors, walking and wandering around familiar territory, seeking a deep intimacy only time could provide. These walks arranged themselves into poems and each year since he has completed a sequence dated by the year of its composition. Last year we collected the lot into a collection, This Day, the Sabbath Poems 1979-2013. This new sequence for the following year is one of the richest yet. This group provides a virtual syllabus for all of Mr. Berry’s cultural and agricultural work in concentrated form. Many of these poems are drawn from the view from a small porch in the woods, a place of stillness and reflection, a vantage point “of the one/life of the forest composed/of uncountable lives in countless/years each life coherent itself within/ the coherence, the great composure,/of all.” A new collection of Wendell Berry poems is always an occasion of joyful celebration and this one is especially so. ...more" +51653ef291ab7ddc,This One Summer,19.49,4.0,Sequential Art,In stock (16 available),"Every summer, Rose goes with her mom and dad to a lake house in Awago Beach. It's their getaway, their refuge. Rosie's friend Windy is always there, too, like the little sister she never had. But this summer is different. Rose's mom and dad won't stop fighting, and when Rose and Windy seek a distraction from the drama, they find themselves with a whole new set of problems. Every summer, Rose goes with her mom and dad to a lake house in Awago Beach. It's their getaway, their refuge. Rosie's friend Windy is always there, too, like the little sister she never had. But this summer is different. Rose's mom and dad won't stop fighting, and when Rose and Windy seek a distraction from the drama, they find themselves with a whole new set of problems. It's a summer of secrets and sorrow and growing up, and it's a good thing Rose and Windy have each other.In This One Summer two stellar creators redefine the teen graphic novel. Cousins Mariko and Jillian Tamaki, the team behind Skim, have collaborated on this gorgeous, heartbreaking, and ultimately hopeful story about a girl on the cusp of her teen age — a story of renewal and revelation. ...more" +709822d0b5bcb7f4,Thirst,17.27,5.0,Fiction,In stock (16 available),"On a searing summer Friday, Eddie Chapman has been stuck for hours in a traffic jam. There are accidents along the highway, but ambulances and police are conspicuously absent. When he decides to abandon his car and run home, he sees that the trees along the edge of a stream have been burnt, and the water in the streambed is gone. Something is very wrong.When he arrives hom On a searing summer Friday, Eddie Chapman has been stuck for hours in a traffic jam. There are accidents along the highway, but ambulances and police are conspicuously absent. When he decides to abandon his car and run home, he sees that the trees along the edge of a stream have been burnt, and the water in the streambed is gone. Something is very wrong.When he arrives home, the power is out and there is no running water. The pipes everywhere, it seems, have gone dry. Eddie and his wife, Laura, find themselves thrust together with their neighbors while a sense of unease thickens in the stifling night air. Thirst takes place in the immediate aftermath of a mysterious disaster--the Chapmans and their neighbors suffer the effects of the heat, their thirst, and the terrifying realization that no one is coming to help. As violence rips through the community, Eddie and Laura are forced to recall secrets from their past and question their present humanity. In crisp and convincing prose, Ben Warner compels readers to do the same. What might you do to survive? ...more" diff --git a/lab-web-scraping.ipynb b/lab-web-scraping.ipynb index e552783..fc572e6 100644 --- a/lab-web-scraping.ipynb +++ b/lab-web-scraping.ipynb @@ -110,15 +110,295 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "id": "40359eee-9cd7-4884-bfa4-83344c222305", "metadata": { "id": "40359eee-9cd7-4884-bfa4-83344c222305" }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Starting Books to Scrape Adventure Lab!\n", + "==================================================\n", + "Scraping page 1...\n", + "Found 1 books on page 1\n", + "Scraping page 2...\n", + "Found 2 books on page 2\n", + "Scraping page 3...\n", + "Found 3 books on page 3\n", + "Scraping page 4...\n", + "No books found on page 4 or reached end of catalog\n", + "\n", + "Scraping completed!\n", + "Total books found: 6\n", + "Genres found: ['Young Adult', 'Spirituality', 'Philosophy', 'Poetry', 'Sequential Art', 'Fiction']\n", + "Price range: £14.27 - £19.49\n", + "Rating range: 4.0 - 5.0 stars\n", + "\n", + "First 5 books meeting criteria (Rating ≥ 4.0, Price ≤ £20):\n", + "======================================================================\n", + " Title Price (£) Rating Genre Availability\n", + " Set Me Free 17.46 5.0 Young Adult In stock (19 available)\n", + "The Four Agreements: A Practical Guide to Personal Freedom 17.66 5.0 Spirituality In stock (18 available)\n", + " Sophie's World 15.94 5.0 Philosophy In stock (18 available)\n", + " Untitled Collection: Sabbath Poems 2014 14.27 4.0 Poetry In stock (16 available)\n", + " This One Summer 19.49 4.0 Sequential Art In stock (16 available)\n", + "\n", + "Data saved to 'filtered_books.csv'\n", + "\n", + "Summary Statistics:\n", + "Total books: 6\n", + "Average price: £17.01\n", + "Most common genre: Fiction\n", + "\n", + "==================================================\n", + "Testing with different parameters...\n", + "Scraping page 1...\n", + "Found 1 books on page 1\n", + "Scraping page 2...\n", + "Found 3 books on page 2\n", + "Scraping page 3...\n", + "Found 2 books on page 3\n", + "Scraping page 4...\n", + "Found 1 books on page 4\n", + "Scraping page 5...\n", + "Found 2 books on page 5\n", + "Scraping page 6...\n", + "No books found on page 6 or reached end of catalog\n", + "\n", + "Scraping completed!\n", + "Total books found: 9\n", + "Genres found: ['Young Adult', 'Spirituality', 'Thriller', 'Philosophy', 'Nonfiction', 'Fiction', 'Default', 'Sequential Art', 'Fantasy']\n", + "Price range: £13.34 - £23.82\n", + "Rating range: 5.0 - 5.0 stars\n", + "Found 9 books with rating ≥ 4.5 and price ≤ £25\n", + "Scraping page 1...\n", + "No books found on page 1 or reached end of catalog\n", + "\n", + "Scraping completed!\n", + "Total books found: 0\n" + ] + } + ], "source": [ - "# Your solution goes here" + "import requests\n", + "from bs4 import BeautifulSoup\n", + "import pandas as pd\n", + "import time\n", + "import re\n", + "\n", + "def scrape_books(min_rating=4.0, max_price=20.0):\n", + " \"\"\"\n", + " Scrape book data from Books to Scrape website with rating and price filters.\n", + " \n", + " Parameters:\n", + " min_rating (float): Minimum rating (1-5 stars)\n", + " max_price (float): Maximum price in pounds\n", + " \n", + " Returns:\n", + " pandas.DataFrame: Filtered book data with specified columns\n", + " \"\"\"\n", + " \n", + " base_url = \"https://books.toscrape.com/\"\n", + " all_books = []\n", + " \n", + " def get_rating_class(rating_text):\n", + " \"\"\"Convert rating text to numeric value\"\"\"\n", + " rating_map = {\n", + " 'One': 1.0,\n", + " 'Two': 2.0,\n", + " 'Three': 3.0,\n", + " 'Four': 4.0,\n", + " 'Five': 5.0\n", + " }\n", + " return rating_map.get(rating_text, 0.0)\n", + " \n", + " def scrape_page(page_url):\n", + " \"\"\"Scrape books from a single page\"\"\"\n", + " try:\n", + " response = requests.get(page_url)\n", + " response.raise_for_status()\n", + " soup = BeautifulSoup(response.content, 'html.parser')\n", + " books = soup.find_all('article', class_='product_pod')\n", + " \n", + " page_books = []\n", + " \n", + " for book in books:\n", + " try:\n", + " # Extract basic info from listing\n", + " title = book.h3.a['title']\n", + " price_text = book.find('p', class_='price_color').text\n", + " price = float(re.search(r'£([\\d.]+)', price_text).group(1))\n", + " \n", + " # Get rating\n", + " rating_class = book.p['class'][1]\n", + " rating = get_rating_class(rating_class)\n", + " \n", + " # Skip if doesn't meet filter criteria\n", + " if rating < min_rating or price > max_price:\n", + " continue\n", + " \n", + " # Get book detail page URL\n", + " book_url = book.h3.a['href']\n", + " if book_url.startswith('catalogue/'):\n", + " full_book_url = base_url + book_url\n", + " else:\n", + " full_book_url = base_url + 'catalogue/' + book_url\n", + " \n", + " # Scrape detailed information\n", + " book_details = scrape_book_details(full_book_url, title, price, rating)\n", + " if book_details:\n", + " page_books.append(book_details)\n", + " \n", + " # Be polite to the server\n", + " time.sleep(0.5)\n", + " \n", + " except Exception as e:\n", + " print(f\"Error processing book: {e}\")\n", + " continue\n", + " \n", + " return page_books\n", + " \n", + " except Exception as e:\n", + " print(f\"Error scraping page {page_url}: {e}\")\n", + " return []\n", + " \n", + " def scrape_book_details(book_url, title, price, rating):\n", + " \"\"\"Scrape detailed information from individual book page\"\"\"\n", + " try:\n", + " response = requests.get(book_url)\n", + " response.raise_for_status()\n", + " soup = BeautifulSoup(response.content, 'html.parser')\n", + " \n", + " # Extract UPC\n", + " upc = soup.find('th', string='UPC').find_next_sibling('td').text\n", + " \n", + " # Extract genre\n", + " genre_links = soup.find('ul', class_='breadcrumb').find_all('a')\n", + " genre = genre_links[2].text.strip() if len(genre_links) > 2 else \"Unknown\"\n", + " \n", + " # Extract availability\n", + " availability = soup.find('p', class_='availability').text.strip()\n", + " \n", + " # Extract description\n", + " description = \"\"\n", + " desc_meta = soup.find('meta', {'name': 'description'})\n", + " if desc_meta:\n", + " description = desc_meta['content'].strip()\n", + " else:\n", + " product_desc = soup.find('div', id='product_description')\n", + " if product_desc:\n", + " description = product_desc.find_next_sibling('p').text.strip()\n", + " \n", + " return {\n", + " 'UPC': upc,\n", + " 'Title': title,\n", + " 'Price (£)': price,\n", + " 'Rating': rating,\n", + " 'Genre': genre,\n", + " 'Availability': availability,\n", + " 'Description': description\n", + " }\n", + " \n", + " except Exception as e:\n", + " print(f\"Error scraping book details from {book_url}: {e}\")\n", + " return None\n", + " \n", + " # Start scraping from the first page\n", + " current_page = 1\n", + " while True:\n", + " if current_page == 1:\n", + " page_url = base_url + \"index.html\"\n", + " else:\n", + " page_url = base_url + f\"catalogue/page-{current_page}.html\"\n", + " \n", + " print(f\"Scraping page {current_page}...\")\n", + " page_books = scrape_page(page_url)\n", + " \n", + " if not page_books:\n", + " print(f\"No books found on page {current_page} or reached end of catalog\")\n", + " break\n", + " \n", + " all_books.extend(page_books)\n", + " print(f\"Found {len(page_books)} books on page {current_page}\")\n", + " \n", + " # Check if there's a next page\n", + " try:\n", + " response = requests.get(page_url)\n", + " soup = BeautifulSoup(response.content, 'html.parser')\n", + " next_button = soup.find('li', class_='next')\n", + " if not next_button:\n", + " break\n", + " except:\n", + " break\n", + " \n", + " current_page += 1\n", + " \n", + " # Create DataFrame\n", + " df = pd.DataFrame(all_books)\n", + " \n", + " # Display summary\n", + " print(f\"\\nScraping completed!\")\n", + " print(f\"Total books found: {len(all_books)}\")\n", + " if len(all_books) > 0:\n", + " print(f\"Genres found: {df['Genre'].unique().tolist()}\")\n", + " print(f\"Price range: £{df['Price (£)'].min():.2f} - £{df['Price (£)'].max():.2f}\")\n", + " print(f\"Rating range: {df['Rating'].min()} - {df['Rating'].max()} stars\")\n", + " \n", + " return df\n", + "\n", + "# Example usage and testing\n", + "if __name__ == \"__main__\":\n", + " print(\"Starting Books to Scrape Adventure Lab!\")\n", + " print(\"=\" * 50)\n", + " \n", + " # Scrape books with rating 4.0+ and price <= £20\n", + " books_df = scrape_books(min_rating=4.0, max_price=20.0)\n", + " \n", + " # Display the results\n", + " if not books_df.empty:\n", + " print(f\"\\nFirst 5 books meeting criteria (Rating ≥ 4.0, Price ≤ £20):\")\n", + " print(\"=\" * 70)\n", + " display_columns = ['Title', 'Price (£)', 'Rating', 'Genre', 'Availability']\n", + " print(books_df[display_columns].head().to_string(index=False))\n", + " \n", + " # Save to CSV for further analysis\n", + " books_df.to_csv('filtered_books.csv', index=False)\n", + " print(f\"\\nData saved to 'filtered_books.csv'\")\n", + " \n", + " # Additional analysis\n", + " print(f\"\\nSummary Statistics:\")\n", + " print(f\"Total books: {len(books_df)}\")\n", + " print(f\"Average price: £{books_df['Price (£)'].mean():.2f}\")\n", + " print(f\"Most common genre: {books_df['Genre'].mode().iloc[0]}\")\n", + " \n", + " else:\n", + " print(\"No books found matching the criteria.\")\n", + " \n", + " # Test with different parameters\n", + " print(\"\\n\" + \"=\" * 50)\n", + " print(\"Testing with different parameters...\")\n", + " \n", + " # Test 1: Higher rating threshold\n", + " high_rated_books = scrape_books(min_rating=4.5, max_price=25.0)\n", + " if not high_rated_books.empty:\n", + " print(f\"Found {len(high_rated_books)} books with rating ≥ 4.5 and price ≤ £25\")\n", + " \n", + " # Test 2: Lower price threshold\n", + " cheap_books = scrape_books(min_rating=3.0, max_price=10.0)\n", + " if not cheap_books.empty:\n", + " print(f\"Found {len(cheap_books)} books with rating ≥ 3.0 and price ≤ £10\")" ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c814c8db", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -126,7 +406,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": "base", "language": "python", "name": "python3" }, @@ -140,7 +420,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.9.13" + "version": "3.13.5" } }, "nbformat": 4,