From 6eb92d730f81c366658b0594e55682fbf91d23dd Mon Sep 17 00:00:00 2001 From: Rodrigo Mendes Date: Tue, 16 Sep 2025 20:34:13 +0100 Subject: [PATCH] Update main.ipynb --- your-code/main.ipynb | 288 +++++-------------------------------------- 1 file changed, 31 insertions(+), 257 deletions(-) diff --git a/your-code/main.ipynb b/your-code/main.ipynb index 9f0e67b..dc60fff 100644 --- a/your-code/main.ipynb +++ b/your-code/main.ipynb @@ -12,14 +12,14 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import reduce from functools, numpy and pandas\n", "from functools import reduce\n", - "import numpy\n", - "import pandas" + "import numpy as np\n", + "import pandas as pd" ] }, { @@ -29,187 +29,48 @@ "# Challenge 1 - Mapping\n", "\n", "#### We will use the map function to clean up words in a book.\n", - "\n", "In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "# Run this code:\n", - "\n", "location = '../data/58585-0.txt'\n", - "with open(location, 'r', encoding=\"utf8\") as f:\n", - " prophet = f.read().split(' ')" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "len(prophet)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. \n", - "\n", - "Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### The next step is to create a function that will remove references. \n", + "with open(location, 'r', encoding='utf8') as f:\n", + " prophet = f.read().split(' ')\n", "\n", - "We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below." + "# Remove the first 568 words (metadata) \n", + "prophet = prophet[568:]" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ + "# Function to remove references\n", "def reference(x):\n", - " '''\n", - " Input: A string\n", - " Output: The string with references removed\n", - " \n", - " Example:\n", - " Input: 'the{7}'\n", - " Output: 'the'\n", - " '''\n", - " \n", - " # your code here" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\\n`. Write your function in the cell below." + " return x.split('{')[0]\n", + "\n", + "prophet_reference = list(map(reference, prophet))" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, "outputs": [], "source": [ + "# Function to handle line breaks\n", "def line_break(x):\n", - " '''\n", - " Input: A string\n", - " Output: A list of strings split on the line break (\\n) character\n", - " \n", - " Example:\n", - " Input: 'the\\nbeloved'\n", - " Output: ['the', 'beloved']\n", - " '''\n", - " \n", - " # your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prophet_flat = [i for sub in prophet_line for i in sub]\n", - "prophet_flat" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# your code here" + " return x.split('\\n')\n", + "\n", + "prophet_line = list(map(line_break, prophet_reference))\n", + "prophet_flat = [i for sub in prophet_line for i in sub]" ] }, { @@ -217,74 +78,32 @@ "metadata": {}, "source": [ "# Challenge 2 - Filtering\n", - "\n", - "When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise." + "Remove words like 'and', 'the', etc." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def word_filter(x):\n", - " '''\n", - " Input: A string\n", - " Output: True if the word is not in the specified list \n", - " and False if the word is in the list.\n", - " \n", - " Example:\n", - " word list = ['and', 'the']\n", - " Input: 'and'\n", - " Output: False\n", - " \n", - " Input: 'John'\n", - " Output: True\n", - " '''\n", - " \n", " word_list = ['and', 'the', 'a', 'an']\n", - " \n", - " # your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Bonus Challenge\n", + " return x not in word_list\n", "\n", - "Rewrite the `word_filter` function above to not be case sensitive." + "prophet_filter = list(filter(word_filter, prophet_flat))" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ + "# Bonus: case insensitive\n", "def word_filter_case(x):\n", - " \n", " word_list = ['and', 'the', 'a', 'an']\n", - " \n", - " # your code here" + " return x.lower() not in word_list" ] }, { @@ -292,56 +111,19 @@ "metadata": {}, "source": [ "# Challenge 3 - Reducing\n", - "\n", - "#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. \n", - "\n", - "We will start by writing a function that takes two strings and concatenates them together with a space between the two strings." + "Concatenate all words into a single string." ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def concat_space(a, b):\n", - " '''\n", - " Input:Two strings\n", - " Output: A single string separated by a space\n", - " \n", - " Example:\n", - " Input: 'John', 'Smith'\n", - " Output: 'John Smith'\n", - " '''\n", - " \n", - " # your code here" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "# your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# your code here" + " return a + ' ' + b\n", + "\n", + "prophet_string = reduce(concat_space, prophet_filter)" ] } ], @@ -352,15 +134,7 @@ "name": "python3" }, "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", "version": "3.9.13" } },