Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
288 changes: 31 additions & 257 deletions your-code/main.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import reduce from functools, numpy and pandas\n",
"from functools import reduce\n",
"import numpy\n",
"import pandas"
"import numpy as np\n",
"import pandas as pd"
]
},
{
Expand All @@ -29,319 +29,101 @@
"# Challenge 1 - Mapping\n",
"\n",
"#### We will use the map function to clean up words in a book.\n",
"\n",
"In the following cell, we will read a text file containing the book The Prophet by Khalil Gibran."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Run this code:\n",
"\n",
"location = '../data/58585-0.txt'\n",
"with open(location, 'r', encoding=\"utf8\") as f:\n",
" prophet = f.read().split(' ')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"len(prophet)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's remove the first 568 words since they contain information about the book but are not part of the book itself. \n",
"\n",
"Do this by removing from `prophet` elements 0 through 567 of the list (you can also do this by keeping elements 568 through the last element)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you look through the words, you will find that many words have a reference attached to them. For example, let's look at words 1 through 10."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The next step is to create a function that will remove references. \n",
"with open(location, 'r', encoding='utf8') as f:\n",
" prophet = f.read().split(' ')\n",
"\n",
"We will do this by splitting the string on the `{` character and keeping only the part before this character. Write your function below."
"# Remove the first 568 words (metadata) \n",
"prophet = prophet[568:]"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Function to remove references\n",
"def reference(x):\n",
" '''\n",
" Input: A string\n",
" Output: The string with references removed\n",
" \n",
" Example:\n",
" Input: 'the{7}'\n",
" Output: 'the'\n",
" '''\n",
" \n",
" # your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have our function, use the `map()` function to apply this function to our book, The Prophet. Return the resulting list to a new list called `prophet_reference`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another thing you may have noticed is that some words contain a line break. Let's write a function to split those words. Our function will return the string split on the character `\\n`. Write your function in the cell below."
" return x.split('{')[0]\n",
"\n",
"prophet_reference = list(map(reference, prophet))"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Function to handle line breaks\n",
"def line_break(x):\n",
" '''\n",
" Input: A string\n",
" Output: A list of strings split on the line break (\\n) character\n",
" \n",
" Example:\n",
" Input: 'the\\nbeloved'\n",
" Output: ['the', 'beloved']\n",
" '''\n",
" \n",
" # your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Apply the `line_break` function to the `prophet_reference` list. Name the new list `prophet_line`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you look at the elements of `prophet_line`, you will see that the function returned lists and not strings. Our list is now a list of lists. Flatten the list using list comprehension. Assign this new list to `prophet_flat`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prophet_flat = [i for sub in prophet_line for i in sub]\n",
"prophet_flat"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
" return x.split('\\n')\n",
"\n",
"prophet_line = list(map(line_break, prophet_reference))\n",
"prophet_flat = [i for sub in prophet_line for i in sub]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Challenge 2 - Filtering\n",
"\n",
"When printing out a few words from the book, we see that there are words that we may not want to keep if we choose to analyze the corpus of text. Below is a list of words that we would like to get rid of. Create a function that will return false if it contains a word from the list of words specified and true otherwise."
"Remove words like 'and', 'the', etc."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def word_filter(x):\n",
" '''\n",
" Input: A string\n",
" Output: True if the word is not in the specified list \n",
" and False if the word is in the list.\n",
" \n",
" Example:\n",
" word list = ['and', 'the']\n",
" Input: 'and'\n",
" Output: False\n",
" \n",
" Input: 'John'\n",
" Output: True\n",
" '''\n",
" \n",
" word_list = ['and', 'the', 'a', 'an']\n",
" \n",
" # your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the `filter()` function to filter out the words speficied in the `word_filter()` function. Store the filtered list in the variable `prophet_filter`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bonus Challenge\n",
" return x not in word_list\n",
"\n",
"Rewrite the `word_filter` function above to not be case sensitive."
"prophet_filter = list(filter(word_filter, prophet_flat))"
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Bonus: case insensitive\n",
"def word_filter_case(x):\n",
" \n",
" word_list = ['and', 'the', 'a', 'an']\n",
" \n",
" # your code here"
" return x.lower() not in word_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Challenge 3 - Reducing\n",
"\n",
"#### Now that we have significantly cleaned up our text corpus, let's use the `reduce()` function to put the words back together into one long string separated by spaces. \n",
"\n",
"We will start by writing a function that takes two strings and concatenates them together with a space between the two strings."
"Concatenate all words into a single string."
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def concat_space(a, b):\n",
" '''\n",
" Input:Two strings\n",
" Output: A single string separated by a space\n",
" \n",
" Example:\n",
" Input: 'John', 'Smith'\n",
" Output: 'John Smith'\n",
" '''\n",
" \n",
" # your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the function above to reduce the text corpus in the list `prophet_filter` into a single string. Assign this new string to the variable `prophet_string`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# your code here"
" return a + ' ' + b\n",
"\n",
"prophet_string = reduce(concat_space, prophet_filter)"
]
}
],
Expand All @@ -352,15 +134,7 @@
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
Expand Down