Skip to content
This repository was archived by the owner on Mar 22, 2022. It is now read-only.

dataset updates #22

Merged
merged 2 commits into from
Sep 27, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29,137 changes: 28,857 additions & 280 deletions day-1/01-preprocessing-solutions.ipynb

Large diffs are not rendered by default.

249 changes: 86 additions & 163 deletions day-1/01-preprocessing.ipynb

Large diffs are not rendered by default.

1,401 changes: 33 additions & 1,368 deletions day-1/Example_trump.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions day-1/data/example2.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way.
In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't, and would've) get in the way.

We can split text into sentences using punctuation, but unfortunately that's not always going to work. For example, if I wanted to tell you about Dr. Frankenstein, or Mrs. Doubtfire, we'd be in trouble. What if I wanted to write about U.C. Berkeley? When you think about it, URLs like www.google.com are troublesome too. How would we settle on a price of $10.50? The main point is that these punctuation characters serve a variety of purposes in writing. Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.
We can split text into sentences using punctuation, but unfortunately that's not always going to work. For example, if I wanted to tell you about Dr. Bailey, or Ms. Ndegeocello, we'd be in trouble. What if I wanted to write about U.C. Berkeley? When you think about it, URLs like www.google.com are troublesome too. How would we settle on a price of $10.50? The main point is that these punctuation characters serve a variety of purposes in writing. Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.
1 change: 1 addition & 0 deletions day-1/data/fires.json

Large diffs are not rendered by default.

3,920 changes: 3,920 additions & 0 deletions day-1/data/harper/minnies-sacrifice.txt

Large diffs are not rendered by default.

Loading