Skip to content
This repository was archived by the owner on Mar 22, 2022. It is now read-only.

Commit db97dfc

Browse files
authored
Merge pull request #22 from katherinerosewolf/master
dataset updates
2 parents c61a24b + 2a213b9 commit db97dfc

17 files changed

+59632
-4398
lines changed

day-1/01-preprocessing-solutions.ipynb

Lines changed: 28857 additions & 280 deletions
Large diffs are not rendered by default.

day-1/01-preprocessing.ipynb

Lines changed: 86 additions & 163 deletions
Large diffs are not rendered by default.

day-1/Example_trump.ipynb

Lines changed: 33 additions & 1368 deletions
Large diffs are not rendered by default.

day-1/data/example2.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't and would've) get in the way.
1+
In this little example, we're going to see some of the problems that regularly appear in tokenization. Tokenization may seem simple, but it's harder than it first appears. Why is it so hard? Punctuations, contractions (like don't, won't, and would've) get in the way.
22

3-
We can split text into sentences using punctuation, but unfortunately that's not always going to work. For example, if I wanted to tell you about Dr. Frankenstein, or Mrs. Doubtfire, we'd be in trouble. What if I wanted to write about U.C. Berkeley? When you think about it, URLs like www.google.com are troublesome too. How would we settle on a price of $10.50? The main point is that these punctuation characters serve a variety of purposes in writing. Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.
3+
We can split text into sentences using punctuation, but unfortunately that's not always going to work. For example, if I wanted to tell you about Dr. Bailey, or Ms. Ndegeocello, we'd be in trouble. What if I wanted to write about U.C. Berkeley? When you think about it, URLs like www.google.com are troublesome too. How would we settle on a price of $10.50? The main point is that these punctuation characters serve a variety of purposes in writing. Moreover, the functions they serve change depending on the domain (medical vs forum text) and language.

day-1/data/fires.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.

day-1/data/harper/minnies-sacrifice.txt

Lines changed: 3920 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)