dataset updates #22

katherinerosewolf · 2021-08-25T12:43:30Z

Day 1

Removed transphobic (Mrs. Doubtfire) references in example2.
Replaced the Jeopardy dataset (with overtly racist content like "Reds", "Braves") with wildfire dataset.
If I had more time I'd also replace the Amazon reviews and Trump tweets.
Austen text replaced by Harper text (to increase Black representation).

Day 2

Fixed some errors introduced by package updates (e.g., n_topics keyword argument replaced by n_components in LatentDirichletAllocation()).
Austen text replaced by Harper text.
Code updated to fix bugs occurring from deprecated keywords in packages revised since 2019.
Note: In the music data, almost all the musicians are white and the grouped reviews focus on white-led outlets and the distinctive word ID top five results for rap are "blank", "waste", "amiable", "awesomely", and "joyless" while the top five for "indie" are "pppperfect", "wonderfull", "perfect", "meh", and "awesome".
Note: The children's literature dataset (childrens_lit.csv.bz2) also has binary gender ("female", "male") and explicitly racist content (see topic 0 in the top words: "Topic Unsupervised #9:
camp indian mountain rock hut arrows ha rocks mountains valley forest savage bushes stream deer gun animal lake meat animals") but I couldn't quite figure out how to plug in another dataset easily for this kind of analysis. I wouldn't teach the module until the racist dataset content gets replaced and the analysis is by something other than binary gender, though.

Day 3

Classification: Revised spam identification exercise to take out the woman-specific objectification in the spam identification example. Still uneasy about the wording (since I'd rather neither shame sexuality as dirty nor expose users to it w/o consent).
Word embeddings: Took out some of the sexism in the terms chosen at the end. I'd also beware of reinforcing false binaries in gender categories.
Updated some code to fix bugs.

aculich · 2021-09-23T01:56:59Z

@pattyf tagging you on this

pssachdeva · 2021-09-27T21:01:01Z

This PR has an abundance of useful fixes, so I am going to go ahead and merge this. I think we have some more work to do with regard to the word embeddings and the LDA in Day 2, but these could also be spun into a discussion about the importance of understanding biases in datasets, and separating what a model tells you about your dataset from some underlying casual interaction.

Averysaurus · 2022-01-06T07:28:21Z

I second us replacing trump tweets in the sentiment analysis portion @pssachdeva. Happy to search for alternative data sources. There are fantastic sentiment analyses done on literary corpora, but a social media example is relevant and may be preferred by us.

aculich · 2022-01-06T17:52:46Z

Here's one candidate to consider...

The SMILE Twitter Emotion dataset is an interesting one outside of the usual tech, politics, or covid tweets that are used so heavily in examples. This dataset also has relevance for DH and social science given the subject matter.

This dataset is collected and annotated for the SMILE project http://www.culturesmile.org. This collection of tweets mentioning 13 Twitter handles associated with British museums was gathered between May 2013 and June 2015. It was created for the purpose of classifying emotions, expressed on Twitter towards arts and cultural experiences in museums.

It contains 3,085 tweets, with 5 emotions namely anger, disgust, happiness, surprise and sadness. Please see our paper "SMILE: Twitter Emotion Classification using Domain Adaptation" for more details of the dataset.

And there's a recently created notebook Using BERT to do sentiment analysis on SMILE Twitter dataset.

Averysaurus · 2022-01-07T01:37:41Z

This looks like really good source data to use, IMO. Going to run some preprocessing, NLP workshop stuff on it this week. @aculich @pssachdeva

aculich · 2022-01-07T02:28:21Z

@Averysaurus that's great! If you get a chance to do that, it would be awesome to do a 2-5 min "lightning talk" by Senior Fellows like yourself and others where you could highlight the datasets like this and get feedback and ideas and suggestions for other datasets we could consider for our workshops, too.

katherinerosewolf added 2 commits August 25, 2021 05:32

dataset and package updates

af63ef6

include ipynb files in commit

2a213b9

katherinerosewolf mentioned this pull request Aug 25, 2021

Dataset issues (some racism/sexism remains) #23

Open

pssachdeva merged commit db97dfc into dlab-berkeley:main Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset updates #22

dataset updates #22

Uh oh!

katherinerosewolf commented Aug 25, 2021 •

edited

Loading

Uh oh!

aculich commented Sep 23, 2021

Uh oh!

pssachdeva commented Sep 27, 2021

Uh oh!

Averysaurus commented Jan 6, 2022

Uh oh!

aculich commented Jan 6, 2022

Uh oh!

Averysaurus commented Jan 7, 2022 •

edited

Loading

Uh oh!

aculich commented Jan 7, 2022

Uh oh!

Uh oh!

dataset updates #22

dataset updates #22

Uh oh!

Conversation

katherinerosewolf commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aculich commented Sep 23, 2021

Uh oh!

pssachdeva commented Sep 27, 2021

Uh oh!

Averysaurus commented Jan 6, 2022

Uh oh!

aculich commented Jan 6, 2022

Uh oh!

Averysaurus commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aculich commented Jan 7, 2022

Uh oh!

Uh oh!

katherinerosewolf commented Aug 25, 2021 •

edited

Loading

Averysaurus commented Jan 7, 2022 •

edited

Loading