Skip to content
This repository was archived by the owner on Mar 22, 2022. It is now read-only.

dataset updates #22

Merged
merged 2 commits into from
Sep 27, 2021
Merged

dataset updates #22

merged 2 commits into from
Sep 27, 2021

Conversation

katherinerosewolf
Copy link
Contributor

@katherinerosewolf katherinerosewolf commented Aug 25, 2021

Day 1

  • Removed transphobic (Mrs. Doubtfire) references in example2.
  • Replaced the Jeopardy dataset (with overtly racist content like "Reds", "Braves") with wildfire dataset.
  • If I had more time I'd also replace the Amazon reviews and Trump tweets.
  • Austen text replaced by Harper text (to increase Black representation).

Day 2

  • Fixed some errors introduced by package updates (e.g., n_topics keyword argument replaced by n_components in LatentDirichletAllocation()).
  • Austen text replaced by Harper text.
  • Code updated to fix bugs occurring from deprecated keywords in packages revised since 2019.
  • Note: In the music data, almost all the musicians are white and the grouped reviews focus on white-led outlets and the distinctive word ID top five results for rap are "blank", "waste", "amiable", "awesomely", and "joyless" while the top five for "indie" are "pppperfect", "wonderfull", "perfect", "meh", and "awesome".
  • Note: The children's literature dataset (childrens_lit.csv.bz2) also has binary gender ("female", "male") and explicitly racist content (see topic 0 in the top words: "Topic Unsupervised #9:
    camp indian mountain rock hut arrows ha rocks mountains valley forest savage bushes stream deer gun animal lake meat animals") but I couldn't quite figure out how to plug in another dataset easily for this kind of analysis. I wouldn't teach the module until the racist dataset content gets replaced and the analysis is by something other than binary gender, though.

Day 3

  • Classification: Revised spam identification exercise to take out the woman-specific objectification in the spam identification example. Still uneasy about the wording (since I'd rather neither shame sexuality as dirty nor expose users to it w/o consent).
  • Word embeddings: Took out some of the sexism in the terms chosen at the end. I'd also beware of reinforcing false binaries in gender categories.
  • Updated some code to fix bugs.

@aculich
Copy link
Contributor

aculich commented Sep 23, 2021

@pattyf tagging you on this

@pssachdeva
Copy link
Member

This PR has an abundance of useful fixes, so I am going to go ahead and merge this. I think we have some more work to do with regard to the word embeddings and the LDA in Day 2, but these could also be spun into a discussion about the importance of understanding biases in datasets, and separating what a model tells you about your dataset from some underlying casual interaction.

@pssachdeva pssachdeva merged commit db97dfc into dlab-berkeley:main Sep 27, 2021
@Averysaurus
Copy link

I second us replacing trump tweets in the sentiment analysis portion @pssachdeva. Happy to search for alternative data sources. There are fantastic sentiment analyses done on literary corpora, but a social media example is relevant and may be preferred by us.

@aculich
Copy link
Contributor

aculich commented Jan 6, 2022

Here's one candidate to consider...

The SMILE Twitter Emotion dataset is an interesting one outside of the usual tech, politics, or covid tweets that are used so heavily in examples. This dataset also has relevance for DH and social science given the subject matter.

This dataset is collected and annotated for the SMILE project http://www.culturesmile.org. This collection of tweets mentioning 13 Twitter handles associated with British museums was gathered between May 2013 and June 2015. It was created for the purpose of classifying emotions, expressed on Twitter towards arts and cultural experiences in museums.

It contains 3,085 tweets, with 5 emotions namely anger, disgust, happiness, surprise and sadness. Please see our paper "SMILE: Twitter Emotion Classification using Domain Adaptation" for more details of the dataset.

And there's a recently created notebook Using BERT to do sentiment analysis on SMILE Twitter dataset.

@Averysaurus
Copy link

Averysaurus commented Jan 7, 2022

This looks like really good source data to use, IMO. Going to run some preprocessing, NLP workshop stuff on it this week. @aculich @pssachdeva

@aculich
Copy link
Contributor

aculich commented Jan 7, 2022

@Averysaurus that's great! If you get a chance to do that, it would be awesome to do a 2-5 min "lightning talk" by Senior Fellows like yourself and others where you could highlight the datasets like this and get feedback and ideas and suggestions for other datasets we could consider for our workshops, too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants