-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_StarSem
Some notes on the StarSEM 2012 shared task. I've used similar annotation conventions to our previous work, with <> for cues, {} for scope and now [] for events. For papers though we should probably follow Morante et al's (2011) conventions of bold for cues, underline for scope and italic for events.
- Entire sentence, except initial/final punctuation: P=64.20 R=17.59 F1=27.61
- From cue to left and right punctuation or sentence boundary: P=97.95, R=32.36, F1=48.65
The files are provided in CONLL format, with the first 7 columns corresponding to:
- Book_Chapter
- Sentence number within chapter
- token number within sentence
- word
- lemma
- part-of-speech
- syntax
If the sentence does not have negations:
- 8. ***
Otherwise there are three columns per negation:
- (8,11,14, ...) word (or part of word) that is part of the cue
- (9,12,15, ...) word (or part of word) that is part of the scope
- (10,13,16, ...) word (or part of word) that is part of the event
3,644 sentences with 986 instances of negation.
98 instances have no scope; 93 instances have a discontinuous scope that is not bridged by the cue.
Of the remaining 795 instances, 439 are aligned with some constituent in the C&J parses. Applying out bioscope slackening heuristics:
- constituent final punctuation (+93)
- constituent initial punctuation (+37)
- initial adverbs when not the cue (+9)
- scope starts with cue when it is a noun (-1)
- scope does not start with an auxiliary (-6)
Applying only the beneficial heuristics leaves us with an alignment rate of 72.7%.
The following are listings of instances with: no scope, discontinuous scope, scope that is not aligned with a constituent and scope that is aligned with a constituent--- in which instances are delimited by double newlines; the first line is an instance identifier comprised of: chapter <TAB> sentence in chapter <TAB> negation in sentence. The second line is the tokens of the sentence, where cues are indicated with < >, scope with { }, and the most specific subsuming constituent of the scope with _ _.
Additional slackening heuristics for CD:
- if a non-NP node on path from cue to subsumer has a sibling CC, scope starts before CC
- move in from initial CC, UH, ADVP or INTJ
Current alignment rate is 81.6% of continuous scopes.
| TRAINING | DEVELOPMENT | |||||
| Freq. | Cue | PoS | Freq. | Cue | PoS | |
| 326 | not | RB | 39 | not | RB | |
| 139 | no | DT | 27 | no | DT | |
| 72 | un* | JJ | 20 | n't | RB | |
| 65 | n't | RB | 16 | nothing | NN | |
| 59 | never | RB | 12 | un* | JJ | |
| 59 | no | UH | 11 | never | RB | |
| 55 | nothing | NN | 7 | without | IN | |
| 27 | no | RB | 5 | im* | JJ | |
| 24 | without | IN | 4 | nor | CC | |
| 22 | *not | RB | 4 | in* | JJ | |
| 20 | *less | JJ | 4 | no | UH | |
| 17 | in* | JJ | 3 | un* | RB | |
| 16 | im* | JJ | 3 | *less | JJ | |
| 12 | none | NN | 2 | neither_*_nor | DT_*_CC | |
| 6 | nor | CC | 1 | nobody | NN | |
| 4 | in* | RB | 1 | in* | RB | |
| 4 | un* | RB | 1 | dis* | VBN | |
| 4 | *less* | RB | 1 | dis* | NN | |
| 4 | ir* | JJ | 1 | save | IN | |
| 3 | *less | NN | 1 | ir* | JJ | |
| 3 | dis* | NN | 1 | *not | NN | |
| 2 | im* | RB | 1 | no_*_nor | DT_*_CC | |
| 2 | nowhere | RB | 1 | *n* | RB | |
| 2 | neither_*_nor | DT_*_CC | 1 | more | JJR | |
| 2 | *not | NN | 1 | im* | NN | |
| 2 | *not | VBD | 1 | neither | DT | |
| 2 | prevent | VB | 1 | no_more | DT_RBR | |
| 2 | *not | VBP | 1 | by_no_means | IN_DT_NNS | |
| 2 | on_the_contrary | IN_DT_NN | 1 | un* | VBN | |
| 2 | by_no_means | IN_DT_NNS | ||||
| 1 | rather_than | RB_IN | ||||
| 1 | by_no_means | IN_DT_VBZ | ||||
| 1 | nobody | NN | ||||
| 1 | ir* | RB | ||||
| 1 | fail | VBP | ||||
| 1 | no* | NN | ||||
| 1 | un | IN | ||||
| 1 | absence | NN | ||||
| 1 | nothing_at_all | NN_IN_DT | ||||
| 1 | neglected | VBN | ||||
| 1 | dis* | VBN | ||||
| 1 | refused | VBD | ||||
| 1 | no | NNP | ||||
| 1 | in* | NNS | ||||
| 1 | un* | IN | ||||
| 1 | ir* | NN | ||||
| 1 | not_the | RB_DT | ||||
| 1 | not_for_the_world | RB_IN_DT_NN | ||||
| 1 | save | VB | ||||
| 1 | except | VB | ||||
| 1 | *less* | JJ | ||||
| 1 | unusual | JJ | ||||
| 1 | *less* | NN | ||||
| 1 | un* | NN | ||||
| 1 | dis* | JJ | ||||
| 1 | not_*_not | RB_*_RB | ||||
| 1 | un* | VBN |
3,640 sentences with 989 instances of negation.
99 instances have no scope; 92 instances have a discontinuous scope that is not bridged by the cue.
Of the remaining 798 instances, 80 are aligned with some constituent. Applying our bioscope slackening heuristics:
- constituent final punctuation (+437)
- constituent initial punctuation (+70)
- initial adverbs when not the cue (+8)
- scope starts with cue when it is a noun (+2)
- scope does not start with an auxiliary (-9)
Applying only the beneficial heuristics leaves us with an alignment rate of 73.4%.
...
371 instances have no event; 14 instances have discontinuous events. In 6 instances the event lies outside of the scope---these seem to be annotation errors:
-
... only {an} <un>[ambitious] {one who abandons a London career for the country} ...
-
... {an} <un>[justifiable] {intrusion}, ...
-
{It} <never> [recovered] {from the blow}, ...
-
"But {I} [can]<'t> {forget them}, Miss Stapleton," said I.
-
... and means to [spare] <no> {pains or expense} to restore the grandeur of his family.
-
Coming down with an <un>[signed] {warrant}.
...
Collins' coverage of the training data is 99.4% (21 of 3,640 sentence). In those 21 there are 10 instances of negation, for example:
- "Know then that in the time of the Great Rebellion (the history of which by the learned Lord Clarendon I most earnestly commend to your attention) this Manor of Baskerville has held by Hugo of that name, nor <can> {[it] be gainsaid that he was a most wild, profane, and {[god]}<less> man}.
| Training | Development | ||
| Freq. | Word | Freq. | Word |
| 35 | don't | 17 | 't |
| 11 | can't | 3 | don't |
| 7 | n't | ||
| 6 | isn't | ||
| 5 | didn't | ||
| 2 | couldn't |
Of the training data bigrams ending in n't there are:
- 4 do n't
- 1 did n't
- 1 had n't
- 1 wo n't
Of the development data bigrams ending in 't there are:
- 7 don 't
- 4 can 't
- 3 didn 't
- 1 couldn 't
- 1 shan 't
- 1 wasn 't
There is a full listing of tokens containing punctuation here: JimWhite/StarSemTokenTabulation.
HoundOfTheBaskervilles_ch1, s1. prefixed cue, weirdness
-
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} not <in>{frequent occasions when he was up all night}, was seated at the breakfast table.
-
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} <not> {infrequent occasions when he was up all night}, was seated at the breakfast table.
-
Mr. Sherlock Holmes, {who was} usually {very late in the mornings,} <save> {upon those not infrequent occasions when he was up all night}, was seated at the breakfast table.
HoundOfTheBaskervilles_ch1, s12, prefixed cue
- Since {we have been so} <un>{[fortunate]] {as to miss him} and have no notion of his errand, this accidental souvenir becomes of importance.
HoundOfTheBaskervilles_ch1, s67: discontinuous scope
- If {he was} in the hospital and yet <not> {on the staff} he could only have been a house-surpeon or a house-physician: little more than a senior student.
HoundOfTheBaskervilles_ch1, s8: weirdness
- It is my experience that it is only an amiable man in this world who receives testimonials, only {an} <un>[ambitious] {one who abandons a London career for the country}, and only an absent-minded one who leaves his stick and not his visiting-card after waiting an hour in your room.
HoundOfTheBaskervilles_ch1, s89: discontinuous scope
- {The dog's jaw}, as shown in the space between these marks, {is} too broad in my opinion for a terrier and <not> {[broad] enough for a mastiff}.
HoundOfTheBaskervilles_ch3, s235: Multi-word cue, discontinuous scope
- Then, again, whom was he waiting for that night, and why was {he [waiting] for him} in the yew alley <rather than> {in his own house}?"
HoundOfTheBaskervilles_ch4, s154: contracted cue
- But as to my uncle's death: well, it all seems boiling up in my head, and {I [can]}<'t> {get it clear yet}.
HoundOfTheBaskervilles_ch4, s233: Abbreviation of "number" tagged as negation
- {No.} 2704 is our man .
| Frq. | Cue | POS |
| 346 | not | RB |
| 137 | no | DT |
| 71 | un | JJ |
| 64 | no | UH |
| 58 | never | RB |
| 55 | nothing | NN |
| 36 | n't | RB |
| 24 | without | IN |
| 22 | less | JJ |
| 18 | no | RB |
| 17 | in | JJ |
| 16 | im | JJ |
| 12 | none | NN |
| 8 | n't | JJ |
| 6 | 't | RB |
| 6 | n't | VB |
| 5 | n't | NN |
| 5 | no | NNP |
| 5 | ir | JJ |
| 4 | nor | CC |
| 4 | un | RB |
| 4 | less | RB |
| 4 | in | RB |
| 3 | dis | NN |
| 3 | not | VB |
| 3 | less | NN |
| 2 | '<NULL>' | '<NULL>' |
| 2 | not | JJ |
| 2 | un | NN |
| 2 | not | NN |
| 2 | un | IN |
| 2 | nowhere | RB |
| 2 | by_no_means | IN_DT_NN |
| 2 | prevent | VB |
| 2 | n't | NNP |
| 2 | 't | NN |
| 2 | im | RB |
| 2 | on_the_contrary | IN_DT_NN |
| 1 | rather_than | RB_IN |
| 1 | nobody | NN |
| 1 | been | VBN |
| 1 | fail | VBP |
| 1 | neither_*_nor | CC_*_CC |
| 1 | absence | NN |
| 1 | other | JJ |
| 1 | nothing_at_all | NN_IN_DT |
| 1 | can | MD |
| 1 | neglected | VBN |
| 1 | ir | RB |
| 1 | un | VBG |
| 1 | refused | VBD |
| 1 | the | DT |
| 1 | yet | RB |
| 1 | never | NNP |
| 1 | save | VBP |
| 1 | not_for_the_world | RB_IN_DT_NN |
| 1 | un | VBN |
| 1 | signs | NNS |
| 1 | in | NNS |
| 1 | no | JJ |
| 1 | unusual | JJ |
| 1 | dis | VBN |
| 1 | neither_*_nor | DT_*_CC |
| 1 | by_no_means | IN_RB_VBZ |
| 1 | not_*_not | RB_*_RB |
| 1 | except | IN |
| 1 | dis | JJ |
The full list is here. There are 367 token/pos types.
| Frq. | Word | POS |
| 51 | could | MD |
| 25 | can | RB |
| 19 | have | VBP |
| 14 | had | VBD |
| 12 | know | VB |
| 10 | know | VBP |
| 7 | able | JJ |
| 7 | seen | VBN |
| 6 | happy | JJ |
| 5 | pleasant | JJ |
| 5 | like | IN |
| 5 | sign | NN |
| 5 | say | VB |
| 5 | man | NN |
| 4 | likely | JJ |
| 4 | heard | VBN |
| 4 | saw | VBD |
| 4 | can | MD |
| 4 | possible | JJ |
| 4 | known | JJ |
Home | Forum | Discussions | Events