-
Notifications
You must be signed in to change notification settings - Fork 17
Integrate concept normalization component to cnlpt #124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes I'd like to see (and work together with @dongfang91 on).
The "task name" argument is now just a way of referring to a column in a data file, and should not be hardcoded in the data processing code. We no longer use task names to map to task types (classification, tagging, etc.), and now just infer them from the file format. So let's separate the conceptnorm task (a fine name for the column) from the task type, which can be generalized to something like cossim? (IIRC, the differentiating aspect of this task is a massive one-hot output space where we use cosine similarity layer instead of softmax)
We should come up with a data format that is unique to cossim and modify the cnlp_processors.py to infer that format correctly. In the existing formats we have a labeltext format, and the proposed format looks to invert that -- probably less confusing if we switch to match other tasks.
I agree. We could definitely infer the task types from the file format since the output label is always CUI (starting with a capital letter C and then followed by digits). But one thing I would be concerned is the amount of CUIs for the output space is more than the total amount of CUIs seen from the existing training data. Which means the input data should have a file to cover all the CUIs. If someone wants to use our code to train concept normalization models, four inputs would be required: training data, all the CUIs, giant embeddings matrix for those CUIS, and CUI-less threshold; if someone only uses our models for inference, then only all the CUIs are required. I assume the data format is used during the training? And it means this data format should be able to cover all four inputs. |
yes, I think that's why thinking of this as a brand new task type is important -- we can infer what type it is from the standard files, but once we realize it's a cossim type task we will know to look for that extra required file with the output space explicitly specified. Yes, the data format is mainly for training, but it could also be for --do_predict or --do_eval mode. |
To do for Dongfang:
|
integrate concept normalization components to this branch. A few places to check:
how to get the labels during the test
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/train_system.py#L702
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/train_system.py#L721
change event_tokens to event_mask
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/CnlpModelForClassification.py#L477
https://github.com/Machine-Learning-for-Medical-Language/cnlp_transformers/blob/concept_norm_updates/src/cnlpt/CnlpModelForClassification.py#L533