Skip to content

Conversation

@dongfang91 dongfang91 requested a review from tmills February 21, 2023 00:23
Copy link
Contributor

@tmills tmills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes I'd like to see (and work together with @dongfang91 on).

The "task name" argument is now just a way of referring to a column in a data file, and should not be hardcoded in the data processing code. We no longer use task names to map to task types (classification, tagging, etc.), and now just infer them from the file format. So let's separate the conceptnorm task (a fine name for the column) from the task type, which can be generalized to something like cossim? (IIRC, the differentiating aspect of this task is a massive one-hot output space where we use cosine similarity layer instead of softmax)

We should come up with a data format that is unique to cossim and modify the cnlp_processors.py to infer that format correctly. In the existing formats we have a labeltext format, and the proposed format looks to invert that -- probably less confusing if we switch to match other tasks.

@dongfang91
Copy link
Contributor Author

I agree. We could definitely infer the task types from the file format since the output label is always CUI (starting with a capital letter C and then followed by digits). But one thing I would be concerned is the amount of CUIs for the output space is more than the total amount of CUIs seen from the existing training data. Which means the input data should have a file to cover all the CUIs. If someone wants to use our code to train concept normalization models, four inputs would be required: training data, all the CUIs, giant embeddings matrix for those CUIS, and CUI-less threshold; if someone only uses our models for inference, then only all the CUIs are required. I assume the data format is used during the training? And it means this data format should be able to cover all four inputs.

@tmills
Copy link
Contributor

tmills commented Apr 3, 2023

yes, I think that's why thinking of this as a brand new task type is important -- we can infer what type it is from the standard files, but once we realize it's a cossim type task we will know to look for that extra required file with the output space explicitly specified. Yes, the data format is mainly for training, but it could also be for --do_predict or --do_eval mode.

Base automatically changed from v0.5.0 to main April 4, 2023 17:28
@tmills tmills changed the base branch from main to v0.6.0 April 4, 2023 17:38
@tmills tmills changed the base branch from v0.6.0 to dev-v0.6.0 April 4, 2023 18:07
@dongfang91
Copy link
Contributor Author

To do for Dongfang:

  1. Define consine similarity as a brand new task type when there is a large output space.
  2. Infere the task type from the standard json files.
  3. Add extra inputs file paths (pre-computed concept embeddings; CUI file) to the json files
  4. Test for both training and evaluation; evaluation won't use the pre-computed concept embeddings.

@tmills tmills changed the base branch from dev-v0.6.0 to dev-v0.7.0 August 3, 2023 21:28
Base automatically changed from dev-v0.7.0 to main November 1, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants