You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Mar 16, 2022. It is now read-only.
One of the dominant scenario for text is to use some pre-trained encoder (Roberta, BERT, XLMR etc) and attach task specific head on top of it (classification head, Language modeling head, POS tagging head, Q&A head etc). I believe this is also true for Vision as well (as well as to audio @mthrok ?). To the best of my knowledge (please correct me if I am am mistaken), vision currently provides factory function for every possible combination there-of? This approach is somewhat limiting in terms of scalability and boiler-plate code over-head that comes with it. Also versioning could be bit redundant if we replicate same weights class across each combination for the encoder part.
I wonder what folks think about extending this framework to support model composition?
As a reference HF also explicitly provide classes for every combination. Here is one example for Roberta Encoder + Q&A task.