-
Notifications
You must be signed in to change notification settings - Fork 736
Closed
Labels
Description
Is there an existing issue for this?
- I have searched the existing issues and did not find a match.
Who can help?
No response
What are you working on?
Hi, I'm trying to use the MPNetEmbeddings annotator to compute embeddings for a given dataset.
Current Behavior
When I compute the embeddings, even with the default pretrained model, I obtain vectors much bigger that expected. They vary in function of the text length, but I'm getting at least 5k dimensions, while it should be only 768 dimensions.
I suspect that this may be related with ONNX not exporting the Average Pool layer that needs to be applied in order to compute the final embeddings.
Expected Behavior
The returned embeddings from the MPNetEmbeddings annotator should have only 768 dimensions.
Steps To Reproduce
import sparknlp
from pyspark.sql.types import StringType
from sparknlp.annotator import *
from sparknlp.base import *
spark = sparknlp.start()
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
embeddings = MPNetEmbeddings.pretrained().setInputCols(["document"]).setOutputCol("mpnet_embeddings")
embeddingsFinisher = EmbeddingsFinisher().setInputCols(["mpnet_embeddings"]).setOutputCols(
"finished_embeddings").setOutputAsVector(True)
pipeline = Pipeline().setStages([
documentAssembler,
embeddings,
embeddingsFinisher
])
data = spark.createDataFrame(["This is an example sentence", "Each sentence is converted"],
StringType()).withColumnRenamed(
"value", "text")
result = pipeline.fit(data).transform(data)
embeddings = result.selectExpr("explode(finished_embeddings) as embeddings")
embeddings.show(20,False)Spark NLP version and Apache Spark
spark-nlp=5.1.3
spark=3.3.3
Type of Spark Application
No response
Java Version
openjdk version "11.0.19" 2023-04-18 LTS
Java Home Directory
No response
Setup and installation
Poetry
Operating System and Version
MacOS Ventura
Link to your project (if available)
No response
Additional Information
No response