Skip to content

MPNetEmbeddings annotator returns wrong embeddings #14066

@dfustes

Description

@dfustes

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

Hi, I'm trying to use the MPNetEmbeddings annotator to compute embeddings for a given dataset.

Current Behavior

When I compute the embeddings, even with the default pretrained model, I obtain vectors much bigger that expected. They vary in function of the text length, but I'm getting at least 5k dimensions, while it should be only 768 dimensions.

I suspect that this may be related with ONNX not exporting the Average Pool layer that needs to be applied in order to compute the final embeddings.

Expected Behavior

The returned embeddings from the MPNetEmbeddings annotator should have only 768 dimensions.

Steps To Reproduce

import sparknlp
from pyspark.sql.types import StringType
from sparknlp.annotator import *
from sparknlp.base import *

spark = sparknlp.start()

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
embeddings = MPNetEmbeddings.pretrained().setInputCols(["document"]).setOutputCol("mpnet_embeddings")
embeddingsFinisher = EmbeddingsFinisher().setInputCols(["mpnet_embeddings"]).setOutputCols(
    "finished_embeddings").setOutputAsVector(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame(["This is an example sentence", "Each sentence is converted"],
                             StringType()).withColumnRenamed(
    "value", "text")
result = pipeline.fit(data).transform(data)

embeddings = result.selectExpr("explode(finished_embeddings) as embeddings")
embeddings.show(20,False)

Spark NLP version and Apache Spark

spark-nlp=5.1.3
spark=3.3.3

Type of Spark Application

No response

Java Version

openjdk version "11.0.19" 2023-04-18 LTS

Java Home Directory

No response

Setup and installation

Poetry

Operating System and Version

MacOS Ventura

Link to your project (if available)

No response

Additional Information

No response

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions