-
Notifications
You must be signed in to change notification settings - Fork 179
Description
Memgraph version
Which version did you use?
3.3
Environment
Some information about the environment you are using Memgraph on: operating
system, architecture (ARM, x86), how do you connect, with or without docker,
which driver etc.
Ubuntu 24.04.2 LTS
Connecting using GQLAlchemy (1.6.0, Python)
Without Docker
Describe the bug
When I upload data using CSV files to a Memgraph database using multiple processes, the creation of duplicate nodes and relationships occur. The files do have potential duplicate nodes and relationships, but the nodes and relationships are indexes and constrained to avoid this.
When running the same process throughout Memgraph 2.X, no duplicate nodes or relationships were created. On 3.3, a variable number of nodes and relationships are created each time uploads are run. I'm seeing nearly 100 million more relationships on 3.X than I am on 2.X, which is a signification deviation.
On 3.3, when I run uploads on a single process using the same dataset, I get the expected number of nodes and processes, so I suspect it's an issue regarding parallelization.
On both 3.3 and 2.X, the database is in "In-Memory Analytical" storage mode. We process ~250 files and used 32 processes in parallel.
The query I run for nodes is:
LOAD CSV FROM '{memgraph_directory}{file_name}' WITH HEADER AS row
WITH row
// Conditional logic to create Org node if entity_type is 'org'
FOREACH ( IN CASE WHEN row.entity_type = 'org' THEN [1] ELSE [] END |
MERGE (n:Org {{internal_id: row.internal_id}})
ON CREATE SET
n.name = row.name,
n.country = row.country
)
// Conditional logic to create Person node if entity_type is 'person'
FOREACH ( IN CASE WHEN row.entity_type = 'person' THEN [1] ELSE [] END |
MERGE (n:Person {{internal_id: row.internal_id}})
ON CREATE SET
n.name = row.name
To Reproduce
Steps to reproduce the behavior:
-
Clear Memgraph database and set index/constraint on unique value / internal ID for nodes and relationships.
-
Have set of CSVs with node and relationship data. Ensure that the nodes and relationships have duplication based on the indexed value. (Our nodes could've shown up in any number of files across the ~250 we use)
-
Run multiple processes in parallel for uploads.
-
Once uploads are complete, record number of nodes/relationships uploaded and repeat steps 1-4 multiple times.
-
Once running steps 1-4 multiple times, compare the numbers and see if they differ across runs.
Expected behavior
A clear and concise description of what you expected to happen.
I'd expect nodes and relationships to not have duplicates created within the Memgraph database.
I'd expect nodes and relationships to have consistent/identical counts upon repeated uploads given the same dataset.
Logs
If applicable, add logs of Memgraph, CLI output or screenshots to help explain
your problem.




The first picture has the accurate node and relationship counts from Memgraph 2.X. The following two pictures show relationship counts after multi-processed uploads on 3.3. The final picture shows the relationship counts from 3.3 after single processed uploads, which match the 2.X multi-processed uploads.
Additional context
We only tested this on 3.3 -- we've largely stayed on Memgraph 2.X since discovering this issue as we rely heavily on being able to reload data quickly. At our scale, reloading all of our data in a single thread takes ~10 hours, whereas we can reduce that to ~20 minutes with multiprocessing.
Verification Environment
Once we fix it, what do you need to verify the fix?
I'd probably just want to re-run the upload process and verify the numbers are consistent/as expected.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status