Issue with Parallel Processing via CSV Uploads in Memgraph 3.3

**Memgraph version**
Which version did you use?
3.3

**Environment**
Some information about the environment you are using Memgraph on: operating
system, architecture (ARM, x86), how do you connect, with or without docker,
which driver etc.

Ubuntu 24.04.2 LTS
Connecting using GQLAlchemy (1.6.0, Python)
Without Docker


**Describe the bug**
When I upload data using CSV files to a Memgraph database using multiple processes, the creation of duplicate nodes and relationships occur. The files do have potential duplicate nodes and relationships, but the nodes and relationships are indexes and constrained to avoid this. 

When running the same process throughout Memgraph 2.X, no duplicate nodes or relationships were created. On 3.3, a variable number of nodes and relationships are created each time uploads are run. I'm seeing nearly 100 million more relationships on 3.X than I am on 2.X, which is a signification deviation. 

On 3.3, when I run uploads on a single process using the same dataset, I get the expected number of nodes and processes, so I suspect it's an issue regarding parallelization. 

On both 3.3 and 2.X, the database is in "In-Memory Analytical" storage mode. We process ~250 files and used 32 processes in parallel. 

The query I run for nodes is:

                        LOAD CSV FROM '{memgraph_directory}{file_name}' WITH HEADER AS row
                        WITH row

                        // Conditional logic to create Org node if entity_type is 'org'
                        FOREACH ( IN CASE WHEN row.entity_type = 'org' THEN [1] ELSE [] END |
                            MERGE (n:Org {{internal_id: row.internal_id}})
                        ON CREATE SET 
                            n.name = row.name,
                            n.country = row.country
                        )

                        // Conditional logic to create Person node if entity_type is 'person'
                        FOREACH ( IN CASE WHEN row.entity_type = 'person' THEN [1] ELSE [] END |
                            MERGE (n:Person {{internal_id: row.internal_id}})
                            ON CREATE SET 
                              n.name = row.name
                            

**To Reproduce**
Steps to reproduce the behavior:

1. Clear Memgraph database and set index/constraint on unique value / internal ID for nodes and relationships.
2. Have set of CSVs with node and relationship data. Ensure that the nodes and relationships have duplication based on the indexed value. (Our nodes could've shown up in any number of files across the ~250 we use)
3. Run multiple processes in parallel for uploads. 
4. Once uploads are complete, record number of nodes/relationships uploaded and repeat steps 1-4 multiple times. 

5. Once running steps 1-4 multiple times, compare the numbers and see if they differ across runs. 

**Expected behavior**
A clear and concise description of what you expected to happen.

I'd expect nodes and relationships to not have duplicates created within the Memgraph database. 
I'd expect nodes and relationships to have consistent/identical counts upon repeated uploads given the same dataset.

**Logs**
If applicable, add logs of Memgraph, CLI output or screenshots to help explain
your problem.


<img width="1320" height="53" alt="Image" src="https://github.com/user-attachments/assets/f2c8694f-8b7a-4405-b9e0-ebe2407c7d61" />
<img width="1096" height="46" alt="Image" src="https://github.com/user-attachments/assets/c1a08871-34e2-4d2b-8402-847195e9732b" />
<img width="1096" height="47" alt="Image" src="https://github.com/user-attachments/assets/4d0eed56-fa40-4401-a94b-71396275cda2" />
<img width="1330" height="55" alt="Image" src="https://github.com/user-attachments/assets/7d4e473c-0073-4dce-978f-fb07ff99ef26" />

The first picture has the accurate node and relationship counts from Memgraph 2.X. The following two pictures show relationship counts after multi-processed uploads on 3.3. The final picture shows the relationship counts from 3.3 after single processed uploads, which match the 2.X multi-processed uploads.

**Additional context**
We only tested this on 3.3 -- we've largely stayed on Memgraph 2.X since discovering this issue as we rely heavily on being able to reload data quickly. At our scale, reloading all of our data in a single thread takes ~10 hours, whereas we can reduce that to ~20 minutes with multiprocessing. 


**Verification Environment**
Once we fix it, what do you need to verify the fix?
I'd probably just want to re-run the upload process and verify the numbers are consistent/as expected.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with Parallel Processing via CSV Uploads in Memgraph 3.3 #3112

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue with Parallel Processing via CSV Uploads in Memgraph 3.3 #3112

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions