Skip to content

Conversation

spzSource
Copy link
Contributor

@spzSource spzSource commented Dec 18, 2020

Fixes #799

As per explanation in the linked issue there is memory leak in JVMObjectTracker. Because of JVMObjectTracker is singleton by it's nature, it retains references to JVM objects throughout the entire life cycle of spark driver node.

All these in turn causes massive memory leak when running DotnetRunner multiple time on the same driver node. For instance this happens when using SetJar approach in Databricks.

To fix the memory leak JVMObjectTracker is cleaned up right before DotnetBackend shutdown, so it releases references to tracked JVM objects, which results for successful garbage collection against these objects.

Heap statistics using jmap tool after the fix applied:

Attaching to process ID 887, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 25.265-b11

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 0
   MaxHeapFreeRatio         = 100
   MaxHeapSize              = 6564085760 (6260.0MB)
   NewSize                  = 2187853824 (2086.5MB)
   MaxNewSize               = 2187853824 (2086.5MB)
   OldSize                  = 4376231936 (4173.5MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 1568145408 (1495.5MB)
   used     = 1502102368 (1432.5164489746094MB)
   free     = 66043040 (62.983551025390625MB)
   95.78846198426007% used
From Space:
   capacity = 303038464 (289.0MB)
   used     = 232644176 (221.8667755126953MB)
   free     = 70394288 (67.13322448730469MB)
   76.77051055802606% used
To Space:
   capacity = 309854208 (295.5MB)
   used     = 0 (0.0MB)
   free     = 309854208 (295.5MB)
   0.0% used
PS Old Generation
   capacity = 4376231936 (4173.5MB)
   used     = 532408504 (507.7443161010742MB)
   free     = 3843823432 (3665.755683898926MB)
   12.16591149157959% used

70301 interned Strings occupying 8281616 bytes.

Usage of Old gen space is 12%, comparing with 99,9% before the fix.

Point to discuss:
My first attempt was to make JVMObjectTracker non-static, which potentially may provide better protection from leaks in the future. But making it non-static causes multiple file being modified, moreover it looks like it will be pretty hard to make DotnetForeachBatchHelper work with non-static version of JVMObjectTracker.

Assuming all mentioned above I've decided to implement an easiest possible solution, but I absolutely do not mind discussing other approaches.

@dnfadmin
Copy link

dnfadmin commented Dec 18, 2020

CLA assistant check
All CLA requirements met.

@suhsteve
Copy link
Member

suhsteve commented Dec 18, 2020

@imback82 I think the fix this PR addresses and the fix for issue #792 / PR #793 are similar. We can either leave ThreadPool and JVMObjectTracker as singletons and add the appropriate fixes, or change them to classes and each instance of DotnetBackend can have their own instances of them.

This PR takes the first approach, and #793 does the second, but I think we should try to be consistent with both fixes. I'm partial to the 2nd approach and CallbackClient will also need to be revisited with a similar fix when running multiple DotnetBackend instances.

@spzSource
Copy link
Contributor Author

spzSource commented Dec 27, 2020

@imback82 I think the fix this PR addresses and the fix for issue #792 / PR #793 are similar. We can either leave ThreadPool and JVMObjectTracker as singletons and add the appropriate fixes, or change them to classes and each instance of DotnetBackend can have their own instances of them.

This PR takes the first approach, and #793 does the second, but I think we should try to be consistent with both fixes. I'm partial to the 2nd approach and CallbackClient will also need to be revisited with a similar fix when running multiple DotnetBackend instances.

Hi @suhsteve and @imback82

Just to let you know. The fix I've created solves memory leak issue, but causes job isolation issues, when all tracking objects for one job may be cleaned up during shutdown of another job.

The only sensible way to fix that is to make JVMObjectTracker non-static, so that each new DotnetBackend will have it's own instance of JVMObjectTracker. I'm currently working on that.

@suhsteve
Copy link
Member

Hi @suhsteve and @imback82

Just to let you know. The fix I've created solves memory leak issue, but causes job isolation issues, when all tracking objects for one job may be cleaned up during shutdown of another job.

The only sensible way to fix that is to make JVMObjectTracker non-static, so that each new DotnetBackend will have it's own instance of JVMObjectTracker. I'm currently working on that.

Sounds like a good plan. Looking forward to the updated PR.

@spzSource spzSource marked this pull request as draft December 28, 2020 21:51
@spzSource
Copy link
Contributor Author

Hi @suhsteve

PR was updated, so currently JVMObjectTracker is created per DotnetBackend. The changes were applied for all three versions 2.3, 2.4, and 3.0. Since there is no common part which is shared across versions, some code duplication is taking place. (Probably it would be better to increase priority of #15 for sake of reducing code duplication)

Looking forward for any feedback.

@spzSource spzSource marked this pull request as ready for review January 4, 2021 16:31
@spzSource spzSource requested a review from suhsteve January 5, 2021 17:28
@spzSource
Copy link
Contributor Author

spzSource commented Jan 6, 2021

FYI I've done a little regression testing on my QA environment using latest fixes from current PR. All batch jobs, which previously suffered by memory leak, currently are running correctly without any memory issues.

It would be perfect to start reviewing changes and move further towards integrating these changes into master.

@imback82
Copy link
Contributor

imback82 commented Jan 6, 2021

Thanks @spzSource for the update. I will get to this PR this week, sorry for the delay.

@gaoyangxiaozhu
Copy link
Contributor

looking forword the PR go to master.

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few minor/nit comments, but generally looking great to me. Thanks @spzSource for working on this!

@spzSource
Copy link
Contributor Author

Seems that latest build failed due to nuget feed issue:

Unable to load the service index for source https://dotnet.myget.org/F/dotnet-try/api/v3/index.json.

Do anybody have idea what has happened with this feed?

@spzSource spzSource requested a review from imback82 January 12, 2021 21:27
@imback82
Copy link
Contributor

Seems that latest build failed due to nuget feed issue:

Unable to load the service index for source https://dotnet.myget.org/F/dotnet-try/api/v3/index.json.

Do anybody have idea what has happened with this feed?

Looks like the feed got deprecated recently. I will create a PR to fix this. Thanks!

@imback82
Copy link
Contributor

The build issue is fixed in #807. I pushed the changes to your branch.

@imback82 imback82 added the enhancement New feature or request label Jan 13, 2021
@imback82 imback82 added this to the 1.0.1 milestone Jan 13, 2021
@spzSource
Copy link
Contributor Author

spzSource commented Jan 14, 2021

@imback82 and @suhsteve
All comments addressed, I think everything is ready for final review and approval 🤞

Copy link
Contributor

@imback82 imback82 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @spzSource!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Memory Leak when running scheduled job using DotnetRunner
5 participants