Skip to content

Frequent OOMs and High CPU Usage While Serving Eth Calls #22567

@nisdas

Description

@nisdas

System information

Geth version: v1.10.1
OS & Version: Linux

Expected behaviour

After the geth node is synced it is able to serve and incoming eth_calls without any issues.

Actual behaviour

The geth node frequently OOMs the moment it starts serving eth_calls , with execution taking a very long time and eventually timing out. The node is immediately killed as it OOMs with memory usage spiking to from 1gb to 10 - 12 gb in a matter of seconds. This behaviour wasn't observed in earlier releases ( 1.9.25 and earlier).

Steps to reproduce the behaviour

Run geth with the following flags

      --v5disc
      --http
      --http.addr=0.0.0.0
      --http.corsdomain=*
      --http.vhosts=*
      --ws
      --ws.addr=0.0.0.0
      --ws.origins=*
      --datadir=/ethereum
      --keystore=/keystore
      --verbosity=3
      --light.serve=50
      --metrics
      --ethstats=/ethstats_secret
      --goerli
      --snapshot=false
      --txlookuplimit=0
      --cache.preimages
      --rpc.allow-unprotected-txs
      --pprof

We have reverted all the big changes from v1.10 onwards however it hasn't made a difference.

Backtrace

INFO [03-24|11:39:23.079] Loaded local transaction journal         transactions=3130 dropped=0
INFO [03-24|11:39:23.092] Regenerated local transaction journal    transactions=3130 accounts=536
WARN [03-24|11:39:23.092] Switch sync mode from fast sync to full sync 
INFO [03-24|11:39:23.140] Unprotected transactions allowed 
WARN [03-24|11:39:23.141] Old unclean shutdowns found              count=141
WARN [03-24|11:39:23.141] Unclean shutdown detected                booted=2021-03-24T10:59:08+0000 age=40m15s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:04:52+0000 age=34m31s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:10:32+0000 age=28m51s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:14:56+0000 age=24m27s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:18:12+0000 age=21m11s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:21:23+0000 age=18m
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:23:03+0000 age=16m20s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:24:49+0000 age=14m34s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:27:13+0000 age=12m10s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:31:47+0000 age=7m36s
WARN [03-24|11:39:23.142] Unclean shutdown detected                booted=2021-03-24T11:35:22+0000 age=4m1s
INFO [03-24|11:39:23.142] Allocated cache and file handles         database=/ethereum/geth/les.server        cache=16.00MiB  handles=16
INFO [03-24|11:39:23.174] Configured checkpoint oracle             address=0x18CA0E045F0D772a851BC7e48357Bcaab0a0795D signers=5 threshold=2
INFO [03-24|11:39:23.185] Loaded latest checkpoint                 section=136 head="cb1485…e62327" chtroot="4fb6c4…a2b521" bloomroot="443066…099a79"
INFO [03-24|11:39:23.185] Starting peer-to-peer node               instance=Geth/v1.10.1-stable-c2d2f4ed/linux-amd64/go1.16
INFO [03-24|11:39:23.276] New local node record                    seq=470 id=xxxxxxxxxxxx ip=127.0.0.1 udp=30303 tcp=30303
INFO [03-24|11:39:23.305] Started P2P networking                   self=enode://[email protected]:30303
INFO [03-24|11:39:23.306] IPC endpoint opened                      url=/ethereum/geth.ipc
INFO [03-24|11:39:23.308] HTTP server started                      endpoint=[::]:8545 prefix= cors=* vhosts=*
INFO [03-24|11:39:23.308] WebSocket enabled                        url=ws://[::]:8546
INFO [03-24|11:39:23.308] Stats daemon started 
INFO [03-24|11:39:33.721] Looking for peers                        peercount=1 tried=35 static=0
INFO [03-24|11:39:37.922] Block synchronisation started 
INFO [03-24|11:39:38.849] New local node record                    seq=471 id=xxxxxxx ip=xxxxxxxxx udp=xxxxx tcp=30303
WARN [03-24|11:39:39.683] Served eth_call                          conn=127.0.0.1:49070 reqid=107580 t=6.568050388s err="execution aborted (timeout = 5s)"
WARN [03-24|11:39:43.317] Served eth_call                          conn=127.0.0.1:53254 reqid=1      t=10.147648933s err="execution aborted (timeout = 5s)"
INFO [03-24|11:39:43.987] Looking for peers                        peercount=1 tried=24 static=0
WARN [03-24|11:39:46.515] Served eth_call                          conn=127.0.0.1:35870 reqid=1      t="559.32µs"    err="execution reverted"
WARN [03-24|11:39:46.622] Served eth_call                          conn=127.0.0.1:60854 reqid=1      t="479.825µs"   err="execution reverted"
INFO [03-24|11:39:46.683] Submitted transaction                    hash=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx from=xxxxxxxxxxxxxxxxxxxxx nonce=xxxx recipient=xxxxxxxxxxxxxxxxxx value=0
WARN [03-24|11:39:47.966] Served eth_call                          conn=127.0.0.1:42870 reqid=290669 t=10.325244195s err="execution aborted (timeout = 5s)"
WARN [03-24|11:39:48.460] Failed to retrieve stats server message  err="websocket: close 1006 (abnormal closure): unexpected EOF"
WARN [03-24|11:39:51.966] Served eth_estimateGas                   conn=127.0.0.1:60412 reqid=3971546 t=5.387361441s  err="execution aborted (timeout = 0s)"
WARN [03-24|11:39:54.181] Full stats report failed                 err="use of closed network connection"

These are the last logs before it gets killed due to an OOM. A restart doesn't help as it goes through this whole process again and gets killed in the next few minutes again while serving an eth_call . While it is expected for our geth node to have higher than normal memory usage due to serving public rpc requests, this hasn't come up before for us which is the reason this issue has been opened.

This is the heap profile of geth right before it gets killed.
geth_heap

This is the cpu profile right before it gets killed.

geth_profile

From both the above figures, it appears that serving these RPC requests causes great stress to the node, and large increases in memory usage due to encoding the response to the request.

These are the raw profiles if it will help you debug this further:

cpu_profile.pb.gz

heap_profile.pb.gz

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions