-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[bug]: unknown postgres query hangs on startup #9730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As you can see it's lnd's |
What parameters are you calling |
I'm not calling |
I have 5,835,721 payments. I'm not sure how many invoices I have because of #9717, but it is on a similar order of magnitude (definitely within 20%). I created the DB by sending 1 sat payments back and forth between two nodes, but there is only one channel opening between those two nodes. In the beginning I had a few different ways I was trying to ping-pong the payments and there were some problems, which is why the number of payments isn't the same as the number of invoices. Here's the amount of data the DB uses:
|
Been looking at this SQLite version that is much faster and I'm able to get some relevant information out of the log file using #9734
So, it seems litd accounts is calling What I don't get is does litd accounts keep track of what invoices it has already subscribed to and resume from there after each restart? If so, does it need to re-do this from the beginning after the invoice DB migration from KV to SQL schema? Also, in #9716 (comment) I get much faster performance with It could be that my SQLite version ran overnight once and I didn't realize it was doing that and now litd is just resuming and has nothing to re-scan (because I've created no new invoices). |
OK, I've switched back to my postgres DB and am again now using #9734. One thing I've noticed here is that the litd accounts sub-server tries to start up before the Channel Router and then they run in parallel (with SQLite, Channel Router was started first and then after started, then litd accounts sub-server was started after scanning for inflight payments completed).
|
I let the above config with postgres run for the rest of the day. I eventually got this (note the
and then CPU was not active anymore and a lot of RAM was released. I then restarted and got
which leads me to believe that
is probably true (it does keep track of where it already subscribed to and resume after each restart) and
is probably also true. It's likely SQLite actually will even be longer than postgres since
but I'd have to do more testing to confirm. One thing I am not sure of though is why we get both
and
? But either way, for now I think people are likely going to need to be aware that if they use LND accounts, they are going to need to plan for a lot more migration time. If they don't use LND accounts but are running litd, they likely want to run litd with Also, I'm not sure if the Also, I've used Is this super long delay going to happen with a the first non-litd accounts usage of |
Thanks for the investigation! I think as you noted, this is actually an issue with the way the This means that for that very first restart post migration (if accounts weren't active before?), it'll effectively ask for the totality of the invoice history to be delivered. I think instead, it should pass in |
Why can't it remember this through the migration?
For the add or settle index? What if we have some invoices that were generated and unsettled before shutdown? |
Would it be of any value to stream some kind of scanning progress to the client so they know the node is still working and not just hung up? This could helpful for all users of |
But it seems like LITD had an invoice number stored. It did not start from 0 as seens in this log-line:
it started from index |
Anyways I agree we need to paginate here and deliver the progress in chunks to the subscriber. This would bring down memory consumption and the subsriber could process it on the fly rather then waiting for the whole bulk at the end. |
The 0 means acutally we will not do a rescan, so we are fine on the LITD side but we could add a comment there as well. See: Lines 255 to 259 in c9fe051
|
So, what I think I did was
Now there could be any point above where litd accounts did not properly record on shutdown the add and settle index that should be tracked on next restart, but I don't think it started doing this rescan at an old add/settle index until step 13. It could be that the problem was introduced in step 3, but migrations in steps 5 and/or 10 did not show it and those steps had an issue that deferred the problem from being surfaced until step 13 ( I think so because whenever I did shutdowns, I thought CPU usage was near 0%). |
Also, I'm not sure if it did not record properly because I did a hard shutdown or if it is a bug. |
We could also be smart enough to check and see if the user has no accounts created but they've still left this enabled, we can know to automatically skip running |
In #9729 we have a long delay on startup when using postgres and a medium delay on startup with SQLite getting through a search for in flight HTLC in the payments DB.
With postgres, I'm having another problem on startup. After the scan for in flight HTLC completes, I still have a postgres process running with 100% CPU:
I believe this process was started around the same time and was running in parallel to the search for in flight HTLC in the payments DB. However, after the search for in flight HTLC in the payments DB completes, I've let things go for another ~3 additional hours waiting for that extra postgres process to complete.
I tried running a profile to see what could be going on:
I'm not sure what is going on, so I tried to kill the postgres process. After killing, I get the following in my logs:
Which leads me to believe that because I'm actually running litd, litd is what might be interacting with the DB here too. I'm wondering, can litd connect to the database directly and not go through lnd? If so, could litd not be seeing the tombstone marker on an old KV schema DB or not obeying the
--lnd.db.use-native-sql
argument?Or, is this query that litd is making intended to be an ongoing thing that keeps litd accounts in sync with lnd invoices and it is just really inefficient in how it is implemented in postgres and it needs to be refactored?
Also, I re-tested the above with SQLite with what should be the same DB (it was created from the same original bbolt DB) and after a 17 minute period to complete a search through the payments DB for in flight HTLC (#9729 (comment)) there was no CPU activity after that. So SQLite must be finishing the extra (invoice???) workload very quickly and I didn't even notice it.
The text was updated successfully, but these errors were encountered: