Skip to content

suspended after Processing all files (on backup_content.control.tmp file) #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkdel opened this issue May 8, 2020 · 16 comments
Closed
Labels
Milestone

Comments

@mkdel
Copy link

mkdel commented May 8, 2020

Hi,
I did:
pg_probackup-9.6 backup -B /mnt/probackup/ --instance 'myPGhost' -b FULL --remote-host=192.168.1.1 --remote-user=postgres -U backup -d backupdb --progress -j 2

and the last line in the log file is:
2020-05-08 03:36:44 UTC: INFO: Progress: (1156427/1156427). Process file "base/16384/44722834"
So looks like it processed 100% of files.

but now is: 14:45, so it's been about 11 hours since no any new message has arrived.

on Postgres database server (myPGhost) no agent is running, but I see some CPU is consumed on Archive server

lsof -p 23877 (PID of running pg_probackup-9.6) shows
`
pg_probac 23877 postgres 3w REG 253,0 147397277 71471 /var/lib/pgsql/pg_probackup_log/myPGhost_2020-18.log

pg_probac 23877 postgres 4u IPv4 30953866 0t0 TCP archiveHost:36042->myPGhost:postgres (ESTABLISHED)

pg_probac 23877 postgres 5w REG 0,42 56614912 4348538889 /mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp (172.2.1.1:/opt/backup_vol1/DB_Backups/probackup)

pg_probac 23877 postgres 9r FIFO 0,9 0t0 30953869 pipe

pg_probac 23877 postgres 10r FIFO 0,9 0t0 30957855 pipe
`

I see this file is modified all the time:
stat /mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp
File: ‘/mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp’
Size: 218171568 Blocks: 788480 IO Block: 1048576 regular file
Device: 2ah/42d Inode: 4348538889 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 26/postgres) Gid: ( 26/postgres)
Context: system_u:object_r:nfs_t:s0
Access: 2020-05-08 14:53:14.654392149 +0000
Modify: 2020-05-08 14:53:18.917391941 +0000
Change: 2020-05-08 14:53:18.917391941 +0000
Birth: -

Should I still wait?

@gsmolk
Copy link
Contributor

gsmolk commented May 8, 2020

Hello!
Hmm, lets take a look at its backtrace.
You have to install gdb and run the following command:
gdb -p 23877 --ex 'thread apply all bt' --batch

@mkdel
Copy link
Author

mkdel commented May 8, 2020

here:

[root@archiveHost ~]# gdb -p 23877 --ex 'thread apply all bt' --batch
[New LWP 24096]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fbb4d529017 in pthread_join () from /lib64/libpthread.so.0

Thread 2 (Thread 0x7fbb490b7700 (LWP 24096)):
#0  0x00007fbb4d240050 in __close_nocancel () from /lib64/libc.so.6
#1  0x00007fbb4d1cb1dc in __GI__IO_file_close_it () from /lib64/libc.so.6
#2  0x00007fbb4d1be1f8 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
#3  0x0000000000416449 in write_backup_filelist ()
#4  0x00000000004114eb in backup_files ()
#5  0x00007fbb4d527ea5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007fbb4d24e8dd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fbb4ddc5980 (LWP 23877)):
#0  0x00007fbb4d529017 in pthread_join () from /lib64/libpthread.so.0
#1  0x0000000000413a5e in do_backup_instance ()
#2  0x0000000000414ae9 in do_backup ()
#3  0x000000000040489d in main ()
[Inferior 1 (process 23877) detached]

@gsmolk
Copy link
Contributor

gsmolk commented May 8, 2020

Hmm, looks like pbk issued close() to file descriptor and kernel is still doing something. Flashing buffers maybe?

  1. Is file /mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp is changing size?
  2. Is /mnt/probackup is located on nfs?
  3. Can you run the following command strace -f -p 23877? Is it doing something?

@mkdel
Copy link
Author

mkdel commented May 8, 2020

Answering you numbered questions

  1. For the last few hours the size of backup_content.control.tmp file is the same
  2. Yes, /mnt/probackup has mounted nfs
  3. Yes, strace shows mostly "[pid 24096] write(5,..." then freeze for a few seconds with "[pid 24096] close(5)" and the starts doing "[pid 24096] write(5," again

@gsmolk
Copy link
Contributor

gsmolk commented May 8, 2020

Ah, I get it.
The flushing of backup_content.control is firing waay too frequently.
The fix is simple, but rolling new release will take some time, so I can provide you with package with fixed binary. This way we can make sure fix is working in your case before shipping it to everyone else. Is it acceptable?

@gsmolk
Copy link
Contributor

gsmolk commented May 8, 2020

On the side note, what is the size of /mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp right now?

@mkdel
Copy link
Author

mkdel commented May 8, 2020

Current size is 209M.
here is output of stat command with 6 hour difference.

[postgres@archiveHost ~/pg_probackup_log]$ stat  /mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp
  File: ‘/mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp’
  Size: 218171568       Blocks: 788480     IO Block: 1048576 regular file
Device: 2ah/42d Inode: 4348538890  Links: 1
Access: (0644/-rw-r--r--)  Uid: (   26/postgres)   Gid: (   26/postgres)
Context: system_u:object_r:nfs_t:s0
Access: 2020-05-08 10:12:28.689213521 +0000
Modify: 2020-05-08 10:12:32.890213316 +0000
Change: 2020-05-08 10:12:32.890213316 +0000
 Birth: -

 
[postgres@archiveHost ~/pg_probackup_log]$ stat  /mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp
  File: ‘/mnt/probackup/backups/myPGhost/Q9Z335/backup_content.control.tmp’
  Size: 218171568       Blocks: 788480     IO Block: 1048576 regular file
Device: 2ah/42d Inode: 4348538889  Links: 1
Access: (0644/-rw-r--r--)  Uid: (   26/postgres)   Gid: (   26/postgres)
Context: system_u:object_r:nfs_t:s0
Access: 2020-05-08 16:32:59.008100365 +0000
Modify: 2020-05-08 16:33:03.485100147 +0000
Change: 2020-05-08 16:33:03.485100147 +0000
 Birth: -

@mkdel
Copy link
Author

mkdel commented May 8, 2020

about new package - it would be great to test it!

@gsmolk
Copy link
Contributor

gsmolk commented May 8, 2020

Great, I need OS version that you are using.

@mkdel
Copy link
Author

mkdel commented May 8, 2020

cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)

@gsmolk
Copy link
Contributor

gsmolk commented May 8, 2020

@mkdel
Copy link
Author

mkdel commented May 8, 2020

Actually, backup is just have finished!

2020-05-08 03:36:44 UTC: INFO: Progress: (1156427/1156427). Process file "base/16384/44722834"
2020-05-08 16:43:21 UTC: INFO: Data files are transferred, time elapsed: 22h:30m
....

13 Hours Gap.

Anyway - new package was installed and backup restarted with the same parameters. It will take a while (at least this night). I'll update you when it's done and let you know how/if this gap decreased.

Thank you so much!

@gsmolk gsmolk added the bug label May 8, 2020
@gsmolk gsmolk added this to the 2.3.2 milestone May 8, 2020
@mkdel
Copy link
Author

mkdel commented May 9, 2020

Yes, this time it was way faster!
2020-05-09 04:06:01 UTC: INFO: Progress: (1159958/1159958). Process file "base/16386/2604"
2020-05-09 04:06:13 UTC: INFO: Data files are transferred, time elapsed: 9h:40m

12 seconds gap

Wow, thank you!

@gsmolk
Copy link
Contributor

gsmolk commented May 10, 2020

I'm going to add checksum calculation for backup_content.control file and implicit verification of said checksum when backup_content.control is access.

gsmolk added a commit that referenced this issue May 11, 2020
… of backup_content.control file, to detect corruption
gsmolk added a commit that referenced this issue May 11, 2020
gsmolk added a commit that referenced this issue May 11, 2020
@gsmolk
Copy link
Contributor

gsmolk commented May 12, 2020

2.3.2 is released, please update and run backup again

@gsmolk
Copy link
Contributor

gsmolk commented May 14, 2020

Fixed.

@gsmolk gsmolk closed this as completed May 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants