-
-
Notifications
You must be signed in to change notification settings - Fork 258
Add detach replica in shutdown function when error occurs during apply segment of journal #8688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add detach replica in shutdown function when error occurs during apply segment of journal #8688
Conversation
m_replicator = nullptr; | ||
if (m_attachment.hasData()) | ||
m_attachment->detach(&localStatus); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both m_replicator
and m_attachment
are RefPtr
's and nullifying them causes a release()
call which does the proper cleanup (similar to close/detach
) under the hood. So I don't see what is actually fixed. What do I miss?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that for m_replicator
close is called during release and it is not necessary to do this (I added it for the company with m_attachment, which is the problem).
There detach
is not called during release, only destroy, in which requests, events, transactions, etc. are released and the handle is also removed, but the connection to the DB itself remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. YAttachment::destroy()
calls destroy2()
which nullifies next
which in turn calls release()
for the underlying provider. So AFAIU the connection should be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right.
The problem occurs when using an external engine, even without a replication error.
It uses the context of the current replication connection and increases the reference counter, and when the main connection is nullified. In the JAttachment::release
function the counter is not equal to 0 and freeEngineData
is not called, in which the following functions should be called in the chain:
purge_attachment
release_attachment
dbb->dbb_extManager->closeAttachment
delete attInfo
~ExternalContextImpl
externalAttachment->release // In which the counter should be reduced and the external engine connection should be deleted.
This behavior occurs in RedDatabase, but I am sure that it will be the same in Firebird, since the code in this part is not different, but I cannot check, since there are problems with configuring jaybird for Firebird.
Calling the detach
function on the main connection directly calls internalDetach
and freeEngineData
, which closes the connection and releases all resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this smells like a problem to me -- the fact that the external engine confuses the reference counter of the object created by user, thus making the whole ref-count-based management unreliable.
Nevertheless, I don't mind fixing this particular issue the way you suggest. However, it seems to me that nullifying the reference (and thus possibly calling release()
) is unnecessary after close()/detach()
. I have a feeling that the code should look something like this:
if (m_replicator.hasData())
{
m_replicator->close(&localStatus);
m_replicator.clear();
}
if (m_attachment.hasData())
{
m_attachment->detach(&localStatus);
m_attachment.clear();
}
Could you please double-check this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reproduce the issue using sample UDR ?
Yes, I got a reproducible case with UDR on Firebird. I didn't think about it, thanks.
It is enough to setup asynchronous replication and add a procedure on the master:
create or alter procedure gen_rows (
start_n integer not null,
end_n integer not null
) returns (
n integer not null
)
external name 'udrcpp_example!gen_rows'
engine udr;
In this case, the segment will be applied on the replica, but the connection will remain active, if after that add the segment again using the external engine, a second connection will appear, and so on.
And maybe these two pointers should not be
RefPtr
's at all, given they're used by a single object only and destroyed explicitly...
Remade without using RefPtr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reproduce the issue using sample UDR ?
Yes, I got a reproducible case with UDR on Firebird. I didn't think about it, thanks.
Could you provide it here ? With all necessary steps, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide it here ? With all necessary steps, please.
Setup replication.conf:
database = d:\repl\master.fdb
{
journal_directory = d:\repl\journal
journal_archive_directory = d:\repl\archive
journal_archive_timeout = 10
}
database = d:\repl\replica.fdb
{
journal_source_directory = d:\repl\replica
}
isql> create database 'd:\repl\master.fdb';
Copy DB to d:\repl\replica.fdb
and execute:
gfix -repl read_only d:\repl\replica.fdb
On master DB:
isql> alter database enable publication;
isql> alter database include all to publication;
Connect to master DB and create procedure:
create or alter procedure gen_rows (
start_n integer not null,
end_n integer not null
) returns (
n integer not null
)
external name 'udrcpp_example!gen_rows'
engine udr;
commit;
Disconnect or wait for archive segment 1.
Create new same procedure on master for create archive 2.
Copy segment 1 to d:\repl\replica\
, wait applied the segment, connect to replica and execute:
isql> select count(*) from mon$attachments;
isql> commit;
got 4 attachments:
- current attachment,
- garbage collector,
- cache writer,
- dead applier attachment.
Copy segment 2 and wait applied, after execute same query and got 5 attachments: current, garbage collector, cache writer, dead applier attachment (x2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Create new same procedure on master for create archive 2.
Do you mean recreate
or create
with a new name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean
recreate
orcreate
with a new name ?
I create it with a new name, but I think it can be recreated.
When apply segment using an external engine (Jaybird/UDR), the replica is not detached by nullifying the variable, because releaseAttachment is not called
13e1e9b
to
26f2495
Compare
If an error occurs during segment apply that results in a return other than PROCESS_CONTINUE, the connection to the replica will not be detach. With asynchronous replication the error will like be repeated, result to many connections that take up RAM and hold a connection to the database.