Skip to content

Conversation

acogoluegnes
Copy link
Contributor

Starting a replica and its reader can temporarily in some cases and environment (e.g. when restarting a node on K8S, when DNS is not sorted out yet). The current logging messages can be misleading and verbose.

This commit tries to avoid crash reports when a replica reader cannot connect to its replica. This is usually a transient issue and the whole crash report does not help much and clutters the log file.

References #186

@acogoluegnes
Copy link
Contributor Author

Here is a comparison of the logging before and after the changes.

Replica reader (before):

2025-06-18T11:06:24.251836+02:00 debug: cluster_write_replication_tls: trying to connect to replica at [{127,0,0,1}]
2025-06-18T11:06:24.251884+02:00 debug: cluster_write_replication_tls [osiris_replica_reader:maybe_connect/5] trying to establish TLS connection to {127,0,0,1} using port 6081
2025-06-18T11:06:24.304796+02:00 debug: Supervisor: {<0.167.0>,tls_dyn_connection_sup}. Started: id=sender,pid=<0.168.0>.
2025-06-18T11:06:24.304872+02:00 debug: Supervisor: {<0.167.0>,tls_dyn_connection_sup}. Started: id=receiver,pid=<0.169.0>.
2025-06-18T11:06:24.354611+02:00 notice: TLS client: In state wait_cert at ssl_handshake.erl:2186 generated CLIENT ALERT: Fatal - Handshake Failure, - {bad_cert,{hostname_check_failed,{requested,{127,0,0,1}},{received,[{dNSName,"M47T1PK9QD"},{dNSName,"M47T1PK9QD"},{dNSName,"localhost"}]}}}
2025-06-18T11:06:24.355556+02:00 debug: cluster_write_replication_tls [osiris_replica_reader:maybe_connect/5] TLS connection refused (handshake failure), host:{127,0,0,1} - port: 6081
2025-06-18T11:06:24.355626+02:00 warning: cluster_write_replication_tls [osiris_replica_reader:init/1] could not connect replica reader to replica at [{127,0,0,1}] port 6081, Reason: connection_refused
2025-06-18T11:06:24.355716+02:00 error: crasher: initial call: osiris_replica_reader:init/1, pid: <0.164.0>, registered_name: [], exit: {connection_refused,[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,2210}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,329}]}]}, ancestors: [osiris_replica_reader_sup,osiris_sup,<0.122.0>], message_queue_len: 0, messages: [], links: [<0.127.0>], dictionary: [], trap_exit: true, status: running, heap_size: 4185, stack_size: 29, reductions: 20857; neighbours:

Replica reader (after):

2025-06-18T11:10:06.234414+02:00 debug: cluster_write_replication_tls: trying to connect to replica at [{127,0,0,1}]
2025-06-18T11:10:06.234466+02:00 debug: cluster_write_replication_tls [osiris_replica_reader:maybe_connect/6] trying to establish TLS connection to {127,0,0,1} using port 6285
2025-06-18T11:10:06.296801+02:00 debug: Supervisor: {<0.162.0>,tls_dyn_connection_sup}. Started: id=sender,pid=<0.163.0>.
2025-06-18T11:10:06.296884+02:00 debug: Supervisor: {<0.162.0>,tls_dyn_connection_sup}. Started: id=receiver,pid=<0.164.0>.
2025-06-18T11:10:06.347450+02:00 notice: TLS client: In state wait_cert at ssl_handshake.erl:2186 generated CLIENT ALERT: Fatal - Handshake Failure, - {bad_cert,{hostname_check_failed,{requested,{127,0,0,1}},{received,[{dNSName,"M47T1PK9QD"},{dNSName,"M47T1PK9QD"},{dNSName,"localhost"}]}}}
2025-06-18T11:10:06.348363+02:00 debug: cluster_write_replication_tls [osiris_replica_reader:maybe_connect/6] TLS connection refused (handshake failure), host:{127,0,0,1} - port: 6285
2025-06-18T11:10:06.348428+02:00 info: cluster_write_replication_tls [osiris_replica_reader:maybe_connect/6] could not connect replica reader to replica on port 6285, reason(s): [{{127,0,0,1},handshake_failure}]
2025-06-18T11:10:06.348462+02:00 warning: cluster_write_replication_tls [osiris_replica_reader:init/1] could not connect replica reader to replica at [{127,0,0,1}] port 6285, Reason: connection_refused

Replica (before):

2025-06-18T11:06:24.355741+02:00 debug: cluster_write_replication_tls [osiris_replica:accept/4] Starting socket acceptor for replication over TLS
2025-06-18T11:06:24.355941+02:00 warning: cluster_write_replication_tls [osiris_replica:handle_continue/2] failed to start replica reader. Reason {connection_refused,{child,undefined,#Ref<0.585219426.2290089985.260843>,{osiris_replica_reader,start_link,[#{name => <<"cluster_write_replication_tls">>,port => 6081,hosts => [{127,0,0,1}],reference => <<"cluster_write_replication_tls">>,leader_pid => <9412.157.0>,transport => ssl,connection_token => <<204,251,140,121,189,8,237,166,95,64,82,123,223,183,109,3,153,20,98,123,61,32,131,72,20,18,45,123,11,217,191,252>>,replica_pid => <0.155.0>,start_offset => {0,empty}}]},temporary,false,5000,worker,[osiris_replica_reader]}}
2025-06-18T11:06:24.356118+02:00 error: Error in process <0.169.0> on node s2@M47T1PK9QD with exit value:, {{badmatch,{error,closed}},[{osiris_replica,accept,4,[{file,"src/osiris_replica.erl"},{line,342}]}]}

Replica (after):

2025-06-18T11:10:06.348578+02:00 debug: cluster_write_replication_tls [osiris_replica:accept/4] Starting socket acceptor for replication over TLS
2025-06-18T11:10:06.348655+02:00 warning: cluster_write_replication_tls [osiris_replica:handle_continue/2] failed to start replica reader on node 's1@M47T1PK9QD'. Reason connection_refused.

Starting a replica and its reader can temporarily in some cases and
environment (e.g. when restarting a node on K8S, when DNS is not sorted
out yet). The current logging messages can be misleading and verbose.

This commit tries to avoid crash reports when a replica reader cannot
connect to its replica. This is usually a transient issue and the whole
crash report does not help much and clutters the log file.

References #186
@acogoluegnes acogoluegnes force-pushed the replica-init-logging branch from 864c763 to ed6fd9b Compare June 18, 2025 09:59
@acogoluegnes acogoluegnes marked this pull request as ready for review June 18, 2025 12:01
@kjnilsson kjnilsson merged commit 535ff5d into main Jun 18, 2025
4 checks passed
@kjnilsson kjnilsson deleted the replica-init-logging branch June 18, 2025 12:45
@acogoluegnes acogoluegnes changed the title Improve logging during replicat initialization Improve logging during replica initialization Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants