Skip to content

Unhandled Exceptions causes PgConnection.close to hang forever (v3) #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
osaxma opened this issue Aug 20, 2023 · 6 comments
Closed

Unhandled Exceptions causes PgConnection.close to hang forever (v3) #119

osaxma opened this issue Aug 20, 2023 · 6 comments

Comments

@osaxma
Copy link
Contributor

osaxma commented Aug 20, 2023

Add the following code to a file in the test folder and run it to reproduce the issue (or read the comments in the code):

import 'package:postgres/postgres_v3_experimental.dart';
import 'package:test/test.dart';

import 'docker.dart';

void main() {
  // NOTE: The Docker Container will not close after stopping this test so that needs to be done manually. 
  usePostgresDocker();

  // ignore: unused_local_variable
  late final PgConnection conn1;

  late final PgConnection conn2;

  group('description', () {
    setUpAll(() async {
      conn1 = await PgConnection.open(
        PgEndpoint(
          host: 'localhost',
          database: 'dart_test',
          username: 'dart',
          password: 'dart',
        ),
        sessionSettings: PgSessionSettings(
          onBadSslCertificate: (cert) => true,
        ),
      );

      conn2 = await PgConnection.open(
        PgEndpoint(
          host: 'localhost',
          database: 'dart_test',
          username: 'postgres',
          password: 'postgres',
        ),
        sessionSettings: PgSessionSettings(
          onBadSslCertificate: (cert) => true,
        ),
      );
    });

    test('produce error', () async {
      // get conn1 PID
      final res = await conn2
          .execute("SELECT pid FROM pg_stat_activity where usename = 'dart';");
      final conn1PID = res.first.first as int;

      // Simulate issue by terminating a connection during a query
      // ignore: unawaited_futures
      conn1.execute('select * from pg_stat_activity;'); // comment this out and a different error will appear 

      // Terminate the conn1 while the query is running
      await conn2.execute(
          'select pg_terminate_backend($conn1PID) from pg_stat_activity;');
      // this will cause the following exception:
      // PostgreSQLException (PostgreSQLSeverity.fatal 57P01: terminating connection due to administrator command )

      expect(true, true);
    });

    tearDownAll(() async {
      print('closing conn1');
      await conn1.close(); // this will never close & execution will hang here
      print('closing conn2');
      await conn2.close();
    });
  });
}

I think currently there's no direct and easy way for the user to handle such Exceptions (maybe a callback would help?). I also think PgConnection.close should check if the connection was closed/terminated/errored-out before trying a graceful termination process.


Additional Context

I initially encountered a simple issue where await conn.close(); would hang forever due to a connection blocking query used in logical replication (i.e. START_REPLICATION ...... ). I tried to force close the connection but there was no way around it. So I opened a separate connection to kill the replication process in the database in the testing teardown code as a workaround. That's when I came across the issue which is presented in the code snippet above.

While this issue appeared with me in an edge case, I still think there should be a way to handle such severe/fatal PostgreSQL errors because they may appear in other cases. Also, as mentioned earlier, the PgConnection.close should account for such scenarios and maybe have a flag for ungraceful termination (e.g. PgConnection.close(force: true)) in addition to a timeout logic (honestly i didn't try to add timeout settings yet).

cc: @isoos @simolus3

@isoos
Copy link
Owner

isoos commented Aug 20, 2023

That's worth investigating! I think we should check-in such tests into the repo with skip: true, so CI would pass, but it will be there for anybody to look into. There should be a different naming though, like test/fixme/x_test.dart? I am happy to take over a pattern from other places, but worst case a test/todo/ or a test/fixme/ directory would be sufficient...

Note: I've just pushed the above test under the test/fixme/ dir, happy to refactor if there is a better pattern.

@osaxma
Copy link
Contributor Author

osaxma commented Aug 26, 2023

I just looked a bit into this and here is what I found:

For reference, according to PostgreSQL docs SEVERITY-LEVELS, both fatal and panic levels abort the session whereas error levels abort the command (not sure what that exactly means in our context). In v2, this was handled in here:

if (exception.severity == PostgreSQLSeverity.fatal ||
exception.severity == PostgreSQLSeverity.panic) {
return _PostgreSQLConnectionStateClosed();
}

In v3, this is not handled in a variety of places. I observed two scenarios:

  • User may execute a query after a fatal/panic error:
    • issue: execute method would hang forever and the user will never know what's the issue.
    • expected: throw an error that the connection is closed/terminated.
  • User may try to close the connection after a fatal/panic error:
    • issue: close method would hang forever and the user will never know what's the issue.
    • expected: to close the connection and cleanup, gracefully when possible and ungracefully when necessary.
  • Or both could happen together.

Also, while looking into the issue, I found some other weak spots. For instance, in simple query with ignore rows, I got unhandled exceptions such as Bad state: Future already completed or Bad state: Cannot add event after closing. In the second case, the error message is added to _PgResultStreamSubscription._controller after the controller was closed. This is most likely happened because the results were successfully delivered but an error message arrived right afterwards (i.e. the controller was closed successfully but the message was added later causing an error).

As for the close case, I think the method should not try to send a terminated message when the connection was already terminated by the server (which is why it's hanging forever):

@override
Future<void> close() async {
await _operationLock.withResource(() {
// Use lock to await earlier operations
_channel.sink.add(const TerminateMessage());
});
await Future.wait([_channel.sink.close(), _serverMessages.cancel()]);
}
}

In short, whenever a fatal or panic error occur, message handlers should be aware of them and handle them properly by updating the session/connection state and passing the error to the user where applicable.


Another thought about fatal errors, assuming the hanging problem was solved, how should the user know that the connection was terminated? Once they try to execute a query? Or should there be an onError callback when establishing the connection as well (to give the user a chance to react immediately and reestablish the connection)?

One final thought -- any reason why execute and friends do not have a timeout option similar to v2?

@simolus3
Copy link
Contributor

Thanks a lot for the investigation @osaxma!!

I've fixed the problematic handling of error messages in #126, that should also add the Bad state errors (or at least the first one, I couldn't reproduce the second one yet).

(which is why it's hanging forever):

It's actually hanging because the _operationLock is already held by a previous process, hopelessly waiting for a ReadyForQueryMessage that it will never receive.

Another thought about fatal errors, assuming the hanging problem was solved, how should the user know that the connection was terminated?

That's a good point - we could expose a Future<void> get closed to get notified when the connection is closed, or an additional status getter to query the current connection state.

One final thought -- any reason why execute and friends do not have a timeout option similar to v2?

How was the timeout implemented in v2? The places I've checked (not all of them) just applied a timeout to the returned future, which would continue to block the connection when a query takes longer than the timeout, right? Is there a way to reliably cancel queries after a timeout is reached?

@osaxma
Copy link
Contributor Author

osaxma commented Aug 28, 2023

It's actually hanging because the _operationLock is already held by a previous process, hopelessly waiting for a ReadyForQueryMessage that it will never receive.

Oh my bad -- I thought the queue was empty but apparently I was looking at the wrong member in _operationLock when debugging 😅

How was the timeout implemented in v2? The places I've checked (not all of them) just applied a timeout to the returned future, which would continue to block the connection when a query takes longer than the timeout, right?

I just looked at where the timeout was handled in v2. It looks like the query is removed from the queue if it times out and nothing much (_onQueryError is an empty function).

Future<QueryResult<T>> _enqueue<T>(Query<T> query,
{int timeoutInSeconds = 30}) async {
if (_queue.add(query)) {
_connection._transitionToState(_connection._connectionState.awake());
try {
final result =
await query.future.timeout(Duration(seconds: timeoutInSeconds));
_connection._cache.add(query);
_queue.remove(query);
return result!;
} catch (e, st) {
_queue.remove(query);
await _onQueryError(query, e, st);
rethrow;
}
} else {
// wrap the synchronous future in an async future to ensure that
// the caller behaves correctly in this condition. otherwise,
// the caller would complete synchronously. This future
// will always complete as a cancellation error.
return Future(() async => (await query.future)!);
}
}

Is there a way to reliably cancel queries after a timeout is reached?

... apparently it's not straightforward according to 55.2.8. Canceling Requests in Progress: "To issue a cancel request, the frontend opens a new connection to the server and sends a CancelRequest message, rather than the StartupMessage message that would ordinarily be sent across a new connection"

@osaxma
Copy link
Contributor Author

osaxma commented Aug 28, 2023

@simolus3 or at least the first one, I couldn't reproduce the second one yet

This test (using same setup as in first comment) should reproduce the Bad state: Cannot add event after closing unhandled exception:

    test('Simple Protocol - fatal error', () async {
      // Get the PID for conn1
      final res = await conn2
          .execute("SELECT pid FROM pg_stat_activity where usename = 'dart';");
      final conn1PID = res.first.first as int;

      // ignore: unawaited_futures
      final future =
          conn1.execute('select * from pg_stat_activity;', ignoreRows: true);

      await conn2.execute(
          'select pg_terminate_backend($conn1PID) from pg_stat_activity;');
    });
>>>> Logs <<<<
postgres.connection.out: Instance of 'StartupMessage'
postgres.connection.in: Instance of 'AuthenticationMessage'
postgres.connection.out: Instance of 'SaslClientFirstMessage'
postgres.connection.in: Instance of 'AuthenticationMessage'
postgres.connection.out: Instance of 'SaslClientLastMessage'
postgres.connection.in: Instance of 'AuthenticationMessage'
postgres.connection.in: Instance of 'AuthenticationMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'ParameterStatusMessage'
postgres.connection.in: Instance of 'BackendKeyMessage'
postgres.connection.in: ReadyForQueryMessage(state = I)
postgres.connection.out: Instance of 'QueryMessage'
postgres.connection.in: Instance of 'RowDescriptionMessage'
postgres.connection.in: Data Row Message: [null, null, [54, 52], null, null, null, [], null, null, null, null, null, null, null, null, null, null, null, null, null, [60, 105, 110, 115, 117, 102, 102, 105, 99, 105, 101, 110, 116, 32, 112, 114, 105, 118, 105, 108, 101, 103, 101, 62], null]
postgres.connection.in: Data Row Message: [null, null, [54, 53], null, [49, 48], [112, 111, 115, 116, 103, 114, 101, 115], [], null, null, null, null, null, null, null, null, null, null, null, null, null, [60, 105, 110, 115, 117, 102, 102, 105, 99, 105, 101, 110, 116, 32, 112, 114, 105, 118, 105, 108, 101, 103, 101, 62], null]
postgres.connection.in: Data Row Message: [[49, 54, 51, 56, 52], [100, 97, 114, 116, 95, 116, 101, 115, 116], [49, 49, 53], null, [49, 54, 51, 56, 53], [100, 97, 114, 116], [], [49, 55, 50, 46, 49, 55, 46, 48, 46, 49], null, [53, 53, 55, 49, 48], [50, 48, 50, 51, 45, 48, 56, 45, 50, 56, 32, 50, 50, 58, 52, 55, 58, 48, 53, 46, 48, 51, 54, 52, 52, 49, 43, 48, 48], [50, 48, 50, 51, 45, 48, 56, 45, 50, 56, 32, 50, 50, 58, 52, 55, 58, 48, 53, 46, 52, 52, 53, 54, 53, 51, 43, 48, 48], [50, 48, 50, 51, 45, 48, 56, 45, 50, 56, 32, 50, 50, 58, 52, 55, 58, 48, 53, 46, 52, 52, 53, 54, 53, 51, 43, 48, 48], [50, 48, 50, 51, 45, 48, 56, 45, 50, 56, 32, 50, 50, 58, 52, 55, 58, 48, 53, 46, 52, 52, 53, 54, 53, 52, 43, 48, 48], null, null, [97, 99, 116, 105, 118, 101], null, [55, 51, 50], null, [115, 101, 108, 101, 99, 116, 32, 42, 32, 102, 114, 111, 109, 32, 112, 103, 95, 115, 116, 97, 116, 95, 97, 99, 116, 105, 118, 105, 116, 121, 59], [99, 108, 105, 101, 110, 116, 32, 98, 97, 99, 107, 101, 110, 100]]
postgres.connection.in: Data Row Message: [[49, 54, 51, 56, 52], [100, 97, 114, 116, 95, 116, 101, 115, 116], [49, 49, 54], null, [49, 48], [112, 111, 115, 116, 103, 114, 101, 115], [], null, null, null, null, null, null, null, null, null, null, null, null, null, [60, 105, 110, 115, 117, 102, 102, 105, 99, 105, 101, 110, 116, 32, 112, 114, 105, 118, 105, 108, 101, 103, 101, 62], null]
postgres.connection.in: Data Row Message: [null, null, [54, 49], null, null, null, [], null, null, null, null, null, null, null, null, null, null, null, null, null, [60, 105, 110, 115, 117, 102, 102, 105, 99, 105, 101, 110, 116, 32, 112, 114, 105, 118, 105, 108, 101, 103, 101, 62], null]
postgres.connection.in: Data Row Message: [null, null, [54, 48], null, null, null, [], null, null, null, null, null, null, null, null, null, null, null, null, null, [60, 105, 110, 115, 117, 102, 102, 105, 99, 105, 101, 110, 116, 32, 112, 114, 105, 118, 105, 108, 101, 103, 101, 62], null]
postgres.connection.in: Data Row Message: [null, null, [54, 51], null, null, null, [], null, null, null, null, null, null, null, null, null, null, null, null, null, [60, 105, 110, 115, 117, 102, 102, 105, 99, 105, 101, 110, 116, 32, 112, 114, 105, 118, 105, 108, 101, 103, 101, 62], null]
postgres.connection.in: CommandCompleteMessage(7 affected rows)
postgres.connection.in: ReadyForQueryMessage(state = I)
postgres.connection.in: Instance of 'ErrorResponseMessage'
Bad state: Cannot add event after closing
dart:async                                      _StreamController.addError
package:postgres/src/v3/connection.dart 508:21  _PgResultStreamSubscription.handleMessage
package:postgres/src/v3/connection.dart 334:25  PgConnectionImplementation._handleMessage
dart:async                                      _EventSinkWrapper.add
test/fixme/v3_bad_state_error_test.dart 88:14   _loggingTransformer.<fn>
dart:async                                      _MultiStreamController.addSync
package:postgres/src/v3/protocol.dart 61:18     _readMessages.<fn>.<fn>.emitFinishedMessages
package:postgres/src/v3/protocol.dart 69:7      _readMessages.<fn>.<fn>.handleChunk
This test failed after it had already completed.
Make sure to use a matching library which informs the test runner
of pending async work.
dart:async                                      _StreamController.addError
package:postgres/src/v3/connection.dart 508:21  _PgResultStreamSubscription.handleMessage
package:postgres/src/v3/connection.dart 334:25  PgConnectionImplementation._handleMessage
dart:async                                      _EventSinkWrapper.add
test/fixme/v3_bad_state_error_test.dart 88:14   _loggingTransformer.<fn>
dart:async                                      _MultiStreamController.addSync
package:postgres/src/v3/protocol.dart 61:18     _readMessages.<fn>.<fn>.emitFinishedMessages
package:postgres/src/v3/protocol.dart 69:7      _readMessages.<fn>.<fn>.handleChunk

✓ service-side connection close Simple Protocol - fatal error
postgres.connection.out: Instance of 'TerminateMessage'

Exited (1).

@osaxma
Copy link
Contributor Author

osaxma commented Nov 2, 2023

I believe this issue has been fully addressed.

Thanks @simolus3 for the fixes!

@osaxma osaxma closed this as completed Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants