Skip to content

random crash at aws_task_scheduler_cancel_task #146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elad-ep opened this issue Aug 25, 2020 · 11 comments
Closed

random crash at aws_task_scheduler_cancel_task #146

elad-ep opened this issue Aug 25, 2020 · 11 comments
Labels
bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue.

Comments

@elad-ep
Copy link

elad-ep commented Aug 25, 2020

Hi, I'm using sdk version 1.7.4. there is a random crash on mac OS. stack:

Thread 17 Crashed:
0   ???                           	000000000000000000 0 + 0
1   	                          	0x0000000107deace1 aws_task_scheduler_cancel_task + 177
2   	                          	0x0000000107e13c4c s_on_shutdown_completion_task + 76
3   	                          	0x0000000107dea92c s_run_all + 348
4   	                          	0x0000000107e11be7 s_event_thread_main + 1863
5   	                          	0x0000000107de8c58 thread_fn + 88
6   libsystem_pthread.dylib       	0x00007fff6ede2109 _pthread_start + 148
7   libsystem_pthread.dylib       	0x00007fff6edddb8b thread_start + 15

I don't have logs from the crash and the crash is quite rare. I'm currently running again with logs enabled, hoping to reproduce the crash in the next days with the logs. but maybe you can track the issue from the stack above even before the logs? thanks...

@elad-ep elad-ep added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2020
@TingDaoK
Copy link
Contributor

Sorry for it. But the stack trace is not quite informative. It sounds like a race condition problem, that may be hard to track down... If you can provide the logs, that will super helpful. But, I'll recommend updating to the latest version at first.

@jmklix jmklix added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. and removed needs-triage This issue or PR still needs to be triaged. labels Aug 25, 2020
@elad-ep
Copy link
Author

elad-ep commented Aug 26, 2020

we encountered in 3 additional crashes yesterday, all with same stack trace. unfortunately, the stack is different from the first one I posted in this bug, so maybe it's two different issues. new crash:

Thread 12 Crashed:
0   	                          	0x000000010c6c0579 aws_input_stream_destroy + 9
1   	                          	0x000000010c6d966f Aws::Crt::Http::HttpMessage::~HttpMessage() + 47
2   	                          	0x000000010c6e3ce7 std::__1::__shared_ptr_pointer<Aws::Crt::Http::HttpRequest*, Aws::Crt::Mqtt::MqttConnection::s_onWebsocketHandshake(aws_http_message*, void*, void (*)(aws_http_message*, int, void*), void*)::$_0, std::__1::allocator<Aws::Crt::Http::HttpRequest> >::__on_zero_shared() + 23
3   	                          	0x000000010c6d2c25 Aws::Crt::Auth::HttpSignerCallbackData::~HttpSignerCallbackData() + 53
4   	                          	0x000000010c6d2b74 Aws::Crt::Auth::s_http_signing_complete_fn(aws_signing_result*, int, void*) + 84
5   	                          	0x000000010c678630 s_perform_signing + 48
6   	                          	0x000000010c673041 s_x509_finalize_get_credentials_query + 193
7   	                          	0x000000010c68c444 s_aws_http_connection_manager_execute_transaction + 2068
8   	                          	0x000000010c68d4f9 s_aws_http_connection_manager_on_connection_setup + 745
9   	                          	0x000000010c68a709 s_client_bootstrap_on_channel_setup + 137
10  	                          	0x000000010c6b20cd s_on_host_resolved + 957
11  	                          	0x000000010c6b942b resolver_thread_fn + 2107
12  	                          	0x000000010c685c58 thread_fn + 88
13  libsystem_pthread.dylib       	0x00007fff6a6e5e65 _pthread_start + 148
14  libsystem_pthread.dylib       	0x00007fff6a6e183b thread_start + 15

logs:

crashes happened at:
2020-08-25 19:12:57
2020-08-25 17:48:26
2020-08-25 13:58:21

@elad-ep
Copy link
Author

elad-ep commented Aug 26, 2020

OK we now got crashes from another machine, with the original stack trace:

Thread 18 Crashed:
0   ???                           	000000000000000000 0 + 0
1   					0x000000010d174ce1 aws_task_scheduler_cancel_task + 177
2   	                          	0x000000010d19dc4c s_on_shutdown_completion_task + 76
3   	                          	0x000000010d17492c s_run_all + 348
4   	                          	0x000000010d19bbe7 s_event_thread_main + 1863
5   	                          	0x000000010d172c58 thread_fn + 88
6   libsystem_pthread.dylib       	0x00007fff6d866109 _pthread_start + 148
7   libsystem_pthread.dylib       	0x00007fff6d861b8b thread_start + 15

logs:

crash times:
2020-08-26 06:17:58
2020-08-26 04:16:57

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 2 days. label Aug 26, 2020
@TingDaoK TingDaoK added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Aug 26, 2020
@TingDaoK
Copy link
Contributor

Please, remember to hide your credentials when you share your log. Please rotate your credentials for now.

@TingDaoK
Copy link
Contributor

Thank you for the information. which helps us uncover one bug. The crash from aws_input_stream_destroy should be fixed.
However, it probably has another cause for aws_task_scheduler_cancel_task. The log didn't show which task got cancelled and leads to the crash... And, as I go through the code, I didn't see any logic error that could lead to the crash...
Besides of that, I tried to reproduce the scenario before your crash, that server denied the update to websocket, but it worked well for me. Hard to say what's the cause... Probably related to your implementation?
So, for now, I'll cut a release of the fix about aws_input_stream_destroy within this week. And please try the new release once it's out. If the random crash about aws_task_scheduler_cancel_task still happens for you, we will come back to it...

TingDaoK added a commit to awslabs/aws-crt-cpp that referenced this issue Aug 27, 2020
*Issue #, if available:*
aws/aws-iot-device-sdk-cpp-v2#146
If the underlying http_message get destroyed before the destructor of HttpMessage, it may crash when the destructor accesses the underlying http_message.

*Description of changes:*
- Keep the underlying http_message alive until the destructor called
@elad-ep
Copy link
Author

elad-ep commented Aug 28, 2020

thanks! please note that our implementation is the same for windows and mac, but the crashes so far happen only on mac. is the fix indeed relevant only for mac, or windows as well?

do you want to add more logging in the new build so if the aws_task_scheduler_cancel_task crash reproduces you'll have more information?

@TingDaoK
Copy link
Contributor

We have log right before aws_task_scheduler_cancel_task , however, I cannot find any possible reason that will lead to a crash from the log... But, I'll check with the team.

The fix just get merged is not related to the platform. However, the crash about aws_task_scheduler_cancel_task is very likely to be platform related. We will come back to this later.

@TingDaoK
Copy link
Contributor

The new release is published. And, the log is not totally synchronized... So, it's possible that the crash happens before the cause is written to the log... That makes it more complicated.
So, here is my suggestion, if it's not really blocking you, or you can have a workaround with your implementation, I'll suggest you just do the workaround.
Or, you may want to provide us informations like how to reproduce the crash. As far as I know, you are using websocket under Mac, and the crash happens rarely. From the log, the crash happens after the server rejected upgrading to websocket. Is it correct? If so, we will try to reproduce the crash first, and get back to you once we can reproduce the error.

@elad-ep
Copy link
Author

elad-ep commented Sep 1, 2020

thanks, we are currently testing the new release. if you encounter in any crash again, I'll send you the logs and details about our implementation.. thanks

@elad-ep
Copy link
Author

elad-ep commented Sep 8, 2020

seems the crash is resolved. the issue can be closed, thanks!

@TingDaoK
Copy link
Contributor

TingDaoK commented Sep 8, 2020

Really? YEAH!!! But, I still think the other crash is something different🤯 Anyway, if you encounter the crash again, feel free to reopen it, or open another issue to report it! Thank you!

@TingDaoK TingDaoK closed this as completed Sep 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue.
Projects
None yet
Development

No branches or pull requests

3 participants