Skip to content

CIRCUITPY_PYSTACK_SIZE=4000 crashes ESP32S3 boards to safe mode #7643

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RetiredWizard opened this issue Feb 24, 2023 · 27 comments · Fixed by #7880
Closed

CIRCUITPY_PYSTACK_SIZE=4000 crashes ESP32S3 boards to safe mode #7643

RetiredWizard opened this issue Feb 24, 2023 · 27 comments · Fixed by #7880
Assignees
Milestone

Comments

@RetiredWizard
Copy link

RetiredWizard commented Feb 24, 2023

CircuitPython version

Adafruit CircuitPython 8.1.0-alpha.2-17-g00a03c323 on 2023-02-23; Adafruit Feather ESP32S3 4MB Flash 2MB PSRAM with ESP32S3
Adafruit CircuitPython 8.1.0-alpha.2-17-g00a03c323 on 2023-02-23; BPI-PicoW-S3 with ESP32S3

Code/REPL

No code.py file is executed.

Behavior

On esp32s3 boards setting CIRCUITPY_PYSTACK_SIZE=4000 causes a crash into safe mode.

Auto-reload is off.
Running in safe mode! Not running saved code.

You are in safe mode because:
CircuitPython core code crashed hard. Whoops!
Fault detected by hardware.
Please file an issue with your program at https://github.com/adafruit/circuitpython/issues.
Press reset to exit safe mode.

Press any key to enter the REPL. Use CTRL-D to reload.

A value of 3500 works fine.

Description

I've used values up to 7000 on a board with the raspberry pi RP2040 micro controller without a similar crash.

This may be a total red herring but building with micropy_stackless=1 prevents the crash, however setting CIRCUITPY_PYSTACK_SIZE > 3500 still doesn't seem to improve the stack limitations (EDIT: I'm not sure about this now, my stack depth test may have been faulty). I remember @bill88t mentioning that on some boards the parameter had an effective upper limit but I can't find the note right now. Perhaps the ESP32S3 was one of those boards.

Additional information

No response

@bill88t
Copy link

bill88t commented Feb 24, 2023

On S2 I could go up to a megabyte before I added checks. I assumed I hit some interger limit there.

The effective upper limit is 3700 for S2, 7000 for rp2.

Currently I have covid and I am pretty much unable to fix this. However can you please find the exact value it fails on?

The fix would pretty much be defining an upper limit for every mcu. If the value set is higher than that, lower it to the maximum.

@bill88t
Copy link

bill88t commented Feb 24, 2023

My stack depth test:

def test(no):
    print(no)
    try:
        test(no+1)
    except RuntimeError:
        print("Done")
test(0)

@RetiredWizard
Copy link
Author

Rest up and get better, this functionality is awesome and is absolutely functional even with this issue. I'm not around much this weekend but I'll try and nail down the exact setting that crashes when I have a chance. Thanks SO much for this work 😁.

@tannewt tannewt added this to the 8.1.0 milestone Feb 24, 2023
@RetiredWizard
Copy link
Author

On the S3 it seems that a setting of 3504 or higher will lead to a safe mode hard crash if the stack depth test is run. 3503 or lower seems stable.

@tannewt
Copy link
Member

tannewt commented Feb 27, 2023

One interest would be to print out the memory address that are allocated to see what they are near.

@RetiredWizard
Copy link
Author

RetiredWizard commented Mar 1, 2023

Built on an ESP32-S3-DevKitC-1-N8R2. The crash limit was actually a little higher but I haven't spent any time looking at that yet. I set the pystack size to 3600 and got the following trackeback:

***ERROR*** A stack overflow in task main has been detected.


Backtrace: 0x40378646:0x3fcf2da0 0x4038262d:0x3fcf2dc0 0x40385daa:0x3fcf2de0 0x403843b0:0x3fcf2e60 0x403826dc:0x3fcf2e90 0x403826d2:0x2d797063 |<-CORRUPTED




ELF file SHA256: 199f30faa96abdc6

CPU halted.

I tried running the decode_backtrace function but the board doesn't seem to have the subprocess library baked in. I also don't know if it would be of much value anyway since the traceback is flagged as corrupted.

@RetiredWizard
Copy link
Author

One interest would be to print out the memory address that are allocated to see what they are near.

I'm not real fluent in c++ pointer addresses but I used the following to print what I'm hoping is the pystack memory address:
mp_printf(&mp_plat_print, "Pystack address %d",(mp_int_t) pystack); and got the following: Pystack address 1070215376

@tannewt
Copy link
Member

tannewt commented Mar 1, 2023

I think the crash makes it pretty clear that we're not checking that we're within the C stack. There is stack checking code but its probably not checking the C stack and the pystack.

@bill88t
Copy link

bill88t commented Mar 4, 2023

I have finished reading the cp source for all stack code (pystack, cstack, heap).
The C stack is allocated before pystack. Starting from main.c:416, function stack_resize calls supervisor/shared/stack.c:75 which in turn calls :46 function allocate_stack.
This function is similar to the other stack allocations, using allocate_memory for a generic supervisor_allocation.
As a reminder allocate_memory is 'safe'. Populated in supervisor/shared/memory.c:233, it uses allocate_memory_node which keeps track of given records (using high/low_head) and should allocate unique, non-overlapping, blocks.
Since it all is supervisor_allocation's, assuming overflow to the C stack would mean, something has gone horribly wrong with the allocations.
Even so, the allocations always happen in the following order: cstack -> pystack -> heap. So if anything, it should be impossible for pystack to 'overflow' to cstack..
I can't use the pointer addresses since I don't have and can't get the board.
I will build and try to reproduce it on C3 and rp2.

@bill88t
Copy link

bill88t commented Mar 4, 2023

The following build is latest master, built manually cuz I just reinstalled my desktop and was testing it.

[bill88t@KeyFalse | espressif]> time beetle-cleanmake
[ ... ]
real    0m41,789s
user    4m51,394s
sys     1m18,136s

Quite a bit faster than my pi400.

Reproduced on C3, with a pystack of just 2000!!

Adafruit CircuitPython 8.1.0-beta.0-4-g8a1006999 on 2023-03-04; DFRobot Beetle ESP32-C3 with ESP32-C3FN4
>>> def test(no):
...     print(no)
...     try:
...         test(no+1)
...     except:
...         print("Done")
... 
>>> test(0)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Done
>>> 
Auto-reload is off.
Running in safe mode! Not running saved code.

You are in safe mode because:
CircuitPython core code crashed hard. Whoops!
Fault detected by hardware.
Please file an issue with your program at https://github.com/adafruit/circuitpython/issues.
Press reset to exit safe mode.

Press any key to enter the REPL. Use CTRL-D to reload.

@bill88t
Copy link

bill88t commented Mar 4, 2023

It boots and works till you access that memory. Stock pystack size works just fine, so as an emergency workaround I suggest CIRCUITPY_SETTABLE_PYSTACK is disabled for C3 and S3 till this issue is solved.

@tannewt
Copy link
Member

tannewt commented Mar 6, 2023

I'm ok disabling it for boards that crash instead of setting an arbitrary limit.

@bill88t
Copy link

bill88t commented Mar 6, 2023

I want to fix it so that all boards can use it to their limits.
However for now, I will create a pr that disables settable pystack for all C3 and S3 boards to act as a temporary patch.
When I manage to fix this I will slide in the revertions in the fix pr.

@bill88t
Copy link

bill88t commented Mar 10, 2023

I managed to have it happen on S2 too, though it seems less frequent..
Stack size 7k.

So it affects all esp.

Also, the effective maximum on rp2 has changed. rp2 is still 100% stable.

def test(no):
  print(no)
  try:
    test(no+1)
  except RuntimeError:
    pass
while True:
  test(0)

I will look into it more.

@bill88t
Copy link

bill88t commented Mar 11, 2023

Hooked up debug uart of S2 to my pi400, produced a debug build, and nothing. It doesn't crash now, even with a 12k stack.
My best guess for right now is that it's an optimisation error.

image

@bill88t
Copy link

bill88t commented Mar 13, 2023

nRF is unaffected. So it's probably an esp-only thing. It's gonna be some optimisation thing and I will cry.

@tannewt
Copy link
Member

tannewt commented Mar 14, 2023

nRF doesn't have any stack checking besides what CircuitPython does. ESP has FreeRTOS to check the stack as well. I wouldn't be so sure it's esp-only. nRF could have the same issue but not fail deterministically.

@bill88t
Copy link

bill88t commented Mar 14, 2023

import gc
a = list()
while True:
  try:
    a.append("a")
  except:
    a.pop()
    gc.collect()
    print("done")
    break
b = """
for i in a:
  if i != "a":
    print("err")
"""
exec(b)

def test(no):
  print(no)
  try:
    test(no+1)
  except RuntimeError:
    print("Done")
for i in range(250):
    test(0)
exec(b)

Passes just fine.

@bill88t
Copy link

bill88t commented Mar 14, 2023

I have let it run for quite a while now. The pystack is not corructed and the system is stable. Uptime 6 hours.
nrf 'is' unaffected.

@bill88t
Copy link

bill88t commented Mar 15, 2023

I am still trying my luck with S2. No matter what I do, when I build debug, it will NEVER crash.

However I discovered another way to reproduce this issue immediately (unless you are using debug builds):

def test(no):
  try:
    test(no+1)
  except RuntimeError:
    print("Done")

a = "for i in range(10): test(0)"
for j in range(20): exec(a)

It immediately crashes, after printing a single "Done".

I have tried my luck with:

#pragma GCC push_options
#pragma GCC optimize("O0")
#pragma GCC pop_options

To pinpoint the faulty function, without much success.

@bill88t
Copy link

bill88t commented Mar 15, 2023

I have news.

By editing ports/espressif/makefile:124 to OPTIMIZATION_FLAGS ?= -O2 the bug is reproduced on debug builds.
So it is 101% an optimisation issue. However I have not quite determined the fault just yet.

I have toyed around with the debug uart of s2 and managed to get logs from main to it.
Anyways.

With pystack allocated to

 start = 0x3f78027c
  end  = 0x3f78327c

(CIRCUITPY_PYSTACK_SIZE=12288, I did check the sizes and addresses across funcs. It's all valid and not optimised incorrectly.)

The crash looks something like:

***ERROR*** A stack overflow in task main has been detected.


Backtrace: 0x4002bdc6:0x3ffe2c40 0x4003309d:0x3ffe2c60 0x40035b0a:0x3ffe2c80 0x4003497a:0x3ffe2d00 0x40033190:0x3ffe2d20 0x40033142:0x3f780328 |<-CORRUPTED




ELF file SHA256: 49f4e038aa7ebd35

CPU halted.

All the above addresses are waaay off of pystack.

This indicates the issue is probably elsewhere.

@bill88t
Copy link

bill88t commented Mar 15, 2023

Added a heap pointer print

pystack: start = 0x3f78027c
pystack:  end  = 0x3f78327c
heap:    start = 0x3f783284
heap:     end  = 0x3ff80000

***ERROR*** A stack overflow in task main has been detected.


Backtrace: 0x4002bdc6:0x3ffe2c80 0x4003309d:0x3ffe2ca0 0x40035b0a:0x3ffe2cc0 0x4003497a:0x3ffe2d40 0x40033190:0x3ffe2d60 0x40033142:0x3ffe2d90 0x4009cb79:0x00000000 |<-CORRUPTED




ELF file SHA256: e8cb997e27dd2b38

CPU halted.

I do not know how to decode the backtrace but some of the heap region is in there.

@gneverov
Copy link

The original issue description suggests that merely booting with CIRCUITPY_PYSTACK_SIZE=4000 will cause a problem on ESP32. However, most of the subsequent comments talk about running a test program that causes a stack overflow to reproduce the problem. Which one is it?

@bill88t
Copy link

bill88t commented Mar 28, 2023

Sir. I have no idea at all!

Different esp's behave differently with this bug.
Meanwhile I wait for my S3 board to arrive to test it on it.
S2 is a pain to debug and C3 is cursed.

Current guestimation: heap overflows to the right.

@RetiredWizard
Copy link
Author

I just grabbed the latest bits for Adafruit CircuitPython 8.1.0-beta.0-81-g703b8b227 on 2023-03-29; Adafruit Feather ESP32S3 4MB Flash 2MB PSRAM with ESP32S3 and setting CIRCUITPY_PYSTACK_SIZE=4000 in settings.toml does still cause a crash into safe mode on power up. Setting the value to 3500 results in a normal startup. I have tested other ESP boards as well and they tend to power up okay but crash when exercised with one of the test programs.

@tannewt tannewt self-assigned this Apr 19, 2023
dhalbert pushed a commit that referenced this issue Apr 20, 2023
PicoDVI in CP support 640x480 and 800x480 on Feather DVI, Pico and
Pico W. 1 and 2 bit grayscale are full resolution. 8 and 16 bit
color are half resolution.

Memory layout is modified to give the top most 4k of ram to the
second core. Its MPU is used to prevent flash access after startup.

The port saved word is moved to a watchdog scratch register so that
it doesn't get overwritten by other things in RAM.

Right align status bar and scroll area. This normally gives a few
pixels of padding on the left hand side and improves the odds it is
readable in a case. Fixes #7562

Fixes c stack checking. The length was correct but the top was being
set to the current stack pointer instead of the correct top.
Fixes #7643

This makes Bitmap subscr raise IndexError instead of ValueError
when the index arguments are wrong.
@bill88t
Copy link

bill88t commented Jun 5, 2023

Guys. I have some very painful news.

CIRCUITPY_PYSTACK_SIZE=12288
Adafruit CircuitPython 8.2.0-beta.0 on 2023-05-24; Adafruit Feather ESP32-S3 TFT with ESP32S3

>>> from time import sleep
>>> d=0
>>> a = """try:
...     sleep(0.1)
...     print(d)
...     d += 1
...     exec(a)
... except KeyboardInterrupt:
...     pass
... """
>>> exec(a)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

[01:14:13.515] Disconnected

@bill88t
Copy link

bill88t commented Jun 9, 2023

@tannewt
I think this should be reopened.
With recursion it's still exhibited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants