Skip to content

No DMA support for SPI in arduino-esp32? #4590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RudolphRiedel opened this issue Nov 29, 2020 · 21 comments
Closed

No DMA support for SPI in arduino-esp32? #4590

RudolphRiedel opened this issue Nov 29, 2020 · 21 comments
Labels
Status: Stale Issue is stale stage (outdated/stuck)

Comments

@RudolphRiedel
Copy link

Neither the SPI class which is merely a wrapper or the underlying esp32-hal-spi.c seem to support DMA.
Is this observation correct?

Or is there a way to send a buffer over SPI with DMA without adding a third party library?

@me-no-dev
Copy link
Member

DMA will be added at some point but it's not an easy task because of some internal issues. For now you can use the IDF API for SPI and utilise DMA that way.

@RudolphRrr
Copy link

@lbernstone
Copy link
Contributor

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/peripherals/spi_master.html
You can include idf drivers in arduino, as long as you don't mix the two.

@RudolphRiedel
Copy link
Author

Before you wonder, "RudolphRrr"is my account as well.
This looks promising, the documentation unfortunately not so much.
I managed to put in some workarounds to make SPI at least faster as on the UNO but sending out the buffer with DMA should be way faster while allowing lower SPI clocks.

With a simple SPI.transfer(data) executing a lot slower on the ESP32 than on the UNO, the SPI class kind of needs an update, and/or probably esp32-hal-spi.c.
The extra functions like SPI.write32() and SPI.writeBytes() are pretty much mandatory when sending more than the occasional byte over SPI - very nice to have but also not really portable.
Yes, I know there is supposed to be a reason why it is that slow, it is thread-safe, or maybe it is.
On the other hand the ESP-IDF driver is not even thread-safe.
And the 6.25µs between SPI transfers with esp32-hal-spi.c. really are way too much - that is 1500 clock-cycles of doing nothing.
https://user-images.githubusercontent.com/31180093/100015163-d0d87f80-2dd7-11eb-807b-d12fee2e7706.png

@stale
Copy link

stale bot commented Jan 30, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Jan 30, 2021
@RudolphRiedel
Copy link
Author

Not actually resolved but I worked around the issue by using ESP-IDF in the meantime.

One thing that was interesting to note is that the time between two simple 8-bit transfers on ESP32-Arduino is ridiculously long, especially for a 240MHz controller, with 6µs.
It is even longer with ESP-IDF, I measured 28µs at first using spi_device_transmit().

Yes, I am talking about the pause between sending two bytes, the time from one byte is finished and
the next starts to transfer.

grafik

When claiming the bus with spi_device_acquire_bus() / spi_device_release_bus() this is even increased to 32µs.

I switched over to use spi_device_polling_transmit() and it got a lot better - for whatever reason.
With locking the bus I got down to 8.2µs without locking I measured 12.2µs.

grafik

So using simple SPI transfers with ESP-IDF is even slower than with ESP32-Arduino.
But at least this is offset by the possibility to use DMA.
Oh yes, I also found that native ESP-IDF is running slower than the ESP-IDF that is supplied with ESP32-Arduino.

At this point I consider SPI with the ESP32 to be severly broken.

And for a reference, when using the Arduino SPI class on an Adafruit Metro M4 which is clocked at 120MHz,
the gap between two bytes on the SPI is only 440ns.
Heck, even an Arduino UNO at 16MHz is faster than the ESP32 at doing single byte transfers over SPI.

@stale
Copy link

stale bot commented Jan 30, 2021

[STALE_CLR] This issue has been removed from the stale queue. Please ensure activity to keep it openin the future.

@stale stale bot removed the Status: Stale Issue is stale stage (outdated/stuck) label Jan 30, 2021
@stale
Copy link

stale bot commented Jun 20, 2021

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Jun 20, 2021
@stale
Copy link

stale bot commented Jul 8, 2021

[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions.

@stale stale bot closed this as completed Jul 8, 2021
@s-light
Copy link

s-light commented Jan 27, 2023

the base issues of very long inter-byte gaps is still valid.
i will open up a new issue - if i get my hands on an oscilloscope to write a good test-case.
i first encountered this issue by trying to get my TI TLC5971 LED-Driver chips working...

@RudolphRiedel
Copy link
Author

Playing with an ESP32-C3 I learned that the SPI got a lot faster in the meantime but there still is no DMA support.
I tried to combine the SPI class with ESP-IDF but so far the DMA works only once and the overhead to switch is about 600µs.
I guess it is not really feasible to combine esp32-hal-spi / SPI.cpp with ESP-IDF since esp32-hal-spi is doing direct register access?

Combined SPI.cpp / ESP-IDF transfer:
grafik

ESP-IDF no-DMA and DMA transfers:
grafik

SPI class only
grafik

Looking at this it is probably overdue to remove the ESP-IDF based transfers again.
While this make the SPI transfers blocking, the short blocking transfers take 3.6x to 4.0x more time when using ESP-IDF.
And at least the first 15µs of the ESP-IDF DMA transfer is overhead.
The display update with 228 bytes takes 30µs with the SPI running at 20MHz, this is preparing the buffer and asking ESP-IDF to send it with DMA.
The display update using only the SPI class which makes this a whole lot more compatible to anything else using the SPI takes 139µs now which is way faster than what I measured originally.

I still would like to see DMA support added, 228 bytes is only a tiny buffer.
But this looks feasible now and compatibility is a very nice bonus.

@me-no-dev
Copy link
Member

There are a couple of reasons why we do not have DMA support in Arduino yet. As you found out yourself, short transactions take a lot longer time to setup and execute with DMA and in Arduino we generally have many/many small transactions. LCD drivers write pixels, etc. On another topic, the DMA SPI on ESP32 is a bit harder to get going correctly and the IDF team had many headaches with it, otherwise the trick is to do the small transactions either not by DMA or by polling the ISR directly (instead of waiting for event in a task to fill the next buffer).

@RudolphRiedel
Copy link
Author

RudolphRiedel commented Jul 20, 2023

I used no-dma transfers for the shorter transfers because most of these are read operations in my application and the result is needed immediately anyways, like checking if the display is busy or reading the touch data.
So, heck no, doing everything with DMA is not what I am hoping for. :-)

But I also need to transfers buffers of 250...4k bytes and the option to do that with DMA would be nice, like how it is implemented for the Teensy 4 with a callback function.
This would mean that during such transfers other things could be done and the SPI clock could be lowered as well.

The issue however is not that DMA is taking a long time to setup in general, only the ESP-IDF needs endless cycles for everything.
I am really glad to see that this has been adressed for Arduino-ESP32 now, the time for a 4+1 byte transfer went down from 29.5µs with ESP-IDF to 8.5µs, this really is something.

grafik

The code for this sequence essentially is:
digitalWrite(EVE_CS, LOW);
SPI.transfer(&data, 4);
result = SPI.transfer(0x00);
digitalWrite(EVE_CS, HIGH);
return result;

And I am currently running it on an ESP32-C3, a 160MHz RISC-V, so in absolute terms this still is rather slow.
A 1.44µs pause is 230 clock cycles at 160MHz.
This is not a complaint, just an observation.
And yes, I noticed that SPI.write32() exists, so this is something on my end to optimize, nice. :-)

@me-no-dev
Copy link
Member

Wrapping this into SPI transactions will make it even faster, since BUS will need to be locked/unlocked only once.

@me-no-dev
Copy link
Member

me-no-dev commented Jul 20, 2023

This is the proper call order to use best SPI in Arduino (and fastest on ESP32):

// Lock the bus and select the slave device
SPI.beginTransaction(SPISettings(SPI_FREQUENCY, MSBFIRST, SPI_MODE0));
digitalWrite(EVE_CS, LOW);

// Transact the necessary data
SPI.transfer(&data, 4);
result = SPI.transfer(0x00);

// Deselect the slave device and release the bus
digitalWrite(EVE_CS, HIGH);
SPI.endTransaction();

// Return the result
return result;

@RudolphRiedel
Copy link
Author

RudolphRiedel commented Jul 20, 2023

Or not always use SPI.beginTransaction() / SPI.endTransaction() which makes this even faster. :-)
I am only using SPI.beginTransaction() / SPI.endTransaction() for the larger buffer transfers as I increase the SPI clock from 8MHz to 20MHz for those, the small transfers are executed under the assumption that the SPI is setup correctly.
Edit: I tried to use SPI.setFrequency(20000000); but this is not working at all, the SPI traffic stops with this call.

And using SPI.write32() is indeed a little faster.

@me-no-dev
Copy link
Member

When you do not use transactions, SPI bus is locked/unlocked with each call to it's API functions, else it is locked/unlocked only once in a transaction. You have two calls to transfer() which without transaction will do it twice. Maybe give it a shot and see if the pause between them is less than 1.4us

@RudolphRiedel
Copy link
Author

No, what I meant is that my code expects SPI.beginTransaction() has been called and it would not call it or SPI.endTransaction() if the frequency could be changed without calling these.

@linhz0hz
Copy link

Is the difficulty of SPI speed specific to the ESP32 or it also show up in S3, C3 etc.?
In my use case I need to query a sensor every ms and the sensor can only allow 10MHz clock. Right now when I time it with esp_timer_get_time() the SPI transfer takes ~50us for 7 bytes, that is 12k cycles, and the transfer only requires ~6us. It is already using the beginTransaction/endTransaction model . I am confused as it is a lot, and even worse than what is reported here. Eventually I need to stick this code in an interrupt, so I really want to reduce this inefficiency. I am using a ESP32-S3.

@RudolphRrr
Copy link

Is the difficulty of SPI speed specific to the ESP32 or it also show up in S3, C3 etc.?

The last tests I did was with the C3 so yes, the results should be the same across the family.
I do have a ESP32-S3 and can re-check.

And to make this clear, the implementation here for Arduino is faster now than using the ESP-IDF directly.
I just switched from using ESP-IDF for the benefit of having DMA to using the Arduino class again,
in order to increase the compatibility of my library with anything else.
Using the Arduino class with DMA would be real nice though.

In my use case I need to query a sensor every ms and the sensor can only allow 10MHz clock.

Every ms sounds like exclusive use of the SPI and if this is the case you do not need to use SPI.beginTransaction() / SPI.endTransaction() every time.

Right now when I time it with esp_timer_get_time() the SPI transfer takes ~50us for 7 bytes,

That should be more like less than 20µs.
Please provide a snippet of commands that shows this behaviour.
And how do you determine it needs 50µs?

If you are using single transfers, try to switch over to a buffer transfer.

@linhz0hz
Copy link

linhz0hz commented Jul 26, 2023

Is the difficulty of SPI speed specific to the ESP32 or it also show up in S3, C3 etc.?

The last tests I did was with the C3 so yes, the results should be the same across the family. I do have a ESP32-S3 and can re-check.

And to make this clear, the implementation here for Arduino is faster now than using the ESP-IDF directly. I just switched from using ESP-IDF for the benefit of having DMA to using the Arduino class again, in order to increase the compatibility of my library with anything else. Using the Arduino class with DMA would be real nice though.

In my use case I need to query a sensor every ms and the sensor can only allow 10MHz clock.

Every ms sounds like exclusive use of the SPI and if this is the case you do not need to use SPI.beginTransaction() / SPI.endTransaction() every time.

Right now when I time it with esp_timer_get_time() the SPI transfer takes ~50us for 7 bytes,

That should be more like less than 20µs. Please provide a snippet of commands that shows this behaviour. And how do you determine it needs 50µs?

If you are using single transfers, try to switch over to a buffer transfer.

Indeed I made a mistake in my measurement. There was actually three SPI transfers in each measurement: query the status bit, clearing interrupt in the sensor, and the actual data transfer. This adds up to the ~50us you observe. I am using esp_timer_get_time() to get the timing, as I currently do not have an oscilloscope handy.
Now I tried to eliminate all the beginTransaction and endTransaction calls (but still have three seperate transfers), this reduce the time from 50 to 20us, and there are 12 bytes worth of communication in total, corresponding to 10us. There is still 10us of overhead. There are some computation in additional to my SPI transfers, I did not think they will contribute much but I would double check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issue is stale stage (outdated/stuck)
Projects
None yet
Development

No branches or pull requests

6 participants