Skip to content

SPI blocking transfers do not wait for completion. #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
newAM opened this issue Dec 23, 2020 · 13 comments · Fixed by #131
Closed

SPI blocking transfers do not wait for completion. #130

newAM opened this issue Dec 23, 2020 · 13 comments · Fixed by #131

Comments

@newAM
Copy link
Member

newAM commented Dec 23, 2020

Problem

When doing a SPI transfer with the blocking traits the CS line goes high before the transfer is completed.

    let mut buf: [u8; 1] = [0];
    w5500_cs.set_low().unwrap();
    spi.write(&[0x00, 0x39, 0x00]).unwrap();
    spi.transfer(&mut buf).unwrap();
    w5500_cs.set_high().unwrap();

image

Expected Behavior

When I add some hacky polling for the BSY bit (0x80) in the SPI status register to that code, the transfer works as expected, with the CS line going high only after completion of the SPI bus activity.

    let mut buf: [u8; 1] = [0];
    w5500_cs.set_low().unwrap();
    spi.write(&[0x00, 0x39, 0x00]).unwrap();
    let mut spi1_sr: u32 = unsafe { core::ptr::read_volatile(0x4001_3008 as *const u32) };
    while spi1_sr & 0x80 != 0x00 {
        spi1_sr = unsafe { core::ptr::read_volatile(0x4001_3008 as *const u32) };
    }
    spi.transfer(&mut buf).unwrap();
    let mut spi1_sr: u32 = unsafe { core::ptr::read_volatile(0x4001_3008 as *const u32) };
    while spi1_sr & 0x80 != 0x00 {
        spi1_sr = unsafe { core::ptr::read_volatile(0x4001_3008 as *const u32) };
    }
    w5500_cs.set_high().unwrap();

image

This seems to be a case of under-specified behavior in the embedded-hal (which is fixed in 1.0.0), the docs do not specify if the blocking part is just blocking for buffer space, after which the function returns, or blocking for buffer space, and polling for completion.

I think this is a bug though, because despite the under-specification other STM32 embedded HAL traits do block for completion (such as the stm32f1 which I also tested), this appears to be the odd one out.

Other Information

I am using a STM32F070CBTx.

Relevant Cargo.toml:

[dependencies]
cortex-m = "0.6.4"
cortex-m-rt = { version = "0.6.13", features = ["device"] }
embedded-hal = "0.2.4"
stm32f0xx-hal = { version = "0.17.1", features = ["stm32f070xb"] }
rtt-target = "0.3.0"
$ cargo -V             
cargo 1.50.0-nightly (a3c2627fb 2020-12-14)
@newAM
Copy link
Member Author

newAM commented Dec 23, 2020

stm32f1xx-hal implementation: https://github.com/stm32-rs/stm32f1xx-hal/blob/master/src/spi.rs#L492
which calls into spi_write here: https://github.com/stm32-rs/stm32f1xx-hal/blob/master/src/spi.rs#L268

The difference does appear to be that the stm32f1xx implementation polls for bsy.


Back to this crate, what is the purpose of this line here?
https://github.com/stm32-rs/stm32f0xx-hal/blob/master/src/spi.rs#L485

As far as I know reads to sr do not have any side effects, and the ok() will just drop any errors. I would like to submit a fix, but I feel I am missing something here.

@therealprof
Copy link
Member

The difference does appear to be that the stm32f1xx implementation polls for bsy.

Comparing two different generations of a peripheral (or software implementations) is typically not too useful. I guess the problem here is that you're assuming that the transfer is completed by the time the send call returns but it actually is buffered. Maybe we should add a busy check before changing the mode to bidi.

Reading sr does have side effects:

This flag is set by hardware and reset when SPIx_SR is read by software.

@newAM
Copy link
Member Author

newAM commented Dec 23, 2020

I guess the problem here is that you're assuming that the transfer is completed by the time the send call returns but it actually is buffered.

Is this a bad assumption to make? I asked in the Rust Embedded element chat and the consensus seemed to agree that the implementation of the traits should block for completion.

@therealprof
Copy link
Member

Well, depends on what the goal is. If you want best possible write performance without complicated implementations (i.e. DMA) it makes total sense to use the built-in buffer. I'm not quite sure what happened in your first example since the trace doesn't quite seem to line up with the code, there could be other issues with a software NSS implementation, too.

@newAM
Copy link
Member Author

newAM commented Dec 24, 2020

Well, depends on what the goal is. If you want best possible write performance without complicated implementations (i.e. DMA) it makes total sense to use the built-in buffer.

I took a look through the awesome embedded list, I could not find a device driver where they made the assumption that the embedded-hal traits would not poll for completion.

For example, the ENC28J60 crate will not work with the stm32f0xx-hal because it does not poll for completion, but it will work with the stm32f1xx-hal.

    fn read_buffer_memory(&mut self, addr: Option<u16>, buf: &mut [u8]) -> Result<(), E> {
        if let Some(addr) = addr {
            self.write_control_register(bank0::Register::ERDPTL, addr.low())?;
            self.write_control_register(bank0::Register::ERDPTH, addr.high())?;
        }

        self.ncs.set_low();
        self.spi.write(&[Instruction::RBM.opcode()])?;
        self.spi.transfer(buf)?;
        self.ncs.set_high();

        Ok(())
    }

I'm not quite sure what happened in your first example since the trace doesn't quite seem to line up with the code

I am not sure I understand what you mean here. The trace not lining up is the issue, specifically the fact that CS goes high before the transfer is complete is unexpected. The rest of the trace looks correct to me.

there could be other issues with a software NSS implementation, too.

Any ideas? I am happy to investigate further!
I do not have a lot of boards on hand, but I can confirm that my first example does work with a STM32 F103C8T6 blue pill board using the stm32f1xx-hal.

@therealprof
Copy link
Member

I took a look through the awesome embedded list, I could not find a device driver where they made the assumption that the embedded-hal traits would not poll for completion.

?

How would you see such a thing? If you have a write-only driver there's no problem at all (and those are quite common, e.g. for displays).

For example, the ENC28J60 crate will not work with the stm32f0xx-hal because it does not poll for completion, but it will work with the stm32f1xx-hal.

Is that an assumption or did you verify that? I appreciate that the pattern you're using might be a problem in general but it would be nice to have some hard evidence.

And again, I think the problem might actually be the turn-around from write to transfer because we're actually changing settings.

I am not sure I understand what you mean here. The trace not lining up is the issue, specifically the fact that CS goes high before the transfer is complete is unexpected. The rest of the trace looks correct to me.

I understand that but I am puzzled about how little data seems to be transferred before the loss of sync... what's the SPI clock?

I do not have a lot of boards on hand, but I can confirm that my first example does work with a STM32 F103C8T6 blue pill board using the stm32f1xx-hal.

Yeah, but again that that's a completely different generation of SPI peripheral which means you can't just assume that using the F1 mechanisms would just work for F0, too.

It would be great if you could insert a busy check loop in set_send_only and set_bidi_mode and see whether that fixes your problem.

@newAM
Copy link
Member Author

newAM commented Dec 24, 2020

How would you see such a thing?

I made the assumption (perhaps a bad one, sorry!) that any crate using the blocking embedded-hal SPI traits with a transfer or write wrapped by software driven chip select set_high and set_low calls will not be compatible.

If you have a write-only driver there's no problem at all (and those are quite common, e.g. for displays).

I tried it out, and it appears that even for write only this is still a problem.

    cs.set_low().unwrap();
    spi.write(&[0x12, 0x34, 0x56, 0x78]);
    cs.set_high().unwrap();

That code (or similar) also produces a clipped chip-select. Full code here: https://github.com/newAM/stm32f0xx-hal-issue-130/blob/example1/src/main.rs#L70

image

Is that an assumption or did you verify that? I appreciate that the pattern you're using might be a problem in general but it would be nice to have some hard evidence.

It was a problem I had a while ago, but I did not dive into it at the time.

Here is a reproduction: https://github.com/newAM/stm32f0xx-hal-issue-130
It is a bit harder to see what is going on because there is a lot more activity on the bus; but it is still pretty clear that the chip select is not acting as expected.

I understand that but I am puzzled about how little data seems to be transferred before the loss of sync... what's the SPI clock?

10kHz, though this can be reproduced at higher frequencies.

Yeah, but again that that's a completely different generation of SPI peripheral which means you can't just assume that using the F1 mechanisms would just work for F0, too.

The point I am trying to make is that there is a gap in behavior between the F0 and F1 that should not exist. Polling the bsy bit which exists on both versions (and seems to have the same behavior across generations) such that both HALs behave similarly should be possible.

It would be great if you could insert a busy check loop in set_send_only and set_bidi_mode and see whether that fixes your problem.

Happy to help debug this! 👍

I tried it out, and the problem still exists :(

This is the same as above, just a bare write, since that is a better minimal reproduction.

stm32f0xx-hal: https://github.com/newAM/stm32f0xx-hal/tree/issue-130
source code: https://github.com/newAM/stm32f0xx-hal-issue-130/tree/example2

image

Edit: I should also mention that all of the above was done with a Nucleo F070RB (original post was with a STM32F070CBTx).

@therealprof
Copy link
Member

I tried it out, and it appears that even for write only this is still a problem.

Well, yes. The problem is the data is still in the hardware buffer when the call returns which isn't a problem if one uses hardware CS or transfer or runs the bus as very high clocks.

10kHz, though this can be reproduced at higher frequencies.

10kHz explains a lot.

Polling the bsy bit which exists on both versions (and seems to have the same behavior across generations) such that both HALs behave similarly should be possible.

Okay, let's poll the bsy bit at the end of write then. It's going to be a performance hit for lots of consecutive writes at high speeds but possibly we can introduce a separate implementation for such common use cases to deal with that.

@newAM
Copy link
Member Author

newAM commented Dec 24, 2020

It's going to be a performance hit for lots of consecutive writes at high speeds but possibly we can introduce a separate implementation for such common use cases to deal with that.

Consecutive writes with software chip select (at any frequency) will also produce a clipped CS.

For example, this code, with SPI at 500kHz.

    w5500_cs.set_low().unwrap();
    spi.write(&[0x00, 0x39, 0x00]).unwrap();
    spi.write(&[0x00]).unwrap();
    w5500_cs.set_high().unwrap();

Will produce a waveform like this:
image

Measurements

I did some measurements at a 500kHz bus frequency to see what the performance hit is at a more reasonable frequency.

This HAL does not expose the hardware NSS management (as far as I can tell), so this should be a decent summary of the existing possible use cases with software NSS management.

The performance hit to enable back to back writes/transfers seems acceptable to me, should I open a pull-request with the stm32f0xx-hal modification described below?

Transfer + Transfer

Using only transfers in two calls, no modifications to stm32f0xx-hal:

    let mut header: [u8; 3] = [0x00, 0x39, 0x00];
    let mut buf: [u8; 1] = [0];
    w5500_cs.set_low().unwrap();
    spi.transfer(&mut header).unwrap();
    spi.transfer(&mut buf).unwrap();
    w5500_cs.set_high().unwrap();

image

Write + Poll BSY + Transfer

Modifying stm32f0xx-hal to introduce polling, and retaining the original write followed by a transfer.

Modified write function:

    fn write(&mut self, words: &[u8]) -> Result<(), Self::Error> {
        let mut bufcap: u8 = 0;

        // We only want to send, so we don't need to worry about the receive buffer overflowing
        self.set_send_only();

        // Make sure we don't continue with an error condition
        nb::block!(self.check_send())?;

        // We have a 32 bit buffer to work with, so let's fill it before checking the status
        for word in words {
            // Loop as long as our send buffer is full
            while bufcap == 0 {
                bufcap = self.send_buffer_size();
            }

            self.send_u8(*word);
            bufcap -= 1;
        }

        loop {
            let sr = self.spi.sr.read();
            if !sr.bsy().bit_is_set() {
                break;
            }
        }

        // Do one last status register check before continuing
        self.check_send().ok();
        Ok(())
    }

Code:

   let mut buf: [u8; 1] = [0];
   w5500_cs.set_low().unwrap();
   spi.write(&[0x00, 0x39, 0x00]).unwrap();
   spi.transfer(&mut buf).unwrap();
   w5500_cs.set_high().unwrap();

image

Write + Poll BSY + Write

Same modifications to stm32f0xx-hal as above, but with a write after the write (just to measure timing).

   w5500_cs.set_low().unwrap();
   spi.write(&[0x00, 0x39, 0x00]).unwrap();
   spi.write(&[0x00]).unwrap();
   w5500_cs.set_high().unwrap();

image

@therealprof
Copy link
Member

500kHz is not reasonably fast, SPI was meant for multi-MHz operation. 😅

Please go ahead to PR a change. Probably the easiest and most sensible and correct change would be to replace

stm32f0xx-hal/src/spi.rs

Lines 485 to 486 in fa3f2c2

self.check_send().ok();
Ok(())

by

nb::block!(self.check_send())

@WasabiFan
Copy link

Just throwing this out there...

As far as I can tell, the current implementation has a race condition if you invoke write directly followed by a transfer: the former sets BIDIMODE/BIDIOE for bi-directional operation in transmit-only mode, while the latter switches back to full-duplex unidirectional mode. If you call transfer immediately after write, the FIFO is likely not fully cleared by the time it enters transfer, and it re-enables receiving before everything has shifted out following the write. In other words, it enables receipt before the write has finished, causing more data to be loaded into the RX buffer than expected.

This should be resolved by the discussed fixes in #131. I've validated that it avoids the overruns I encountered in my case.

In theory, fixing this requires the new "busy" check that was proposed; that being said, the FIFO only takes a few extra cycles to clear, so even just doing the reads to check if there is space in the TX buffer is enough. Not that I'm fond of that 🙂

@therealprof
Copy link
Member

As far as I can tell, the current implementation has a race condition if you invoke write directly followed by a transfer: the former sets BIDIMODE/BIDIOE for bi-directional operation in transmit-only mode, while the latter switches back to full-duplex unidirectional mode.

Yes, that it is correct and mentioned before. This needs to be fixed.

In theory, fixing this requires the new "busy" check that was proposed; that being said, the FIFO only takes a few extra cycles to clear, so even just doing the reads to check if there is space in the TX buffer is enough. Not that I'm fond of that 🙂

You're assuming people would run SPI at high speeds. Something that doesn't necessarily seem to be true as I've learned by this issue. 😅

@newAM
Copy link
Member Author

newAM commented Dec 28, 2020

You're assuming people would run SPI at high speeds. Something that doesn't necessarily seem to be true as I've learned by this issue. 😅

I do it for ease of debug with my low end equipment. 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants