llama-cpp manually is noticeably faster than running it through the python API

similar to https://github.com/abetlen/llama-cpp-python/issues/71#issuecomment-1506635860, I found on my machine, LLaMA CPP is 27ms per token, and the python bindings are 68ms per token so it's much slower.
I wonder is there any idea for this performance degradation? Is such degradation normal? Or is there any way to close this gap?
Thanks a lot！