This is actually not my question, llama.cpp wants to implement this but encountered some problems. https://github.com/ggerganov/llama.cpp/pull/4207