tests : add WER benchmarks

It would be nice to start measuring the word error rate (WER) of `whisper.cpp` across some representative dataset:

- short audio
- long audio
- english
- non-english
- etc.

This will help us catch regressions in the future. I'm not familiar with what is typically used for TTS WER benchmarks, so looking for help from the community.