Implement parallelization over heads

Currently the attention over heads runs in serial: 

https://github.com/certik/fastGPT/blob/01eb84b015d89a567245da0445c0abb7d53a8500/gpt2.f90#L101

We should try to parallelize it and see if we get any speedups.