Currently the attention over heads runs in serial: https://github.com/certik/fastGPT/blob/01eb84b015d89a567245da0445c0abb7d53a8500/gpt2.f90#L101 We should try to parallelize it and see if we get any speedups.