Skip to content

Failed CI on A100 #2064

@xuzhao9

Description

@xuzhao9

The test test_llama_v2_7b_16h_example_cuda failed between 20231115 and 20231116.

Failed workflow: https://github.com/pytorch/benchmark/actions/runs/7006721966/job/19059198530

Detailed error and command to reproduce:

$ python run.py llama_v2_7b_16h -d cuda --accuracy
fp64 golden ref were not generated for llama_v2_7b_16h. Setting accuracy check to cosine
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Traceback (most recent call last):
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 510, in check_accuracy
    correct_result = run_n_iterations(
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 395, in run_n_iterations
    _model_iter_fn(mod, inputs, contexts, optimizer, collect_outputs=False)
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 393, in _model_iter_fn
    return forward_pass(mod, inputs, contexts, collect_outputs)
  File "/data/users/xzhao9/git/benchmark/torchbenchmark/util/env_check.py", line 370, in forward_pass
    return mod(*inputs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 820, in forward
    outputs = self.model(
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 708, in forward
    layer_outputs = decoder_layer(
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/xzhao9/.conda/envs/py38/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Running eval method from llama_v2_7b_16h on cuda in eager mode with input batch size 1 and precision fp16.
Accuracy:              eager_1st_run_fail

Bisection workflow: https://github.com/pytorch/benchmark/actions/runs/6985353191
Root cause commit: 12b2dd16b050e6495910fc564517fbb51dde1f20 (pytorch/pytorch@12b2dd1)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions