Model execution runs excruciatingly slow/doesn't run at all

### 🐛 Describe the bug

When running any inferences, the model doesn't load anything except maybe the first line. I've tried using Ollama and everything runs instantaneously, but I haven't gotten anything out of running any model on torchchat. Any idea why this is happening?

```bash
(.venv) (base) jakemalis@Jakes-MacBook-Pro torchchat % python3 torchchat.py generate llama3.1 --prompt "What's your favorite color?"              
Downloading meta-llama/Meta-Llama-3.1-8B-Instruct from HuggingFace...
original/params.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 199/199 [00:00<00:00, 1.91MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 855/855 [00:00<00:00, 1.96MB/s]
.gitattributes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.52k/1.52k [00:00<00:00, 1.65MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 245kB/s]
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.0k/44.0k [00:00<00:00, 3.38MB/s]
USE_POLICY.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.69k/4.69k [00:00<00:00, 15.7MB/s]
LICENSE: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.63k/7.63k [00:00<00:00, 49.4MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 3.53MB/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 3.53MB/s]
tokenizer.model: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.18M/2.18M [00:00<00:00, 3.05MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:02<00:00, 3.63MB/s]
consolidated.00.pth: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16.1G/16.1G [08:07<00:00, 32.9MB/s]
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [08:07<00:00, 40.66s/it]
Converting meta-llama/Meta-Llama-3.1-8B-Instruct to torchchat format...███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.09M/9.09M [00:02<00:00, 3.64MB/s]NumExpr defaulting to 10 threads.
PyTorch version 2.7.0.dev20250124 available.
Warning: PTEModel (ExecuTorch) not available with exception: No module named 'executorch'
known configs: ['llava-1.5', '13B', '70B', 'CodeLlama-7b-Python-hf', 'Meta-Llama-3.1-70B-Tune', 'Granite-3B-Code', '34B', 'Meta-Llama-3.1-8B', 'stories42M', 'Llama-Guard-3-1B', '30B', 'Meta-Llama-3.1-8B-Tune', 'stories110M', 'Granite-3.1-8B-Instruct', 'Llama-3.2-11B-Vision', 'Meta-Llama-3.2-3B', 'Meta-Llama-3.1-70B', 'Meta-Llama-3.2-1B', 'Granite-3.0-2B-Instruct', 'Granite-3.0-8B-Instruct', '7B', 'stories15M', 'Llama-Guard-3-1B-INT4', 'Mistral-7B', 'Granite-8B-Code', 'Meta-Llama-3-70B', 'Granite-3.1-2B-Instruct', 'Meta-Llama-3-8B']
Model config {'block_size': 131072, 'vocab_size': 128256, 'n_layers': 32, 'n_heads': 32, 'dim': 4096, 'hidden_dim': 14336, 'n_local_heads': 8, 'head_dim': 128, 'rope_base': 500000.0, 'norm_eps': 1e-05, 'multiple_of': 1024, 'ffn_dim_multiplier': 1.3, 'use_tiktoken': True, 'use_hf_tokenizer': False, 'tokenizer_prepend_bos': True, 'max_seq_length': 8192, 'rope_scaling': {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192}, 'n_stages': 1, 'stage_idx': 0, 'attention_bias': False, 'feed_forward_bias': False, 'tie_word_embeddings': False, 'embedding_multiplier': None, 'attention_multiplier': None, 'residual_multiplier': None, 'logits_scaling': None}
Moving checkpoint to /Users/jakemalis/.torchchat/model-cache/downloads/meta-llama/Meta-Llama-3.1-8B-Instruct/model.pth.
Done.
Moving model to /Users/jakemalis/.torchchat/model-cache/meta-llama/Meta-Llama-3.1-8B-Instruct.
Unable to import torchao experimental quant_api with error:  [Errno 2] No such file or directory: '/Users/jakemalis/Downloads/torchchat/torchao-build/src/ao/torchao/experimental/quant_api.py'
Using device=mps 
Loading model...
Time to load model: 23.71 seconds
-----------------------------------------------------------
What's your favorite color? - A simple
```

### Versions

```bash
CMakeLists.txt		CONTRIBUTING.md		README.md		collect_env.py		docs			runner			tokenizer		torchchat.py
CODE_OF_CONDUCT.md	LICENSE			assets			dist_run.py		install			tests			torchchat
(.venv) (base) jakemalis@Jakes-MacBook-Pro torchchat % python collect_env.py
Collecting environment information...
PyTorch version: 2.7.0.dev20250124
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 15.3 (arm64)
GCC version: Could not collect
Clang version: 16.0.0 (clang-1600.0.26.6)
CMake version: version 3.31.4
Libc version: N/A

Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-15.3-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] numpy==2.2.2
[pip3] torch==2.7.0.dev20250124
[pip3] torchao==0.8.0+git2f97b095
[pip3] torchtune==0.6.0.dev20250124+cpu
[pip3] torchvision==0.22.0.dev20250124
[conda] numpy                     2.2.1                    pypi_0    pypi
[conda] numpy-base                2.1.3           py312he047099_0  
[conda] pytorch                   2.6.0.dev20241112        py3.12_0    pytorch-nightly
[conda] torchaudio                2.5.0.dev20241118       py312_cpu    pytorch-nightly
[conda] torchvision               0.20.0.dev20241118       py312_cpu    pytorch-nightly
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model execution runs excruciatingly slow/doesn't run at all #1483

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model execution runs excruciatingly slow/doesn't run at all #1483

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions