[Bug]: When using VLLM_USE_MODELSCOPE, the huggingface_hub API will be used to get the model file list.

### Your current environment

<details>
<summary>The output of `python collect_env.py`</summary>

```text
INFO 02-17 15:29:01 __init__.py:190] Automatically detected platform cuda.
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.9 (main, Feb  5 2025, 19:10:45) [Clang 19.1.6 ] (64-bit runtime)
Python platform: Linux-5.10.134-17.al8.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB
GPU 2: Tesla V100-SXM2-32GB
GPU 3: Tesla V100-SXM2-32GB
GPU 4: Tesla V100-SXM2-32GB
GPU 5: Tesla V100-SXM2-32GB
GPU 6: Tesla V100-SXM2-32GB
GPU 7: Tesla V100-SXM2-32GB

Nvidia driver version: 535.129.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             96
On-line CPU(s) list:                0-95
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
CPU family:                         6
Model:                              85
Thread(s) per core:                 2
Core(s) per socket:                 48
Socket(s):                          1
Stepping:                           4
BogoMIPS:                           4999.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          1.5 MiB (48 instances)
L1i cache:                          1.5 MiB (48 instances)
L2 cache:                           48 MiB (48 instances)
L3 cache:                           33 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-95
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Vulnerable
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.2.1
[pip3] torch==2.5.1
[pip3] torchaudio==2.5.1
[pip3] torchvision==0.20.1
[pip3] transformers==4.48.3
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.7.3.dev133+g84683fa2.d20250214
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV1	NV1	NV2	PHB	PHB	NV2	PHB	0-95	0		/A
GPU1	NV1	 X 	NV2	NV1	PHB	PHB	PHB	NV2	0-95	0		/A
GPU2	NV1	NV2	 X 	NV2	NV1	PHB	PHB	PHB	0-95	0		/A
GPU3	NV2	NV1	NV2	 X 	PHB	NV1	PHB	PHB	0-95	0		/A
GPU4	PHB	PHB	NV1	PHB	 X 	NV2	NV1	NV2	0-95	0		/A
GPU5	PHB	PHB	PHB	NV1	NV2	 X 	NV2	NV1	0-95	0		/A
GPU6	NV2	PHB	PHB	PHB	NV1	NV2	 X 	NV1	0-95	0		/A
GPU7	PHB	NV2	PHB	PHB	NV2	NV1	NV1	 X 	0-95	0		/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NVIDIA_VISIBLE_DEVICES=all
NVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NCCL_VERSION=2.19.3-1
NVIDIA_DRIVER_CAPABILITIES=compute,utility
VLLM_USE_MODELSCOPE=True
NVIDIA_PRODUCT_NAME=CUDA
CUDA_VERSION=12.2.2
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
VLLM_USE_V1=1
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

When using `export VLLM_USE_MODELSCOPE=True`, the huggingface_hub API will be used to get the model file list.

It takes a long time to wait, then return error:
```shell
$ export VLLM_USE_MODELSCOPE=True
$ export VLLM_USE_V1=1
$ vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --enable-reasoning --reasoning-parser deepseek_r1
INFO 02-17 15:31:16 __init__.py:190] Automatically detected platform cuda.
INFO 02-17 15:31:18 api_server.py:891] vLLM API server version 0.7.3.dev133+g84683fa2.d20250214
...
ERROR 02-17 15:33:29 config.py:102] Error retrieving file list: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B/tree/main?recursive=True&expand=False (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f216b87a360>, 'Connection to huggingface.co timed out. (connect timeout=None)'))"), '(Request ID: 1ba47281-d3fd-47d8-b844-bc6c1984d527)'), retrying 1 of 2
ERROR 02-17 15:35:41 config.py:100] Error retrieving file list: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B/tree/main?recursive=True&expand=False (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f216b87b9e0>, 'Connection to huggingface.co timed out. (connect timeout=None)'))"), '(Request ID: 88b71445-799e-40b8-a4b1-4ba41a5cd4b6)')
Traceback (most recent call last):
....
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: When using VLLM_USE_MODELSCOPE, the huggingface_hub API will be used to get the model file list. #13382

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: When using VLLM_USE_MODELSCOPE, the huggingface_hub API will be used to get the model file list. #13382

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions