Skip to content

Conversation

masahi
Copy link
Member

@masahi masahi commented Nov 7, 2023

Building on @Lunderberg's work in mlc-ai#1096, we are now switching to build-time sharding in mlc_serve. Runtime sharding is no longer supported, and when building a model you must add --use-presharded-weights. You also need the latest contrib-vllm.

Build-time sharding also lets us support FT quantization with Disco. Now weight preprocessing is applied after sharding. It has been confirmed to work on 7B and 13B, with both q4f16_ft and q8f16_ft, with --num-shards=2. Other configs will be tested later.

@masahi masahi merged commit 9c006fd into octoml:batch-serving Nov 7, 2023
@Lunderberg
Copy link
Member

As an addendum, the build.py script must be run in two steps, once with --convert-weight-only and once with --build-model-only. This is implemented in this check, and is due to the same parameter size handling described here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants