[Disco] Switch to build-time sharding and enable FT quantization #55

masahi · 2023-11-07T20:27:35Z

Building on @Lunderberg's work in mlc-ai#1096, we are now switching to build-time sharding in mlc_serve. Runtime sharding is no longer supported, and when building a model you must add --use-presharded-weights. You also need the latest contrib-vllm.

Build-time sharding also lets us support FT quantization with Disco. Now weight preprocessing is applied after sharding. It has been confirmed to work on 7B and 13B, with both q4f16_ft and q8f16_ft, with --num-shards=2. Other configs will be tested later.

Lunderberg · 2023-11-07T21:45:12Z

As an addendum, the build.py script must be run in two steps, once with --convert-weight-only and once with --build-model-only. This is implemented in this check, and is due to the same parameter size handling described here.

masahi added 7 commits November 7, 2023 01:29

wip

09533b0

works with ThreadedSession

6285f1a

wip

89dfe4d

wip

8149ea2

automatically set use-presharded-weights for FT

1330ac3

conditionally apply preprocessing

7ed660e

fix

1c65fa6

masahi merged commit 9c006fd into octoml:batch-serving Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Disco] Switch to build-time sharding and enable FT quantization #55

[Disco] Switch to build-time sharding and enable FT quantization #55

Uh oh!

masahi commented Nov 7, 2023 •

edited

Loading

Uh oh!

Lunderberg commented Nov 7, 2023

Uh oh!

Uh oh!

[Disco] Switch to build-time sharding and enable FT quantization #55

[Disco] Switch to build-time sharding and enable FT quantization #55

Uh oh!

Conversation

masahi commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lunderberg commented Nov 7, 2023

Uh oh!

Uh oh!

masahi commented Nov 7, 2023 •

edited

Loading