State dict serialization #51

kaiyuan-li · 2025-10-06T17:16:39Z

This PR creates two functions for state_dict serialization and deserialization.

generate_tensor_blob recursively looks for tensors in the state_dict and serializes them into a blob of tensors. And replace the tensor in the state_dict with TensorReference. TensorReference contains the metadata for the offset, shape and dtype of the original tensor.
reconstruct_state_dict_from_tensor_blob does the reverse operation of generate_tensor_blob, it takes the tensor blob and state_dict (with only tensor_references) and replace all of the tensor_references inside the state_dict with the reconstructed tensors (from tensor blob and TensorReference)
Tests added.

LucasLLC

This looks really solid, I'm impressed with how quickly this came together. A couple nits, but nothing major.

~~Could you please test speed increase in e2e test_models tests and report in this PR?~~

edit: Ah I just remembered we talked about landing this in stages so I think some of the necessary plumbing wont exist.

LucasLLC · 2025-10-07T14:31:00Z

torchstore/state_dict_utils.py

+    size: int  # Size in bytes
+
+
+def generate_tensor_blob(state_dict: Dict[str, Any]):


Can we instead use flatten_state_dict instead of making this recursive?

What do you think about making this a class method of a "TorchStoreStateDict", or similar?

Then we can do things like:

torchstore_sd = TorchStoreStateDict.from_state_dict(original_state_dict) torchstore_sd.to_state_dict()

and also store any necessary data as objects in the state dict.

LucasLLC · 2025-10-07T14:31:59Z

torchstore/state_dict_utils.py

+        return modified_state_dict, torch.empty(0, dtype=torch.uint8)
+
+    # Calculate total size and update offsets
+    current_offset = 0


I think we have a _state_dict_size function in state dict utils

_state_dict_size calculates approximate size return size << 20.

LucasLLC

I believe the largest deltas on this PR are:

Implementing as class
Managing DTensor

My recommendation for DTensor is to first convert to tensor slice, and store all additional metadata in the state dict. (This is actually my advice always for dealing with dtensor so we can reduce the number of branches in the codebase)

kaiyuan-li · 2025-10-07T19:39:47Z

Updated to class representation and using flattened state dict, which makes a lot of sense because list iteration is way simpler than recursion.

Also added DTensor support with TensorSlice as metadata. Please take another look.

casteryh · 2025-10-08T18:24:49Z

Haven't gone through the code but have a general question in mind.
Let's say I have a state dict consisting of DTensors and I convert it to a TorchStoreStateDict and do a ts.put.
On the get side, do I get a state dict of DTensors? What if I want materialized whole tensors for each tensor in the state dict, or if the get side has different sharding patterns than those of the put side?

kaiyuan-li · 2025-10-08T18:51:03Z

Hi @casteryh, for getting DTensor with a different sharding plan, right now the interface in torchstore is by specifying the get dtensor sharding plan in a inplace tensor.

Right now in this PR, it only supports get('state_dict_key') for getting the whole state dict. @LucasLLC and I have discussed this morning and I will add a new feature so we can get a specific DTensor like get('dtensor.fqn', inplace_dtensor), then the DTensor can be resharded. I should be able to get that done by tomorrow.

casteryh

overall LGTM.

still have some questions:
currently, this is not integrated with torchstore.put / torchstore.get right?

For example, if I have a state dict sd = {"a": t} where t is a DTensor sharded across two ranks.

On each rank, if I do ts_sd = TorchStoreStateDict.from_state_dict(d), then ts_sd will no longer contain DTensors, right?
Consequently, if I do a ts.put("state_dict_key", ts_sd) on both ranks, then torchstore is supposed to detect that ts_sd is a TorchStoreStateDict and handle the sharding logic accordingly, right? <- My understanding is this part is not done yet

casteryh · 2025-10-09T05:21:30Z

tests/test_state_dict.py

+    assert torchstore_state_dict.flattened_state_dict == {}
+    assert len(torchstore_state_dict.tensor_blob) == 0


this seems to be testing implementation details as opposed to behaviors.

Good point. Removed.

casteryh · 2025-10-09T05:22:44Z

tests/test_state_dict.py

+    assert len(torchstore_state_dict.tensor_blob) == 0
+    reconstructed = torchstore_state_dict.to_state_dict()


casteryh · 2025-10-09T05:23:50Z

tests/test_state_dict.py

+    scalar_dict = {"scalar": torch.tensor(3.14159)}
+    torchstore_state_dict = TorchStoreStateDict.from_state_dict(scalar_dict)
+    # Check flattened state dict has TensorReference
+    scalar_ref = torchstore_state_dict.flattened_state_dict["scalar"]


casteryh · 2025-10-09T05:26:50Z

tests/test_state_dict.py

+
+    # Create DTensor from local tensor
+    local_tensor = torch.randn(4, 6, dtype=torch.float32)
+    dtensor = DTensor.from_local(local_tensor, device_mesh, [Replicate()])


can you add a test for sharded dtensor (with world size > 1)? I am actually also confused about the expected behavior in this case.

That dtensor put then get functionality will be added in the next PR where we integrate the state_dict functionality into torchstore. This PR only do the serialization and deserialization part.

casteryh · 2025-10-09T05:31:40Z

torchstore/dtensor_utils.py

+from torch.distributed.tensor._utils import _compute_local_shape_and_global_offset
+
+
+def create_tensor_slice_from_dtensor(dtensor: DTensor) -> "TensorSlice":


Suggested change

def create_tensor_slice_from_dtensor(dtensor: DTensor) -> "TensorSlice":

from torchstore.transport.pipe import TensorSlice

def create_tensor_slice_from_dtensor(dtensor: DTensor) -> TensorSlice:

See the response to the other comment about import ordering.

casteryh · 2025-10-09T05:33:05Z

torchstore/dtensor_utils.py

+    Returns:
+        TensorSlice containing the distributed tensor metadata
+    """
+    from torchstore.transport.pipe import TensorSlice


is there a particular reason to avoid import this on the file level?
if not, move import to top of file.

So there's a circular dependency where

pipe.TensorSlice ^ | dtensor_util.create_tensor_slice_from_dtensor ^ | pipe.Request.from_dtensor

Maybe we should put TensorSlice definition into dtensor_util.py module?

would from __future__ import annotations fix this?
If not then just leave it as is.

kaiyuan-li

Sync'ed with Yuxuan and Lucas. We will do DTensor put and get in the next PR. This current PR only makes sure that DTensor can be serialized and deserialized properly.

kaiyuan-li · 2025-10-09T19:32:40Z

tests/test_state_dict.py

+
+    # Create DTensor from local tensor
+    local_tensor = torch.randn(4, 6, dtype=torch.float32)
+    dtensor = DTensor.from_local(local_tensor, device_mesh, [Replicate()])


That dtensor put then get functionality will be added in the next PR where we integrate the state_dict functionality into torchstore. This PR only do the serialization and deserialization part.

kaiyuan-li · 2025-10-09T19:33:11Z

torchstore/dtensor_utils.py

+from torch.distributed.tensor._utils import _compute_local_shape_and_global_offset
+
+
+def create_tensor_slice_from_dtensor(dtensor: DTensor) -> "TensorSlice":


See the response to the other comment about import ordering.

kaiyuan-li · 2025-10-09T19:37:08Z

torchstore/dtensor_utils.py

+    Returns:
+        TensorSlice containing the distributed tensor metadata
+    """
+    from torchstore.transport.pipe import TensorSlice


So there's a circular dependency where

pipe.TensorSlice ^ | dtensor_util.create_tensor_slice_from_dtensor ^ | pipe.Request.from_dtensor

Maybe we should put TensorSlice definition into dtensor_util.py module?

kaiyuan-li · 2025-10-09T19:37:34Z

tests/test_state_dict.py

+    scalar_dict = {"scalar": torch.tensor(3.14159)}
+    torchstore_state_dict = TorchStoreStateDict.from_state_dict(scalar_dict)
+    # Check flattened state dict has TensorReference
+    scalar_ref = torchstore_state_dict.flattened_state_dict["scalar"]


kaiyuan-li · 2025-10-09T19:37:40Z

tests/test_state_dict.py

+    assert torchstore_state_dict.flattened_state_dict == {}
+    assert len(torchstore_state_dict.tensor_blob) == 0


Good point. Removed.

kaiyuan-li · 2025-10-09T19:37:46Z

tests/test_state_dict.py

+    assert len(torchstore_state_dict.tensor_blob) == 0
+    reconstructed = torchstore_state_dict.to_state_dict()


casteryh

Maybe try from __future__ import annotations.
If it doesn't work then don't bother.

kaiyuan-li added 2 commits October 6, 2025 13:13

create functions

9567b1f

add tests

43427f6

kaiyuan-li requested a review from LucasLLC October 6, 2025 17:16

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 6, 2025

sync

c1da899

LucasLLC reviewed Oct 7, 2025

View reviewed changes

kaiyuan-li added 4 commits October 7, 2025 11:01

Merge branch 'main' into state_dict_serialization

dce528a

TorchstoreStateDict

1b35e01

dtensor support

f68ce3f

sync

25ca59b

sync

24d46ac

casteryh reviewed Oct 9, 2025

View reviewed changes

sync

298eb5d

kaiyuan-li commented Oct 9, 2025

View reviewed changes

casteryh approved these changes Oct 9, 2025

View reviewed changes

		size: int # Size in bytes


		def generate_tensor_blob(state_dict: Dict[str, Any]):

		assert torchstore_state_dict.flattened_state_dict == {}
		assert len(torchstore_state_dict.tensor_blob) == 0

		assert len(torchstore_state_dict.tensor_blob) == 0
		reconstructed = torchstore_state_dict.to_state_dict()

		from torch.distributed.tensor._utils import _compute_local_shape_and_global_offset


		def create_tensor_slice_from_dtensor(dtensor: DTensor) -> "TensorSlice":

	def create_tensor_slice_from_dtensor(dtensor: DTensor) -> "TensorSlice":
	from torchstore.transport.pipe import TensorSlice
	def create_tensor_slice_from_dtensor(dtensor: DTensor) -> TensorSlice:

State dict serialization #51

Are you sure you want to change the base?

State dict serialization #51

Uh oh!

Conversation

kaiyuan-li commented Oct 6, 2025

Uh oh!

LucasLLC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaiyuan-li Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasLLC left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaiyuan-li commented Oct 7, 2025

Uh oh!

casteryh commented Oct 8, 2025

Uh oh!

kaiyuan-li commented Oct 8, 2025

Uh oh!

casteryh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaiyuan-li left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

casteryh left a comment

Choose a reason for hiding this comment

LucasLLC left a comment •

edited

Loading

kaiyuan-li Oct 7, 2025 •

edited

Loading

LucasLLC left a comment •

edited

Loading

casteryh left a comment •

edited

Loading