Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented May 26, 2024

Some functions that return dict[str, JSON] were mistakenly annotated as returning JSON.

One wrinkle to this PR is the BatchedCodecPipeline class, where from_dict takes a list, and to_dict returns a list. In this PR, I annotated those methods as if they worked with dicts, which doesn't match the current runtime behavior. I think either the method name or the return type should change here, but I'm not sure which one. @normanrz any ideas?

@normanrz
Copy link
Member

Some functions that return dict[str, JSON] were mistakenly annotated as returning JSON.

dict[str, JSON] is part of JSON so the annotations were not incorrect.

We could also rename these functions to to_json and from_json or serialize and deserialize.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 26, 2024

It's true that JSON includes dict[str, JSON], but the functions previously annotated as returning JSON specifically return dict[str, JSON], and not any of the other members of the JSON union type; accordingly, I think we should use the narrower type hint. The batched codec to_dict method is returning a list, but I think we should change that behavior, and return a dict isomorphic to the batched codec data structure.

We could also rename these functions to to_json and from_json or serialize and deserialize.

at least to me, to_json would imply that the output is a JSON string, which is definitely useful but a rather different thing than a python dict, and I think there's enough utility for the dict representation that we should ensure that this path is as simple as possible. serialize and deserialize could also work as names, provided the return type is a dict

@normanrz
Copy link
Member

The batched codec to_dict method is returning a list, but I think we should change that behavior, and return a dict isomorphic to the batched codec data structure.

Not sure I understand what you mean with isomorphic. In the end a list needs to be put in the zarr.json file and that is what BatchedCodecPipeline.to_dict returns.

@d-v-b
Copy link
Contributor Author

d-v-b commented May 26, 2024

By "isomorphic" I mean "has the same structure". Typically, the classes that inherit from Metadata are basically sets of JSON serializable properties, and to_dict spits out a dict with the same structure. In the case of BatchedCodecPipeline, since it is defined like this:

@dataclass(frozen=True)
class BatchedCodecPipeline(CodecPipeline):
    array_array_codecs: tuple[ArrayArrayCodec, ...]
    array_bytes_codec: ArrayBytesCodec
    bytes_bytes_codecs: tuple[BytesBytesCodec, ...]
    batch_size: int

I would expect BatchedCodecPipeline.to_dict() to produce something like

{"array_array_codecs": [...], "array_bytes_codec": {...}, "bytes_bytes_codecs": {...}, ...} 

I think if cls.to_dict is a) not producing a dictionary, and b) not producing something isomorphic to an instance of cls, then a different method should be used, e.g. to_list

@d-v-b d-v-b marked this pull request as ready for review May 26, 2024 19:18
@normanrz
Copy link
Member

I would expect BatchedCodecPipeline.to_dict() to produce something like

{"array_array_codecs": [...], "array_bytes_codec": {...}, "bytes_bytes_codecs": {...}, ...} 

I think if cls.to_dict is a) not producing a dictionary, and b) not producing something isomorphic to an instance of cls, then a different method should be used, e.g. to_list

Got it. I don't think there is any use in producing an output like in your JSON snippet. I wouldn't mind renaming the methods to to_list and from_list. Would it still make sense to inherit from Metadata?

@d-v-b
Copy link
Contributor Author

d-v-b commented May 26, 2024

I would expect BatchedCodecPipeline.to_dict() to produce something like

{"array_array_codecs": [...], "array_bytes_codec": {...}, "bytes_bytes_codecs": {...}, ...} 

I think if cls.to_dict is a) not producing a dictionary, and b) not producing something isomorphic to an instance of cls, then a different method should be used, e.g. to_list

Got it. I don't think there is any use in producing an output like in your JSON snippet. I wouldn't mind renaming the methods to to_list and from_list. Would it still make sense to inherit from Metadata?

It might be fine to not inherit from Metadata here? I not too familiar with how the class in question gets used, but I feel like for codecs as long as we have 1 place where the codecs are collectively validated, after that we don't need more validation and they can be passed around internally with whatever data makes the most sense for zarr-python without worrying about how it will get serialized to / from JSON (which is Metadata's job, atm). So yeah, I would see how it feels to not inherit from Metadata and just add exactly what the class needs to do its job.

@jhamman jhamman added the V3 label Jul 1, 2024
@jhamman jhamman added this to the After 3.0.0 milestone Aug 15, 2024
@jhamman jhamman changed the base branch from v3 to main October 14, 2024 20:56
@dstansby dstansby removed the V3 label Dec 12, 2024
@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
@dstansby dstansby removed this from the After 3.0.0 milestone May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs release notes Automatically applied to PRs which haven't added release notes

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

4 participants