- 
                Notifications
    You must be signed in to change notification settings 
- Fork 30
chore(decoder): clean decoders and make csvdecoder available #326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 📝 WalkthroughWalkthroughThis PR refactors the decoding and parsing architecture. It removes several deprecated decoders and parsers (e.g., GzipJsonDecoder, JsonParser, JsonLineParser, CsvParser) and introduces a unified approach with a new GzipDecoder and renamed CsvDecoder. The CompositeRawDecoder now supports configurable streaming via a new  Changes
 Sequence Diagram(s)sequenceDiagram
    participant C as Caller
    participant CRD as CompositeRawDecoder
    participant P as Parser
    C->>CRD: decode(response)
    alt stream_response is True
       CRD->>P: parse(response.raw)
    else stream_response is False
       CRD->>CRD: wrap response.content in BytesIO
       CRD->>P: parse(wrapped content)
    end
    P-->>CRD: return parsed data
    CRD-->>C: yield decoded data
Possibly related PRs
 Suggested labels
 Suggested reviewers
 How does this updated setup look to you? Any tweaks you'd like to make? ✨ Finishing Touches
 Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 
 Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 CodeRabbit Configuration File ( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (15)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
2035-2036: Consider leveraging the model parameters or removing the unused argument.Right now, this method always returns a new JsonDecoder with empty parameters, ignoring the passed-in model. Would it make sense to incorporate model parameters or drop the unused argument to avoid confusion, wdyt?
2039-2040: Make stream_response configurable or confirm it’s always false.Here, you set stream_response=False for CSV. Are you certain that no streaming scenario is needed for CSV data, or would making it configurable benefit some use cases, wdyt?
2061-2061: Check for ZipfileDecoder parameters.Currently, the created ZipfileDecoder ignores additional parameters in model.decoder or model.parameters. Do you want to forward them to the parser, or is this intentional, wdyt?
2064-2077: Consider exposing parameter checks & fallback for decoders.
- The _get_parser method doesn't incorporate model.parameters. If additional settings (like encoding) are required, you might unify that logic here.
- There's a potential for infinitely nested GzipParser if user misconfigures the inner_decoder repeatedly. A recursion limit or check might help.
Wdyt about adding these safeguards?airbyte_cdk/sources/declarative/decoders/json_decoder.py (2)
24-25: Consider making 'stream_response' a parameter.
It's currently hardcoded to False. Would you like to introduce a parameter to toggle streaming for future flexibility, wdyt?
36-41: Catching broad exceptions.
Catching Exception might mask unexpected errors. Would you like to handle a more specific exception type, wdyt?unit_tests/sources/declarative/decoders/test_json_decoder.py (2)
11-13: Great alignment with the new composite decoders!
This import approach looks consistent. Would you consider adding more test coverage to verify interplay between CompositeRawDecoder and JsonDecoder, wdyt?
44-45: Testing partial streaming scenarios?
We now set stream=True. Would you like to add tests confirming that partial lines or chunked responses are handled gracefully, wdyt?unit_tests/sources/declarative/auth/test_token_provider.py (1)
58-60: Testing updated token response.
This properly simulates a new token. Maybe we could also test invalid JSON scenarios to ensure robustness, wdyt?unit_tests/sources/declarative/extractors/test_dpath_extractor.py (1)
24-24: Consider adding a comment explaining the stream_response flag.The initialization looks good, but since this is a test file, it might be helpful to add a comment explaining why
stream_response=Trueis needed here, wdyt?-decoder_jsonl = CompositeRawDecoder(parser=JsonLineParser(), stream_response=True) +# stream_response=True is required for JSONL parsing to handle streaming responses correctly +decoder_jsonl = CompositeRawDecoder(parser=JsonLineParser(), stream_response=True)airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)
142-145: Consider adding docstring for the decode method.The implementation looks good, but since this is a significant change in behavior, would you consider adding a docstring explaining the difference between streaming and non-streaming modes, wdyt?
def decode( self, response: requests.Response ) -> Generator[MutableMapping[str, Any], None, None]: + """Decode the response based on stream_response setting. + + When stream_response is True: + - Uses response.raw for streaming parsing + - Suitable for large responses or JSONL format + When stream_response is False: + - Uses response.content with BytesIO + - Suitable for responses that need to be parsed multiple times + """ if self.is_stream_response(): yield from self.parser.parse(data=response.raw) # type: ignore[arg-type] else: yield from self.parser.parse(data=io.BytesIO(response.content))
134-134: Nice addition of streaming control! Consider adding docstring?The new
stream_responseflag and its implementation look good. Would you consider adding a docstring to explain when to use each mode? For example:stream_response: bool = True + """ + Controls how responses are processed: + - True: Streams response.raw directly (memory efficient for large responses) + - False: Loads response.content into memory (allows multiple iterations) + """Also applies to: 136-137, 142-145
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1268-1272: Consider adding docstring for CsvDecoder.The implementation looks good, but would you consider adding a docstring explaining the purpose and configuration options of the CSV decoder, wdyt?
class CsvDecoder(BaseModel): type: Literal["CsvDecoder"] + """Decoder for CSV formatted data. + + Attributes: + encoding: The character encoding to use (default: utf-8) + delimiter: The character used to separate fields (default: comma) + """ encoding: Optional[str] = "utf-8" delimiter: Optional[str] = ","airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
3012-3025: CsvDecoder – Making CSV decoding available
Introducing the CsvDecoder with clear defaults (utf-8 encoding and a comma delimiter) is a clean and welcome addition. It looks like it accomplishes the PR objective to make CSV decoding available while cleaning up the decoders. Would you be open to adding some tests for different CSV configurations to ensure robustness? wdyt?unit_tests/sources/declarative/decoders/test_composite_decoder.py (1)
203-213: Great test for stream consumption! Consider adding error message check?The test for streamed response consumption looks good. Would you consider also asserting the specific error message to ensure the right error is being raised? Something like:
- with pytest.raises(Exception): + with pytest.raises(Exception, match="Response body has already been consumed"): list(composite_raw_decoder.decode(response))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
- airbyte_cdk/sources/declarative/declarative_component_schema.yaml(3 hunks)
- airbyte_cdk/sources/declarative/decoders/__init__.py(0 hunks)
- airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py(2 hunks)
- airbyte_cdk/sources/declarative/decoders/json_decoder.py(1 hunks)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py(8 hunks)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(5 hunks)
- unit_tests/sources/declarative/auth/test_token_provider.py(3 hunks)
- unit_tests/sources/declarative/decoders/test_composite_decoder.py(1 hunks)
- unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py(0 hunks)
- unit_tests/sources/declarative/decoders/test_json_decoder.py(2 hunks)
- unit_tests/sources/declarative/extractors/test_dpath_extractor.py(1 hunks)
💤 Files with no reviewable changes (2)
- airbyte_cdk/sources/declarative/decoders/init.py
- unit_tests/sources/declarative/decoders/test_decoders_memory_usage.py
🧰 Additional context used
🪛 GitHub Actions: Linters
unit_tests/sources/declarative/decoders/test_json_decoder.py
[warning] 1-1: Code would be reformatted to adhere to style guidelines.
unit_tests/sources/declarative/auth/test_token_provider.py
[warning] 1-1: Code would be reformatted to adhere to style guidelines.
unit_tests/sources/declarative/decoders/test_composite_decoder.py
[warning] 1-1: Code would be reformatted to adhere to style guidelines.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
[warning] 1-1: Code would be reformatted to adhere to style guidelines.
airbyte_cdk/sources/declarative/decoders/json_decoder.py
[warning] 1-1: Code would be reformatted to adhere to style guidelines.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (21)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
152-152: Looks good!The added import for CsvDecoderModel synchronizes well with the rest of the codebase. No issues found, wdyt?
227-227: Nice addition for GzipDecoderModel.This import appears consistent with your usage in _get_parser. No concerns, wdyt?
2045-2046: Validate streaming approach for JSONL.Here, the method sets stream_response=True for JSONL. This is likely correct given JSON lines are commonly processed in a streaming manner. Have you tested large JSONL data with this approach, wdyt?
airbyte_cdk/sources/declarative/decoders/json_decoder.py (3)
13-13: Thank you for adopting the composite approach.
This new import ensures we unify decoding logic with JsonParser. Would you like to verify usage in other parts of the codebase for consistency, wdyt?
28-28: Pass-through of 'is_stream_response' looks good.
No issues here!
44-44: Verify empty response behavior.
We yield an empty dict when nothing was decoded. Are we certain we want a single empty mapping rather than not yielding at all or returning an empty list, wdyt?unit_tests/sources/declarative/auth/test_token_provider.py (2)
4-4: Importing 'json' is good.
This helps us easily create mock responses. No concerns here!
21-21: Switching to '.content' is more realistic.
Setting the token via encoded JSON simulates real response behavior. Would you like to confirm that bytes-to-JSON decoding logic is correctly handled in production, wdyt?unit_tests/sources/declarative/extractors/test_dpath_extractor.py (2)
12-13: LGTM! Clean import changes.The imports are correctly updated to use the new decoder architecture.
12-13: LGTM! Nice refactoring of the JsonlDecoder.The change to use
CompositeRawDecoderwithJsonLineParserlooks good and aligns with the decoder cleanup objectives. The test cases continue to pass with the new implementation.Also applies to: 24-24
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py (2)
134-134: LGTM! Good default value choice.Setting
stream_response=Trueas default maintains backward compatibility while allowing opt-out when needed.
3-3: LGTM! Added required import.The addition of
ioimport is necessary for usingBytesIOin the non-streaming mode.unit_tests/sources/declarative/decoders/test_composite_decoder.py (3)
203-213: LGTM! Good test for consumed stream behavior.The test verifies that attempting to decode an already consumed stream raises an exception, which is the expected behavior.
215-223: LGTM! Good test for non-streaming mode.The test verifies that non-streaming mode allows multiple decodes of the same response.
215-223: LGTM! Good test for non-streamed mode.The test effectively verifies that non-streamed responses can be decoded multiple times.
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (4)
1664-1666: LGTM! Good design for GzipDecoder.The GzipDecoder with inner_decoder support allows for flexible composition of decoders.
1268-1272: LGTM! Clean CsvDecoder implementation.The CsvDecoder class looks good with appropriate default values for encoding and delimiter.
1664-1666: LGTM! Nice GzipDecoder implementation.The GzipDecoder class with inner_decoder support looks good and aligns with the decoder cleanup objectives.
1912-1914: LGTM! Simplified decoder options.The update to SessionTokenAuthenticator's decoder field to only allow JsonDecoder and XmlDecoder makes sense.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
2141-2155: ZipfileDecoder – Updated property from "parser" to "decoder"
The new changes now require a "decoder" property (instead of the old "parser") and correctly reference the unified decoders (CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder). Would you consider renaming the inner title from "Parser" to "Decoder" for consistency with the rest of the schema? wdyt?
2997-3011: GzipDecoder – Flexible inner_decoder implementation
The new GzipDecoder correctly requires an inner_decoder that can be one of CsvDecoder, GzipDecoder, JsonDecoder, or JsonlDecoder, which aligns with the unified decoding strategy. Have you thought about potential recursion issues if a GzipDecoder is nested inside another? It might be useful to add a note or safeguard if deep-nested gzip decoding isn’t expected. wdyt?
| /autofix 
 
 
 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (2)
24-25: Consider documenting the stream_response=False choice?The initialization looks good! Would it be helpful to add a comment explaining why
stream_response=Falseis the default here? This could help future maintainers understand the design decision, wdyt?
30-45: Consider more specific error handling?The code looks good overall! A few thoughts:
- The generic
Exceptioncatch might hide specific issues. Would it be helpful to catch and log specific exceptions likeJSONDecodeErrorseparately, wdyt?- The empty dict fallback is a nice safety net, but should we log a warning when this happens to help with debugging?
try: for element in self._decoder.decode(response): yield element has_yielded = True - except Exception: + except json.JSONDecodeError as e: + logger.warning(f"Failed to decode JSON response: {e}") + yield {} + except Exception as e: + logger.warning(f"Unexpected error while decoding response: {e}") yield {}unit_tests/sources/declarative/decoders/test_json_decoder.py (1)
44-48: Consider adding error case tests?The happy path tests look good! Would it be valuable to add some error case tests, wdyt? For example:
- Malformed JSON lines
- Mixed valid/invalid JSON lines
- Empty lines between valid JSON
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
2035-2048: LGTM! Consider adding docstrings for better maintainability?The implementation looks good. The stream_response flag is correctly set based on the decoder type. Would you consider adding docstrings to explain the purpose and behavior of each decoder method? This could help future maintainers understand the differences between them, wdyt?
Example docstring for
create_csv_decoder:def create_csv_decoder(model: CsvDecoderModel, config: Config, **kwargs: Any) -> Decoder: """Creates a CSV decoder using CompositeRawDecoder with CsvParser. Args: model: The CSV decoder model containing encoding and delimiter settings. config: The connector configuration. **kwargs: Additional keyword arguments. Returns: A CompositeRawDecoder instance configured for CSV parsing. """
2066-2083: LGTM! Consider enhancing error messages?The implementation is clean and handles all decoder types appropriately. Would you consider making the error messages more specific by including the list of supported decoders in the error message? This could help users quickly understand what decoders are available, wdyt?
Example enhanced error message:
- raise ValueError(f"Decoder type {model} does not have parser associated to it") + raise ValueError(f"Decoder type {model} does not have parser associated to it. Supported decoders are: JsonDecoder, JsonlDecoder, CsvDecoder, and GzipDecoder")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
- airbyte_cdk/sources/declarative/decoders/json_decoder.py(1 hunks)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(5 hunks)
- unit_tests/sources/declarative/auth/test_token_provider.py(3 hunks)
- unit_tests/sources/declarative/decoders/test_composite_decoder.py(1 hunks)
- unit_tests/sources/declarative/decoders/test_json_decoder.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- unit_tests/sources/declarative/decoders/test_composite_decoder.py
- unit_tests/sources/declarative/auth/test_token_provider.py
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)
13-13: LGTM! Nice simplification of the decoder structure.The removal of the dataclass decorator and delegation to CompositeRawDecoder makes the code more maintainable and follows the composition over inheritance principle.
Also applies to: 19-22
unit_tests/sources/declarative/decoders/test_json_decoder.py (1)
11-13: LGTM! Clean import updates.The imports are properly aligned with the new decoder structure.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
152-152: LGTM! Clean addition of CsvDecoderModel.The import and mapping follow the established pattern in the codebase.
Also applies to: 521-521
227-227: LGTM! Clean addition of GzipDecoderModel.The import follows the established pattern in the codebase.
2063-2063: LGTM! Clean refactor of create_zipfile_decoder.The change nicely leverages the new _get_parser method, making the code more maintainable and consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked into what that would mean for our connectors and published manifests — I think the only concern is around GzipJsonDecoder (not used?) and CompositeRawDecoded that I believe is used in a few spots, but it's a very simple manifest change to update, right?
        
          
                airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding comments on the code review
        
          
                airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
              
                Outdated
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (1)
43-44: Would you consider being more specific with error handling? 🤔Currently, we're catching all exceptions and returning an empty dict. Maybe we could catch specific exceptions (e.g.,
orjson.JSONDecodeError) to avoid masking unexpected errors, wdyt?- except Exception: + except (orjson.JSONDecodeError, UnicodeDecodeError) as e: + logger.debug(f"Failed to decode response: {e}") yield {}airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1268-1272: Should we add validation for encoding and delimiter fields?The
CsvDecoderclass looks good, but we could enhance it by adding:
- Field descriptions and examples
- Validation for supported encodings
- Common delimiter options
What do you think about adding these improvements? They would make the schema more user-friendly and help prevent configuration errors. wdyt?
class CsvDecoder(BaseModel): type: Literal["CsvDecoder"] - encoding: Optional[str] = "utf-8" - delimiter: Optional[str] = "," + encoding: Optional[str] = Field( + "utf-8", + description="Character encoding to use when reading CSV files.", + examples=["utf-8", "ascii", "iso-8859-1"], + title="Character Encoding", + ) + delimiter: Optional[str] = Field( + ",", + description="Character used to separate fields in the CSV file.", + examples=[",", ";", "\t"], + title="Field Delimiter", + )airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2066-2083: Consider enhancing error messages and documentation?The parser selection logic is well-structured, but what do you think about these potential improvements? wdyt?
- The error message could be more specific about which decoder types are supported:- raise ValueError(f"Decoder type {model} does not have parser associated to it") + raise ValueError(f"Decoder type {model} does not support parsing. Supported decoders: JsonDecoder, JsonlDecoder, CsvDecoder, GzipDecoder")
- The comment about JsonDecoder logic could be expanded to explain the specific error cases:- # Note that the logic is a bit different from the JsonDecoder as there is some legacy that is maintained to return {} on error cases + # Note: JsonParser differs from JsonDecoder in error handling: + # - JsonParser returns {} on parsing errors to maintain backward compatibility + # - JsonDecoder raises exceptions for better error visibility
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
2142-2155: ZipfileDecoder Update – New "decoder" Field and References
The ZipfileDecoder component now requires a "decoder" field (instead of the old "parser") and its properties section has been updated accordingly. The "anyOf" list now includes references to CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder. Could you please confirm that including all these decoders (especially the inclusion of GzipDecoder within ZipfileDecoder) is intentional for handling decompressed zipfile data? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
- airbyte_cdk/sources/declarative/declarative_component_schema.yaml(3 hunks)
- airbyte_cdk/sources/declarative/decoders/json_decoder.py(1 hunks)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py(8 hunks)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (12)
airbyte_cdk/sources/declarative/decoders/json_decoder.py (3)
22-23: Great job documenting the historical context! 🎉The documentation clearly explains the rationale behind using JsonDecoder instead of CompositeRawDecoder, which will be super helpful for future maintainers.
26-27: Nice refactor using composition! 👍The initialization is clean and follows the Single Responsibility Principle by delegating to CompositeRawDecoder.
38-47: Love the robust implementation! ✨The
has_yieldedflag ensures we maintain the contract of always yielding at least one item, even when the decoder returns nothing. This is a great defensive programming practice!airbyte_cdk/sources/declarative/models/declarative_component_schema.py (4)
1664-1666: LGTM! The GzipDecoder implementation looks good.The recursive decoder pattern allows for flexible handling of nested formats, and the naming is consistent with other decoders.
1704-1708: The field name change fromparsertodecoderlooks good!This change aligns with the previous review comment about naming consistency between
GzipDecoder.inner_decoderandZipfileDecoder.decoder.
1912-1914: LGTM! The SessionTokenAuthenticator decoder field update is correct.The change simplifies the decoder options to just
JsonDecoderandXmlDecoder, which makes sense for session token responses.
2109-2123: LGTM! The decoder field updates in SimpleRetriever and AsyncRetriever are consistent.The changes consistently use the new
CsvDecoderacross both retrievers, maintaining uniformity in the codebase.Also applies to: 2186-2215
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
152-152: LGTM! Clean addition of CsvDecoderModel.The import and constructor mapping follow the established pattern in the codebase.
Also applies to: 521-521
227-227: LGTM! Clean addition of GzipDecoderModel.The import follows the established pattern in the codebase.
2035-2048: LGTM! Clean implementation of decoder creation methods.The methods follow a consistent pattern using CompositeRawDecoder with appropriate parsers. Nice job on keeping the implementations concise and similar in structure.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml (2)
2997-3011: GzipDecoder Enhancements – Recursive Decoder Reference Check
The new GzipDecoder now requires a "decoder" field and its "anyOf" list includes references to CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder. I noticed that GzipDecoder is allowed to reference itself, which could enable chaining of gzip decoders. Is this recursive configuration intentional and aligned with your design objectives? wdyt?
3012-3025: CsvDecoder Definition Verification
The CsvDecoder component now explicitly sets its type with enum [CsvDecoder] and provides default values for encoding and delimiter. This looks clear and consistent with the unified decoder design. Does this meet your requirements for CSV parsing? wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved -- just want to confirm if we should add GzipDecoder as a "top level" decoder in the SimpleRetriever and AsyncRetriever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (1)
1664-1666: Consider making the decoder field name consistent with other decoders, wdyt?Based on the past review comments about naming consistency, should we rename
decodertoinner_decoderto match other similar components? This would help maintain a consistent naming convention across the codebase.airbyte_cdk/sources/declarative/declarative_component_schema.yaml (3)
2141-2155: Refactor in ZipfileDecoder’s "decoder" Field
It looks like you’ve replaced the old “parser” property with a unified “decoder” property in the ZipfileDecoder definition. This change now allows the use of CsvDecoder, GzipDecoder, JsonDecoder, and JsonlDecoder. Would you consider adding some inline examples or extra documentation here to clarify for users how to leverage these options? wdyt?
2997-3012: Addition of the GzipDecoder Component
You’ve introduced a new GzipDecoder that requires both a “type” and a nested “decoder” field to define how to interpret the decompressed data. I did notice that within its “decoder” field, you also allow a reference to a GzipDecoder via "#/definitions/GzipDecoder." Is this recursive configuration intentional (for chained decompression) or might it lead to unexpected recursion? Perhaps a comment or guard would help clarify its intended use. wdyt?
3013-3026: Introduction of the CsvDecoder Component
The CsvDecoder definition is straightforward, with reasonable defaults for encoding ("utf-8") and delimiter (","). Would you consider including one or two usage examples (or references to documentation) directly within the schema to help users understand how to correctly configure CSV decoding in practice? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
- airbyte_cdk/sources/declarative/declarative_component_schema.yaml(3 hunks)
- airbyte_cdk/sources/declarative/models/declarative_component_schema.py(8 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Analyze (python)
🔇 Additional comments (5)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (5)
1268-1272: LGTM! The CsvDecoder class looks well-defined.The class has sensible defaults for encoding (utf-8) and delimiter (,).
1704-1708: LGTM! The ZipfileDecoder's decoder field update looks good.The field has been updated to use the new decoder types (CsvDecoder, GzipDecoder, JsonDecoder, JsonlDecoder) consistently.
1912-1914: LGTM! The SessionTokenAuthenticator's decoder field update is correct.The field has been correctly restricted to only JsonDecoder and XmlDecoder, which aligns with the typical response formats for session token authentication.
2109-2124: LGTM! The SimpleRetriever's decoder field update is comprehensive.The field now includes all available decoders (CsvDecoder, GzipDecoder, JsonDecoder, etc.) with proper documentation.
2187-2218: LGTM! The AsyncRetriever's decoder field updates are thorough.Both
decoderanddownload_decoderfields have been updated consistently to include all available decoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2037-2056: Consider extracting common decoder creation logic?The create_csv_decoder, create_jsonl_decoder, and create_gzip_decoder methods follow a similar pattern of creating a CompositeRawDecoder with a parser. What do you think about extracting this common logic into a private helper method to reduce code duplication? Something like:
@staticmethod def _create_composite_decoder(model: BaseModel, config: Config, stream_response: bool) -> Decoder: return CompositeRawDecoder( parser=ModelToComponentFactory._get_parser(model, config), stream_response=stream_response )This would make the code more DRY and easier to maintain, wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (3)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (3)
105-105: LGTM!The GzipDecoder import is correctly placed in alphabetical order.
522-522: LGTM!The new decoder models are correctly mapped to their creation methods in the PYDANTIC_MODEL_TO_CONSTRUCTOR dictionary.
Also applies to: 552-552
2074-2092: Verify encoding handling in GzipParser with inner parsersBased on a past review comment, there was an issue where GzipDecoder passes bytes to the inner_parser, which caused problems with non-standard (utf) encoding. Could you verify that this is now handled correctly, especially for cases like:
GzipDecoder(decoder=JsonDecoder()) # GzipParser passes bytes to JsonParser GzipDecoder(decoder=CsvDecoder(encoding='utf-16')) # Non-standard encoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (1)
2041-2044: Consider extracting common decoder creation logic?The CSV, JSONL, and GZIP decoder implementations share the same pattern of creating a CompositeRawDecoder with stream_response=True. Would extracting this into a helper method make sense to reduce duplication, wdyt?
+ @staticmethod + def _create_composite_decoder(model: BaseModel, config: Config) -> Decoder: + return CompositeRawDecoder( + parser=ModelToComponentFactory._get_parser(model, config), + stream_response=True + ) @staticmethod def create_csv_decoder(model: CsvDecoderModel, config: Config, **kwargs: Any) -> Decoder: - return CompositeRawDecoder( - parser=ModelToComponentFactory._get_parser(model, config), stream_response=True - ) + return ModelToComponentFactory._create_composite_decoder(model, config) @staticmethod def create_jsonl_decoder(model: JsonlDecoderModel, config: Config, **kwargs: Any) -> Decoder: - return CompositeRawDecoder( - parser=ModelToComponentFactory._get_parser(model, config), stream_response=True - ) + return ModelToComponentFactory._create_composite_decoder(model, config) @staticmethod def create_gzip_decoder(model: GzipDecoderModel, config: Config, **kwargs: Any) -> Decoder: - return CompositeRawDecoder( - parser=ModelToComponentFactory._get_parser(model, config), stream_response=True - ) + return ModelToComponentFactory._create_composite_decoder(model, config)Also applies to: 2047-2050, 2053-2056
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
- airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-the-guardian-api' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: Analyze (python)
🔇 Additional comments (4)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (4)
105-105: LGTM!The GzipDecoder import is correctly placed in alphabetical order.
2037-2038: LGTM!The JSON decoder implementation is clean and straightforward.
2074-2091: LGTM! The parser selection logic is well-structured.The implementation:
- Handles each decoder type appropriately
- Provides clear error messages for unsupported decoders
- Correctly wraps inner parsers for GzipParser
Based on the past review comments, I see that you've already addressed the issue with GzipParser and JsonLineParser that was discussed between @artem1205 and @maxi297. The current implementation looks good.
2071-2071: LGTM!The ZipfileDecoder now uses the centralized parser selection logic, maintaining consistency with other decoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
| Adding a link to the issue to add this new functionality to the builder, just for x-referencing purposes: https://github.com/airbytehq/airbyte-internal-issues/issues/11679 | 
* main: fix: update cryptography package to latest version to address CVE (airbytehq#377) fix: (CDK) (HttpRequester) - Make the `HttpRequester.path` optional (airbytehq#370) feat: improved custom components handling (airbytehq#350) feat: add microseconds timestamp format (airbytehq#373) fix: Replace Unidecode with anyascii for permissive license (airbytehq#367) feat: add IncrementingCountCursor (airbytehq#346) feat: (low-code cdk) datetime format with milliseconds (airbytehq#369) fix: (CDK) (AsyncRetriever) - Improve UX on variable naming and interpolation (airbytehq#368) fix: (CDK) (AsyncRetriever) - Add the `request` and `response` to each `async` operations (airbytehq#356) fix: (CDK) (ConnectorBuilder) - Add `auxiliary requests` to slice; support `TestRead` for AsyncRetriever (part 1/2) (airbytehq#355) feat(concurrent perpartition cursor): Add parent state updates (airbytehq#343) fix: update csv parser for builder compatibility (airbytehq#364) feat(low-code cdk): add interpolation for limit field in Rate (airbytehq#353) feat(low-code cdk): add AbstractStreamFacade processing as concurrent streams in declarative source (airbytehq#347) fix: (CDK) (CsvParser) - Fix the `\\` escaping when passing the `delimiter` from Builder's UI (airbytehq#358) feat: expose `str_to_datetime` jinja macro (airbytehq#351) fix: update CDK migration for 6.34.0 (airbytehq#348) feat: Removes `stream_state` interpolation from CDK (airbytehq#320) fix(declarative): Pass `extra_fields` in `global_substream_cursor` (airbytehq#195) feat(concurrent perpartition cursor): Refactor ConcurrentPerPartitionCursor (airbytehq#331) feat(HttpMocker): adding support for PUT requests and bytes responses (airbytehq#342) chore: use certified source for manifest-only test (airbytehq#338) feat: check for request_option mapping conflicts in individual components (airbytehq#328) feat(file-based): sync file acl permissions and identities (airbytehq#260) fix: (CDK) (Connector Builder) - refactor the `MessageGrouper` > `TestRead` (airbytehq#332) fix(low code): Fix missing cursor for ClientSideIncrementalRecordFilterDecorator (airbytehq#334) feat(low-code): Add API Budget (airbytehq#314) chore(decoder): clean decoders and make csvdecoder available (airbytehq#326)
What
https://github.com/airbytehq/airbyte-internal-issues/issues/11616
This is a breaking change but only for an experimental component or one that is only used in source-amplitude so I'm fine keeping this a minor change.
Note that this means we will start parsing twice instead of relying on the in-memory value of
response.json()from the requests library but we expect the parsing done by orjson to be twice as fast which means that we don't expect a performance hit even with the parsing twice.Summary by CodeRabbit
Summary by CodeRabbit
New Features
Refactor
Tests
CompositeRawDecoderto ensure correct behavior with consumed and non-streamed responses.