feat: specialized dataset classes, fix: datasets refactor #153

morgandu · 2020-12-29T07:52:21Z

Fixes:

go/mbsdk-dataset-source-refactor2
https://b.corp.google.com/issues/174168850
https://b.corp.google.com/issues/175726441

ivanmkc · 2020-12-30T13:17:33Z

google/cloud/aiplatform/datasets/datasources.py

+
+    @property
+    def dataset_metadata(self) -> Optional[Dict]:
+        return self._dataset_metadata


Could probably just return {} here instead of saving to self._dataset_metadata

Better yet, just return None

ivanmkc · 2020-12-30T13:50:22Z

google/cloud/aiplatform/datasets/image_dataset.py

+        )
+
+    @classmethod
+    def create(


This is a little strange, I thought the point of the subclass was to only expose the relevant parameters and exclude non-relevant ones (for example, bq_source in the case of image_dataset).

Hence, in the branch I provided, the signature was:

@classmethod def create( cls, display_name: str, gcs_source_uris: Sequence[str], import_schema_uri: str, data_items_labels: Optional[Dict] = None, metadata: Sequence[Tuple[str, str]] = (), labels: Optional[Dict] = None, project: Optional[str] = None, location: Optional[str] = None, credentials: Optional[auth_credentials.Credentials] = None, sync=True, ) -> "Dataset":

I wonder if it makes more sense to use this instead:

def create( cls, display_name: str, datasource: Union[NonTabularDatasource, NonTabularDatasourceImportable], metadata: Sequence[Tuple[str, str]] = (), labels: Optional[Dict] = None, project: Optional[str] = None, location: Optional[str] = None, credentials: Optional[auth_credentials.Credentials] = None, sync=True, ) -> "Dataset":

This would automatically give you validation such as:

dataset_items_labels is only relevant when import_schema_uri is provided

gcs_source_uris is only relevant when import_schema_uri is provided

It'd also make subclasses even more trivial to write.

Probably need @sasha-gitg's opinion on this since this would be user-facing.

@ivanmkc to your first feedback, bq_source was left by mistake, will fix it.

Input signature LGTM.

ivanmkc · 2020-12-30T13:53:55Z

google/cloud/aiplatform/datasets/image_dataset.py

+
+    _support_import_schema_classes = ("image",)
+
+    def __init__(


Do we need an overridden init method at all in the ImageDataset subclass since it seems like this just calls the Dataset's init method with no modifications. Wouldn't it be redundant?

One reason for using an overridden init in the subclass is to check that the returned existing Dataset's metadata_schema_uri matches Image (or other subclasses). However, I don't see this here as you put in in Dataset class (which I commented on)

ivanmkc · 2020-12-30T13:58:44Z

google/cloud/aiplatform/datasets/image_dataset.py

+            )
+            import_data_config = datasource.import_data_config
+
+        return cls._create_and_import(


I thought we agreed to create a function that encapsulates Tabular/Non-tabular parameters as discussed in https://docs.google.com/document/d/16Nu_jFnGO79mcn83IgtpvTCYYEwxyPLOgdvxdxfILbI/edit?resourcekey=0-89933HPbKWlN6sz31SvBuA#bookmark=id.k68tqmbzwags

That way you wouldn't call _create_and_import directly which would simplify all the subclasses as _create_and_import has many optional parameters with potentially invalid combinations.

ivanmkc · 2020-12-30T14:00:21Z

google/cloud/aiplatform/datasets/image_dataset.py

+                input file(s). May contain wildcards. For more
+                information on wildcards, see
+                https://cloud.google.com/storage/docs/gsutil/addlhelp/WildcardNames.
+            bq_source: Optional[str] = None:


Not relevant for image datasets

ivanmkc · 2020-12-30T14:01:03Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+
+    _support_import_schema_classes = None
+
+    def __init__(


see ImageDataset.init comment

ivanmkc · 2020-12-30T14:01:35Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+        cls.metadata_schema_uri = schema.dataset.metadata.tabular
+        datasource = TabularDatasource(gcs_source, bq_source)
+
+        return cls._create_and_import(


ditto, see ImageDataset.create comment

google/cloud/aiplatform/utils.py

ivanmkc · 2020-12-30T14:25:45Z

google/cloud/aiplatform/datasets/tabular_dataset.py

+            sync=sync,
+        )
+
+    def import_data(self):


Good catch.

I wish there was a cleaner way to do this, like split out import_data (and other import functionality) from Dataset so that only relevant subclasses would even have this function. This way we wouldn't have to explicitly nullify it. This definitely feels hacky.

Perhaps we can do a separate refactor later:

DatasetBase only includes Dataset creation functionality

Dataset inherits DatasetBase and adds import functionality

ImageDataset would inherit Dataset

TabularDataset would inherit DatasetBase

This way TabularDataset wouldn't even have an import_data function.

Seems big enough to be a subsequent PR.

google/cloud/aiplatform/datasets/datasources.py

google/cloud/aiplatform/datasets/image_dataset.py

google/cloud/aiplatform/datasets/dataset.py

ivanmkc · 2020-12-30T15:04:23Z

google/cloud/aiplatform/datasets/dataset.py

+        return self._metadata_schema_uri
+
+    @metadata_schema_uri.setter
+    def metadata_schema_uri(self, metadata_schema_uri):


Prefer not to expose a getter, and a setter even less so unless required.

They can pass this in through the create functions.
Having a setter as well means that the validation has to be carefully synced across all places where metadata_schema_uri is passed in.

It also can introduce temporal dependencies like I pointed out in the other comment.

ivanmkc · 2020-12-30T15:09:20Z

google/cloud/aiplatform/datasets/dataset.py

+                    f"{cls.metadata_schema_uri} does not support import_schema_uri."
+                )
+            datasource = TabularDatasource(gcs_source, bq_source)
+        elif cls.metadata_schema_uri in [schema.dataset.metadata.image]:


We can probably just check for tabular and then assume everything else is non-tabular instead of checking that it matches image, etc.

ivanmkc · 2020-12-30T15:11:04Z

google/cloud/aiplatform/datasets/dataset.py

+        return True
+
+    @property
+    def metadata_schema_uri(self) -> str:


Is there a reason we're exposing this? I'm of the opinion that we don't need to expose it as a property if not required.

ivanmkc · 2020-12-30T15:20:54Z

google/cloud/aiplatform/datasets/dataset.py

+
+        for import_schema_class in self._support_import_schema_classes:
+            if (
+                self.metadata_schema_uri


This creates a temporal dependency/coupling such that the user has to first set self.metadata_schema_uri before setting self.import_schema_uri, which can introduce bugs, since both are setters.

This wouldn't be apparent to a user unless they peered into the class, and even then, can be error-prone.

For reasons like this, I would prefer not to have property setters if possible.
We could perhaps provide getters if that's something users care about, although I'd lean towards hiding as many internal workings as possible.

ivanmkc · 2020-12-30T15:26:15Z

tests/unit/aiplatform/test_datasets.py

@@ -226,7 +226,7 @@ def test_create_and_import_dataset(
            labels=_TEST_LABEL,
            metadata_schema_uri=_TEST_METADATA_SCHEMA_URI_NONTABULAR,
            import_schema_uri=_TEST_IMPORT_SCHEMA_URI,
-            data_items_labels=_TEST_DATA_LABEL_ITEMS,


ivanmkc

Thanks for putting together the PR! Left some comments

dizcology · 2020-12-30T19:26:20Z

google/cloud/aiplatform/datasets/__init__.py

+# limitations under the License.
+#
+
+from google.cloud.aiplatform.datasets.datasources import (


I think here and below we prefer to import the module instead of the classes. from google.cloud.aiplatform.datasets import datasources

dizcology · 2020-12-30T19:35:27Z

google/cloud/aiplatform/datasets/dataset.py

@@ -77,45 +155,37 @@ def create(
        metadata_schema_uri: str,
        gcs_source: Optional[Sequence[str]] = None,
        bq_source: Optional[str] = None,
+        labels: Optional[Dict] = {},


The benefit of having this might not outweigh the confusion it would cause.

Oh yes, labels is named poorly. Maybe in MBSDK we can rename it as metadata_labels?

If these are GCP resource labels then resource_labels should be OK. I would recommend removing this and creating a ticket to add resource labeling to all MB SDK resources and follow up later.

created a ticket for resource_labels to resources

google/cloud/aiplatform/datasets/dataset.py

dizcology · 2020-12-30T19:43:55Z

google/cloud/aiplatform/datasets/datasources.py

+
+
+class _Datasource(abc.ABC):
+    """An abstract class that sets dataset_metadata"""


How much of the distinction between tabular vs non-tabular, and importable vs non-importable is tentative oddity that will be removed from the service, and as such how much should be completely hidden from the user?

As an example, while tabular datasets do not support an import_data service method, the library could well support it through BQ/GCS, which is the expected mechanism for updating tabular datasets.

The near-term goal is to keep the Tabular vs Non-tabular distinction internal so the users don't have to worry about it if they used the Dataset subclasses. The subclasses will tell them exactly which arguments are relevant.

So the effect is that internally, the _Datasource class abstracts away the Tabular/Non-tabular details so that the Dataset class doesn't need to know about the differences. If there comes a day that the services can handle Tabular/Non-tabular the same way, all we have to do is replace TabularDatasource and NonTabularDatasource with some UnifiedDatasource that also conforms to _Datasource. Dataset will not have to be touched (unless the service changes things there too).

In the case of importing data, once we have a mechanism to import data to tabular datasets, we just conform TabularDatasource to _DatasourceImportable, provide the relevant logic and it should thus gain the ability to import data.

…sting

…d of a concrete type

sasha-gitg

Looks great! Some minor requested changes and unit tests. Thanks!

sasha-gitg · 2021-01-05T20:52:25Z

google/cloud/aiplatform/datasets/dataset.py

                Required. A fully-qualified dataset resource name or dataset ID.
                Example: "projects/123/locations/us-central1/datasets/456" or
                "456" when project and location are initialized or passed.
-            project (str):
+            project: (str) = None


The convention is not to include the default value in the docstring. These docstring args should be reverted back to:

arg_name (arg_type): format.

sasha-gitg · 2021-01-05T20:55:08Z

google/cloud/aiplatform/datasets/dataset.py

+        """The metadata schema uri of this dataset resource."""
+        return self._gca_resource.metadata_schema_uri
+
+    def _validate_metadata_schema_uri(self) -> bool:


Since this is validating, it's safe to return None and not a bool.

sasha-gitg · 2021-01-05T20:55:34Z

google/cloud/aiplatform/datasets/dataset.py

+        return self._gca_resource.metadata_schema_uri
+
+    def _validate_metadata_schema_uri(self) -> bool:
+        """Validate the metadata_schema_uri of retrieved dataset resource."""


Requires raises section.

sasha-gitg · 2021-01-05T20:58:13Z

google/cloud/aiplatform/datasets/dataset.py

@@ -77,45 +155,37 @@ def create(
        metadata_schema_uri: str,
        gcs_source: Optional[Sequence[str]] = None,
        bq_source: Optional[str] = None,
+        labels: Optional[Dict] = {},


If these are GCP resource labels then resource_labels should be OK. I would recommend removing this and creating a ticket to add resource labeling to all MB SDK resources and follow up later.

sasha-gitg · 2021-01-05T21:00:11Z

google/cloud/aiplatform/datasets/dataset.py

        bq_source: Optional[str] = None,
+        labels: Optional[Dict] = {},


Side point: Avoid providing a mutable type as a default argument.

sasha-gitg · 2021-01-06T00:08:11Z

google/cloud/aiplatform/datasets/datasources.py

@@ -0,0 +1,141 @@
+import abc
+from typing import Optional, Dict, Sequence, Union
+from google.cloud.aiplatform_v1beta1 import (


These should be module level imports and they should be imported directly from the module they are implemented in.

sasha-gitg · 2021-01-06T00:15:28Z

google/cloud/aiplatform/datasets/datasources.py

@@ -0,0 +1,141 @@
+import abc


This module should be prefixed with underscore, ie: _datasources to indicate that we don't intend for it to be used on the API surface.

sasha-gitg · 2021-01-06T00:19:01Z

google/cloud/aiplatform/datasets/dataset.py

@@ -42,6 +41,8 @@ class Dataset(base.AiPlatformResourceNounWithFutureManager):
    _resource_noun = "datasets"
    _getter_method = "get_dataset"

+    _support_metadata_schema_uris = None


Prefer _supported_metadata_schema_uris. Also it would be nice to include the type signature to signal to devs what the supported type is, ie: _supported_metadata_schema_uris: Optional[Tuple[str]] = None.

sasha-gitg · 2021-01-06T00:20:32Z

google/cloud/aiplatform/datasets/image_dataset.py

+        )
+
+    @classmethod
+    def create(


Input signature LGTM.

sasha-gitg · 2021-01-06T00:23:18Z

google/cloud/aiplatform/utils.py

@@ -18,7 +18,7 @@

 import re

-from typing import Any, Match, Optional, Type, TypeVar, Tuple
+from typing import Any, Match, Optional, Type, TypeVar, Tuple, Sequence


Why was Sequence added?

ivanmkc · 2021-01-06T08:57:58Z

google/cloud/aiplatform/datasets/dataset.py

-        project: str,
-        location: str,
-        credentials: Optional[auth_credentials.Credentials],
+        datasource: datasources.Datasource,


Thanks for fixing!

- remove Sequence from utils.py - refactor datasources.py to _datasources.py - change docstring format to arg_name (arg_type): convention - change and include the type signature _supported_metadata_schema_uris - change _validate_metadata_schema_uri - refactor _create_encapsulated to _create_and_import - refactor to module level imports - add tests for ImageDataset and TabularDataset

sasha-gitg

LGTM. Added a few more comments.

This branch needs to be updated with the latest dev branch so the builds pass.

sasha-gitg · 2021-01-07T15:24:38Z

google/cloud/aiplatform/datasets/_datasources.py

+        """Creates a tabular datasource
+
+        Args:
+            gcs_source (Optional[Union[str, Sequence[str]]]):


It's OK to remove the Optional type hint in the docstring as that's communicated by not including "Required" in the description.

sasha-gitg · 2021-01-07T15:30:37Z

google/cloud/aiplatform/datasets/dataset.py

+                examples:
+                    str: "gs://bucket/file.csv"
+                    Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
+            bq_source (Optional[str]):
                BigQuery URI to the input table.


We may also want a sample for bq_source to match the documentation for gcs_source.

sasha-gitg · 2021-01-07T15:32:53Z

google/cloud/aiplatform/datasets/dataset.py

-                dataset_metadata = {
-                    "input_config": {"bigquery_source": {"uri": bq_source}}
-                }
+        if metadata_schema_uri == schema.dataset.metadata.tabular:


I wonder if we should push this conditional instantiation of correct Datasource class as a helper method in the _datasource module.

sasha-gitg · 2021-01-07T15:43:07Z

tests/unit/aiplatform/test_datasets.py

+        assert my_dataset._gca_resource == expected_dataset
+
+
+class TestTabularDataset:


This should have one more test to ensure import_data raises.

ivanmkc

LGTM once rest of comments addressed!

Let me know if you need help rebasing.

* fix: unblock builds (#132) * chore: Update README with Experimental verbiage (#131) * fix: Fixed comments (#116) Co-authored-by: Ivan Cheung <[email protected]> * feat: Implements a wrapped client that instantiates the client at every API invocation (#139) * feat: Added optional model args for custom training (#129) * Added optional model args * fix: Removed etag * fix: Added predict schemata and fixed type error * fix: Added description and fixed predict_schemata * Added _model_serving_container_command, _model_serving_container_args, env=self._model_serving_container_environment_variables and _model_serving_container_ports * fix: Ran linter * fix: Added tests for model_instance_schema_uri, model_parameters_schema_uri and model_prediction_schema_uri * fix: Fixed env and ports and added tests * fix: Removed model_labels * fix: Moved container spec creation into init function * fix: Fixed docstrings * fix: Moved import to be alphabetical * fix: Moved model creation to init function * fix: Fixed predict_schemata * fix: simplified predict schemata * fix: added linter * fix: Fixed trailing comma * fix: Removed CustomTrainingJob private fields * fix: Fixed model tests * fix: Set managed_model to None Co-authored-by: Ivan Cheung <[email protected]> * Fix: refactor class constructor for retrieving resource (#125) * Added property and abstract method _getter_method and _resource_noun, implemented method _get_gca_resource to class AiPlatformResourceNoun; Added _resource_noun, _getter_method, to Dataset, Model, Endpoint, subclasses of _Job, _TrainingJob, refactored (_)get_* and utils.full_resource_name in class constructor to self._get_gca_resource to Dataset, Model, Endpoint, _Job * Added return value in _get_gca_resource, added method _sync_gca_resource in AiPlatformResourceNoun class; removed job_type, updated status method with _sync_gca_resource in _Job class * fix: added return type and lint issues * fix: merge conflict issue with models.py * fix: F401 'abc' imported but unused * chore: merge main into dev (#154) * test: Dataset integration tests (#126) * Add dataset.metadata.text to schemas * Add first integation tests, Dataset class * Make teardown work if test fails, update asserts * Change test folder name, enable system tests * Hide test_base, test_end_to_end for Kokoro CI bug * Add GCP Project env var to Kokoro presubmit cfg * Restore presubmit cfg, drop --quiet in unit tests * Restore test_base, test_end_to_end to find timeout * Skip tests depending on persistent resources * Use auth default creds for system tests * Drop unused import os * feat: specialized dataset classes, fix: datasets refactor (#153) * feat: Refactored Dataset by removing intermediate layers * Added image_dataset and tabular_dataset subclass * Moved metadata_schema_uri responsibility to subclass to enable forecasting * Moved validation logic for tabular into Dataset._create_tabular * Added validation in image_dataset and fixed bounding_box schema error * Removed import_config * Fixed metadata_schema_uri * Fixed import and subclasses * Added EmptyNontabularDatasource * change import_metadata to ioformat * added datasources.py * added support of multiple gcs_sources * fix: default (empty) dataset_metadata need to be set to {}, not None * 1) imported datasources 2) added _support_metadata_schema_uris and _support_import_schema_classes 3) added getter and setter/validation for resource_metadata_schema_uri, metadata_schema_uri, and import_schema_uri 4) fixed request_metadata, data_item_labels 5) encapsulated dataset_metadata, and import_data_configs 6) added datasource configuration logic * added image_dataset.py and tabular_dataset.py * fix: refactor - create datasets modeule * fix: cleanup __init__.py * fix: data_item_labels * fix: docstring * fix: - changed NonTabularDatasource.dataset_metadata default to None - updated NonTabularDatasource docstring - changed gcs_source type hint with Union - changed _create_and_import to _create_encapsulated with datasource - removed subclass.__init__ and irrelevant parameters in create * fix: import the module instead of the classes for datasources * fix: removed all validation for import_schema_uri * fix: set parameter default to immutable * fix: replaced Datasource / DatasourceImportable abstract class instead of a concrete type * fix: added examples for gcs_source * fix: - remove Sequence from utils.py - refactor datasources.py to _datasources.py - change docstring format to arg_name (arg_type): convention - change and include the type signature _supported_metadata_schema_uris - change _validate_metadata_schema_uri - refactor _create_encapsulated to _create_and_import - refactor to module level imports - add tests for ImageDataset and TabularDataset * fix: remove all labels * fix: remove Optional in docstring, add example for bq_source * test: add import_data raise for tabular dataset test * fix: refactor datasource creation with create_datasource * fix: lint Co-authored-by: Ivan Cheung <[email protected]> * feat: Add AutoML Image Training Job class (#152) * Add AutoMLImageTrainingJob, tests, constants * Address reviewer comments * feat: Add custom container support (#164) * chore: merge main into dev (#162) * fix: suppress no project id warning (#160) * fix: suppress no project id warning * fix: temporary suppress logging.WARNING and set credentials as google.auth.default credentials * fix: move default credentials config to credentials property * fix: add property setter for credentials to avoid everytime reset * fix: Fixed wrong key value for multilabel (#168) Co-authored-by: Ivan Cheung <[email protected]> * feat: Add delete methods, add list_models and undeploy_all for Endpoint class (#165) * Endpoint list_models, delete, undeploy_all WIP * Finish delete + undeploy methods, tests * Add global pool teardowns for test timeout issue * Address reviewer comments, add async support * fix: Fixed bug causing training failure for object detection (#171) Co-authored-by: Ivan Cheung <[email protected]> * fix: Support intermediary BQ Table for Custom Training (#166) * chore: add AutoMLImageTrainingJob to aiplatform namespace (#173) * fix: Unblock build (#174) * fix: default credentials config related test failures (#167) * fix: suppress no project id warning * fix: temporary suppress logging.WARNING and set credentials as google.auth.default credentials * fix: move default credentials config to credentials property * fix: add property setter for credentials to avoid everytime reset * fix: tests for set credentials to default when default not provided * fix: change credentials with initializer default when not provided in AiPlatformResourceNoun * fix: use credential mock in tests * fix: lint Co-authored-by: sasha-gitg <[email protected]> Co-authored-by: Ivan Cheung <[email protected]> Co-authored-by: Ivan Cheung <[email protected]> Co-authored-by: Morgan Du <[email protected]> Co-authored-by: Vinny Senthil <[email protected]>

* fix: unblock builds (#132) * chore: Update README with Experimental verbiage (#131) * fix: Fixed comments (#116) Co-authored-by: Ivan Cheung <[email protected]> * feat: Implements a wrapped client that instantiates the client at every API invocation (#139) * feat: Added optional model args for custom training (#129) * Added optional model args * fix: Removed etag * fix: Added predict schemata and fixed type error * fix: Added description and fixed predict_schemata * Added _model_serving_container_command, _model_serving_container_args, env=self._model_serving_container_environment_variables and _model_serving_container_ports * fix: Ran linter * fix: Added tests for model_instance_schema_uri, model_parameters_schema_uri and model_prediction_schema_uri * fix: Fixed env and ports and added tests * fix: Removed model_labels * fix: Moved container spec creation into init function * fix: Fixed docstrings * fix: Moved import to be alphabetical * fix: Moved model creation to init function * fix: Fixed predict_schemata * fix: simplified predict schemata * fix: added linter * fix: Fixed trailing comma * fix: Removed CustomTrainingJob private fields * fix: Fixed model tests * fix: Set managed_model to None Co-authored-by: Ivan Cheung <[email protected]> * Fix: refactor class constructor for retrieving resource (#125) * Added property and abstract method _getter_method and _resource_noun, implemented method _get_gca_resource to class AiPlatformResourceNoun; Added _resource_noun, _getter_method, to Dataset, Model, Endpoint, subclasses of _Job, _TrainingJob, refactored (_)get_* and utils.full_resource_name in class constructor to self._get_gca_resource to Dataset, Model, Endpoint, _Job * Added return value in _get_gca_resource, added method _sync_gca_resource in AiPlatformResourceNoun class; removed job_type, updated status method with _sync_gca_resource in _Job class * fix: added return type and lint issues * fix: merge conflict issue with models.py * fix: F401 'abc' imported but unused * chore: merge main into dev (#154) * test: Dataset integration tests (#126) * Add dataset.metadata.text to schemas * Add first integation tests, Dataset class * Make teardown work if test fails, update asserts * Change test folder name, enable system tests * Hide test_base, test_end_to_end for Kokoro CI bug * Add GCP Project env var to Kokoro presubmit cfg * Restore presubmit cfg, drop --quiet in unit tests * Restore test_base, test_end_to_end to find timeout * Skip tests depending on persistent resources * Use auth default creds for system tests * Drop unused import os * feat: specialized dataset classes, fix: datasets refactor (#153) * feat: Refactored Dataset by removing intermediate layers * Added image_dataset and tabular_dataset subclass * Moved metadata_schema_uri responsibility to subclass to enable forecasting * Moved validation logic for tabular into Dataset._create_tabular * Added validation in image_dataset and fixed bounding_box schema error * Removed import_config * Fixed metadata_schema_uri * Fixed import and subclasses * Added EmptyNontabularDatasource * change import_metadata to ioformat * added datasources.py * added support of multiple gcs_sources * fix: default (empty) dataset_metadata need to be set to {}, not None * 1) imported datasources 2) added _support_metadata_schema_uris and _support_import_schema_classes 3) added getter and setter/validation for resource_metadata_schema_uri, metadata_schema_uri, and import_schema_uri 4) fixed request_metadata, data_item_labels 5) encapsulated dataset_metadata, and import_data_configs 6) added datasource configuration logic * added image_dataset.py and tabular_dataset.py * fix: refactor - create datasets modeule * fix: cleanup __init__.py * fix: data_item_labels * fix: docstring * fix: - changed NonTabularDatasource.dataset_metadata default to None - updated NonTabularDatasource docstring - changed gcs_source type hint with Union - changed _create_and_import to _create_encapsulated with datasource - removed subclass.__init__ and irrelevant parameters in create * fix: import the module instead of the classes for datasources * fix: removed all validation for import_schema_uri * fix: set parameter default to immutable * fix: replaced Datasource / DatasourceImportable abstract class instead of a concrete type * fix: added examples for gcs_source * fix: - remove Sequence from utils.py - refactor datasources.py to _datasources.py - change docstring format to arg_name (arg_type): convention - change and include the type signature _supported_metadata_schema_uris - change _validate_metadata_schema_uri - refactor _create_encapsulated to _create_and_import - refactor to module level imports - add tests for ImageDataset and TabularDataset * fix: remove all labels * fix: remove Optional in docstring, add example for bq_source * test: add import_data raise for tabular dataset test * fix: refactor datasource creation with create_datasource * fix: lint Co-authored-by: Ivan Cheung <[email protected]> * feat: Add AutoML Image Training Job class (#152) * Add AutoMLImageTrainingJob, tests, constants * Address reviewer comments * feat: Add custom container support (#164) * chore: merge main into dev (#162) * fix: suppress no project id warning (#160) * fix: suppress no project id warning * fix: temporary suppress logging.WARNING and set credentials as google.auth.default credentials * fix: move default credentials config to credentials property * fix: add property setter for credentials to avoid everytime reset * fix: Fixed wrong key value for multilabel (#168) Co-authored-by: Ivan Cheung <[email protected]> * feat: Add delete methods, add list_models and undeploy_all for Endpoint class (#165) * Endpoint list_models, delete, undeploy_all WIP * Finish delete + undeploy methods, tests * Add global pool teardowns for test timeout issue * Address reviewer comments, add async support * fix: Fixed bug causing training failure for object detection (#171) Co-authored-by: Ivan Cheung <[email protected]> * fix: Support intermediary BQ Table for Custom Training (#166) * chore: add AutoMLImageTrainingJob to aiplatform namespace (#173) * fix: Unblock build (#174) * fix: default credentials config related test failures (#167) * fix: suppress no project id warning * fix: temporary suppress logging.WARNING and set credentials as google.auth.default credentials * fix: move default credentials config to credentials property * fix: add property setter for credentials to avoid everytime reset * fix: tests for set credentials to default when default not provided * fix: change credentials with initializer default when not provided in AiPlatformResourceNoun * fix: use credential mock in tests * fix: lint Co-authored-by: sasha-gitg <[email protected]> * Fix: pass bq_destination to input data config when using training script (#181) * fix: pass bigquery destination * fix: add tests and formatting Co-authored-by: Ivan Cheung <[email protected]> Co-authored-by: Ivan Cheung <[email protected]> Co-authored-by: Morgan Du <[email protected]> Co-authored-by: Vinny Senthil <[email protected]>

morgandu requested a review from ivanmkc December 29, 2020 07:52

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Dec 29, 2020

morgandu changed the title ~~Mor dataset refactor datasource~~ feat: specialized dataset classes, fix: datasets refactor Dec 29, 2020

ivanmkc reviewed Dec 30, 2020

View reviewed changes

google/cloud/aiplatform/utils.py Outdated Show resolved Hide resolved

ivanmkc reviewed Dec 30, 2020

View reviewed changes

google/cloud/aiplatform/datasets/datasources.py Outdated Show resolved Hide resolved

ivanmkc reviewed Dec 30, 2020

View reviewed changes

google/cloud/aiplatform/datasets/datasources.py Outdated Show resolved Hide resolved

ivanmkc reviewed Dec 30, 2020

View reviewed changes

google/cloud/aiplatform/datasets/datasources.py Outdated Show resolved Hide resolved

ivanmkc reviewed Dec 30, 2020

View reviewed changes

google/cloud/aiplatform/datasets/image_dataset.py Outdated Show resolved Hide resolved

ivanmkc reviewed Dec 30, 2020

View reviewed changes

google/cloud/aiplatform/datasets/dataset.py Outdated Show resolved Hide resolved

ivanmkc reviewed Dec 30, 2020

View reviewed changes

ivanmkc suggested changes Dec 30, 2020

View reviewed changes

dizcology reviewed Dec 30, 2020

View reviewed changes

morgandu force-pushed the mor--dataset-refactor-datasource branch from b5431c3 to b21db22 Compare December 31, 2020 00:32

Ivan Cheung added 5 commits January 5, 2021 10:11

feat: Refactored Dataset by removing intermediate layers

a2c0492

Added image_dataset and tabular_dataset subclass

b179439

Moved metadata_schema_uri responsibility to subclass to enable foreca…

4a605b7

…sting

Moved validation logic for tabular into Dataset._create_tabular

17cedc3

Added validation in image_dataset and fixed bounding_box schema error

a5497dc

morgandu force-pushed the mor--dataset-refactor-datasource branch from dbb9297 to 34cffdc Compare January 5, 2021 20:45

morgandu added 3 commits January 5, 2021 13:29

fix: set parameter default to immutable

1d2e89f

fix: replaced Datasource / DatasourceImportable abstract class instea…

760f3e5

…d of a concrete type

fix: added examples for gcs_source

cb180b5

sasha-gitg requested changes Jan 6, 2021

View reviewed changes

ivanmkc reviewed Jan 6, 2021

View reviewed changes

morgandu added 2 commits January 6, 2021 13:38

fix: remove all labels

88cc8bd

morgandu requested review from sasha-gitg, ivanmkc and dizcology January 6, 2021 22:04

sasha-gitg approved these changes Jan 7, 2021

View reviewed changes

ivanmkc approved these changes Jan 7, 2021

View reviewed changes

morgandu added 3 commits January 7, 2021 10:26

fix: remove Optional in docstring, add example for bq_source

fdc329b

test: add import_data raise for tabular dataset test

edab913

fix: refactor datasource creation with create_datasource

4c83478

sasha-gitg requested review from dizcology and removed request for dizcology January 7, 2021 20:14

fix: lint

4d47fe3

morgandu force-pushed the mor--dataset-refactor-datasource branch from cb0f75d to 4d47fe3 Compare January 7, 2021 20:27

Merge branch 'dev' into mor--dataset-refactor-datasource

6f51faa

morgandu self-assigned this Jan 7, 2021

vinnysenthil approved these changes Jan 7, 2021

View reviewed changes

morgandu merged commit 4459fff into googleapis:dev Jan 7, 2021

morgandu deleted the mor--dataset-refactor-datasource branch January 7, 2021 20:58

morgandu mentioned this pull request Jan 7, 2021

feat: TabularDataset, fix: Dataset #103

Closed

morgandu restored the mor--dataset-refactor-datasource branch January 7, 2021 22:13

morgandu deleted the mor--dataset-refactor-datasource branch March 17, 2021 16:43



		class _Datasource(abc.ABC):
		"""An abstract class that sets dataset_metadata"""

		bq_source: Optional[str] = None,
		labels: Optional[Dict] = {},

		assert my_dataset._gca_resource == expected_dataset


		class TestTabularDataset:

feat: specialized dataset classes, fix: datasets refactor #153

feat: specialized dataset classes, fix: datasets refactor #153

Uh oh!

Conversation

morgandu commented Dec 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivanmkc left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 30, 2020 •

edited

Loading

ivanmkc Dec 31, 2020 •

edited

Loading