Skip to content

Commit 23466f0

Browse files
ravinkohlinabenabe0928ArlindKadra
authored
Cocktail fixes time debug (#286)
* preprocess inside data validator * add time debug statements * Add fixes for categorical data * add fit_ensemble * add arlind fix for swa and se * fix bug in trainer choice fit * fix ensemble bug * Correct bug in cleanup * Cleanup for removing time debug statements * ablation for adversarial * shuffle false in dataloader * drop last false in dataloader * fix bug for validation set, and cutout and cutmix * shuffle = False * Shake Shake updates (#287) * To test locally * fix bug in trainer choice fit * fix ensemble bug * Correct bug in cleanup * To test locally * Cleanup for removing time debug statements * ablation for adversarial * shuffle false in dataloader * drop last false in dataloader * fix bug for validation set, and cutout and cutmix * To test locally * shuffle = False * To test locally * updates to search space * updates to search space * update branch with search space * undo search space update * fix bug in shake shake flag * limit to shake-even * restrict to even even * Add even even and others for shake-drop also * fix bug in passing alpha beta method * restrict to only even even * fix silly bug: * remove imputer and ordinal encoder for categorical transformer in feature validator * Address comments from shuhei * fix issues with ensemble fitting post hoc * Address comments on the PR * Fix flake and mypy errors * Address comments from PR #286 * fix bug in embedding * Update autoPyTorch/api/tabular_classification.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/datasets/base_dataset.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/datasets/base_dataset.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/pipeline/components/training/trainer/base_trainer.py Co-authored-by: nabenabe0928 <[email protected]> * Address comments from shuhei * adress comments from shuhei * fix flake and mypy * Update autoPyTorch/pipeline/components/training/trainer/RowCutMixTrainer.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/pipeline/tabular_classification.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/pipeline/components/setup/network_backbone/utils.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/pipeline/components/setup/network_backbone/utils.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/pipeline/components/setup/network_backbone/utils.py Co-authored-by: nabenabe0928 <[email protected]> * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * increase threads_per_worker * fix bug in rowcutmix * Enhancement for the tabular validator. (#291) * Initial try at an enhancement for the tabular validator * Adding a few type annotations * Fixing bugs in implementation * Adding wrongly deleted code part during rebase * Fix bug in _get_args * Fix bug in _get_args * Addressing Shuhei's comments * Address Shuhei's comments * Refactoring code * Refactoring code * Typos fix and additional comments * Replace nan in categoricals with simple imputer * Remove unused function * add comment * Update autoPyTorch/data/tabular_feature_validator.py Co-authored-by: nabenabe0928 <[email protected]> * Update autoPyTorch/data/tabular_feature_validator.py Co-authored-by: nabenabe0928 <[email protected]> * Adding unit test for only nall columns in the tabular feature categorical evaluator * fix bug in remove all nan columns * Bug fix for making tests run by arlind * fix flake errors in feature validator * made typing code uniform * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * address comments from shuhei * address comments from shuhei (2) Co-authored-by: Ravin Kohli <[email protected]> Co-authored-by: Ravin Kohli <[email protected]> Co-authored-by: nabenabe0928 <[email protected]> * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * resolve code issues with new versions * Address comments from shuhei * make run_traditional_ml function * implement suggestion from shuhei and fix bug in rowcutmixtrainer * fix return type docstring * add better documentation and fix bug in shake_drop_get_bl * Apply suggestions from code review Co-authored-by: nabenabe0928 <[email protected]> * add test for comparator and other improvements based on PR comments * fix bug in test * [fix] Fix the condition in the raising error of all_nan_columns * [refactor] Unite name conventions of numpy array and pandas dataframe * [doc] Add the description about the tabular feature transformation * [doc] Add the description of the tabular feature transformation * address comments from arlind * address comments from arlind * change to as_tensor and address comments from arlind * correct description for functions in data module Co-authored-by: nabenabe0928 <[email protected]> Co-authored-by: Arlind Kadra <[email protected]> Co-authored-by: nabenabe0928 <[email protected]>
1 parent d37d4a5 commit 23466f0

35 files changed

+1130
-527
lines changed

autoPyTorch/api/base_task.py

Lines changed: 280 additions & 57 deletions
Large diffs are not rendered by default.

autoPyTorch/api/tabular_classification.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,8 @@ def search(
275275
y_test=y_test,
276276
dataset_name=dataset_name)
277277

278+
if self.dataset is None:
279+
raise ValueError("`dataset` in {} must be initialized, but got None".format(self.__class__.__name__))
278280
return self._search(
279281
dataset=self.dataset,
280282
optimize_metric=optimize_metric,

autoPyTorch/api/tabular_regression.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,8 @@ def search(
261261
y_test=y_test,
262262
dataset_name=dataset_name)
263263

264+
if self.dataset is None:
265+
raise ValueError("`dataset` in {} must be initialized, but got None".format(self.__class__.__name__))
264266
return self._search(
265267
dataset=self.dataset,
266268
optimize_metric=optimize_metric,

autoPyTorch/data/base_feature_validator.py

Lines changed: 56 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import logging
2-
import typing
2+
from typing import List, Optional, Set, Tuple, Union
33

44
import numpy as np
55

@@ -12,8 +12,8 @@
1212
from autoPyTorch.utils.logging_ import PicklableClientLogger
1313

1414

15-
SUPPORTED_FEAT_TYPES = typing.Union[
16-
typing.List,
15+
SUPPORTED_FEAT_TYPES = Union[
16+
List,
1717
pd.DataFrame,
1818
np.ndarray,
1919
scipy.sparse.bsr_matrix,
@@ -35,60 +35,61 @@ class BaseFeatureValidator(BaseEstimator):
3535
List of the column types found by this estimator during fit.
3636
data_type (str):
3737
Class name of the data type provided during fit.
38-
encoder (typing.Optional[BaseEstimator])
38+
encoder (Optional[BaseEstimator])
3939
Host a encoder object if the data requires transformation (for example,
4040
if provided a categorical column in a pandas DataFrame)
41-
enc_columns (typing.List[str])
41+
enc_columns (List[str])
4242
List of columns that were encoded.
4343
"""
4444
def __init__(self,
45-
logger: typing.Optional[typing.Union[PicklableClientLogger, logging.Logger
46-
]] = None,
45+
logger: Optional[Union[PicklableClientLogger, logging.Logger
46+
]
47+
] = None,
4748
) -> None:
4849
# Register types to detect unsupported data format changes
49-
self.feat_type = None # type: typing.Optional[typing.List[str]]
50-
self.data_type = None # type: typing.Optional[type]
51-
self.dtypes = [] # type: typing.List[str]
52-
self.column_order = [] # type: typing.List[str]
50+
self.feat_type: Optional[List[str]] = None
51+
self.data_type: Optional[type] = None
52+
self.dtypes: List[str] = []
53+
self.column_order: List[str] = []
5354

54-
self.encoder = None # type: typing.Optional[BaseEstimator]
55-
self.enc_columns = [] # type: typing.List[str]
55+
self.encoder: Optional[BaseEstimator] = None
56+
self.enc_columns: List[str] = []
5657

57-
self.logger: typing.Union[
58+
self.logger: Union[
5859
PicklableClientLogger, logging.Logger
5960
] = logger if logger is not None else logging.getLogger(__name__)
6061

6162
# Required for dataset properties
62-
self.num_features = None # type: typing.Optional[int]
63-
self.categories = [] # type: typing.List[typing.List[int]]
64-
self.categorical_columns: typing.List[int] = []
65-
self.numerical_columns: typing.List[int] = []
66-
# column identifiers may be integers or strings
67-
self.null_columns: typing.Set[str] = set()
63+
self.num_features: Optional[int] = None
64+
self.categories: List[List[int]] = []
65+
self.categorical_columns: List[int] = []
66+
self.numerical_columns: List[int] = []
67+
68+
self.all_nan_columns: Optional[Set[Union[int, str]]] = None
6869

6970
self._is_fitted = False
7071

7172
def fit(
7273
self,
7374
X_train: SUPPORTED_FEAT_TYPES,
74-
X_test: typing.Optional[SUPPORTED_FEAT_TYPES] = None,
75+
X_test: Optional[SUPPORTED_FEAT_TYPES] = None,
7576
) -> BaseEstimator:
7677
"""
7778
Validates and fit a categorical encoder (if needed) to the features.
7879
The supported data types are List, numpy arrays and pandas DataFrames.
7980
CSR sparse data types are also supported
8081
81-
Arguments:
82+
Args:
8283
X_train (SUPPORTED_FEAT_TYPES):
8384
A set of features that are going to be validated (type and dimensionality
8485
checks) and a encoder fitted in the case the data needs encoding
85-
X_test (typing.Optional[SUPPORTED_FEAT_TYPES]):
86+
X_test (Optional[SUPPORTED_FEAT_TYPES]):
8687
A hold out set of data used for checking
8788
"""
8889

8990
# If a list was provided, it will be converted to pandas
9091
if isinstance(X_train, list):
91-
X_train, X_test = self.list_to_dataframe(X_train, X_test)
92+
X_train, X_test = self.list_to_pandas(X_train, X_test)
9293

9394
self._check_data(X_train)
9495

@@ -114,14 +115,15 @@ def _fit(
114115
X: SUPPORTED_FEAT_TYPES,
115116
) -> BaseEstimator:
116117
"""
117-
Arguments:
118+
Args:
118119
X (SUPPORTED_FEAT_TYPES):
119120
A set of features that are going to be validated (type and dimensionality
120121
checks) and a encoder fitted in the case the data needs encoding
121122
Returns:
122123
self:
123124
The fitted base estimator
124125
"""
126+
125127
raise NotImplementedError()
126128

127129
def _check_data(
@@ -131,19 +133,20 @@ def _check_data(
131133
"""
132134
Feature dimensionality and data type checks
133135
134-
Arguments:
136+
Args:
135137
X (SUPPORTED_FEAT_TYPES):
136138
A set of features that are going to be validated (type and dimensionality
137139
checks) and a encoder fitted in the case the data needs encoding
138140
"""
141+
139142
raise NotImplementedError()
140143

141144
def transform(
142145
self,
143146
X: SUPPORTED_FEAT_TYPES,
144147
) -> np.ndarray:
145148
"""
146-
Arguments:
149+
Args:
147150
X_train (SUPPORTED_FEAT_TYPES):
148151
A set of features, whose categorical features are going to be
149152
transformed
@@ -152,4 +155,30 @@ def transform(
152155
np.ndarray:
153156
The transformed array
154157
"""
158+
159+
raise NotImplementedError()
160+
161+
def list_to_pandas(
162+
self,
163+
X_train: SUPPORTED_FEAT_TYPES,
164+
X_test: Optional[SUPPORTED_FEAT_TYPES] = None,
165+
) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
166+
"""
167+
Converts a list to a pandas DataFrame. In this process, column types are inferred.
168+
169+
If test data is provided, we proactively match it to train data
170+
171+
Args:
172+
X_train (SUPPORTED_FEAT_TYPES):
173+
A set of features that are going to be validated (type and dimensionality
174+
checks) and a encoder fitted in the case the data needs encoding
175+
X_test (Optional[SUPPORTED_FEAT_TYPES]):
176+
A hold out set of data used for checking
177+
Returns:
178+
pd.DataFrame:
179+
transformed train data from list to pandas DataFrame
180+
pd.DataFrame:
181+
transformed test data from list to pandas DataFrame
182+
"""
183+
155184
raise NotImplementedError()

autoPyTorch/data/base_target_validator.py

Lines changed: 27 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import logging
2-
import typing
2+
from typing import List, Optional, Union, cast
33

44
import numpy as np
55

@@ -12,8 +12,8 @@
1212
from autoPyTorch.utils.logging_ import PicklableClientLogger
1313

1414

15-
SUPPORTED_TARGET_TYPES = typing.Union[
16-
typing.List,
15+
SUPPORTED_TARGET_TYPES = Union[
16+
List,
1717
pd.Series,
1818
pd.DataFrame,
1919
np.ndarray,
@@ -35,48 +35,50 @@ class BaseTargetValidator(BaseEstimator):
3535
is_classification (bool):
3636
A bool that indicates if the validator should operate in classification mode.
3737
During classification, the targets are encoded.
38-
encoder (typing.Optional[BaseEstimator]):
38+
encoder (Optional[BaseEstimator]):
3939
Host a encoder object if the data requires transformation (for example,
4040
if provided a categorical column in a pandas DataFrame)
41-
enc_columns (typing.List[str])
41+
enc_columns (List[str])
4242
List of columns that where encoded
4343
"""
4444
def __init__(self,
4545
is_classification: bool = False,
46-
logger: typing.Optional[typing.Union[PicklableClientLogger, logging.Logger
47-
]] = None,
46+
logger: Optional[Union[PicklableClientLogger,
47+
logging.Logger
48+
]
49+
] = None,
4850
) -> None:
4951
self.is_classification = is_classification
5052

51-
self.data_type = None # type: typing.Optional[type]
53+
self.data_type: Optional[type] = None
5254

53-
self.encoder = None # type: typing.Optional[BaseEstimator]
55+
self.encoder: Optional[BaseEstimator] = None
5456

55-
self.out_dimensionality = None # type: typing.Optional[int]
56-
self.type_of_target = None # type: typing.Optional[str]
57+
self.out_dimensionality: Optional[int] = None
58+
self.type_of_target: Optional[str] = None
5759

58-
self.logger: typing.Union[
60+
self.logger: Union[
5961
PicklableClientLogger, logging.Logger
6062
] = logger if logger is not None else logging.getLogger(__name__)
6163

6264
# Store the dtype for remapping to correct type
63-
self.dtype = None # type: typing.Optional[type]
65+
self.dtype: Optional[type] = None
6466

6567
self._is_fitted = False
6668

6769
def fit(
6870
self,
6971
y_train: SUPPORTED_TARGET_TYPES,
70-
y_test: typing.Optional[SUPPORTED_TARGET_TYPES] = None,
72+
y_test: Optional[SUPPORTED_TARGET_TYPES] = None,
7173
) -> BaseEstimator:
7274
"""
7375
Validates and fit a categorical encoder (if needed) to the targets
7476
The supported data types are List, numpy arrays and pandas DataFrames.
7577
76-
Arguments:
78+
Args:
7779
y_train (SUPPORTED_TARGET_TYPES)
7880
A set of targets set aside for training
79-
y_test (typing.Union[SUPPORTED_TARGET_TYPES])
81+
y_test (Union[SUPPORTED_TARGET_TYPES])
8082
A hold out set of data used of the targets. It is also used to fit the
8183
categories of the encoder.
8284
"""
@@ -95,8 +97,8 @@ def fit(
9597
np.shape(y_test)
9698
))
9799
if isinstance(y_train, pd.DataFrame):
98-
y_train = typing.cast(pd.DataFrame, y_train)
99-
y_test = typing.cast(pd.DataFrame, y_test)
100+
y_train = cast(pd.DataFrame, y_train)
101+
y_test = cast(pd.DataFrame, y_test)
100102
if y_train.columns.tolist() != y_test.columns.tolist():
101103
raise ValueError(
102104
"Train and test targets must both have the same columns, yet "
@@ -127,24 +129,24 @@ def fit(
127129
def _fit(
128130
self,
129131
y_train: SUPPORTED_TARGET_TYPES,
130-
y_test: typing.Optional[SUPPORTED_TARGET_TYPES] = None,
132+
y_test: Optional[SUPPORTED_TARGET_TYPES] = None,
131133
) -> BaseEstimator:
132134
"""
133-
Arguments:
135+
Args:
134136
y_train (SUPPORTED_TARGET_TYPES)
135137
The labels of the current task. They are going to be encoded in case
136138
of classification
137-
y_test (typing.Optional[SUPPORTED_TARGET_TYPES])
139+
y_test (Optional[SUPPORTED_TARGET_TYPES])
138140
A holdout set of labels
139141
"""
140142
raise NotImplementedError()
141143

142144
def transform(
143145
self,
144-
y: typing.Union[SUPPORTED_TARGET_TYPES],
146+
y: Union[SUPPORTED_TARGET_TYPES],
145147
) -> np.ndarray:
146148
"""
147-
Arguments:
149+
Args:
148150
y (SUPPORTED_TARGET_TYPES)
149151
A set of targets that are going to be encoded if the current task
150152
is classification
@@ -161,8 +163,8 @@ def inverse_transform(
161163
"""
162164
Revert any encoding transformation done on a target array
163165
164-
Arguments:
165-
y (typing.Union[np.ndarray, pd.DataFrame, pd.Series]):
166+
Args:
167+
y (Union[np.ndarray, pd.DataFrame, pd.Series]):
166168
Target array to be transformed back to original form before encoding
167169
Returns:
168170
np.ndarray:

autoPyTorch/data/base_validator.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def fit(
5858
+ Checks for dimensionality as well as missing values are performed.
5959
+ If performing a classification task, the data is going to be encoded
6060
61-
Arguments:
61+
Args:
6262
X_train (SUPPORTED_FEAT_TYPES):
6363
A set of features that are going to be validated (type and dimensionality
6464
checks). If this data contains categorical columns, an encoder is going to
@@ -102,7 +102,7 @@ def transform(
102102
"""
103103
Transform the given target or features to a numpy array
104104
105-
Arguments:
105+
Args:
106106
X (SUPPORTED_FEAT_TYPES):
107107
A set of features to transform
108108
y (typing.Optional[SUPPORTED_TARGET_TYPES]):

0 commit comments

Comments
 (0)