-
Notifications
You must be signed in to change notification settings - Fork 52
Move variable mapping out of mpas_xarray into new generalized_reader module #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move variable mapping out of mpas_xarray into new generalized_reader module #104
Conversation
@pwolfram and @milenaveneziani, this is the last reorganization PR before I do a pull request to generalize the calendar. This builds on #102, so should not be merged until that PR has been fulfilled. This is a relatively simple reorganization, so I didn't feel like it needed a design document. Please let me know if you disagree and would prefer to have a design document. |
Oops, I need to add CI for this and fix the mpas_xarray CI. Update coming soon... |
Okay, I added CI. |
066d185
to
833171d
Compare
Yes, I'll definitely do that. |
71f7e57
to
74b7bbf
Compare
@pwolfram, probably hold off on reviewing this until #102 has been merged. If you like, you can take a look at the new classes specifically: One question I have for you is how you feel about my file organization:
This organization leads to clunky imports like:
Also, would you prefer the files and directories not be named after the classes they contain, given that there are also several utility functions in these files? Would you prefer I separate the utility functions into their own files (e.g. I didn't want to bother with a design doc for this PR but this particular bit of decision making probably would have warranted a design doc. Let me know if you feel strongly that I should make one. |
74b7bbf
to
447891a
Compare
Okay, this one is rebased and ready to review. |
|
||
dsFirstYear = remove_repeated_time_index(dsFirstYear) | ||
dsFirstYear = openMultifileDataSet(filesFirstYear, | ||
calendar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, but shouldn't the two lines above have inFiles=filesFirstYear and calendar=calendar, for style consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that would be an improvement.
preprocess=lambda x: preprocess_mpas(x, yearoffset=yearOffset)) | ||
dsPreprocessed = remove_repeated_time_index(dsPreprocessed) | ||
dsPreprocessed = openMultifileDataSet(inFilesPreprocessed, | ||
calendar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same minor comment as above
preprocess=lambda x: preprocess_mpas(x, yearoffset=yearOffset)) | ||
dsPreprocessed = remove_repeated_time_index(dsPreprocessed) | ||
dsPreprocessed = openMultifileDataSet(inFilesPreprocessed, | ||
calendar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment
mpas_analysis/sea_ice/modelvsobs.py
Outdated
endDate=endDate, calendar=calendar) | ||
print 'Reading files {} through {}'.format(infiles[0], infiles[-1]) | ||
print 'Reading files {} through {}'.format(inFiles[0], inFiles[-1]) | ||
|
||
plotsDirectory = buildConfigFullPath(config, 'output', 'plotsSubdirectory') | ||
obsDirectory = config.get('seaIceObservations', 'baseDirectory') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not the right place to do it, but I realized recently that we should have the iceCocentration and iceThickness observational files defined in the config file, rather than hard wired here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, maybe make an issue for this.
mpas_analysis/sea_ice/timeseries.py
Outdated
inFilesPreprocessed, | ||
preprocess=lambda x: preprocess_mpas(x, yearoffset=yearOffset)) | ||
dsPreprocessed = openMultifileDataSet(inFilesPreprocessed, | ||
calendar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as before, with a few other instances below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big picture concern: use of classes is making internals of mpas_xarray more confusing, but at the benefit of providing a single function a user needs to understand in order to load in data. There is a trade-off here and in some ways this assumes that our user's won't be developers. This is potentially dangerous given our group's mission and composition.
We definitely need a design document for this code because it is becoming more complex and it will be something everyone will have to live with in the future. I recommend getting feedback from @vanroekel on whether he thinks this all is a step in the right direction because he is less steeped in this project than @xylar, @milenaveneziani, or myself (@pwolfram).
from ..shared.plot.plotting import timeseries_analysis_plot | ||
|
||
from ..shared.io import NameList, StreamsFile | ||
from ..shared.io.utility import buildConfigFullPath | ||
|
||
from ..shared.MpasAnalysisXarray.MpasAnalysisXarray import openMultifileDataSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar, I think this organization is strange. Do we really need a folder with the same class name? Also, typically files with classes are not camel-cased, but would be something like from ...shared.mpas_analysis_xarray import OpenMultifileDataSet
where I'm assuming OpenMultifileDataSet
is a class. If it is a function it should be lower case with _
characters.
variableMap=variableMap, | ||
startDate=startDate, | ||
endDate=endDate, | ||
yearOffset=yearOffset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, this changes public API so we probably want at least a very simple design doc.
mpas_analysis/sea_ice/timeseries.py
Outdated
@@ -14,6 +11,9 @@ | |||
from ..shared.timekeeping.utility import stringToDatetime, \ | |||
clampToNumpyDatetime64 | |||
|
|||
from ..shared.MpasAnalysisXarray.MpasAnalysisXarray import openMultifileDataSet | |||
from ..shared.MpasXarray.MpasXarray import subsetVariables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right @xylar, the organization here is important and need further consideration. I'm thinking we probably do need a design doc
- highlighting directory structure
- highlighting public API
I don't think this needs to be a "full" design doc, but more of a simple listing of how this all should work.
|
||
Utility function | ||
---------------- | ||
openMultifileDataSet : creates an instance of MpasAnalysisXarray, opens a data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions typically should not be camel-cased.
I do not have specific opinions about MpasXarray and MpasAnalysisXarray. |
@xylar and @milenaveneziani, I think we want a high-level view of the changes here to directory structure and public API. I'm thinking a minimal quasi-"design doc" may be the best way to do this because we need to make sure these changes are going to be more broadly useful than for just the three of us. |
The high level changes to the directory structure are quite simple. The following have been added:
And the following has been removed:
|
@pwolfram: I am not too concerned about the fact that this PR changes API, since for now we only have 4 developers. But I agree that it would be nice to hear from @vanroekel about these changes. |
32e5350
to
cb78a8f
Compare
TestingI tested this with both LANL tests and it's bit-for-bit with #110 (which fixes a bug in develop). |
@xylar: I don't have much else to say about this PR or #108. I agree with @vanroekel that the call to open the multi-file dataset is much more intuitive now. Thanks for all the work on this. |
cb78a8f
to
5e56c59
Compare
5e56c59
to
8e38579
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API for opening files is now much cleaner and simpler, which is a huge plus.
I think this PR will make it easier to load in MPAS data for new users.
A few minor, general notes:
-
We should try to keep blocks of code as close to their original locations as possible in order to make incremental changes to the code obvious between commits for debugging purposes.
-
Name changes to use
_
characters are general frowned upon in python because strings with_
characters typically designate functions and methods.
variable_map=variableMap, | ||
start_date=startDateFirstYear, | ||
end_date=endDateFirstYear, | ||
year_offset=yearOffset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes the usage much clearer, especially because additional selections via xarray aren't needed now.
""" | ||
Overloads the _preprocess function from MpasXarray to include | ||
mapping variable names. See MpasXaray._preprocess for more | ||
details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar, we aren't overloading anymore, right? If so, the docstring needs changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, missed that.
""" | ||
|
||
allvars = ds.data_vars.keys() | ||
|
||
# get set of variables to drop (all ds variables not in vlist) | ||
dropvars = set(allvars) - set(vlist) | ||
dropvars = set(allvars) - set(variable_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions use underscores, not variables names. This change is confusing and should be rolled-back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said below, I'm really frustrated with this process. I would like some very clear rules on what our naming conventions are. What I feel like I'm getting instead is a steady stream of corrections with no clear rule behind them. What's more I can't find any clear rule in PEP8 either.
I think vlist is a poor variable name. It is not very descriptive. I'm okay with not having an underscore but I think we want descriptive names over short names. I am fine with variableList
as an alternative.
All that being said, this is your code so I will defer to your style choices. I just want to understand what the rules are and also to get a chance to say whether I agree with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought at some point we agreed to use camelcase for variables and underscores for functions/scripts? In any terms, let us just agree on something. My only big preference is for longer rather than shorter/cryptic names, but either camelcase or underscores would work equally well for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I didn't understand that. I just talked with @pwolfram and we agreed that CamelCase is the right way to do this for variable names (first letter lowercase) and class names (first letter uppercase). Function and method names as well as directories and file names are all lowercase with underscores.
As a counterexample to this, let me present xarray:
def open_dataset(filename_or_obj, group=None, decode_cf=True,
mask_and_scale=True, decode_times=True,
concat_characters=True, decode_coords=True, engine=None,
chunks=None, lock=None, cache=None, drop_variables=None):
Clearly they don't mind using underscores in variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I personally do not care either. But I will add this 'rule' in the python standard requirement somewhere, so we can refer to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the confluence page with all our requirements:
https://acme-climate.atlassian.net/wiki/display/OCNICE/Standard+for+MPAS+Analysis+scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have some type of local document on standards in this repo? It will be easier to refer to if it is local to this repository and not on confluence, although a mirror on confluence may be optimal.
@xylar, you are right that underscored variable names are used in xarray. I would personally say that is suboptimal for us, particularly since we are allowing use of camel case for variable names. I'm happy to bend to the group if you and @milenaveneziani find this too restrictive, however.
""" | ||
assert datetimes[0].year > 1678, \ | ||
'ERROR: yearoffset={}'.format(yearoffset) + \ | ||
'ERROR: year_offset={}'.format(year_offset) + \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as before. This change is not helpful, but is confusing and against PEP8 and standard conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned, this is not covered by PEP8. But I agree that I will change it. Not back to yearoffset
but to yearOffset
.
datetimes = [starts[i] + (ends[i] - starts[i])/2 | ||
for i in range(len(starts))] | ||
return datetimes | ||
|
||
time_var = ds[timestr] | ||
time_var = ds[time_variable_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of underscore for variable, again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'll fix it again.
end_date='0005-03-01', | ||
year_offset=1850) | ||
self.assertEqual(len(ds.Time), 1) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great @xylar -- testing of opening file will hopefully ensure that we have all the functionality we need and if libraries change and this breaks we should know!
02/15/2017 | ||
""" | ||
|
||
def preprocess_fn(ds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a partial function here, which would be cleaner because you are essentially doing this by hand. See https://docs.python.org/2/library/functools.html#functools.partial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a great tip!
# and handle it manually (adding a new variable rather than renaming | ||
# an existing one) | ||
if variable_map is not None and \ | ||
time_variable_name in variable_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if you need to break this line, it may be clearer (even if the line is long) to keep on one line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that it doesn't need to be split in this case.
return ds # }}} | ||
|
||
|
||
def preprocess(ds, time_variable_name, variable_list, sel_values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has been largely unchanged, but it is highlighted as though it were new code. In the future, please try to keep edits to code in-place so that we have a clearer idea of how the code evolves from commit to commit, which is obviously very important for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, this is probably not a huge deal in this case, but is a huge deal in other contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to reorganize which functions are public and which are private. This required moving several functions to the end of the file (since my understanding is that private functions should follow public ones whenever possible). GitHub makes arbitrary choices about which functions it follows and which it loses track of in such cases.
In any case, the order of the functions was not arbitrary. It is somewhat related to the history of me making a class and then undoing it per your previous requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sounds good. This is a downside of github and diffs in general-- they really can't sort out moves and edits in general. Seems like there is a better way to do this but no real choices for better diff evaluations...
`isel_values`) was provided in `openMultifileDataSet`, this | ||
function performs the appropriate slicing on the data set. | ||
|
||
Parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know that these current doc strings will be parse-able via sphinx? I don't think we need to verify this now as long as we are willing to make tweaks to the docstrings later when we build out a automatic doc generation capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure they will need tweaks, but I'm doing my best to follow examples I've seen that are parsed with sphinx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is exactly what I'd do too. Sounds great.
@pwolfram, I'm really confused about what you want in terms of CamelCase vs. underscores vs nothing at all. I didn't see anything in the PEP8 prescriptions that agrees with what you just said. I really, really, really hate variable names that are all lowercase with no indication of word separation (more than underscores even). If I had to choose, I would always use CamelCase. You've told me not to except for class names (which I'm apparently not supposed to use anyway because classes are confusing). I'm really frustrated and confused about this!!! |
@pwolfram, looking again at https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions, I don't see anything about variable names. |
@xylar, to summarize from our phone conversation earlier today:
When in doubt we should try to follow conventions of the pydata stack, e.g., numpy, scipy, pandas, xarray, etc because PEP8 isn't fully prescriptive. In general camel case is probably discouraged in these efforts for variable names but we are allowing it here for consistency with the MPAS convention to meld with existing python standards. |
@pwolfram, so I've looked at both numpy and xarray, and I see plenty of example where they have function arguments that use lowercase with underscores. See my earlier comment for an example from xarray (xarray.open_dataset). Here's one from
I don't see much evidence that this convention is as established as you seem to think it is. |
Fair enough @xylar, maybe it is just my own personal quirk then. Vim takes care of this issue anyway by identifying and highlighting functions so this isn't a huge problem, but I thought it could be a helpful convention. I'll go with whatever you think best, but my vote still is a clear distinction between function/method names, variable names, and class names. |
So let's stick with the convention @pwolfram outlined above. But let's be careful about referring to it as an established standard in python because I don't see much evidence that that's the case. Especially in xarray, it seems not to be the case. |
A new utility function, open_multifile_dataset, has been added to make it easier and more straightforward to open multifile datasets without having to understand the preprocessor. Several helper functions have been made non-public because they are unlikely to be used directly outside of mpas_xarray. These include: _assert_valid_datetimes, _assert_valid_selections, _ensure_list, _get_datetimes Unit tests have been updated to use open_multifile_dataset but analysis scripts have not yet been updated because they require an alternative open_multifile_dataset and a more complex preprocess function that can handle variable mapping to support the full MPAS-analysis capabilities.
This module includes: * A utility function open_multifile_dataset, similar to that for mpas_xarray, that opens an xarray data set and does the preprocessing needed for MPAS-Analysis, including mapping variable names and slicing the time axis to the appropriate range. * some non-public helper functions for performing mapping of variable names. Unit testing has also been added for generalized_reader
8e38579
to
c42fb0e
Compare
Okay, I really hope I've addressed all your concerns. @pwolfram, could you give it another look when you can? |
c42fb0e
to
ce20177
Compare
I retested this on my laptop. Seems to work, but I don't have a way of testing @milenaveneziani, would you be willing to try this on all the lanl and edison configs to make sure they still work as expected? If so, and if @pwolfram doesn't have further changes, you're free to merge. I've already squashed commits. |
sure, I will test on edison and wolf and report here. |
I've signed-off. |
Great. And my tests just finished as well, successfully: on wolf (GMPAS-240 case) and on edison (beta0 case). |
This merge moves the capability to map variable names (used in MPAS-Analysis to support multiple MPAS and ACME versions as seamlessly as possible) out of
mpas_xarray
to a newgeneralized_reader
module.Both
mpas_xarray
andgeneralized_reader
have functions,open_multifile_dataset
, that can be used in place of existing calls toxarray.open_mfdataset
. These functions hide the complications of using a preprocessing function.generalized_reader.open_multifile_dataset
adds the capabilities to map variables and to select only 'Time' values between a given start and end date.These changes have been made to allow MPAS files to be preprocessed in various ways without explicitly having to add complicating functionality to
mpas_xarray
itself. After these changes, the thought is thatmpas_xarray
can remain generic to any MPAS data set, whereasgeneralized_reader
can adapt to the specific needs for reading data sets within MPAS-Analysis.