Skip to content

Move variable mapping out of mpas_xarray into new generalized_reader module #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 16, 2017

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Feb 8, 2017

This merge moves the capability to map variable names (used in MPAS-Analysis to support multiple MPAS and ACME versions as seamlessly as possible) out of mpas_xarray to a new generalized_reader module.

Both mpas_xarray and generalized_reader have functions, open_multifile_dataset, that can be used in place of existing calls to xarray.open_mfdataset. These functions hide the complications of using a preprocessing function. generalized_reader.open_multifile_dataset adds the capabilities to map variables and to select only 'Time' values between a given start and end date.

These changes have been made to allow MPAS files to be preprocessed in various ways without explicitly having to add complicating functionality to mpas_xarray itself. After these changes, the thought is that mpas_xarray can remain generic to any MPAS data set, whereas generalized_reader can adapt to the specific needs for reading data sets within MPAS-Analysis.

@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2017

@pwolfram and @milenaveneziani, this is the last reorganization PR before I do a pull request to generalize the calendar. This builds on #102, so should not be merged until that PR has been fulfilled.

This is a relatively simple reorganization, so I didn't feel like it needed a design document. Please let me know if you disagree and would prefer to have a design document.

@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2017

Oops, I need to add CI for this and fix the mpas_xarray CI. Update coming soon...

@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2017

Okay, I added CI.

@xylar xylar force-pushed the variable_mapping_reorg branch from 066d185 to 833171d Compare February 8, 2017 19:53
@pwolfram
Copy link
Contributor

pwolfram commented Feb 8, 2017

@xylar, once we fulfill #86 do you mind doing the rebase so that this is easier to review?

@xylar
Copy link
Collaborator Author

xylar commented Feb 8, 2017

Yes, I'll definitely do that.

@xylar xylar force-pushed the variable_mapping_reorg branch 2 times, most recently from 71f7e57 to 74b7bbf Compare February 10, 2017 16:16
@xylar xylar changed the title Reorganize variable name mapping Make MpasXarray and MpasAnalysisXarray classes Feb 10, 2017
@xylar
Copy link
Collaborator Author

xylar commented Feb 10, 2017

@pwolfram, probably hold off on reviewing this until #102 has been merged. If you like, you can take a look at the new classes specifically:
https://github.com/MPAS-Dev/MPAS-Analysis/pull/104/files#diff-bd27288e6337dce2ebec4ac95896b8eb
https://github.com/MPAS-Dev/MPAS-Analysis/pull/104/files#diff-27af017bf6b93f74592ce0c6fca4cc1b

One question I have for you is how you feel about my file organization:

mpas_analysis/shared/MpasXarray/MpasXarray.py
mpas_analysis/shared/MpasAnalysisXarray/MpasAnalysisXarray.py

This organization leads to clunky imports like:

from ..shared.MpasAnalysisXarray.MpasAnalysisXarray import openMultifileDataSet

Also, would you prefer the files and directories not be named after the classes they contain, given that there are also several utility functions in these files? Would you prefer I separate the utility functions into their own files (e.g. utility.py) like we have in other subfolders?

I didn't want to bother with a design doc for this PR but this particular bit of decision making probably would have warranted a design doc. Let me know if you feel strongly that I should make one.

@milenaveneziani
Copy link
Collaborator

@xylar: this needs a rebase and an opinion from @pwolfram about the question above, right?

@xylar xylar force-pushed the variable_mapping_reorg branch from 74b7bbf to 447891a Compare February 10, 2017 17:07
@xylar
Copy link
Collaborator Author

xylar commented Feb 10, 2017

Okay, this one is rebased and ready to review.


dsFirstYear = remove_repeated_time_index(dsFirstYear)
dsFirstYear = openMultifileDataSet(filesFirstYear,
calendar,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment, but shouldn't the two lines above have inFiles=filesFirstYear and calendar=calendar, for style consistency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that would be an improvement.

preprocess=lambda x: preprocess_mpas(x, yearoffset=yearOffset))
dsPreprocessed = remove_repeated_time_index(dsPreprocessed)
dsPreprocessed = openMultifileDataSet(inFilesPreprocessed,
calendar,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same minor comment as above

preprocess=lambda x: preprocess_mpas(x, yearoffset=yearOffset))
dsPreprocessed = remove_repeated_time_index(dsPreprocessed)
dsPreprocessed = openMultifileDataSet(inFilesPreprocessed,
calendar,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment

endDate=endDate, calendar=calendar)
print 'Reading files {} through {}'.format(infiles[0], infiles[-1])
print 'Reading files {} through {}'.format(inFiles[0], inFiles[-1])

plotsDirectory = buildConfigFullPath(config, 'output', 'plotsSubdirectory')
obsDirectory = config.get('seaIceObservations', 'baseDirectory')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not the right place to do it, but I realized recently that we should have the iceCocentration and iceThickness observational files defined in the config file, rather than hard wired here...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, maybe make an issue for this.

inFilesPreprocessed,
preprocess=lambda x: preprocess_mpas(x, yearoffset=yearOffset))
dsPreprocessed = openMultifileDataSet(inFilesPreprocessed,
calendar,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as before, with a few other instances below.

Copy link
Contributor

@pwolfram pwolfram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big picture concern: use of classes is making internals of mpas_xarray more confusing, but at the benefit of providing a single function a user needs to understand in order to load in data. There is a trade-off here and in some ways this assumes that our user's won't be developers. This is potentially dangerous given our group's mission and composition.

We definitely need a design document for this code because it is becoming more complex and it will be something everyone will have to live with in the future. I recommend getting feedback from @vanroekel on whether he thinks this all is a step in the right direction because he is less steeped in this project than @xylar, @milenaveneziani, or myself (@pwolfram).

from ..shared.plot.plotting import timeseries_analysis_plot

from ..shared.io import NameList, StreamsFile
from ..shared.io.utility import buildConfigFullPath

from ..shared.MpasAnalysisXarray.MpasAnalysisXarray import openMultifileDataSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, I think this organization is strange. Do we really need a folder with the same class name? Also, typically files with classes are not camel-cased, but would be something like from ...shared.mpas_analysis_xarray import OpenMultifileDataSet where I'm assuming OpenMultifileDataSet is a class. If it is a function it should be lower case with _ characters.

variableMap=variableMap,
startDate=startDate,
endDate=endDate,
yearOffset=yearOffset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this changes public API so we probably want at least a very simple design doc.

@@ -14,6 +11,9 @@
from ..shared.timekeeping.utility import stringToDatetime, \
clampToNumpyDatetime64

from ..shared.MpasAnalysisXarray.MpasAnalysisXarray import openMultifileDataSet
from ..shared.MpasXarray.MpasXarray import subsetVariables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right @xylar, the organization here is important and need further consideration. I'm thinking we probably do need a design doc

  1. highlighting directory structure
  2. highlighting public API

I don't think this needs to be a "full" design doc, but more of a simple listing of how this all should work.


Utility function
----------------
openMultifileDataSet : creates an instance of MpasAnalysisXarray, opens a data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions typically should not be camel-cased.

@milenaveneziani
Copy link
Collaborator

I do not have specific opinions about MpasXarray and MpasAnalysisXarray.
About the file organization, I just wonder if we could have both classes in the shared/MpasXarray directory.

@pwolfram
Copy link
Contributor

@xylar and @milenaveneziani, I think we want a high-level view of the changes here to directory structure and public API. I'm thinking a minimal quasi-"design doc" may be the best way to do this because we need to make sure these changes are going to be more broadly useful than for just the three of us.

@xylar
Copy link
Collaborator Author

xylar commented Feb 10, 2017

The high level changes to the directory structure are quite simple. The following have been added:

mpas_analysis/shared/
             - MpasXarray/
                    __init__.py
                    MpasXarray.py
             -  MpasAnalysisXarray/
                     __init__.py
                    MpasAnalysisXarray.py

And the following has been removed:

mpas_analysis/shared/
             - mpas_xarray/
                    __init__.py
                    mpas_xarray.py

@milenaveneziani
Copy link
Collaborator

@pwolfram: I am not too concerned about the fact that this PR changes API, since for now we only have 4 developers. But I agree that it would be nice to hear from @vanroekel about these changes.

@xylar xylar force-pushed the variable_mapping_reorg branch from 32e5350 to cb78a8f Compare February 11, 2017 21:55
@xylar
Copy link
Collaborator Author

xylar commented Feb 11, 2017

Testing

I tested this with both LANL tests and it's bit-for-bit with #110 (which fixes a bug in develop).

@milenaveneziani
Copy link
Collaborator

@xylar: I don't have much else to say about this PR or #108. I agree with @vanroekel that the call to open the multi-file dataset is much more intuitive now. Thanks for all the work on this.

@xylar xylar force-pushed the variable_mapping_reorg branch from cb78a8f to 5e56c59 Compare February 12, 2017 09:10
@xylar xylar removed the in progress label Feb 12, 2017
@xylar xylar force-pushed the variable_mapping_reorg branch from 5e56c59 to 8e38579 Compare February 15, 2017 22:02
@xylar
Copy link
Collaborator Author

xylar commented Feb 15, 2017

@pwolfram, once you are happy with the design doc (#108), could you please re-review this PR at your earliest convenience? Much appreciated!

@xylar xylar changed the title Make MpasXarray and MpasAnalysisXarray classes Move variable mapping out of mpas_xarray into new generalized_reader module Feb 15, 2017
Copy link
Contributor

@pwolfram pwolfram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API for opening files is now much cleaner and simpler, which is a huge plus.

I think this PR will make it easier to load in MPAS data for new users.

A few minor, general notes:

  • We should try to keep blocks of code as close to their original locations as possible in order to make incremental changes to the code obvious between commits for debugging purposes.

  • Name changes to use _ characters are general frowned upon in python because strings with _ characters typically designate functions and methods.

variable_map=variableMap,
start_date=startDateFirstYear,
end_date=endDateFirstYear,
year_offset=yearOffset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the usage much clearer, especially because additional selections via xarray aren't needed now.

"""
Overloads the _preprocess function from MpasXarray to include
mapping variable names. See MpasXaray._preprocess for more
details.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, we aren't overloading anymore, right? If so, the docstring needs changed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed that.

"""

allvars = ds.data_vars.keys()

# get set of variables to drop (all ds variables not in vlist)
dropvars = set(allvars) - set(vlist)
dropvars = set(allvars) - set(variable_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions use underscores, not variables names. This change is confusing and should be rolled-back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said below, I'm really frustrated with this process. I would like some very clear rules on what our naming conventions are. What I feel like I'm getting instead is a steady stream of corrections with no clear rule behind them. What's more I can't find any clear rule in PEP8 either.

I think vlist is a poor variable name. It is not very descriptive. I'm okay with not having an underscore but I think we want descriptive names over short names. I am fine with variableList as an alternative.

All that being said, this is your code so I will defer to your style choices. I just want to understand what the rules are and also to get a chance to say whether I agree with them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought at some point we agreed to use camelcase for variables and underscores for functions/scripts? In any terms, let us just agree on something. My only big preference is for longer rather than shorter/cryptic names, but either camelcase or underscores would work equally well for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I didn't understand that. I just talked with @pwolfram and we agreed that CamelCase is the right way to do this for variable names (first letter lowercase) and class names (first letter uppercase). Function and method names as well as directories and file names are all lowercase with underscores.

As a counterexample to this, let me present xarray:

def open_dataset(filename_or_obj, group=None, decode_cf=True,
                 mask_and_scale=True, decode_times=True,
                 concat_characters=True, decode_coords=True, engine=None,
                 chunks=None, lock=None, cache=None, drop_variables=None):

Clearly they don't mind using underscores in variable names.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I personally do not care either. But I will add this 'rule' in the python standard requirement somewhere, so we can refer to it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have some type of local document on standards in this repo? It will be easier to refer to if it is local to this repository and not on confluence, although a mirror on confluence may be optimal.

@xylar, you are right that underscored variable names are used in xarray. I would personally say that is suboptimal for us, particularly since we are allowing use of camel case for variable names. I'm happy to bend to the group if you and @milenaveneziani find this too restrictive, however.

"""
assert datetimes[0].year > 1678, \
'ERROR: yearoffset={}'.format(yearoffset) + \
'ERROR: year_offset={}'.format(year_offset) + \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before. This change is not helpful, but is confusing and against PEP8 and standard conventions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned, this is not covered by PEP8. But I agree that I will change it. Not back to yearoffset but to yearOffset.

datetimes = [starts[i] + (ends[i] - starts[i])/2
for i in range(len(starts))]
return datetimes

time_var = ds[timestr]
time_var = ds[time_variable_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of underscore for variable, again...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll fix it again.

end_date='0005-03-01',
year_offset=1850)
self.assertEqual(len(ds.Time), 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great @xylar -- testing of opening file will hopefully ensure that we have all the functionality we need and if libraries change and this breaks we should know!

02/15/2017
"""

def preprocess_fn(ds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a partial function here, which would be cleaner because you are essentially doing this by hand. See https://docs.python.org/2/library/functools.html#functools.partial

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a great tip!

# and handle it manually (adding a new variable rather than renaming
# an existing one)
if variable_map is not None and \
time_variable_name in variable_map:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if you need to break this line, it may be clearer (even if the line is long) to keep on one line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it doesn't need to be split in this case.

return ds # }}}


def preprocess(ds, time_variable_name, variable_list, sel_values,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been largely unchanged, but it is highlighted as though it were new code. In the future, please try to keep edits to code in-place so that we have a clearer idea of how the code evolves from commit to commit, which is obviously very important for debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, this is probably not a huge deal in this case, but is a huge deal in other contexts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to reorganize which functions are public and which are private. This required moving several functions to the end of the file (since my understanding is that private functions should follow public ones whenever possible). GitHub makes arbitrary choices about which functions it follows and which it loses track of in such cases.

In any case, the order of the functions was not arbitrary. It is somewhat related to the history of me making a class and then undoing it per your previous requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. This is a downside of github and diffs in general-- they really can't sort out moves and edits in general. Seems like there is a better way to do this but no real choices for better diff evaluations...

`isel_values`) was provided in `openMultifileDataSet`, this
function performs the appropriate slicing on the data set.

Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know that these current doc strings will be parse-able via sphinx? I don't think we need to verify this now as long as we are willing to make tweaks to the docstrings later when we build out a automatic doc generation capability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure they will need tweaks, but I'm doing my best to follow examples I've seen that are parsed with sphinx.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is exactly what I'd do too. Sounds great.

@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2017

Name changes to use _ characters are general frowned upon in python because strings with _ characters typically designate functions and methods.

@pwolfram, I'm really confused about what you want in terms of CamelCase vs. underscores vs nothing at all. I didn't see anything in the PEP8 prescriptions that agrees with what you just said. I really, really, really hate variable names that are all lowercase with no indication of word separation (more than underscores even). If I had to choose, I would always use CamelCase. You've told me not to except for class names (which I'm apparently not supposed to use anyway because classes are confusing). I'm really frustrated and confused about this!!!

@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2017

@pwolfram, looking again at https://www.python.org/dev/peps/pep-0008/#prescriptive-naming-conventions, I don't see anything about variable names.

@pwolfram
Copy link
Contributor

@xylar, to summarize from our phone conversation earlier today:

  • variable names should be fully lower-case or camel cased with the first letter lower case, e.g., avariablename, aVariableName
  • function names should be all lower case with underscores, a_function_name
  • class names should be camel case with a capital first letter, AClassName, or MyClassName
  • module names should be lower case and may have underscores, a_module_name.py

When in doubt we should try to follow conventions of the pydata stack, e.g., numpy, scipy, pandas, xarray, etc because PEP8 isn't fully prescriptive. In general camel case is probably discouraged in these efforts for variable names but we are allowing it here for consistency with the MPAS convention to meld with existing python standards.

@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2017

@pwolfram, so I've looked at both numpy and xarray, and I see plenty of example where they have function arguments that use lowercase with underscores. See my earlier comment for an example from xarray (xarray.open_dataset). Here's one from numpy.ma.core:

def set_fill_value(a, fill_value):

I don't see much evidence that this convention is as established as you seem to think it is.

@pwolfram
Copy link
Contributor

Fair enough @xylar, maybe it is just my own personal quirk then. Vim takes care of this issue anyway by identifying and highlighting functions so this isn't a huge problem, but I thought it could be a helpful convention. I'll go with whatever you think best, but my vote still is a clear distinction between function/method names, variable names, and class names.

@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2017

So let's stick with the convention @pwolfram outlined above. But let's be careful about referring to it as an established standard in python because I don't see much evidence that that's the case. Especially in xarray, it seems not to be the case.

A new utility function, open_multifile_dataset, has been added
to make it easier and more straightforward to open multifile
datasets without having to understand the preprocessor.

Several helper functions have been made non-public because they
are unlikely to be used directly outside of mpas_xarray. These
include: _assert_valid_datetimes, _assert_valid_selections,
_ensure_list, _get_datetimes

Unit tests have been updated to use open_multifile_dataset but
analysis scripts have not yet been updated because they require
an alternative open_multifile_dataset and a more complex
preprocess function that can handle variable mapping to support
the full MPAS-analysis capabilities.
This module includes:
* A utility function open_multifile_dataset, similar to that for
  mpas_xarray, that opens an xarray data set and does the
  preprocessing needed for MPAS-Analysis, including mapping
  variable names and slicing the time axis to the appropriate
  range.

* some non-public helper functions for performing mapping of
  variable names.

Unit testing has also been added for generalized_reader
@xylar xylar force-pushed the variable_mapping_reorg branch from 8e38579 to c42fb0e Compare February 16, 2017 22:04
@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2017

Okay, I really hope I've addressed all your concerns. @pwolfram, could you give it another look when you can?

@xylar xylar force-pushed the variable_mapping_reorg branch from c42fb0e to ce20177 Compare February 16, 2017 22:18
@xylar
Copy link
Collaborator Author

xylar commented Feb 16, 2017

I retested this on my laptop. Seems to work, but I don't have a way of testing sea_ice.modelvsobs still. I'll be away from my computer for the weekend starting tomorrow so won't be able to do any further testing.

@milenaveneziani, would you be willing to try this on all the lanl and edison configs to make sure they still work as expected?

If so, and if @pwolfram doesn't have further changes, you're free to merge. I've already squashed commits.

@milenaveneziani
Copy link
Collaborator

sure, I will test on edison and wolf and report here.

@pwolfram
Copy link
Contributor

I've signed-off.

@milenaveneziani
Copy link
Collaborator

Great. And my tests just finished as well, successfully: on wolf (GMPAS-240 case) and on edison (beta0 case).

@milenaveneziani milenaveneziani merged commit ce20177 into MPAS-Dev:develop Feb 16, 2017
@xylar xylar deleted the variable_mapping_reorg branch February 23, 2017 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants