-
Notifications
You must be signed in to change notification settings - Fork 262
NF nib-diff #617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NF nib-diff #617
Conversation
…two different headers
…nto nf-nib-diff
…ent, implemented yarik suggestions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments around but I think we need to step back a little and now that you have coded this logic for awhile try to take a fresh look, since I think code could be greatly improved to avoid
- ambiguity (all the try/except for Attribute/Value Errors)
- redundancy (comment about diff_headers and then diff_header_fields)
and also make it shorter, cleaner, and pretty ;-)
Let's start with the "core" function diff_values
. The task -- make it cuter
- it should accept not just 2 values but any number of them (
diff_values(*values)
) - you build your logic using
any
/all
andmap
or generators or list comprehensions, while converting following human language statements into python- value0 is the first value
- if any of the types different from the type of value0 - there is difference
- if value0 is an array, then all should be equal to it in every element of an array
- if value0 is not an array, then all should be equal to it
- If tests pass now, they should pass after this refactoring
After you are done with that, try to express in Python the ultimate diff_headers which would
- get a set of all header fields across all dicts provided to it
- return a dict mapping field to a list of values if
diff_values
given those values returned True
Then given such a dict you would do all the string conversions only for print out, not for logic of comparison (like done now in diff_header_fields)
Makes sense? ;)
nibabel/cmdline/diff.py
Outdated
keyed_inputs.append(list(field_value)) | ||
|
||
except UnboundLocalError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look "kosher" -- try to make code explicit to not have some undefined local variables used
nibabel/cmdline/diff.py
Outdated
continue | ||
|
||
for i in range(len(keyed_inputs)): | ||
keyed_inputs[i] = str(keyed_inputs[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and in general in Python try to avoid explicit indexing. Eg. here it is just a list comprehension (or even a map
):
keyed_inputs = [str(x) for x in keyed_inputs]
or
# list() because map in PY3 is a generator
keyed_inputs = list(map(str, keyed_inputs)))
nibabel/cmdline/diff.py
Outdated
|
||
if data_diff: | ||
break | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea ValueError could/should happen here (no comment etc) but I do not think it should be considered as "there is no difference"
nibabel/cmdline/diff.py
Outdated
return headers | ||
|
||
|
||
def diff_header_fields(header_field, files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is just a single header_field
, better rename function to correspond ;)
Also you also have get_headers_diff
which is just the one which goes over multiple header fields.
So, please make names and signature consistent between them and more descriptive, also while renaming files
into something closer to the current state of affairs, e.g. diff_header_fields(headers, fields)
and diff_header_field(headers, field)
?
adjust also docstring to correspond. e.g.
headers: list of dict
Header records to be compared
field: str
Name of the header field to compare
In PyCharm you could easily rename functions using Refactor -> Rename function.
But then I also spotted diff_headers
above ;-) so it seems that there is duplicate functionality:
- diff_headers now just returns header fields which differ
- then you go again and get the values for those fields
Why not to make get_diff_headers
do what diff_headers
does now?
what you need is just to go through dicts and see which fields differ (or present in one but absent in another) and return a dict with those fields which differ.
nibabel/cmdline/diff.py
Outdated
other_field = other_field.tolist() | ||
|
||
except AttributeError: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why continue? if one misses smth when others don't -- they differ
Travis should be clear after this commit. AppVeyor has problems relating to its Python environments - some tests that weren't touched by this pull request started crapping out. Otherwise, aside from maybe a few more tests, hopefully this function is ready to be pulled into NiBabel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's polish it up, and hopefully @matthew-brett could chime in as well!
Then IMHO could be merged and improved upon later when/if needed/desired
nibabel/cmdline/tests/test_utils.py
Outdated
from os.path import (dirname, join as pjoin, abspath) | ||
|
||
|
||
DATA_PATH = abspath(pjoin(dirname(__file__), '../../tests/data')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is already one defined, just use
from nibabel.testing import data_path
note that in above '/' within the path might be *nix specific and might (?) not work on Windows. That is why we use all the path.join
nibabel/cmdline/tests/test_utils.py
Outdated
assert_equal(actual_difference["qoffset_z"], expected_difference["qoffset_z"]) | ||
np.testing.assert_array_equal(actual_difference["srow_x"], expected_difference["srow_x"]) | ||
np.testing.assert_array_equal(actual_difference["srow_y"], expected_difference["srow_y"]) | ||
np.testing.assert_array_equal(actual_difference["srow_z"], expected_difference["srow_z"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although I like when tests are "explicit" like this, I hate code duplication more ;-) and apparently (thanks google) we could just use np.testing.assert_array_equal(actual_difference, expected_difference)
for f in ('example4d.nii.gz', 'example4d.nii.gz')] | ||
code, stdout, stderr = run_command(['nib-diff'] + fnames2, check_code=False) | ||
assert_equal(stdout, "These files are identical.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about comparing different files now? at least a basic test using those nibabel/tests/data/spmT_0001*.nii.gz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you think that testing that get_headers_diff works is enough? My concern is that testing text output here has traditionally really messed up with Travis and AppVeyor because their Python environments for some reason print fewer significant figures than my computer does. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- in principle, you don't have to check output for 1-to-1 matching, you could check for the important segments to be there in the output (that fields you expect are listed, etc)
- ideally we should have output consistent across machines.
- as I've outlined in a separate email - if all the printing is moved to another function, then may be it should be tested in a dedicated to that function test. Here we just would need to check that script is indeed running and exiting as expected when we give the identical or different files, with only minimal testing of output (to verify that it is there)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DONE
nibabel/cmdline/diff.py
Outdated
print("{:<15}".format('Field'), end="") | ||
|
||
for f in files: | ||
output = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is output
simply the file name or something more complicated? If not, using os.path.basename
might be easier, readable and generic across OSes to obtain the filename?
nibabel/cmdline/diff.py
Outdated
print() | ||
|
||
for key, value in diff.items(): | ||
print("{:<15}".format(key), end="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest capturing string lengths (15, 55) into variables, as hardcoding them is not a good idea, and having them in variables allows us to adapt for different environments. For example, for terminals with smaller widths etc..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And eventually the table Formatter which was used by nib-ls could be used to make it all dynamic without hardcoring much... And then eventually pyout by @kyleam could be used for the most flexible tabulator ;-)
But we thought here to get it to some viable state first and then improve upon
nibabel/cmdline/diff.py
Outdated
item_str = str(item) | ||
# Value might start/end with some invisible spacing characters so we | ||
# would "condition" it on both ends a bit | ||
item_str = re.sub('^[ \t]+', '<', item_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from using str.strip()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Place those indicators only if anything was replaced at any end
nibabel/cmdline/diff.py
Outdated
for key, value in diff.items(): | ||
print("{:<15}".format(key), end="") | ||
|
||
for item in value: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea that might be interesting to consider is whether layout differences vertically, which can scale better with number of images. Although horizontal layout looks nice for 2/3 images, which can be the default, but for a 10 images, it might be easier print each differing field into a separate block of values from all images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but most frequent case is actually two files and many fields differing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- do not catch SystemExit
- remove test specific handling, just use
assert_raises
outside
nibabel/cmdline/diff.py
Outdated
|
||
except SystemExit: | ||
opts = None | ||
files = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you expect SystemExit to be raised anywhere in above code... by --help
etc? I do not think that this function should continue going forward if that happens... would also cause breakage (e.g. files here become None but later you test len(files)
.
Anyways -- I do not think you want to catch anything at this point
nibabel/cmdline/diff.py
Outdated
sys.stdout.write(display_diff(files, diff)) | ||
raise SystemExit(1) | ||
else: | ||
return diff # this functionality specifically for testing main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally there should be no test specific code in the main code.
OK so a (-0.004%) decrease in coverage is probably not going to mean the end of the world. nib-diff in its current stage is at a good starting point for functionality in NiBabel and I think is ready for merge. @yarikoptic Moving forwards, there are several functionalities to implement that could improve it:
... and more! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost there! but see comments ~~~above~~~ below and:
- old comment "what about comparing different files now? at least a basic test using those nibabel/tests/data/spmT_0001*.nii.gz ?" if you aren't using them -- remove.
- also add a test where you have more than 2 files, I don't think you have any. E.g. one with the same file specified 3 times (no changes), and then one where 3rd is different. See that you get correct result overall
that "dropped" test coverage even though small, shows that some critical pieces were never ran, and that there is a good chance they could not work correctly/as expected... that would lead to even more frustration if we had to fix up for it later . so let's push just a bit more in tuning/extending tests
nibabel/cmdline/diff.py
Outdated
return difference | ||
|
||
|
||
def get_data_md5sums(files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that you would hate continued review @chrispycheng but, by seeing the function name I got confused why below it returns an empty list if there is only one unique value. So, please
- rename to e.g.
get_data_diff
- add a docstring describing the output like you nicely did for the above
get_header_diff
nibabel/cmdline/diff.py
Outdated
if np.any(value0 != value): # special test for ndarray | ||
return True | ||
else: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just return np.any(...)
here. no need for all the return True/False dance since np.any
returns a single bool here
] | ||
|
||
if len(set(md5sums)) == 1: | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might be one contributor to your .000?% coverage miss (do you have codecov extension to the browser installed to see what lines aren't covered?). Apparently there is no test which verifies that you do get empty list in output whenever two files have the same data? you could make a dedicated test for this function and feed it
nibabel/cmdline/diff.py
Outdated
|
||
file_headers = [nib.load(f).header for f in files] | ||
|
||
if opts: # will almost always have a header field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need this if opts
(so no else
is needed either) since they always will be there!
nibabel/cmdline/diff.py
Outdated
# TODO: header fields might vary across file types, thus prior sensing would be needed | ||
header_fields = file_headers[0].keys() | ||
else: | ||
header_fields = opts.header_fields.split(',') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently has no test case to test this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you test an intermediary if/else statement within a function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invoke the command/function with options where you specify your list of fields to be used for comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that would be the command parser itself? I don't think any other NiBabel function has a test for that, maybe that could be a separate PR?
raise SystemExit(1) | ||
|
||
else: | ||
out.write("These files are identical.\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and again no test to test this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- simple/integration test: provide two identical files for comparison!
- "advanced"/unit/- logic- test: mock out the diff calls so they return no difference, and then invoke the function to get here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my latest commit I added the simple/integration test. I'm not sure how I would go about the advanced logic test though?
I'd like to state for the record that the AppVeyor fail was because of some other file! It wasn't me! |
@matthew-brett @effigies any final remarks to let this PR finally get in to mature further in the real world of nibabel? ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a quick pass, and no serious objections. A couple quibbly questions/comments.
import re | ||
import sys | ||
from collections import OrderedDict | ||
from optparse import OptionParser, Option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't hold this PR up for this, but just FYI optparse
has been deprecated, and argparse
is the supported argument parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... we should convert all the cmdline tools which still use optparse in some one PR ;)
md5sums = [ | ||
hashlib.md5(np.ascontiguousarray(nib.load(f).get_data(), dtype=np.float32)).hexdigest() | ||
for f in files | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you're using MD5 and not something more collision-resistant such as SHA256?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since MD5 is sufficient and shorter. It is unlikely that in our lifetime we would see any user who would run into a collision in this use case ;-)
|
||
if diff: | ||
out.write(display_diff(files, diff)) | ||
raise SystemExit(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this now preferred to sys.exit
(or just returning a return code and having the entry point exit with that code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it is where python violated its own Zen:
In [4]: import this
The Zen of Python, by Tim Peters
...
There should be one-- and preferably only one --obvious way to do it.
...
In [2]: sys.exit?
Docstring:
exit([status])
Exit the interpreter by raising SystemExit(status).
so seems to be exactly the same thing... but it is easier to explain that "we will raise exception and then test in the test that it was raised" ;)
Okay. Since nothing's likely to depend on two separate versions of 👍 to merge. |
thanks! Let's merge then! |
Initial commits are duplicated with PR: RF nib-ls move which has since been merged
Example:
TODOs:
Future diff (to be done in separate PR)