Skip to content

How to align objects in lists properly? #185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MKaras93 opened this issue Apr 10, 2020 · 14 comments
Closed

How to align objects in lists properly? #185

MKaras93 opened this issue Apr 10, 2020 · 14 comments

Comments

@MKaras93
Copy link
Contributor

Hi. Sorry, I don't know whether it's a bug/feature or just a question. I've asked it some time ago on StackOverflow, but didn't receive any answer, so I decided to try here.

I'm trying to compare two lists of objects (dicts in this case) with deepdiff:

old = [
       {'name': 'war', 'status': 'active'},
       {'name': 'drought', 'status': 'pending'}
]

new = [
       {'name': 'war', 'status': 'pending'},
       {'name': 'fire', 'status': 'pending'}]

DeepDiff(old, new)

# Result:
{'values_changed': 
  {"root[0]['status']": {'new_value': 'pending', 'old_value': 'active'},
   "root[1]['name']": {'new_value': 'fire', 'old_value': 'drought'}}}

The problem is that I need a different way of aligning objects. In my project a particular state (for example war) have a strict life cycle: appears as pending, transforms to active and disappears. I want to use deepdiff to track these changes. Objects with different names are different objects and I don't want them to align with each other.

So the result I expect is:

'values_changed':
    {"root[0]['status']": {'new_value': 'pending', 'old_value': 'active'},
 'iterable_item_removed':
    {'root[1]': {'name': 'drought', 'status': 'pending'}}}
 'iterable_item_added':
    {'root[1]': {'name': 'fire', 'status': 'pending'}}}

Is there a way I can modify my object to align it properly? I've tried replacing the dict with a class with custom __eq__ method, but it didn't work. Do you have any other suggestion how I can make objects with the same name align only with each other?

Here is the original stack overflow question.

@testautomation
Copy link

testautomation commented Apr 10, 2020

you may have to nest your objects(?)

what about this

old = [ 
        {"obj1":{'name': 'war', 'status': 'active'}}, 
        {"obj2":{'name': 'drought', 'status': 'pending'}} 
      ]

new = [ 
        {"obj1":{'name': 'war', 'status': 'pending'}}, 
        {"obj3":{'name': 'fire', 'status': 'pending'}} 
      ]  

from deepdiff import DeepDiff

DeepDiff(old, new)
>>>    {'dictionary_item_added': [root[1]['obj3']], 
        'dictionary_item_removed': [root[1]['obj2']], 
        'values_changed': {"root[0]['obj1']['status']": 
               {'new_value': 'pending', 'old_value': 'active'}}}

or

>>> new = [ 
...        {"war":{'name': 'war', 'status': 'pending'}}, 
...        {"fire":{'name': 'fire', 'status': 'pending'}} 
...      ]

>>> old = [ 
...        {"war":{'name': 'war', 'status': 'active'}}, 
...        {"drought":{'name': 'drought', 'status': 'pending'}} 
...      ]

>>> DeepDiff(old, new)                                                      
{'dictionary_item_added': [root[1]['fire']], 
 'dictionary_item_removed': [root[1]['drought']], 
 'values_changed': {"root[0]['war']['status']": 
    {'new_value': 'pending', 'old_value': 'active'}}}

@seperman
Copy link
Owner

Hi @MKaras93 and @testautomation .
Thanks @testautomation for posting a solution.
I'm also going to add a parameter to allow object transformations. That way you can have the full control to do the transformations easily especially useful on deeply nested objects.

@MKaras93
Copy link
Contributor Author

Thanks, I'm currently busy with other projects but I will get back to it when I have some time and let you know if it worked ;)

@testautomation
Copy link

@seperman I have a question which is related or similar to original post.

Let's assume I have some data like in example below where rows is a list of lists each with one object (that can have more and deeper nested elements than in below example). Each object is identifiable by it's uid.value.

How can we make sure that Deepdiff compares the correct objects with each other when we compare two of such data sets?

What I have noticed is that Deepdiff may report "false" diffs even when using ignore_order option because it may end up comparing the wrong objects with each other.

{
  "rows": 
    [

        [
            {
                "uid": {
                    "_type": "OBJECT_VERSION_ID",
                    "value": "1d0fdce1-f98c-4fcc-b578-6b8d1c207bfb::local.ehrbase.org::1"
                },
                "territory": {
                    "terminology_id": {
                        "value": "ISO_3166-1"
                    },
                    "code_string": "UY"
                },
                "name": {
                    "value": "Minimal",
                    "_type": "DV_TEXT"
                },
              ...
            }
        ],

        [
            {
                "uid": {
                    "_type": "OBJECT_VERSION_ID",
                    "value": "a32f9166-cc32-4bfb-ab46-792e611b876d::local.ehrbase.org::1"
                },
                "territory": {
                    "terminology_id": {
                        "value": "ISO_3166-1"
                    },
                    "code_string": "UY"
                },
                "name": {
                    "value": "Minimal",
                    "_type": "DV_TEXT"
                },
              ...
            }
        ]

    ]
}

@seperman
Copy link
Owner

seperman commented May 1, 2020

Hi @testautomation
DeepDiff won't know what uid field means. It won't know that it should use uid to compare objects with the same uid. When ignore_order=True, DeepDiff compares all objects in the iterable with every other single object that is not in the first iterable to find the closest matches.

Assuming the 2 iterables contain exactly the same uids, The best thing to do is , sort the 2 iterables by the uid field before passing it to deepdiff. Then set ignore_order=False. That way automatically the correct objects with the same uid are compared with each other.

@wlad
Copy link

wlad commented May 13, 2020

fyi: I'm back w/ a new Github handle @wlad. @testautomation is now an org.

@seperman
Copy link
Owner

seperman commented May 13, 2020 via email

@seperman
Copy link
Owner

seperman commented May 15, 2020

@wlad Please pull the new changes on the dev branch and test with your data. It should be exponentially faster when ignore_order=True (I hope). I have also added logging so you get some insights while DeepDiff is running. Pass log_frequency_in_sec=20 or number other than zero to get some status updates while DeepDiff is running. Once the calculations finish, please run the new get_stats() method to get more details and post it here.
I'm curious how long it will take to run and what stats you will get.
Thanks!

@seperman
Copy link
Owner

Update: The delta module has 100% test coverage now. The tricky part is now that DeepDiff itself should run with or without Numpy. But I want to make sure in one run of pytest, the coverage report reaches 100%. That means some tests should make DeepDiff think Numpy doesn't exist. And the rest should use the Numpy module. Any ideas how to do that?

I have not tried it but if I can mutate the sys.modules to rename Numpy (np) to something else for some tests and then turn it back to be called np for other tests, then some tests can run with and other tests can run without Numpy. That way in one run of Pytest I can get to 100% coverage. But I think that's a dirty hack. Any recommendations how to proceed?

@wlad
Copy link

wlad commented May 16, 2020

Please pull the new changes on the dev branch and test with your data ...

@seperman We have to solve #193 for me to continue testing

That means some tests should make DeepDiff think Numpy doesn't exist. And the rest should use the Numpy module. Any ideas how to do that?

I have no idea how to achieve that in one run but just found that pytest has an option --cov-append which would allow to have two separate runs an then merge the cov results. That can be implemented relatively easy on a CI pipeline (w/ CircleCI or Github Actions)

@dtorres-sf
Copy link
Contributor

I am currently evaluating this library for diffing python dictionaries and we need a way to compare items in a sequence. In our case their our unique IDs such as:

{ 
  "Cars": [
    {
      "id": "1",
      "make": "Toyota",
      "model": "Camry"
    },
    {
      "id": "2",
      "make": "Toyota",
      "model": "Highlander"
    }
  ]
}

I would prefer not to preprocess making id the key (as described in this thread and what group_by does) because I want to be able to use the Delta functionality to track and store Deltas.

The above is a contrived example to keep things simple, but our use case is a more complex. Ignoring order and using hashing would likely provide a match but at a performance cost, when the true compare function is well known (simply comparing functions).

@seperman I would like to get your thoughts on a proposed change that we could help develop if you think it makes sense:

Add an option to DeepDiff such as iterable_compare_func, this takes two arguments a and b to compare. This would return True if the items are the same, False otherwise.

We would update (

def _diff_iterable(self, level, parents_ids=frozenset(), _original_type=None):
) to check if there was a compare function and call a new _diff_iterable_compare_func function that would be very similar to _diff_iterable_in_order except it would have a nested loop and call the compare function across every item to be compared to until it found a match or exhausted the list (in which case it would be marked as removed). Finally we would need to check the new array to see if anything was added in a similar fashion.

First, do you think this approach would work?

Secondly, are you open to merging something like this in if we develop a patch?

@seperman
Copy link
Owner

seperman commented Apr 9, 2021

Hi @dtorres-sf
Thanks for the feedback. The approach you are suggesting has some overlap with how the ignore_order=True works. The hashing part is usually cheap. What is expensive is deciding what items to diff against each other to report as modified. DeepDiff does the Cartesian product of the items added or removed between 2 iterables and gets the deep distance between those items. Then picks the items with the shortest distance from each other to diff against each other as "modified" objects. The rest of the items will be reported as added or removed instead of modified.

Perhaps we could target this expensive calculation itself by allowing the user to pass a function like what you described in iterable_compare_func to override:

def _get_most_in_common_pairs_in_iterables(

For example if we have

{ 
  "Cars": [
    {
      "id": "1",
      "make": "Toyota",
      "model": "Camry"
    },
    {
      "id": "2",
      "make": "Toyota",
      "model": "Highlander"
    },
    {
      "id": "3",
      "make": "Toyota",
      "model": "4Runner"
    }
  ]
}

and

{ 
  "Cars": [
    {
      "id": "1",
      "make": "Toyota",
      "model": "Camry"
    },
    {
      "id": "2",
      "make": "Toyota",
      "model": "4Runner"
    },
    {
      "id": "3",
      "make": "Toyota",
      "model": "Highlander"
    }
  ]
}

which has only model name changed in ids 2 and 3, then deepdiff via hashing finds out that indexes 1 and 2 have changed between the 2 iterables but indexes 0 are not. It passes this info to _get_most_in_common_pairs_in_iterables to decide which items should be diffed against each other. Here is where the Cartesian product and expensive work happens. A user provided get_most_in_common_pairs_in_iterables_override function can tell deepdiff to only compare the 2 items with id 2 with each other and the 2 items with id 3 with each other. It will greatly increase the performance! Best of all it will work with the delta object too since it is not mutating the data. :)

@dtorres-sf
Copy link
Contributor

@seperman I just hacked up a version to do what you mentioned for some quick testing. The good news is I was able to get matching to work correctly, but the bad news is that the Delta object can not be used to recreate the original object (because ignore_order is true). Here is the test I was using (note: I did not pass in a get_most_in_common_pairs_in_iterables_override because for quick testing I just hardcoded the _get_most_in_common_pairs_in_iterables to do the id check).

    def test_ignore_order_compare_ids(self):
        t1 = {
          "Cars": [
            {
              "id": "1",
              "make": "Toyota",
              "model": "Camry"
            },
            {
              "id": "2",
              "make": "Toyota",
              "model": "Highlander",
              "dealers": [
                  {
                      "id": 123,
                      "address": "123 Fake St",
                      "quantity": 50
                  },
                  {
                      "id": 125,
                      "address": "125 Fake St",
                      "quantity": 20
                  }
              ]
            },
            {
              "id": "3",
              "make": "Toyota",
              "model": "4Runner"
            },
            {
              "id": "4",
              "make": "Toyota",
              "model": "supra",
              "color": "red"
            }
          ]
        }

        t2 = {
          "Cars": [
            {
              "id": "7",
              "make": "Toyota",
              "model": "8Runner"
            },
            {
              "id": "3",
              "make": "Toyota",
              "model": "4Runner"
            },
            {
              "id": "1",
              "make": "Toyota",
              "model": "Camry"
            },
            {
              "id": "4",
              "make": "Toyota",
              "model": "Supra",
              "dealers": [
                  {
                      "id": 123,
                      "address": "123 Fake St",
                      "quantity": 50
                  },
                  {
                      "id": 125,
                      "address": "125 Fake St",
                      "quantity": 20
                  }
              ]
            }
          ]
        }



        # This is using a version of DeepDiff that matches IDs in _get_most_in_common_pairs_in_iterables
        diff = DeepDiff(t1, t2, ignore_order=True, report_repetition=True)

        # The diff works
        # Toyota Supra has been correctly matched based on the "id" even though Highlander and Supra have a closer distance

        delta = Delta(diff)
        recreated_t2 = t1 + delta
        replay_diff = DeepDiff(recreated_t2, t2)
        assert replay_diff.to_dict() == {} # XXX: This assert fails

       # Here is the value of replay_diff:
       # {'values_changed': {"root['Cars'][1]['id']": {'new_value': '3', 'old_value': '1'}, "root['Cars'][1]['model']": {'new_value': '4Runner', 'old_value': 'Camry'}, "root['Cars'][2]['id']": {'new_value': '1', 'old_value': '3'}, "root['Cars'][2]['model']": {'new_value': 'Camry', 'old_value': '4Runner'}}}

I am not sure the best way to solve this, it would almost need a new report item such as "iterable_item_moved".

@seperman
Copy link
Owner

Hello @MKaras93 @wlad and @dtorres-sf
Thanks to @dtorres-sf 's PR, this feature is now released in DeepDiff 5.5.0! https://zepworks.com/deepdiff/current/other.html#iterable-compare-func

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants