Skip to content

Add new consider-using-generator checker Issue #3165 #3309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Feb 20, 2021

Conversation

ikraduya
Copy link
Contributor

@ikraduya ikraduya commented Dec 17, 2019

Steps

  • Add yourself to CONTRIBUTORS if you are a new contributor.
  • Add a ChangeLog entry describing what your PR does.
  • If it's a new feature or an important bug fix, add a What's New entry in doc/whatsnew/<current release.rst>.
  • Write a good description on what the PR does.

Description

[Quoting from Issue #3165]
There are many cases of comprehensions inside of any or all calls that are unnecessary and should be replaced by a generator.
For example:

some_list = list(range(1000))
test_old = any([x % 7 == 0 for x in some_list])

Instead, a generator would be less code, and way faster:

test_new = any(x % 7 == 0 for x in some_list)

Speed comparisons:

64 µs ± 1.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) # old
447 ns ± 3.48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) # new

any and all are functions that profit from that because the loop is breaking and can short cut; the remaining elements are not even looped over if the test returns a True in the any function or a False in the all function.

This PR creates new consider-using-generator checker written in pylint/checkers/refactoring.py and add unit test consider_using_generator.py and consider_using_generator.txt

Type of Changes

Type
🐛 Bug fix
✨ New feature
🔨 Refactoring
📜 Docs

Related Issue

Closes #3165

@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 89.812% when pulling 25a92f1 on ikraduya:Issue3165 into dc83a86 on PyCQA:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) to 89.812% when pulling 25a92f1 on ikraduya:Issue3165 into dc83a86 on PyCQA:master.

@coveralls
Copy link

coveralls commented Dec 17, 2019

Coverage Status

Coverage increased (+0.007%) to 91.436% when pulling 522ca9b on ikraduya:Issue3165 into 570e655 on PyCQA:master.

Copy link
Contributor

@AWhetter AWhetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the contribution.
The only important change is the one about the message description. The code is all good though.

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice new checker, I had to talk about this problem in a code review recently it would have saved some time.

@Pierre-Sassoulas
Copy link
Member

@ikraduya can you rebase on the latest master and take Ashley's remark into account ? :) We're going to release 2.7 soon, now is the time if you want this feature to make it ;)

@ikraduya
Copy link
Contributor Author

@ikraduya can you rebase on the latest master and take Ashley's remark into account ? :) We're going to release 2.7 soon, now is the time if you want this feature to make it ;)

Sure, thanks for reminding

@ikraduya
Copy link
Contributor Author

ikraduya commented Feb 16, 2021

I have rebased on the latest master and edit the message description. Please review @Pierre-Sassoulas
Thanks

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have a two minor comments before merging, thank you for reacting so fast ;) !

@@ -684,18 +708,40 @@ def _check_consider_using_comprehension_constructor(self, node):
message_name = "consider-using-set-comprehension"
self.add_message(message_name, node=node)

def _check_consider_using_generator(self, node):
checked_call = ["any", "all"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @ngie-eign comment was relevant here #3309 (comment)
What do you think ?

Copy link
Contributor Author

@ikraduya ikraduya Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure, I have tested those.
The result is slower with the generator.

list:

python -m timeit "some_list = list(range(1000));b = list([x % 7 == 0 for x in some_list]);"
# output: 10000 loops, best of 3: 79.4 usec per loop

python -m timeit "some_list = list(range(1000));b = list(x % 7 == 0 for x in some_list);"
# output: 10000 loops, best of 3: 99.4 usec per loop

set:

python -m timeit "some_list = list(range(1000));b = set([x % 7 == 0 for x in some_list]);"
# output: 10000 loops, best of 3: 89.1 usec per loop

python -m timeit "some_list = list(range(1000));b = set(x % 7 == 0 for x in some_list);"
# output: 10000 loops, best of 3: 104 usec per loop

tuple:

python -m timeit "some_list = list(range(1000));b = tuple([x % 7 == 0 for x in some_list]);"
# output: 10000 loops, best of 3: 80.3 usec per loop

python -m timeit "some_list = list(range(1000));b = tuple(x % 7 == 0 for x in some_list);"
# output: 10000 loops, best of 3: 99.6 usec per loop

compared to any and all

any:

python -m timeit "some_list = list(range(1000));b = any([x % 7 == 0 for x in some_list]);"
# output: 10000 loops, best of 3: 74.2 usec per loop

python -m timeit "some_list = list(range(1000));b = any(x % 7 == 0 for x in some_list);"
# output: 100000 loops, best of 3: 17 usec per loop

all:

python -m timeit "some_list = list(range(1000));b = all([x % 7 == 0 for x in some_list]);"
# output: 10000 loops, best of 3: 84.6 usec per loop

python -m timeit "some_list = list(range(1000));b = all(x % 7 == 0 for x in some_list);"
# output: 100000 loops, best of 3: 16.5 usec per loop

When I checked the python docs, any and all take iterable as the parameter. But list, set, and tuple takes [iterable]. And there isn't any other function that takes iterable. So this 'consider-using-generator' still only relevant with any and all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it makes senses.

Suggested change
checked_call = ["any", "all"]
# We only check 'any' and 'all' because for list, set, and tuple a generator performs worse
# See https://github.com/PyCQA/pylint/pull/3309#discussion_r576683109
checked_call = ["any", "all"]

Copy link
Contributor

@ngie-eign ngie-eign Feb 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the timing concerns, but when it boils down to it, it's the unnecessary data copies that I'm worried about with my original comment. With particularly large sequences or iterables, creating unnecessary copies really adds up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikraduya: out of curiosity, what version of python was your output above tested on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here we could create two separate messages. Actually for all or any there is nothing to 'consider', changing for a generator is a no-brainer, you can cut the execution tree and exit directly at the first element. Except in the worst possible case it's an easy performance win. So we could have a stronger use-generator-instead message for this. Then for other container, maybe for really long lists or sets it would be worth it to use generator, but the question should be considered on a case by case basis. So the user should consider it and we could keep a consider-using-generator for those. What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngie-eign I am using python 3.6.9 version.

I have tested @ngie-eign codes using timeit.

tuple(list(range(100)))
tuple(range(100))
set(list(range(100)))
set(range(100))

tuple(list(range(1000000)))
tuple(range(1000000))
set(list(range(1000000)))
set(range(1000000))

output:

1000000 loops, best of 3: 1.44 usec per loop
1000000 loops, best of 3: 0.927 usec per loop
100000 loops, best of 3: 2.72 usec per loop
1000000 loops, best of 3: 2.28 usec per loop


10 loops, best of 3: 45.9 msec per loop
10 loops, best of 3: 36.3 msec per loop
10 loops, best of 3: 93.9 msec per loop
10 loops, best of 3: 64.9 msec per loop

Clearly using the generator wins over the timing because the conversion is unnecessary. But the codes are not using list comprehension. I think we can create another case to check such unnecessary copies code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also profiled the memory for tuple and set when using list comprehension.
Codes:

a = set([x % 7 for x in range(100)])
a = set(x % 7 for x in range(100))
a = set([x % 7 for x in range(1000000)])
a = set(x % 7 for x in range(1000000))

a = tuple([x % 7 for x in range(100)])
a = tuple(x % 7 for x in range(100))
a = tuple([x % 7 for x in range(1000000)])
a = tuple(x % 7 for x in range(1000000))

Results:

Maximum resident set size (kbytes): 9584
Maximum resident set size (kbytes): 9556
Maximum resident set size (kbytes): 17020
Maximum resident set size (kbytes): 9556

Maximum resident set size (kbytes): 9584
Maximum resident set size (kbytes): 9572
Maximum resident set size (kbytes): 24936
Maximum resident set size (kbytes): 17816

It is clear that a very long list/set will get the benefit from using generator :).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pierre-Sassoulas : I think your suggestion about creating additional suggestions for using generators judiciously with other containers makes a lot of sense (seems like a good candidate for an I class message).

I also think removing the “consider” part in the checker makes a lot of sense too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
checked_call = ["any", "all"]
checked_call = ["any", "all", "set", "list", "tuple"]

@Pierre-Sassoulas Pierre-Sassoulas added this to the 2.7.0 milestone Feb 16, 2021
@Pierre-Sassoulas Pierre-Sassoulas added Checkers Related to a checker Enhancement ✨ Improvement to a component and removed Work in progress labels Feb 16, 2021
@Pierre-Sassoulas Pierre-Sassoulas self-assigned this Feb 16, 2021
Remove item from 2.5.rst
Change `consider-using-generator` message ID
@ikraduya
Copy link
Contributor Author

There exist consider-using-set-comprehension check when the code is set([0 % 7 for i in range(10)])
I think we shouldn't include set in consider-using-generator` right?

@Pierre-Sassoulas Pierre-Sassoulas merged commit 45c8245 into pylint-dev:master Feb 20, 2021
@Pierre-Sassoulas
Copy link
Member

Merging despite the change request review, because @AWhetter review was about the message content and it was fixed.

Thanks a lot for your work @ikraduya, this is a great checkers, will make a lot of codebase faster ! And congratulation on becoming a pylint contributor !

@tweakimp
Copy link

Thank you for implementing this! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Checkers Related to a checker Enhancement ✨ Improvement to a component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement: Add a [consider-using-generator]-checker in any or all calls
8 participants