Skip to content

bpo-24665: Add CJK support in textwrap by default. #5649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

width = 0
pos = 0
for char in text:
width += 2 if east_asian_width(char) in {'F', 'W'} else 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why inlining _len(), I don't have seen performance issues and it's less readable (less pythonic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know where to break once you have the whole value?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello I was reading too fast. In my version there's the _wide boolean function.
So here width += _wide(char) + 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And _len is just return sum(2 if _wide(char) else 1 for char in text) with no performance issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More pythonic, DRY.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't bet on the performances, calling _len from _slice adds two functions calls per character (one to _len and one to sum). In one case I'm doing it on a character, and in the other case in a whole string. Yes I could also factorize this ternary to a third function, but I don't find it more readable.

width += 2 if east_asian_width(char) in {'F', 'W'} else 1
if width > index:
break
pos += 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why note use enumerate(), it's less readable (less pythonic)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it does not works with enumerate as the last incrementation were not done. I do not remember which case exactly but if you run the unit test you'll spot it easily, it was failing, I'll do if needed but can't right now.

Copy link

@fgallaire fgallaire Feb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in that, the code was strongly tested for txt2tags and don't catch this problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your initial implementation was working thanks to your if cjk_len(text) <= index: return text, '' fixing the special case explicitly, I may have tried to avoid it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Explicit is better than implicit." but the more important is that both solutions are correct.

@fgallaire
Copy link

Don't see my author credit

@fgallaire
Copy link

And you miss the if self.width <= 0: bug fixed in #89

@JulienPalard JulienPalard force-pushed the textwrap-cjk branch 2 times, most recently from 45fd84d to 4623375 Compare March 6, 2018 22:42
@JulienPalard
Copy link
Member Author

And you miss the if self.width <= 0: bug fixed in #89

You're right! And trying to split a wide character yield to an infinite loop.

Don't see my author credit

Gladly fixed and co-authored you.

@fgallaire
Copy link

Thanks @JulienPalard, I'm so happy ! I had almost lost hope to see this issue fixed.

if self.width <= 0:
raise ValueError("invalid width %r (must be > 0)" % self.width)
elif self.width == 1 and _width(text) > len(text):
raise ValueError("invalid width 1 (must be > 1 when CJK chars)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done a more complex solution:

elif self.width == 1 and (sum(self._width(chunk) for chunk in chunks) >
                              sum(len(chunk) for chunk in chunks)):

It throws the exception earlier, but it's probably not absolutely necessary.

@JulienPalard JulienPalard requested a review from vstinner March 28, 2018 21:16
Copy link
Member

@terryjreedy terryjreedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change I request is that this be closed because it is conceptually wrong. Textwrap works in terms of abstract 'characters' (codepoint), not physical units. I will explain this on the issue.

Aside from that, 2 is the wrong number to add, as 'double-width' characters are not actually twice as wide as fixed-pitch Ascii chars of the same height. See the issue for this as well.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@methane methane closed this Jul 11, 2018
@JulienPalard JulienPalard deleted the textwrap-cjk branch June 16, 2019 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants