Fix a bug that GZipReader#gets may return incomplete line #32

kou · 2021-10-05T07:46:36Z

How to reproduce with x.csv.gz in the issue comment:

Zlib::GzipReader.open("x.csv.gz") do |rio|
  rio.gets(nil, 1024)
  while line = rio.gets(nil, 8192)
    raise line unless line.valid_encoding?
  end
end

Reported by Dimitrij Denissenko. Thanks!!!

See also: ruby/csv#117 (comment) How to reproduce with x.csv.gz in the issue comment: Zlib::GzipReader.open("x.csv.gz") do |rio| rio.gets(nil, 1024) while line = rio.gets(nil, 8192) raise line unless line.valid_encoding? end end Reported by Dimitrij Denissenko. Thanks!!!

kou · 2021-10-07T05:08:48Z

@nobu What do you think about this?

nobu · 2021-10-08T03:21:56Z

If the filled size equals the reading size, gzreader_charboundary calls rb_enc_left_char_head with p == e.
Oniguruma's left_adjust_char_head functions don't expect this condition.
Although this may be a bug in Oniguruma, the following patch would workaround it.

diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index f9af18f530e..8b6b802e09d 100644
--- a/ext/zlib/zlib.c
+++ b/ext/zlib/zlib.c
@@ -4198,12 +4198,15 @@ static long
 gzreader_charboundary(struct gzfile *gz, long n)
 {
     char *s = RSTRING_PTR(gz->z.buf);
-    char *e = s + ZSTREAM_BUF_FILLED(&gz->z);
-    char *p = rb_enc_left_char_head(s, s + n, e, gz->enc);
+    long f = ZSTREAM_BUF_FILLED(&gz->z);
+    int boundary = (f == n);
+    char *e = s + f;
+    char *p = rb_enc_left_char_head(s, s + n - boundary, e, gz->enc);
     long l = p - s;
     if (l < n) {
 	n = rb_enc_precise_mbclen(p, e, gz->enc);
 	if (MBCLEN_NEEDMORE_P(n)) {
+	    l += boundary;
 	    if ((l = gzfile_fill(gz, l + MBCLEN_NEEDMORE_LEN(n))) > 0) {
 		return l;
 	    }

nobu · 2021-10-08T11:20:48Z

Sorry, I didn't see that there was the patch already.
Doesn't using s+n-1 have a problem when s+n points a leading byte?
And n_bytes seems for the left character of s+n but not for the character at s+n.

kou · 2021-10-08T21:24:42Z

Doesn't using s+n-1 have a problem when s+n points a leading byte?

Umm, I think that gzreader_charboundary() should not care whether s+n points a leading byte or not.

When n == ZSTREAM_BUF_FILLED(&gz->z):

A byte pointed by s+n isn't uninitialized. We should not use it.

When n < ZSTREAM_BUF_FILLED(&gz->z):

If s+n-1 doesn't point a leading byte, n should not be changed. (No boundary adjustment is needed.)
If s+n-1 point a leading byte, s+n byte is used as a part of boundary character. It doesn't care whether s+n points a leading byte or not.

And n_bytes seems for the left character of s+n but not for the character at s+n.

I think that it's intentional. I think that gzreader_charboundary() should align to the left character of s+n.

Anyway, I'm not familiar with zlib code base. If you think that your patch is right approach, could you push your patch. I'm OK with any approach that doesn't return incomplete line.

kou · 2021-10-09T00:33:29Z

I think that it's intentional. I think that gzreader_charboundary() should align to the left character of s+n.

Fix: align to the character of s+n-1

kou · 2021-10-11T02:45:28Z

It seems that IO#gets uses s+n-1:

https://github.com/ruby/ruby/blob/b9f7286fe95827631b11342501e471e5e6f13bbb/io.c#L3751

		pp = rb_enc_left_char_head(s, p-1, p, enc);

File.open("/tmp/x", "w") do |output|
  output.puts("あい")
end

File.open("/tmp/x") do |input|
  p input.gets(nil, 4) # This uses the 4th byte (the first byte of "い") not the 5th byte
end

ext/zlib/zlib.c

kou mentioned this pull request Oct 5, 2021

CSV.new する時にGzipReaderを渡すとCSV::Parser::InvalidEncodingが出る場合がある ruby/csv#117

Closed

nobu reviewed Oct 11, 2021

View reviewed changes

ext/zlib/zlib.c Show resolved Hide resolved

kou merged commit b1f182e into master Oct 15, 2021

kou deleted the gets-may-return-invalid-line branch October 15, 2021 06:31

kou mentioned this pull request Oct 15, 2021

2.2.0 release #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix a bug that GZipReader#gets may return incomplete line #32

Fix a bug that GZipReader#gets may return incomplete line #32

Uh oh!

kou commented Oct 5, 2021

Uh oh!

kou commented Oct 7, 2021

Uh oh!

nobu commented Oct 8, 2021

Uh oh!

nobu commented Oct 8, 2021 •

edited

Loading

Uh oh!

kou commented Oct 8, 2021

Uh oh!

kou commented Oct 9, 2021

Uh oh!

kou commented Oct 11, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix a bug that GZipReader#gets may return incomplete line #32

Fix a bug that GZipReader#gets may return incomplete line #32

Uh oh!

Conversation

kou commented Oct 5, 2021

Uh oh!

kou commented Oct 7, 2021

Uh oh!

nobu commented Oct 8, 2021

Uh oh!

nobu commented Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kou commented Oct 8, 2021

Uh oh!

kou commented Oct 9, 2021

Uh oh!

kou commented Oct 11, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nobu commented Oct 8, 2021 •

edited

Loading