Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented Oct 5, 2021

See also: ruby/csv#117 (comment)

How to reproduce with x.csv.gz in the issue comment:

Zlib::GzipReader.open("x.csv.gz") do |rio|
  rio.gets(nil, 1024)
  while line = rio.gets(nil, 8192)
    raise line unless line.valid_encoding?
  end
end

Reported by Dimitrij Denissenko. Thanks!!!

See also: ruby/csv#117 (comment)

How to reproduce with x.csv.gz in the issue comment:

    Zlib::GzipReader.open("x.csv.gz") do |rio|
      rio.gets(nil, 1024)
      while line = rio.gets(nil, 8192)
        raise line unless line.valid_encoding?
      end
    end

Reported by Dimitrij Denissenko. Thanks!!!
@kou
Copy link
Member Author

kou commented Oct 7, 2021

@nobu What do you think about this?

@nobu
Copy link
Member

nobu commented Oct 8, 2021

If the filled size equals the reading size, gzreader_charboundary calls rb_enc_left_char_head with p == e.
Oniguruma's left_adjust_char_head functions don't expect this condition.
Although this may be a bug in Oniguruma, the following patch would workaround it.

diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index f9af18f530e..8b6b802e09d 100644
--- a/ext/zlib/zlib.c
+++ b/ext/zlib/zlib.c
@@ -4198,12 +4198,15 @@ static long
 gzreader_charboundary(struct gzfile *gz, long n)
 {
     char *s = RSTRING_PTR(gz->z.buf);
-    char *e = s + ZSTREAM_BUF_FILLED(&gz->z);
-    char *p = rb_enc_left_char_head(s, s + n, e, gz->enc);
+    long f = ZSTREAM_BUF_FILLED(&gz->z);
+    int boundary = (f == n);
+    char *e = s + f;
+    char *p = rb_enc_left_char_head(s, s + n - boundary, e, gz->enc);
     long l = p - s;
     if (l < n) {
 	n = rb_enc_precise_mbclen(p, e, gz->enc);
 	if (MBCLEN_NEEDMORE_P(n)) {
+	    l += boundary;
 	    if ((l = gzfile_fill(gz, l + MBCLEN_NEEDMORE_LEN(n))) > 0) {
 		return l;
 	    }

@nobu
Copy link
Member

nobu commented Oct 8, 2021

Sorry, I didn't see that there was the patch already.
Doesn't using s+n-1 have a problem when s+n points a leading byte?
And n_bytes seems for the left character of s+n but not for the character at s+n.

@kou
Copy link
Member Author

kou commented Oct 8, 2021

Doesn't using s+n-1 have a problem when s+n points a leading byte?

Umm, I think that gzreader_charboundary() should not care whether s+n points a leading byte or not.

When n == ZSTREAM_BUF_FILLED(&gz->z):

  • A byte pointed by s+n isn't uninitialized. We should not use it.

When n < ZSTREAM_BUF_FILLED(&gz->z):

  • If s+n-1 doesn't point a leading byte, n should not be changed. (No boundary adjustment is needed.)
  • If s+n-1 point a leading byte, s+n byte is used as a part of boundary character. It doesn't care whether s+n points a leading byte or not.

And n_bytes seems for the left character of s+n but not for the character at s+n.

I think that it's intentional. I think that gzreader_charboundary() should align to the left character of s+n.

Anyway, I'm not familiar with zlib code base. If you think that your patch is right approach, could you push your patch. I'm OK with any approach that doesn't return incomplete line.

@kou
Copy link
Member Author

kou commented Oct 9, 2021

I think that it's intentional. I think that gzreader_charboundary() should align to the left character of s+n.

Fix: align to the character of s+n-1

@kou
Copy link
Member Author

kou commented Oct 11, 2021

It seems that IO#gets uses s+n-1:

https://github.com/ruby/ruby/blob/b9f7286fe95827631b11342501e471e5e6f13bbb/io.c#L3751

		pp = rb_enc_left_char_head(s, p-1, p, enc);
File.open("/tmp/x", "w") do |output|
  output.puts("あい")
end

File.open("/tmp/x") do |input|
  p input.gets(nil, 4) # This uses the 4th byte (the first byte of "い") not the 5th byte
end

@kou kou merged commit b1f182e into master Oct 15, 2021
@kou kou deleted the gets-may-return-invalid-line branch October 15, 2021 06:31
@kou kou mentioned this pull request Oct 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants