GzipReader: gets sometimes returns incomplete multi-byte characters

Calling `gets` with limit in certain cases results in incomplete multi-byte characters as the returned string ends in the middle of a character.

### To reproduce

Example file: [example.txt.gz](https://github.com/user-attachments/files/18299872/example.txt.gz) (uncompressed version [example.txt](https://github.com/user-attachments/files/18299871/example.txt))

```ruby
require 'zlib'
# File example.txt.gz is attached to the issue
test_file = Zlib::GzipReader.open("example.txt.gz")
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(2513)
test_file.gets(2)
test_file.gets(100)
test_file.gets(8017)
test_file.gets(2)
test_file.gets(4579)
test_file.gets(3127)
test_file.gets(3127)
test_file.gets(3126)
test_file.gets(1)
result = test_file.gets(14020)
result.valid_encoding?
# => false
# Printing out the result shows that it contains two bytes at the end without the final byte to complete the character
# ...南部から、つづ\xE3\x81"

# Reading next chunk with gets and combining strings results in a string that is once again complete
next_chunk = test_file.gets(100)
(result + next_chunk).valid_encoding?
# => true
```

This issue can occur when reading compressed CSV files since CSV reader reads a chunk of IO and then checks whether returned string has valid encoding: https://github.com/ruby/csv/blob/f33c613ca94013db0667c4311311c98b6c20d5ea/lib/csv/parser.rb#L310-L312

### Possible cause
This issue seems to occur when `gets` is called with limit, which is 1 byte smaller than the contents of currently loaded buffer.

In function `gzreader_charboundary` `rb_enc_precise_mbclen` is used to identify number of missing bytes at the end, but then `gzfile_fill` is called with `n + MBCLEN_NEEDMORE_LEN(n_bytes)` where `n` can be 1 byte smaller than current loaded buffer size, and as a result no additional data is read.

https://github.com/ruby/zlib/blob/master/ext/zlib/zlib.c#L4309
```c
static long
gzreader_charboundary(struct gzfile *gz, long n)
{
    // EX: In case of the example above:
    // EX: n = 14020
    // EX: ZSTREAM_BUF_FILLED(&gz->z) = 14021
    // EX: last four bytes of s:
    // EX: ... |  0xA5 |  0xE3 |  0x81 |  0x00 | (byte)
    // EX: ... | 14019 | 14020 | 14021 | 14022 | (index)
    char *s = RSTRING_PTR(gz->z.buf);
    char *e = s + ZSTREAM_BUF_FILLED(&gz->z);
    // EX: p correctly points to 0xE3 (byte 14020 of s)
    char *p = rb_enc_left_char_head(s, s + n - 1, e, gz->enc);
    long l = p - s;
    if (l < n) {
        // EX: rb_enc_precise_mbclen correctly determines that 1 more byte is needed (after 0xE3 and 0x81)
	int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
	if (MBCLEN_NEEDMORE_P(n_bytes)) {
            // EX: n + MBCLEN_NEEDMORE_LEN(n_bytes) = 14020 + 1 which is NOT correct:
            // EX: 14021 is already size of the buffer, the value should be 14022
	    if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
		return l;
	    }
	}
	else if (MBCLEN_CHARFOUND_P(n_bytes)) {
	    return l + MBCLEN_CHARFOUND_LEN(n_bytes);
	}
    }
    return n;
}
```

### Possible solution

Since `rb_enc_precise_mbclen` determines how many bytes are needed based on the entire buffer (until `e` pointer), `gzfile_fill` should also request current buffer size + number of missing bytes to be read:

```diff
diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index 0b9c4d6..5d7bd1b 100644
--- a/ext/zlib/zlib.c
+++ b/ext/zlib/zlib.c
@@ -4316,7 +4316,7 @@ gzreader_charboundary(struct gzfile *gz, long n)
     if (l < n) {
        int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
        if (MBCLEN_NEEDMORE_P(n_bytes)) {
-           if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
+           if ((l = gzfile_fill(gz, ZSTREAM_BUF_FILLED(&gz->z) + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
                return l;
            }
        }
```

As far as I tested this indeed seems to fix the issue


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GzipReader: gets sometimes returns incomplete multi-byte characters #92

To reproduce

Possible cause

Possible solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GzipReader: gets sometimes returns incomplete multi-byte characters #92

Description

To reproduce

Possible cause

Possible solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions