-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Calling gets with limit in certain cases results in incomplete multi-byte characters as the returned string ends in the middle of a character.
To reproduce
Example file: example.txt.gz (uncompressed version example.txt)
require 'zlib'
# File example.txt.gz is attached to the issue
test_file = Zlib::GzipReader.open("example.txt.gz")
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(2513)
test_file.gets(2)
test_file.gets(100)
test_file.gets(8017)
test_file.gets(2)
test_file.gets(4579)
test_file.gets(3127)
test_file.gets(3127)
test_file.gets(3126)
test_file.gets(1)
result = test_file.gets(14020)
result.valid_encoding?
# => false
# Printing out the result shows that it contains two bytes at the end without the final byte to complete the character
# ...南部から、つづ\xE3\x81"
# Reading next chunk with gets and combining strings results in a string that is once again complete
next_chunk = test_file.gets(100)
(result + next_chunk).valid_encoding?
# => trueThis issue can occur when reading compressed CSV files since CSV reader reads a chunk of IO and then checks whether returned string has valid encoding: https://github.com/ruby/csv/blob/f33c613ca94013db0667c4311311c98b6c20d5ea/lib/csv/parser.rb#L310-L312
Possible cause
This issue seems to occur when gets is called with limit, which is 1 byte smaller than the contents of currently loaded buffer.
In function gzreader_charboundary rb_enc_precise_mbclen is used to identify number of missing bytes at the end, but then gzfile_fill is called with n + MBCLEN_NEEDMORE_LEN(n_bytes) where n can be 1 byte smaller than current loaded buffer size, and as a result no additional data is read.
https://github.com/ruby/zlib/blob/master/ext/zlib/zlib.c#L4309
static long
gzreader_charboundary(struct gzfile *gz, long n)
{
// EX: In case of the example above:
// EX: n = 14020
// EX: ZSTREAM_BUF_FILLED(&gz->z) = 14021
// EX: last four bytes of s:
// EX: ... | 0xA5 | 0xE3 | 0x81 | 0x00 | (byte)
// EX: ... | 14019 | 14020 | 14021 | 14022 | (index)
char *s = RSTRING_PTR(gz->z.buf);
char *e = s + ZSTREAM_BUF_FILLED(&gz->z);
// EX: p correctly points to 0xE3 (byte 14020 of s)
char *p = rb_enc_left_char_head(s, s + n - 1, e, gz->enc);
long l = p - s;
if (l < n) {
// EX: rb_enc_precise_mbclen correctly determines that 1 more byte is needed (after 0xE3 and 0x81)
int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
if (MBCLEN_NEEDMORE_P(n_bytes)) {
// EX: n + MBCLEN_NEEDMORE_LEN(n_bytes) = 14020 + 1 which is NOT correct:
// EX: 14021 is already size of the buffer, the value should be 14022
if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
return l;
}
}
else if (MBCLEN_CHARFOUND_P(n_bytes)) {
return l + MBCLEN_CHARFOUND_LEN(n_bytes);
}
}
return n;
}Possible solution
Since rb_enc_precise_mbclen determines how many bytes are needed based on the entire buffer (until e pointer), gzfile_fill should also request current buffer size + number of missing bytes to be read:
diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index 0b9c4d6..5d7bd1b 100644
--- a/ext/zlib/zlib.c
+++ b/ext/zlib/zlib.c
@@ -4316,7 +4316,7 @@ gzreader_charboundary(struct gzfile *gz, long n)
if (l < n) {
int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
if (MBCLEN_NEEDMORE_P(n_bytes)) {
- if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
+ if ((l = gzfile_fill(gz, ZSTREAM_BUF_FILLED(&gz->z) + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
return l;
}
}As far as I tested this indeed seems to fix the issue