-
Notifications
You must be signed in to change notification settings - Fork 227
(PDB-135) Improve and reduce the number of UTF-8 warnings #1640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,37 +31,141 @@ module CharEncoding | |
|
||
Utf8ReplacementChar = [ 0xEF, 0xBF, 0xBD ].pack("c*") | ||
|
||
DEFAULT_INVALID_CHAR = "\ufffd" | ||
|
||
def self.utf8_string(str) | ||
# @api private | ||
def self.all_indexes_of_char(str, char) | ||
(0..str.length).find_all{ |i| str[i] == char} | ||
end | ||
|
||
# @api private | ||
# | ||
# Takes an array and returns a sub-array without the last element | ||
# | ||
# @return [Object] | ||
def self.drop_last(array) | ||
array[0..-2] | ||
end | ||
|
||
# @api private | ||
# | ||
# Takes an array of increasing integers and collapses the sequential | ||
# integers into ranges | ||
# | ||
# @param index_array an array of sorted integers | ||
# @return [Range] | ||
def self.collapse_ranges(index_array) | ||
ranges = index_array.each.inject([]) do |spans, n| | ||
if spans.empty? || spans.last.end != n - 1 | ||
spans << Range.new(n, n) | ||
else | ||
drop_last(spans) << Range.new(spans.last.begin,n) | ||
end | ||
end | ||
end | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Stolen from stackexchange
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's the monkey version
|
||
|
||
# @api private | ||
# | ||
# Scans the string s with bad characters found at bad_char_indexes | ||
# and returns an array of messages that give some context around the | ||
# bad characters. This will give up to 100 characters prior to the | ||
# bad character and 100 after. It will return fewer if it's at the | ||
# beginning of a string or if another bad character appears before | ||
# reaching the 100 characters | ||
# | ||
# @param str string coming from to_pson, likely a command to be submitted to PDB | ||
# @param bad_char_indexes an array of indexes into the string where invalid characters were found | ||
# @return [String] | ||
def self.error_char_context(str, bad_char_indexes) | ||
bad_char_ranges = collapse_ranges(bad_char_indexes) | ||
bad_char_ranges.each_with_index.inject([]) do |state, (r, index)| | ||
gap = r.to_a.length | ||
|
||
prev_bad_char_end = bad_char_ranges[index-1].end + 1 if index > 0 | ||
next_bad_char_begin = bad_char_ranges[index+1].begin - 1 if index < bad_char_ranges.length - 1 | ||
|
||
start_char = [prev_bad_char_end || 0, r.begin-100].max | ||
end_char = [next_bad_char_begin || str.length - 1, r.end+100].min | ||
x = [next_bad_char_begin || str.length, r.end+100, str.length] | ||
prefix = str[start_char..r.begin-1] | ||
suffix = str[r.end+1..end_char] | ||
|
||
state << "'#{prefix}' followed by #{gap} invalid/undefined bytes then '#{suffix}'" | ||
end | ||
end | ||
|
||
# @api private | ||
# | ||
# Warns the user if an invalid character was found. If debugging is | ||
# enabled will also log contextual information about where the bad | ||
# character(s) were found | ||
# | ||
# @param str A string coming from to_pson, likely a command to be submitted to PDB | ||
# @param error_context_str information about where this string came from for use in error messages | ||
# @return String | ||
def self.warn_if_invalid_chars(str, error_context_str) | ||
bad_char_indexes = all_indexes_of_char(str, DEFAULT_INVALID_CHAR) | ||
if bad_char_indexes.empty? | ||
str | ||
else | ||
Puppet.warning "#{error_context_str} ignoring invalid UTF-8 byte sequences in data to be sent to PuppetDB, see debug logging for more info" | ||
if Puppet.settings[:log_level] == "debug" | ||
Puppet.debug error_context_str + "\n" + error_char_context(str, bad_char_indexes).join("\n") | ||
end | ||
|
||
str | ||
end | ||
end | ||
|
||
# @api private | ||
# | ||
# Attempts to coerce str to UTF-8, if that fails will output context | ||
# information using error_context_str | ||
# | ||
# @param str A string coming from to_pson, likely a command to be submitted to PDB | ||
# @param error_context_str information about where this string came from for use in error messages | ||
# @return Str | ||
def self.coerce_to_utf8(str, error_context_str) | ||
str_copy = str.dup | ||
# This code is passed in a string that was created by | ||
# to_pson. to_pson calls force_encoding('ASCII-8BIT') on the | ||
# string before it returns it. This leaves the actual UTF-8 bytes | ||
# alone. Below we check to see if this is the case (this should be | ||
# most common). In this case, the bytes are still UTF-8 and we can | ||
# just encode! and we're good to go. If They are not valid UTF-8 | ||
# bytes, that means there is probably some binary data mixed in | ||
# the middle of the UTF-8 string. In this case we need to output a | ||
# warning and give the user more information | ||
str_copy.force_encoding("UTF-8") | ||
if str_copy.valid_encoding? | ||
str_copy.encode!("UTF-8") | ||
else | ||
# This is force_encoded as US-ASCII to avoid any overlapping | ||
# byte related issues that could arise from mis-interpreting a | ||
# random extra byte as part of a multi-byte UTF-8 character | ||
str_copy.force_encoding("US-ASCII") | ||
warn_if_invalid_chars(str_copy.encode!("UTF-8", | ||
:invalid => :replace, | ||
:undef => :replace, | ||
:replace => DEFAULT_INVALID_CHAR), | ||
error_context_str) | ||
end | ||
end | ||
|
||
def self.utf8_string(str, error_context_str) | ||
if RUBY_VERSION =~ /^1.8/ | ||
# Ruby 1.8 doesn't have String#encode and related methods, and there | ||
# appears to be a bug in iconv that will interpret some byte sequences | ||
# as 6-byte characters. Thus, we are forced to resort to some unfortunate | ||
# manual chicanery. | ||
warn_if_changed(str, ruby18_clean_utf8(str)) | ||
elsif str.encoding == Encoding::UTF_8 | ||
# If we get here, we're in ruby 1.9+, so we have the string encoding methods | ||
# available. However, just because a ruby String object is already | ||
# marked as UTF-8, that doesn't guarantee that its contents are actually | ||
# valid; and if you call ruby's ".encode" method with an encoding of | ||
# "utf-8" for a String that ruby already believes is UTF-8, ruby | ||
# seems to optimize that to be a no-op. So, we have to do some more | ||
# complex handling... | ||
|
||
# If the string already has valid encoding then we're fine. | ||
return str if str.valid_encoding? | ||
|
||
# If not, we basically have to walk over the characters and replace | ||
# them by hand. | ||
warn_if_changed(str, str.each_char.map { |c| c.valid_encoding? ? c : "\ufffd"}.join) | ||
else | ||
# if we get here, we're ruby 1.9 and the current string is *not* encoded | ||
# as UTF-8. Thus we can actually rely on ruby's "encode" method. | ||
begin | ||
str.encode('UTF-8') | ||
coerce_to_utf8(str, error_context_str) | ||
rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError => e | ||
# If we got an exception, the string is either invalid or not | ||
# convertible to UTF-8, so drop those bytes. | ||
# If we got an exception, the string is either invalid or not | ||
# convertible to UTF-8, so drop those bytes. | ||
|
||
warn_if_changed(str, str.encode('UTF-8', :invalid => :replace, :undef => :replace)) | ||
end | ||
end | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,7 +42,7 @@ def initialize(command, version, certname, payload) | |
# | ||
# This is roughly inline with how Puppet serializes for catalogs as of | ||
# Puppet 4.1.0. We need a better answer to non-utf8 data end-to-end. | ||
}.to_pson) | ||
}.to_pson, "Error encoding a '#{command}' command for host '#{certname}'") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should I see this getting logged with this catalog?
I see this:
in the puppet log, but no mention of replace catalog specifically. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh sorry, not on your code |
||
end | ||
end | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit simpler