-
Notifications
You must be signed in to change notification settings - Fork 30
[WIP] proper delimiter support for line based input #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I just realized I forgot to run the other specs, not just |
So thinking about it, I suggest we refactor multiline and extract the identity map support stuff required in the multiline codec per se into its own module or something like that. Currently, as it is, the multiline codec code & structure is really confusing, it's not clear what does what. |
Some observations:
|
thanks @guyboertje for the clarifications.
|
I had always intended to move the imc class into LS core, when any bugs identified in its use with the file input were ironed out - because it is easier to release new versions of plugins in quick succession. Further, I had imagined that the component channel idea would take off and the accept method in the ML codec would be subsumed by the component. |
@guyboertje but my problem is not the IdentityMapCodec per se, its the specific support code for it in the For the component idea, that will probably happen at some point but in the mean time we have to live with the current design :P |
I am conflicted about this PR. It forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer. It means that line oriented inputs like s3, kafka, http, redis and file must now append a Inputs supply data in flavours Codecs consume data in flavours Practically, in terms of GH issues and or support queries, for production use of the ML codec, we see a small usage of it with text bytes inputs but a massive usage of it with text lines inputs. This change is positive for the small usage but negative (double buffering and newline append data mutation) for the massive usage (file input). I struggle to see any urgency for this change. It is a bug that we cannot detect when a flavour mismatch occurs in a users config, to either: a) warn the user or prevent LS from starting or b) correct the mismatch by inserting an "adapter" between the input and the codec. e.g a bytes to lines adapter between stdin, tcp inputs and multiline, json codecs. I want us to work on the problem of appropriate channel composition. We could never move the codec out of the input without this. |
@guyboertje I think you raise some very valid concerns. It seems the fundamental issue here is that there's no type information carried from input->codec. The contracts for other things are clear, (filters are event -> [event], outputs are [event] -> nil), but there's just a lot of gray area here. I like your proposal for automatic flavor detection + adapters. |
I agree we should not change external default behaviour and I already proposed to make sure we keep previous/legacy behaviour in that respect and not have any current input modified to append a delimiter, but instead make that behaviour optional and keep the legacy behaviour as default.
The thing is that this codec, in its current released state is actually in-between between line-oriented and text-bytes. It was actually meant as text-bytes since it was doing a def decode(text, &block)
text = @converter.convert(text)
text.split("\n").each do |line|
... But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks. Furthermore, introducing the buffered tokenizer essentially does a On the other hand, using the buffered tokenizer in the context of text-bytes will actually provide the correct expected behaviour of supporting cross-blocks lines. I still believe that using the buffered tokenizer and respecting the current behaviour is the right thing to do since it will make this codec work correctly the way it was intended to and with all types of inputs without any performance penalty. @andrewvc @guy I agree we have a contract confusion in that respect between inputs and codecs and we should address this in a separate conversation and have a proper design discussion around it. Until we sort this out I believe this PR is correct and will provide value. |
I have reverted to default behaviour of assuming line-based input data (and all original tests are passing as-is) and added this new config option: # Assume data received from input plugin as line based. For some input plugins
# like stdin or tcp/udp data is not line based and this option should be set to false.
config :line_based_input, :validate => :boolean, :default => true I believe this is an improvement as it makes the plugin more flexible and fixes a bug which made it not work correctly for non lined-based input. I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec, I suggest to move it into a module and include that module. that way we will avoid mixin the "real" codec public interface with the identity map support interface. |
There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized. I am surprised we did not catch this in reviews, or was it assumed that it would always run within the identity mapper? |
I already established in a previous comment that this bug existed. But it has nothing to do with an identity map context; the IMC is transparent. It has everything to do with inputs needing to be modified to provide the listener mechanism. The listener implementation transfers the references to objects defined in the input held in the decode callback block binding to a listener and establishes a callback API on the listener such that callbacks are by method call and not block call. |
end | ||
|
||
def decode(data, &block) | ||
data = data + @delimiter if @line_based_input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider not mutating the original string. You could create a @appendage
ivar once using @line_based_input
in register
and then do _data = data + @appendage
At this stage my concern is not specifically about removing this code from this plugin repo (I agree it should eventually) but the IMC support code in the MLC code like the
I am sorry about the confusion. I am talking about the
I am not sure what you mean by "transparent"? It is not transparent in the MLC code, the |
|
||
# Assume data received from input plugin as line based. For some input plugins | ||
# like stdin or tcp/udp data is not line based and this option should be set to false. | ||
config :line_based_input, :validate => :boolean, :default => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like this should be a contract between the input and decoder plugins and not something we need to expose via configuration to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that would need us to properly annotate all input plugins? in which case this option allows us to have something working today, and eventually we can deprecate this once we have the proper info from the input plugins?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a historical perspective, I think multiline codec was not really intended to be used with single-payload inputs such as redis, rabbitmq, etc -- only intended for bytestream inputs like stdin, tcp, file(*), etc.
(*) File input currently, because the way the filewatch library works, only receives lines right now. It could simplify things if we had an api in filewatch that presented a bytestream instead of line-oriented data and make the codec deal with it.
@colinsurprenant have you considered implementing this with logstash-codec-line instead of implementing buftok directly? This is already done in other plugins. |
@jsvd actually I thinks it is a bad idea, see logstash-plugins/logstash-codec-json_lines#17 |
putting this PR on-hold for @guyboertje upcoming refactor PR of the auto flushing and timers tasks. |
is this still blocked? |
I would think that Event Mills coming soon, will obsolete this. |
@guyboertje are Event Mills still coming? If not, we could pick this back up. |
@cwurm eventually mills may arrive. Ultimately, this problem is due to a mismatch between how the file input works (what the multiline codec was intended to target) and how things like tcp input work. The file input already emits lines. Fixing this before mills arrive (if/when they arrive) is possible. |
@jordansissel I think we should, there have been issues and PRs dating back to 2015: @colinsurprenant Would you be interested to pick this back up? If not, I'd see if I can find the time for it. |
BufferedTokenizer
to correctly parse delimited input across read blocks. the previous implementation first assumed hardcoded\n
delimiter and also assumed that the delimiter was always part of the same input block which is obviously not true for streaming data (stdin, tcp, etc) which is read by fixed sized chunks and passed to the codec in which case lines split across blocks were not correctly processed.TODO:
s that need to be clarified with the help of @guyboertje who introduced the latest changes related to identity map and auto flushing. In particular I think the implementation assumes it will run under an identity map codec which I do not believe is true, for example if used with the stdin input.decode
method. Obviously this is wrong, and cannot work with theBufferedTokenizer
. All specs needed to be adjusted to add a line terminator the the passed lines.