[WIP] proper delimiter support for line based input #26

colinsurprenant · 2016-01-28T22:48:52Z

introduction of the BufferedTokenizer to correctly parse delimited input across read blocks. the previous implementation first assumed hardcoded \n delimiter and also assumed that the delimiter was always part of the same input block which is obviously not true for streaming data (stdin, tcp, etc) which is read by fixed sized chunks and passed to the codec in which case lines split across blocks were not correctly processed.
this PR is flagged as a WIP because I have inlined many TODO:s that need to be clarified with the help of @guyboertje who introduced the latest changes related to identity map and auto flushing. In particular I think the implementation assumes it will run under an identity map codec which I do not believe is true, for example if used with the stdin input.
the pre-PR implementation assumed the implicit line termination of the passed data to the decode method. Obviously this is wrong, and cannot work with the BufferedTokenizer. All specs needed to be adjusted to add a line terminator the the passed lines.

colinsurprenant · 2016-01-28T23:21:47Z

I just realized I forgot to run the other specs, not just multiline_spec.rb ... I will and fix what breaks :P

colinsurprenant · 2016-01-29T12:48:33Z

So thinking about it, I suggest we refactor multiline and extract the identity map support stuff required in the multiline codec per se into its own module or something like that. Currently, as it is, the multiline codec code & structure is really confusing, it's not clear what does what.

guyboertje · 2016-01-29T14:58:50Z

Some observations:

The code is not dependent on the identity mapper. The introduction of auto_flush meant having to have an out-of-band route to the queue. The block supplied to the existing decode method cannot be cached to be used out-of-band. This requires a new mechanism, I chose to use a listener. Effectively, to use ML codec + auto_flush with an input, it needs to be enhanced with support for listeners. The identity mapper transparently passes accept or decode to any codec that it wraps.
There is a bug in the existing code: in the closemethod, auto_flush should only be called if @last_seen_listener is not nil.
For the file input to work with this change each line received from filewatch will need a delimiter appended to it (but only if a ML codec is specified), and as the file input already has a delimiter that it passes to filewatch for its BufferedTokenizer it must use what it defined in the ML codec, but again only if a ML codec is specified - just in case the user specified a different delimiter (from "\n") in the codec config.

colinsurprenant · 2016-01-29T15:59:48Z

thanks @guyboertje for the clarifications.

I will fix the close + auto_flush + @last_seen_listener which I believe should also fix the current dependency on using the identity mapper so that @last_seen_listener is properly initialized.
I still don't think that the identity mapper specific code, the accept method and listener stuff should be part of the codec and probably factored out into a separate module because it makes the code harder to understand and there is no documentation explaining the purpose of these methods. one has to go and dig in the identity mapper to figure the logic and so on.
very good point for the delimiter appending, and I wonder if we should not just make that configurable with a "auto_append_delimiter" option which defaults to appending "\n" by default to keep backward compatibility? what do you think? because effectively, as is, this PR changes the current codec behaviour which might not be a good idea. Thoughts?

guyboertje · 2016-01-29T16:11:22Z

I had always intended to move the imc class into LS core, when any bugs identified in its use with the file input were ironed out - because it is easier to release new versions of plugins in quick succession.

Further, I had imagined that the component channel idea would take off and the accept method in the ML codec would be subsumed by the component.

colinsurprenant · 2016-01-29T16:20:38Z

@guyboertje but my problem is not the IdentityMapCodec per se, its the specific support code for it in the Multiline (accept and listener stuff).

For the component idea, that will probably happen at some point but in the mean time we have to live with the current design :P

guyboertje · 2016-02-01T19:02:48Z

I am conflicted about this PR. It forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer. It means that line oriented inputs like s3, kafka, http, redis and file must now append a delimiter to each piece of line data it sends to a codec but only if it is a bytes oriented one - line, json_lines, multiline (after this patch)

Inputs supply data in flavours
text bytes
text lines
serialized protocol representations

Codecs consume data in flavours
text bytes
text lines
serialized protocol representations

Practically, in terms of GH issues and or support queries, for production use of the ML codec, we see a small usage of it with text bytes inputs but a massive usage of it with text lines inputs. This change is positive for the small usage but negative (double buffering and newline append data mutation) for the massive usage (file input). I struggle to see any urgency for this change.

It is a bug that we cannot detect when a flavour mismatch occurs in a users config, to either: a) warn the user or prevent LS from starting or b) correct the mismatch by inserting an "adapter" between the input and the codec. e.g a bytes to lines adapter between stdin, tcp inputs and multiline, json codecs.

I want us to work on the problem of appropriate channel composition. We could never move the codec out of the input without this.

andrewvc · 2016-02-01T20:03:45Z

@guyboertje I think you raise some very valid concerns. It seems the fundamental issue here is that there's no type information carried from input->codec. The contracts for other things are clear, (filters are event -> [event], outputs are [event] -> nil), but there's just a lot of gray area here.

I like your proposal for automatic flavor detection + adapters.

colinsurprenant · 2016-02-01T22:43:23Z

@guy

It means that line oriented inputs like s3, kafka, http, redis and file must now append a delimiter to each piece of line data it sends to a codec

I agree we should not change external default behaviour and I already proposed to make sure we keep previous/legacy behaviour in that respect and not have any current input modified to append a delimiter, but instead make that behaviour optional and keep the legacy behaviour as default.

forces this codec to change its fundamental external behaviour - that of being a line oriented data consumer

The thing is that this codec, in its current released state is actually in-between between line-oriented and text-bytes. It was actually meant as text-bytes since it was doing a split("\n") on the input so it was assuming a blob of line delimited text.

  def decode(text, &block)
    text = @converter.convert(text)
    text.split("\n").each do |line|
      ...

But obviously this is both useless in the context of already line-delimited input and useless for text-bytes input as is does not properly support lines across data blocks.

Furthermore, introducing the buffered tokenizer essentially does a .split(@delimiter) on the data and will not incur double buffering and should not be more costly on line oriented data.

On the other hand, using the buffered tokenizer in the context of text-bytes will actually provide the correct expected behaviour of supporting cross-blocks lines.

I still believe that using the buffered tokenizer and respecting the current behaviour is the right thing to do since it will make this codec work correctly the way it was intended to and with all types of inputs without any performance penalty.

@andrewvc @guy I agree we have a contract confusion in that respect between inputs and codecs and we should address this in a separate conversation and have a proper design discussion around it. Until we sort this out I believe this PR is correct and will provide value.

colinsurprenant · 2016-02-02T01:09:01Z

I have reverted to default behaviour of assuming line-based input data (and all original tests are passing as-is) and added this new config option:

# Assume data received from input plugin as line based. For some input plugins
# like stdin or tcp/udp data is not line based and this option should be set to false.
config :line_based_input, :validate => :boolean, :default  => true

I believe this is an improvement as it makes the plugin more flexible and fixes a bug which made it not work correctly for non lined-based input.

I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec, I suggest to move it into a module and include that module. that way we will avoid mixin the "real" codec public interface with the identity map support interface.

colinsurprenant · 2016-02-02T04:49:25Z

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized. I am surprised we did not catch this in reviews, or was it assumed that it would always run within the identity mapper?

guyboertje · 2016-02-03T14:10:05Z

@colinsurprenant

I would love some opinion on the identity map support specific code which interwinds with the actual plugin code and makes it less comprehensible. since this code is used only specifically with the codec run inside the indentity map codec

Please understand that the implementation of identity mapping and then later auto-flush was done MVP style so we could iterate quickly by releasing new versions of this library when problems were found like e.g the 20000 map limit.
You are correct, the IMC class and test do not need to be in this repo. Maybe, some common support classes for plugins can have their own library.

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized

I already established in a previous comment that this bug existed. But it has nothing to do with an identity map context; the IMC is transparent. It has everything to do with inputs needing to be modified to provide the listener mechanism. The listener implementation transfers the references to objects defined in the input held in the decode callback block binding to a listener and establishes a callback API on the listener such that callbacks are by method call and not block call.

guyboertje · 2016-02-03T14:20:35Z

lib/logstash/codecs/multiline.rb

+  end
+
+  def decode(data, &block)
+    data = data + @delimiter if @line_based_input


Consider not mutating the original string. You could create a @appendage ivar once using @line_based_input in register and then do _data = data + @appendage

colinsurprenant · 2016-02-03T16:23:15Z

@guyboertje

You are correct, the IMC class and test do not need to be in this repo. Maybe, some common support classes for plugins can have their own library.

At this stage my concern is not specifically about removing this code from this plugin repo (I agree it should eventually) but the IMC support code in the MLC code like the accept method and listener stuff. I am suggesting that this code should be factored out of the MLC code and put into a module so that the MLC code looks like a proper codec with the usual interface methods implemented. having this IMC support code in the IMC code makes it very confusing to understand the logic of what is going on.

There is also clearly a bug with the auto_flush which assumes running within the identity map context and the listener being initialized

I already established in a previous comment that this bug existed.

I am sorry about the confusion. I am talking about the auto_flush code itself and what I understood is the bug you pointed to was specifically calling auto_flush from the close method. What I am saying is that auto_flush itself depends on the listener stuff which is only relevant in the context of the IMC but auto_flush will run even if not in the context of the IMC.

the IMC is transparent.

I am not sure what you mean by "transparent"? It is not transparent in the MLC code, the accept method are required to support the IMC, plus the listener support code no?

jordansissel · 2016-02-03T16:31:45Z

lib/logstash/codecs/multiline.rb

+
+  # Assume data received from input plugin as line based. For some input plugins
+  # like stdin or tcp/udp data is not line based and this option should be set to false.
+  config :line_based_input, :validate => :boolean, :default  => true


Feels like this should be a contract between the input and decoder plugins and not something we need to expose via configuration to users.

but that would need us to properly annotate all input plugins? in which case this option allows us to have something working today, and eventually we can deprecate this once we have the proper info from the input plugins?

From a historical perspective, I think multiline codec was not really intended to be used with single-payload inputs such as redis, rabbitmq, etc -- only intended for bytestream inputs like stdin, tcp, file(*), etc.

(*) File input currently, because the way the filewatch library works, only receives lines right now. It could simplify things if we had an api in filewatch that presented a bytestream instead of line-oriented data and make the codec deal with it.

jsvd · 2016-02-03T16:41:02Z

@colinsurprenant have you considered implementing this with logstash-codec-line instead of implementing buftok directly? This is already done in other plugins.

colinsurprenant · 2016-02-03T16:43:59Z

@jsvd actually I thinks it is a bad idea, see logstash-plugins/logstash-codec-json_lines#17

colinsurprenant · 2016-02-09T18:20:15Z

putting this PR on-hold for @guyboertje upcoming refactor PR of the auto flushing and timers tasks.

jsvd · 2016-04-26T16:58:21Z

is this still blocked?

guyboertje · 2016-04-27T16:42:16Z

I would think that Event Mills coming soon, will obsolete this.

cwurm · 2017-06-30T21:27:59Z

@guyboertje are Event Mills still coming?

If not, we could pick this back up.

jordansissel · 2017-06-30T21:38:56Z

@cwurm eventually mills may arrive.

Ultimately, this problem is due to a mismatch between how the file input works (what the multiline codec was intended to target) and how things like tcp input work. The file input already emits lines.

Fixing this before mills arrive (if/when they arrive) is possible.

cwurm · 2017-06-30T22:09:31Z

@jordansissel I think we should, there have been issues and PRs dating back to 2015:
#38 #37, #14, #6

@colinsurprenant Would you be interested to pick this back up? If not, I'd see if I can find the time for it.

colinsurprenant · 2018-09-24T22:21:22Z

@cwurm just did in #63

colinsurprenant added 3 commits January 28, 2016 14:14

cleanup code

b584c3c

add partial reads support spec

e9d6d30

support buftok for proper delimited input extraction

4a4dbbe

revert to original assumption of line-based input and make that optional

70ea5c3

guyboertje reviewed Feb 3, 2016
View reviewed changes

jordansissel reviewed Feb 3, 2016
View reviewed changes

jsvd mentioned this pull request Feb 4, 2016

inefficiency caused by double event creation and extra field lookup logstash-plugins/logstash-codec-json_lines#17

Closed

jsvd added bug P3 labels Apr 26, 2016

colinsurprenant mentioned this pull request Sep 24, 2018

[WIP] use BufferedTokenizer and configurable line delimiter #63

Closed

roaksoax added the status:needs-review label Apr 6, 2021

[WIP] proper delimiter support for line based input #26

Are you sure you want to change the base?

[WIP] proper delimiter support for line based input #26

Uh oh!

Conversation

colinsurprenant commented Jan 28, 2016

Uh oh!

colinsurprenant commented Jan 28, 2016

Uh oh!

colinsurprenant commented Jan 29, 2016

Uh oh!

guyboertje commented Jan 29, 2016

Uh oh!

colinsurprenant commented Jan 29, 2016

Uh oh!

guyboertje commented Jan 29, 2016

Uh oh!

colinsurprenant commented Jan 29, 2016

Uh oh!

guyboertje commented Feb 1, 2016

Uh oh!

andrewvc commented Feb 1, 2016

Uh oh!

colinsurprenant commented Feb 1, 2016

Uh oh!

colinsurprenant commented Feb 2, 2016

Uh oh!

colinsurprenant commented Feb 2, 2016

Uh oh!

guyboertje commented Feb 3, 2016

Uh oh!

guyboertje Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

colinsurprenant commented Feb 3, 2016

Uh oh!

jordansissel Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

colinsurprenant Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

jordansissel Feb 3, 2016

Choose a reason for hiding this comment

Uh oh!

jsvd commented Feb 3, 2016

Uh oh!

colinsurprenant commented Feb 3, 2016

Uh oh!

colinsurprenant commented Feb 9, 2016

Uh oh!

jsvd commented Apr 26, 2016

Uh oh!

guyboertje commented Apr 27, 2016

Uh oh!

cwurm commented Jun 30, 2017

Uh oh!

jordansissel commented Jun 30, 2017

Uh oh!

cwurm commented Jun 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

colinsurprenant commented Sep 24, 2018

Uh oh!

Uh oh!

cwurm commented Jun 30, 2017 •

edited

Loading