Skip to content

Filters sort and uniq #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
breml opened this issue Jun 15, 2019 · 18 comments
Closed

Filters sort and uniq #7

breml opened this issue Jun 15, 2019 · 18 comments

Comments

@breml
Copy link

breml commented Jun 15, 2019

First of all, this is a really interesting package, thanks for the effort.

In my scripts I often use the shell commands sort and uniq so I feel these two would be a good additions for this package.

@bitfield
Copy link
Owner

Yes, I'd like that too! Could you provide a short example program of the kind that you'd like to write in script that uses sort and uniq the way you want to use them? That gives us a concrete design proposal to talk about.

@breml
Copy link
Author

breml commented Jun 21, 2019

The use case is normally to process some kind of log file where I try to find out the frequency of some thing. So in bash this normally looks like this:

cat some.log | grep "the log lines of interest" | awk '{ print $5 }' | sort | uniq -c | sort -rn | head

The awk part is used to select a particular column (see #6).
With uniq I normally use the -c to count the frequency.
sort is used in two ways, first to sort alphabetically and then to sort by frequency (emited by uniq) in reverse order.
In most cases I am only interested in the top X, therefore I use head.

@bitfield
Copy link
Owner

Great! Can you turn this into a rough example of the Go program you'd like to write using script?

@breml
Copy link
Author

breml commented Jun 25, 2019

One challenge is, that the unix tools like grep, tr, sort, uniq, etc. have a lot of options and flags, that control the exact behaviour. In order to create an API, that is easy to use and flexible to extend, maybe functional options could be something worth to consider (but this is maybe a topic for another issue).

I think, something like this would be great (already combined with the functional options, which are placed in distinct packages for better readability):

script.Stdin().Match("the log lines of interest").Col(" ", 5).Sort().Uniq(uniq.WithCount()).Sort(sort.Reverse(), sort.Numeric()).Stdout()

@bitfield
Copy link
Owner

I'm completely in sympathy with your ideas, but one of the design principles for script is that the API should be very simple. The most common things should need zero config. Something like this would be more in keeping with the spirit of the package:

script.Stdin().Match("the log lines of interest").Column(5).Uniq().Sort().CountLines().Stdout()

What do you think?

@breml
Copy link
Author

breml commented Jun 27, 2019

@bitfield In my opinion, the sample you have put together does not the same as my command, but there is also one simplification, that I like.

The simplification

For the unix command uniq to work, the imput needs to be sorted. Therefore my command had the sequence Sort().Uniq().Sort(). My understanding of your sample is, that the Uniq(), that would be implemented in the script package would not have this precondition (because the sorting is maybe done with a map), so there is no need to sort the imput first.

The differences

Column(5)

I can life with the fact, that the default behaviour of Column() would be to split by whitespace. Of course, there are a lot of cases, where the delimiter is something different (like , or ;). So my question is, how would you address these cases. ColumnByComma()?

Uniq

The default behaviour of the unix command uniq is, that it just returns all unique lines of the input. With the flag -c, the number of occurences are counted and returned together with the lines. This looks like this:

  5 foobar
 12 barbaz
  4 foobaz

So if you want to have a "simple" API, you would need something like UniqWithCounts().

Sort

The default behaviour of the unix command sort is to sort alphabetically. If you want to sort an imput by numeric value, you need to pass the flag -n, which then looks for a number starting from the left side of the line and then sorts by these numbers. The flag -r reverses the order, such that the lines with the most hits are provided first in the output (because these lines are of the most interest).
So this would become something like SortByNumberReversed()

CountLines

This does not work, because I am not interested in the number of lines after sort and uniq.

Alternative Proposal

script.Stdin().Match("the log lines of interest").Column(5).UniqWithCounts().SortByNumberReversed().Stdout()

So this boils down to the question, what exactly is considered an "simple" API. I feel like removing the posibility for options/flags leads to a very broad (but still not very flexible) API. Alone the case for Sort would currently need:

  • Sort(): sort alphabetically
  • SortReversed(): sort alphabetically reversed
  • SortByNumber(): sort by leading numbers/integers
  • SortByNumberReversed(): sort by leading numbers/integers reversed

You can imagine your self, where this leads with more options added. The unix command sort provides some more options e.g. ignore case (but I think the case is preserved in the output), sort by human numbers (like 1K, 1G, etc.) and by month. I can also imagin, that sort by date could be interesting.

Bottom line

I personally prefer the API with the functional options. But this is your package, so you have to give the guidelines.

@bitfield
Copy link
Owner

Very thoughtful contribution, thank you!

You make an excellent point: that if you try to consider every possible way someone might want to use your functionality, you end up with a very complicated API (one way or the other). Either you have to provide lots of variants of no-config methods, or methods with lots of config.

This is exactly why I'm being such a hardass about use cases. I've rejected a lot of my own feature proposals on the basis that they're just something I thought would be cool, without having an actual, real requirement for them in a production program.

It sounds like the use case you're proposing is something roughly like this:

  • Find the top 10 most common 'column 5' values from a set of input lines

(That's quite complex! What specific real-world thing is it you're trying to do here? It's always helpful to know the concrete details.)

Now, we don't necessarily have to do this in a one-liner. That's the nice thing about a Go library; we can just use Go for anything the library doesn't have built in. But if we were to do this entirely in a script pipe, we would need at least the following:

  • Cut by columns
  • Count frequencies of identical lines
  • Sort lines numerically
  • Get first N lines

It's worth noting that I'm not particularly interested in following the API of specific Unix commands, except where it's obvious enough that it doesn't matter if you're familiar with the Unix equivalent.)

So here's a counter-counter-proposal, for five specific features:

  • Column(col) selects a specified column number, delimited by whitespace
  • CountFreq() outputs the unique lines in its input, prepended by a frequency count and whitespace
  • SortNumeric() sorts its input lines by numeric order
  • Reverse() reverses the input lines
  • First(n) outputs only the first N lines of input

Here's how your use case might be implemented using these:

script.Stdin().Match("foo").Column(5).CountFreq().SortNumeric().Reverse().First(10).Stdout()

@bitfield
Copy link
Owner

By the way, one thing I often do which is similar to this is to analyse webserver logs to find the IP address generating the most requests. For example, I cut the IP address out of access.log lines, and sort them by frequency to get the top 10 requesters.

@bitfield
Copy link
Owner

PS: maybe CountFreq should default to sorting its results with the highest frequency first? Does any other default make sense? If not, this seems like a sensible idea.

@breml
Copy link
Author

breml commented Jun 28, 2019

My real world example I had in mind, when you asked me about a specific use case was pretty similar to the use case you outline in your above comment. In my case it was the log of an ssh daemon and I was interested in the source IP addresses from where the most failed login attempts have been made.

This use case I obviously solved with a bash one-liner. If I use a particular bash one-liner often, I put it in a script. Maybe later, some additional logic is added to process the logs in different ways. In the end I have a bash script, that would be better implemented in something like Go. So this is the way it normally takes in my case. Therefore I would prefere an API, that is similar to the unix commands I already use, because the translation process from the existing bash one-liner or even script to a Go program with this package would become easier.

That being said, your counter-counter proposal covers my use case pretty good. Based on your last comment, it could be simplified to:

script.Stdin().Match("foo").Column(5).CountFreq().First(10).Stdout()

where CountFreq() would count the frequency of identical lines and return the result in descending order of the frequencies, each unique line prefixed with the frequency and a whitespace (the command uniq takes care of proper indention. because we return the lines with the most frequent line first, this should be an easy thing to do here as well).

@bitfield
Copy link
Owner

I would prefere an API, that is similar to the unix commands I already use, because the translation process from the existing bash one-liner or even script to a Go program with this package would become easier.

Yes, understood. By and large the Unix API is clear, straightforward, and unambiguous, and where that's so, I'm happy to adopt it. Some things or options are called by weird names, for historical reasons, and where there's an alternative that's clearer to people who don't write shell scripts, I'll use that instead. Clarity first, Unix familiarity second.

For example, Match() is called Match(), not Grep() (although that was naturally the first thing I thought of).

@bitfield
Copy link
Owner

Based on your last comment, it could be simplified to:

script.Stdin().Match("foo").Column(5).CountFreq().First(10).Stdout()

This looks very good to me. Are you interested in working on implementation of any of these? I might start putting a PR together with a log-line counting example, and the code to make it work.

@chinglinwen
Copy link

Hi, for a side note

The name CountFreq feels a little strange to me, I think what it does includes two part

  1. Uniq ( uniq lines, so no same line show again )
  2. Print frequency ( should be an option UniqCount or PrefixCount? maybe turn on by default )

So I think CountFreq use the name Uniq with option feels more intuitive.

@bitfield
Copy link
Owner

bitfield commented Jul 1, 2019

I think CountFreq use the name Uniq with option feels more intuitive.

I understand why you might feel that way, but here's another way of looking at it:

The primary thing that the method does is to count frequency. As a side effect of this it reduces the input to unique lines. But if you just wanted unique lines, you wouldn't use this method, because it prepends frequency counts.

I can quite see that one might want a method which produces unique lines, and that method should be called Unique(), certainly. So I don't want to lose that name by allocating it to something whose main purpose is not to produce unique lines.

@chinglinwen
Copy link

Looks good to me now .

@bitfield
Copy link
Owner

bitfield commented Jul 1, 2019

@breml have a look at #11 and see what you think. All the components for our visitor-counting example are now in place.

@bitfield
Copy link
Owner

bitfield commented Jul 3, 2019

@breml thanks for all the help on this one! Do you have everything you want now in v0.9.0?

@breml
Copy link
Author

breml commented Jul 4, 2019

LGTM, thanks.

@breml breml closed this as completed Jul 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants