Skip to content

split() should document the behavior around adjacent separators, particularly at the start of the string #25986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nikomatsakis opened this issue Jun 3, 2015 · 10 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@nikomatsakis
Copy link
Contributor

It's not clear to me if this is the expected behavior or not, but this program:

fn main() {
    let x = "    a  b c".to_string();
    let d: Vec<_> = x.split(' ').collect();
    println!("{:?}", d);
}

yields:

["", "", "", "", "a", "", "b", "c"]

whereas I expected:

["a","b","c"]

If the current behavior is expected, it should be more clearly documented, at minimum.

cc @Kimundi

@nikomatsakis
Copy link
Contributor Author

That said, it does seem inconsistent to me? For example, four leading spaces yields four empty entries, but two intermediate spaces yield one empty entry, and one intermediate space yields no empty entries.

@Veedrac
Copy link
Contributor

Veedrac commented Jun 3, 2015

There are two reasonable behaviours I see. One is to split around padding, to get ["a", "b", "c"]. The other is to split around delimiters, eg. ",12,,34".split(',') → ["", "12", "", "34"]. This would keep the ends of the string, even if empty.

@bluss
Copy link
Member

bluss commented Jun 3, 2015

It is certainly expected by the implementors, this is the way it has been since introduction of .split() I believe. Since it's stable, I guess it's a documentation issue

Replace each space with a splitting point and it seems consistent

"    a  b c"
"||||a||b|c"

=> "", "", "", "", "a", "", "b", "c"

@nikomatsakis You're not taking into account the string that precedes the first separator I guess. It's empty in this case.

@bluss
Copy link
Member

bluss commented Jun 3, 2015

FWIW, Python does the same thing

>>> "    a  b c".split(' ')
['', '', '', '', 'a', '', 'b', 'c']

Both Rust and Python can filter the list or iterator just fine, but it's slightly simpler in Python since an empty string is a falseish value.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Jun 3, 2015
@nikomatsakis
Copy link
Contributor Author

@bluss yes, ok, that makes sense (I did not, it's true, consider the "empty string" at the beginning.)

I don't particularly like this behavior -- I've always found it annoying in Python too :) -- but yes I agree it's a doc issue, not a bug.

@nikomatsakis nikomatsakis changed the title split() can include separators in the result split() should document the behavior around adjacent separators, particularly at the start of the string Jun 3, 2015
@Kimundi
Copy link
Member

Kimundi commented Jun 3, 2015

@bluss basically said all I'd say already. So yeah, I guess the docs could be improved a bit?

@bluss bluss added the A-docs label Jun 6, 2015
@nagisa
Copy link
Member

nagisa commented Jun 7, 2015

That said, it does seem inconsistent to me? For example, four leading spaces yields four empty entries, but two intermediate spaces yield one empty entry, and one intermediate space yields no empty entries.

.split(x).connect(x) should be an identity function, which is how it is consistent.

@Veedrac
Copy link
Contributor

Veedrac commented Jun 7, 2015

@nagisa This doesn't seem to actually be the case. The number of leading spaces seems to match the number of empty entries. Eg.

println!("{:?}", " a  b c".to_owned().split(' ').collect::<Vec<_>>());
#>>> ["", "a", "", "b", "c"]

As such, the requirement seems to hold.

@nagisa
Copy link
Member

nagisa commented Jun 7, 2015

@nagisa This doesn't seem to actually be the case.

I don’t understand what you’re trying to say here, because split(x).connect(x) is an identity function (well, yeah, you have to collect it in between, etc; but that’s not the point):

" a  b c" == " a  b c".to_owned().split(" ").collect::<Vec<_>>().connect(" ")
(20:39:06) playbot: (notice) true

@Veedrac
Copy link
Contributor

Veedrac commented Jun 7, 2015

Sorry, I thought by saying "should be" you were implying that it wasn't - I was pointing out that it is consistent. Nvm then, we agree.

steveklabnik added a commit to steveklabnik/rust that referenced this issue Jun 9, 2015
This can be confusing when whitespace is the separator

Fixes rust-lang#25986
bors added a commit that referenced this issue Jun 10, 2015
This can be confusing when whitespace is the separator

Fixes #25986
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants