RFC 1054: str-words

libs ()

Summary

Rename or replace str::words to side-step the ambiguity of “a word”.

Motivation

The str::words method is currently marked #[unstable(reason = "the precise algorithm to use is unclear")]. Indeed, the concept of “a word” is not easy to define in presence of punctuation or languages with various conventions, including not using spaces at all to separate words.

Issue #15628 suggests changing the algorithm to be based on the Word Boundaries section of Unicode Standard Annex #29: Unicode Text Segmentation.

While a Rust implementation of UAX#29 would be useful, it belong on crates.io more than in std:

Therefore, std would be better off avoiding the question of defining word boundaries entirely.

Detailed design

Rename the words method to split_whitespace, and keep the current behavior unchanged. (That is, return an iterator equivalent to s.split(char::is_whitespace).filter(|s| !s.is_empty()).)

Rename the return type std::str::Words to std::str::SplitWhitespace.

Optionally, keep a words wrapper method for a while, both #[deprecated] and #[unstable], with an error message that suggests split_whitespace or the chosen alternative.

Drawbacks

split_whitespace is very similar to the existing str::split<P: Pattern>(&self, P) method, and having a separate method seems like weak API design. (But see below.)

Alternatives

Unresolved questions

Is there a better alternative?