“Zero or more”

Posted by waldner on 10 August 2010, 8:20 pm

This usually comes up in a form similar to "I'm running this code:

$ echo "foobar 123" | sed 's/[0-9]*/blah/'
blahfoobar 123

but I don't get the expected result!" (which they think should be "foobar blah"). The result they get is instead as shown above.

Why is that? Remember that the star quantifier "*" means "zero or more" times of whatever is quantified. So in particular, "zero times" is perfectly fine. As a matter of fact, there's a match for "zero times [0-9]" just at the beginning of our string, and that's what sed replaces. Yes, it's a zero-length match (but, can a match of "zero times" something be non-zero-length?). Yet, it's still a perfectly valid match.

But there's more. By convention, there's a match for "zero times anything" between any two characters of the string, and also another one at the end:

$ echo "foobar" | sed 's/[0-9]*/BLAH/g'
BLAHfBLAHoBLAHoBLAHbBLAHaBLAHrBLAH

So why, in the original example, does sed replace the first match even if there is a longer match later in the string (that is, "123")? Shouldn't quantifiers be greedy and always find the longest match?

Sure quantifiers are greedy, but greediness matters only when multiple matches are possible at the same position. So if our pattern is /[0-9]*/ and we are matching against "2847293745 foo", surely "2847293745" is matched, rather than "284729374", "28472937" or the other possible shorter matches down to "2" and the empty string: of all the possible matches at the same position, the longest is chosen.

But this is not the case in our original example. In it, there's no contention: there is only a single possible match at the beginning of the string, and the regex engine is content with it.

The usual (although somewhat simplistic) description for this behavior is "leftmost longest match", so stop at the first match that is found starting from the left, and if at that position multiple matches are possible, take the longest one (the aforementioned greediness). Our sed took the leftmost match, which is also the longest at that position, even if it has length zero.

Another pitfall which results from the same misconception is for example grepping for '[0-9]*' and wondering why all the input lines are printed. It's because they all match the pattern.

To sum it up, what people really mean when they write this code is usually "one or more", not "zero or more", so they really want this:

$ echo "foobar 123" | sed 's/[0-9][0-9]*/blah/'
foobar blah

There are other ways to express "one or more". With BREs (Basic Regular Expressions, those used by sed or grep) it is possible to say

[0-9]\{1,\}

or, with GNU tools, also

[0-9]\+

With ERE (Extended Regular Expressions) tools, like awk or grep -E, "one or more" can be expressed as

[0-9]{1,}

or simply

[0-9]+

Filed under faq Tagged awk, grep, quantifier, regexp, sed, star, text processing

Comments are closed | Permalink

2 Comments

Mussé Redi says:

July 23, 2017 at 12:18

Thank you very much, this was just the explanation I was looking for! I didn't consider two things before reading this; 1) that a string of zero length is considered a match by "zero or more times"; 2) that greediness matters only when multiple matches are possible at the same position.
Nick C says:

August 11, 2010 at 03:20

Thanks for this, it summarises nicely the difference between the various implementations of regex: BRE, ERE and Gnu.

One thing I find frustrating is the inconsistency in control chars, i.e. the []{}()*?+ and so on. I really wish the original implementers had thought it through more carefully. I'm talking about (to use your examples) why the [] of [0-9] are not escaped, but the {} of \{1,\} *are* escaped. There doesn't seem to be any rhyme or reason to it, you just have to memorise how the various control chars are treated.

Then there are the inconsistent implementations or implementations that are only partial: bash, vim and python's regex are all different to each other, for example.

Of course, we are stuck with the situation now, there are just too many scripts that rely on it.

\1

“Zero or more”

2 Comments

BTC

Recent Posts

Categories

Archives

\1

“Zero or more”

2 Comments

BTC

Recent Posts

Categories

Tags

Archives