技术专题

boost regex正则匹配的 memory exhausted

2009年3月12日 阅读(353)

实际上根本原因在正则表达式写的不够简洁正确,具有歧义。

The problem here is that matching a back-reference is an NP-complete problem – Boost.Regex has a built in limiter that causes it to throw an exception rather than go on trying to find a match "for ever". Other engines just carry on regardless – so while they may find a match in this particular case (eventually), with a bit more text they might equally well hang the machine. In short, there ain’t no good solution except to rewrite your regular expression 🙁 Can’t help much there as I can’t tell what you’re trying to do from the Japanese.

HTH, John.

方法:made sure to add anchors such that I don’t run into catastrophic backtraking.

看几则邮件及讨论:

https://svn.boost.org/trac/boost/ticket/620

bug报告:

Ticket #620 (closed Bugs: Wont Fix) "memory exhausted" in regex matching

nobody johnmaddock regex None

Description ?

Using the following regex:(a*ba*)*cto match the following text:ababababababababababababwill throw an exception saying "Memory exhausted"(tried with g++ 4.0.4 on linux with boost 1.32 and1.33. Similar problem on Windows/mingw with 1.32)The following, more useful expression has the same problem:(a*ba*d?)*cCharles-Henri Gros <chgros@coverity.com>

Attachments

回复:

Logged In: YES user_id=14804

It’s a deliberate feature: it’s possible to writeexpressions that take *forever* to find a match. Mostperl-regex engines don’t protect you against this,Boost.Regex deliberately keeps track of how many states thefinite state machine has visited, and if it looks like themachine is "thrashing" as a result of a badly writtenexpression will throw an expression.

That’s what you’reseeing here, and yes the error message is misleading :-(It’s possible to hack the regex source to up the limits used(see estimate_max_state_count in perl_matcher_common.hpp),but it’s better to fix the regular expression used: problemsoccur when you have something that looks like (a*)*. Theexpression you posted could be improved by using somethinglike (a*b(?:a*d)?)*c which I *think* cuts out the recursivepaths through the state machine. This is also the sort ofproblem for which (?>…) independent sub-expressions wereinvented.HTH, John.

Add/Change #620 ("memory exhausted" in regex matching)

http://lists.boost.org/boost-users/2007/01/24551.php

回复:

Kyle Alons wrote:

>> The attached sample works properly in boost v1.30.2 but fails with a

>> ‘Memory exhausted’ exception in v1.31 through v1.33.1. Is this a

>> bug, and/or do you have any ideas on rephrasing the expression to

>> work in boost 1.33.1? Thanks.

Kyle: the trick with these is to change the expression to make it as unambiguous as possible: the exception is thrown when the regex state machine visits too many states while trying to find a match and then gives it up as a lost cause rather than risking looking indefinitely.

In this case the first (.*|\\n*) is superfluous since . can match \n as well, so the machine can "thrash" if it encounters a lot of whitespace. Using (?!System)* fixed the problem and brought the execution time down enormously.

HTH, John.

 

讨论:Regex failure

Regex failure  by Bob Kline Feb 06, 2007; 06:19am :: Rate this Message:    – Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
We have encountered a problem in the regex++ package, which reports
having exhausted memory after examining a very short string with a
regular expression of only modest complexity.  I realize that the
documentation for the package does not specify how much memory usage is
too much, but since the same combination of regular expression and test
string works without problems with a number of other programming
toolsets (e.g., Python, Perl, C#) I’m guessing that the maintainers of
the package would be interested in tracking down the problem (I would if
it were my software).  Here’s a repro case which boils the problem down
to the tiniest example:

#include <boost/regex.hpp>
int main() {
    boost::wregex  e(L"^[^\\s]( ?([^\\s]+ ?)*[^\\s])?$");
    boost::wcmatch m;
    boost::regex_match(L"codeine phosphate ", m, e);
    return 0;
}

I have confirmed that the behavior is present in the most recent version
of the Boost code by retrieving and building the latest set of sources
from CVS this morning.  (I understand the usefulness of having the user
perform this check, but making this a requirement, as the web
instructions for submitting bug reports do, may be eliminating a
substantial number of valuable reports; it took a number of attempts,
with the connection to the CVS server hanging several times, before I
could even get to the very lengthy build step.)

The failure is not triggered if a version of regex_match is used which
does not take the match_results argument, but then of course we don’t
get access to the match results.

I’m pretty sure that the expression is boiled down to the least
ambiguous form (without changing the semantics).  In plain English, it’s
looking for strings that have no leading or trailing whitespace, and for
which any internal whitespace runs are comprised solely of a single
blank character.  Doesn’t seem like a very esoteric pattern.

We have reproduced the behavior on Linux and on Windows.

Hope this is useful.  Feel free to contact me if you need any further
information.

Bob Kline

_______________________________________________
Boost-users mailing list
Boost-users@…
http://lists.boost.org/mailman/listinfo.cgi/boost-users
 

 Re: Regex failure  by Peter Dimov Feb 06, 2007; 08:50am :: Rate this Message:    – Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
Bob Kline wrote:

> #include <boost/regex.hpp>
> int main() {
>    boost::wregex  e(L"^[^\\s]( ?([^\\s]+ ?)*[^\\s])?$");
>    boost::wcmatch m;
>    boost::regex_match(L"codeine phosphate ", m, e);
>    return 0;
> }

[…]

> I’m pretty sure that the expression is boiled down to the least
> ambiguous form (without changing the semantics).  In plain English,
> it’s looking for strings that have no leading or trailing whitespace,
> and for which any internal whitespace runs are comprised solely of a
> single blank character.

I’ve no idea which part of the regexp causes the library to give up, but
isn’t there a much simpler reformulation:

  ^[^\s]+( [^\s]+)*$

that achieves the same result? In addition to running at all, it should also
run much faster. 🙂 Unless I’ve missed something, of course.

_______________________________________________
Boost-users mailing list
Boost-users@…
http://lists.boost.org/mailman/listinfo.cgi/boost-users
 

 Re: Regex failure  by John Maddock Feb 06, 2007; 08:31pm :: Rate this Message:    – Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
Bob Kline wrote:
> We have encountered a problem in the regex++ package, which reports
> having exhausted memory after examining a very short string with a
> regular expression of only modest complexity.  I realize that the
> documentation for the package does not specify how much memory usage
> is too much, but since the same combination of regular expression and
> test string works without problems with a number of other programming
> toolsets (e.g., Python, Perl, C#) I’m guessing that the maintainers of
> the package would be interested in tracking down the problem (I would
> if it were my software).  Here’s a repro case which boils the problem
> down to the tiniest example: …[show rest of quote]
This is probably not what you want to hear, but the behaviour is by design:
Perl-regexes are in the worst case NP-Hard problems to solve, so there is a
built in state-count that tracks how many states in the machine are visited
and throws an exception if the number of states visited looks too large
compared to the size of the string.  The aim is to weed out "bad" regexes
that may cause problems with certain texts and not others before they can do
too much damage.  Obviously this is a heuristic that sometimes throws too
early or too late.

When this issue occurs it’s almost always a problem with the expression
being ambiguous: each time a repeat or other decision has to be made then
either alternative can be taken, so if no match is found the regex state
machine has to thrash around trying ever possible combination.

Is there any reason why you can’t simplify your expression to just
"^[^\\s]+( ([^\\s]+)*)?$" which does the same thing much more efficiently?

HTH, John.

_______________________________________________
Boost-users mailing list
Boost-users@…
http://lists.boost.org/mailman/listinfo.cgi/boost-users
 

 Re: Regex failure  by Nat Goodspeed Feb 06, 2007; 11:30pm :: Rate this Message:    – Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
—–Original Message—–
From: boost-users-bounces@… on behalf of John Maddock
Sent: Tue 2/6/2007 7:31 AM
To: boost-users@…
Subject: Re: [Boost-users] Regex failure

Is there any reason why you can’t simplify your expression to just
"^[^\\s]+( ([^\\s]+)*)?$" which does the same thing much more efficiently?

[Nat] <offtopic> Is there any research work about programmatically optimizing a given regex to an equivalent, but more efficient, one? </offtopic>

_______________________________________________
Boost-users mailing list
Boost-users@…
http://lists.boost.org/mailman/listinfo.cgi/boost-users

 Re: Regex failure  by John Maddock Feb 07, 2007; 12:56am :: Rate this Message:    – Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
Nat Goodspeed wrote:
> [Nat] <offtopic> Is there any research work about programmatically
> optimizing a given regex to an equivalent, but more efficient, one?
> </offtopic>

That would certainly be useful.  Last time I did a literature search in this
area (a few years ago now I admit) there was a lot of work on *simplified*
regular expressions: there was some overlap with search engine technology
which made this a hot topic for a while.  Likewise anything that can be
reduced to a DFA ends up being automatically simplified, but I found very
little for Perl-style expressions.

John.

_______________________________________________
Boost-users mailing list
Boost-users@…
http://lists.boost.org/mailman/listinfo.cgi/boost-users
 

 Re: Regex failure  by Bob Kline Feb 07, 2007; 02:33am :: Rate this Message:    – Use ratings to moderate (?)

Reply | Reply to Author | Print | View Threaded | Show Only this Message
On 2/6/2007 7:31 AM, John Maddock wrote:
> Is there any reason why you can’t simplify your expression to just
> "^[^\\s]+( ([^\\s]+)*)?$" which does the same thing much more efficiently?

No reason at all.  I have no idea why we didn’t see that when were
looking at this originally, unless it was just way too late in the
afternoon that day. 🙂  Thanks, John (and Peter, who also suggested the
same simplification), and sorry for the wasted bandwidth.

Bob

_______________________________________________
Boost-users mailing list
Boost-users@…
http://lists.boost.org/mailman/listinfo.cgi/boost-users
 

You Might Also Like