Home Section Blog Technology Parsing Balanced Parenthesis with Regular Expressions
Parsing Balanced Parenthesis with Regular Expressions PDF Print E-mail
Blog - Technology
Written by Dennis Reinhardt   
Tuesday, 24 March 2009 20:48
One of the well known limitations of Regular Expressions (regex) is that they cannot count balanced pairs of parenthesis.  While true, if we can limit the depth of nesting, we can write a regex which recognizes balanced parenthesis to that depth.  Here is the regex for depth 3:

[\s]*?\(([a-zA-Z0-9_\$\.=+&%!\\|:\?\*\-,/; \[\]]|('[^']*?')|("[^"]*?")|(\(\/[^\/]*?\/[\)]?)|[\s]*?\(([a-zA-Z0-9_\$\.=+&%!\\|:\?\*\-,/; \[\]]|('[^']*?')|("[^"]*?")|(\(\/[^\/]*?\/[\)]?)|[\s]*?\(([a-zA-Z0-9_\$\.=+&%!\\|:\?\*\-,/; \[\]]|('[^']*?')|("[^"]*?")|(\(\/[^\/]*?\/[\)]?))*\))*\))*\)

However, this result cannot easily be decomposed and discussed for purposes of understanding where it came from. Here is the same thing in Python source format with line numbers added for discussion purposes:

1. al1             = "a-zA-Z0-9_\$\."
2. op1             = "=+&%!\\|:\?\*\-"
3. txt1_re         = "\'[^\']*?\'"
4. txt2_re         = '\"[^\"]*?\"'
5. rgx1_re         = "\(\/[^\/]*?\/[\)]?"
6. parn_fill       = "%s%s,/; \[\]"          % (al1,op1)
7. parn_repeat     = "*"
8. parn_inner_skip = "[%s]|(%s)|(%s)|(%s)"   % (parn_fill, txt1_re, txt2_re, rgx1_re)
9. parn1           = "[\s]*?\((%s)%s\)"      % (parn_inner_skip, parn_repeat)
10 parn2           = "[\s]*?\((%s|%s)%s\)"   % (parn_inner_skip, parn1, parn_repeat)
11 parn3           = "[\s]*?\((%s|%s)%s\)"   % (parn_inner_skip, parn2, parn_repeat)

Line 8 defines the "fill" which can appear between parenthesis.  One of those fill items are single quoted text strings (line 3) and double quoted text stringa (line 4).  We must scan over text strings and not merely process a character at a time.  Consider the simple subroutine call:

    calling("(");

If we merely scanned from left to right, we could mistake the left parenthesis inside the quotes as starting a new nesting level.  Thus, we must scan quoted strings as an entity.  


At some point, we end up with an expression for fill characters.  Line 9 is pretty straightforward for processing parenthesis one level deep.  The input possibilities are:

    ...(...)...
    ...(...)...(...)...
    ...(...)...(...)...(...)...
    etc.

and the answer is always, "(...)", the first set of balance parenthesis.  Line 9 does that and does not apply an repetition to the parenthesis literals.

The situation is more complicated at level 2.  The "fill" for level 2 now includes the expression we have already defined for level 1.  So, the definition of level 2 in line 10 includes parn1 inside the repetition.  Likewise, the definition of level 3 includes repeated level 2.

This can be extended to further levels if desired.

Note we have simplified here somewhat as the text strings need to be scanned internally in at least double characters for backslash escape sequences and single characters for everything else.  It is outside the scope of present discussion to re-write lines 3-5, although it is recognized that this must be done for a final system.
Trackback(0)
Comments (1)Add Comment
0
Parsing Balanced Parenthesis with Regular Expressions
written by donaljeo, May 24, 2010
Balanced parentheses are literally a textbook example of a language that cannot be processed with regular expressions. JSON is essentially balanced parentheses plus a bunch of other stuff, with the braces replaced by parens. In the hierarchy of formal languages, JSON is a context-free language. Regular expressions can't parse context-free languages.

Some systems offer extensions to regular expressions that kinda-sorta handle balanced expressions. However they're all ugly hacks, they're all unportable, and they're all ultimately the wrong tool for the job.

In professional work, you would almost always use an existing JSON parser. If you want to roll your own for educational purposes then I'd suggest starting with a simple arithmetic a+ certification grammar that supports + - * / ( ). (JSON has some escaping rules which, while not complex, will make your first attempt harder than it needs to be.) Basically, you'll need to:

1. Decompose the language into an alphabet of symbols
2. Write a context-free grammar in terms of those symbols thatrecognizes the language
3. Convert the grammar into Chomsky normal form, or near enough to make step 5 easy
4. Write a lexer that converts raw text into your input alphabet
5. Write a recursive descent parser that takes your lexer's output, parses it, and produces some kind of output

This is a typical third-year CS assignment at just about any university.

The next step is to find out how complex a JSON string you need to trigger a stack overflow in your recursive parser. Then look at the other types of parsers that can be written, and you'll understand why anyone who has to parse a context-free language in the real world uses a tool like yacc or antlr instead of writing a parser by hand.

If that's more learning than you were looking for then you should feel free to go use an off-the-shelf JSON parser, satisified that you learned something important and useful: the limits of regular expressions.

Write comment

busy
Last Updated on Tuesday, 24 March 2009 20:51
 
Copyright © 2010 TextToolKit. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.