Regex Block and Blocked Context Switch

Top  Previous  Next

Although regular expressions are very powerful feature, they do not allow to express some complex language constructions. This is due to general limitation of parser which scope is limited to a single line of text. Regular expression can't work on multiple lines of the parsed text. Often programming languages have constructions wrapped into each other unlimited number of times. This is also an area where Regular Expression do not help much.

To define more complex syntax structures and context-free grammar constructions we have a special element named <RegexBlock>.

 

Element: <RegexBlock>

 

Attribute: innerScheme, type: string, case-sensitive, scheme reference
Causes parser to switch inside specified scheme to parse matched text.
Same as <RegexRule> innerScheme attribute. For more, look there.

 

Attribute: start, type: Regular expression
This attribute gives regular expression which starts switch to scheme referred by innerScheme attribute. Nota bene: sub-scheme parsing will start from first symbol of start regex match, unless you specify start_token0..N attribute (see below).
Thus, matched text will be parsed again, by rules of innerScheme.

 

Attribute: end, type: Regular expression
This attribute gives regular expression which ends parsing scheme referred by innerScheme attribute.  After text, matched by end regex, parser will switch back to outer scheme. In nested innerScheme rule for exit given by end regexp will have highest priority. Also, in end regexp you can refer any group matched in start regexp usng $0..$9 syntax for group number. See example2 for details.
Attribute: start_moreWordSeparators, type: string.
This attribute extends default word separator chars, used by \b regexp operator, for this start regexp attribute only.
See topic in regexps section.
See <KeywordRegex> element for example.
Attribute: start_moreWordChars, type: string.
This attribute extends default word chars, used by \b regexp operator, for this start regexp attribute only. See links above.
Attribute: end_moreWordSeparators, type: string.
This attribute extends default word separator chars, used by \b regexp operator, for this end regexp attribute only. See links above.
Attribute: end_moreWordChars, type: string.
This attribute extends default word chars, used by \b regexp operator, for this end regexp attribute only. See links above.

 

For easer description of nested scheme bounds, you can use start_token0..N and end_token0..N attributes

 

Attribute: start_token0..N, type: string, case-sensitive, <Token> reference
This attribute splits text matched by start regexp by tokens. Splitting rules same as for  token0..N of <RegexRule>  element (see more above).

 

Attribute: end_token0..N, type: string, case-sensitive, <Token> reference
This attribute splits text matched by end regexp by tokens. Splitting rules same as for  token0..N of <RegexRule>  element (see more above).
Nota bene: token produced by start_token and end_token attributes belongs inner scheme, not outer. This is important for coloring and further syntax parsing used for fold generation.

 

Attribute: priority, type: Integer
This property gives priority for this rule (for start regexp) on parsing text, acceptable for several rules. See description of <RegexRule> priority attribute above for more.

 

Attribute: innerContentGroup, type: Integer
Gives group number for rule’s regexp used to get token “contents” for further syntax parsing. For more, see “Syntax Blocks” section

 

Attribute: chainBlock, type: string, case-sensitive, <ChainBlock> reference
This attribute instructs parser after end of innerScheme don’t switch back to outer scheme. Instead, parser should switch to innerScheme of specified <ChainBlock>.
See <ChainBlock> element for more information.

 

Example1 of <RegexBlock>:

 

<Scheme name='Comment' inherit='Text' defaultToken='comment'>          

    <!-- We will fold big comments -->

    <SyntaxBlock capture="true">

        <!-- commentBound comingn from outer scheme rules   

             (start_token/end_token) of my RegexBlock -->

        <Start> commentBound  </Start>

        <End> commentBound  </End>

    </SyntaxBlock>                

</Scheme>

 

<Scheme name='CPP' defaultToken='defaultCPP'>

    <!-- Here we give C/C++ - style multiline comment -->

    <RegexBlock priority='100' 

                start='\/\*' 

                end='\*\/' 

                start_token0='commentBound' 

                end_token0='commentBound' 

                innerScheme='Comment'/>

 

    <!-- Another form for element -->

    <RegexBlock priority='100' 

          <!-- We don’t want to create rules 

               for comment bounds again, in 

               Comment scheme. We use start,

               end_token attributes to avoid that. -->       

                start_token0='commentBound' 

                end_token0='commentBound' 

                innerScheme='Comment'>

        <Start> \/\* </Start>

        <End> \*\/ </End>

    </RegexBlock>

</Scheme>

 

Example2 of <RegexBlock>: (start regex groups referencing)

 

<Scheme name='HereDoc' 

        defaultToken='string'>

</Scheme>

 

<Scheme name='Main' defaultToken='default'>

    <RegexBlock innerScheme='HereDoc'>

        <Start>  [^ &lt; ]? &lt; &lt; &lt; (\w+)  </Start>        

        <End> ^ $1 ; </End>        

    </RegexBlock>

</Scheme>

 

This will highlight Php HEREDOC syntax from <<<SOMETEXT to SOMETEXT:

 

<<<_HEREDOC_

   Here is  Php 'Heredoc' string;        No substitutions performed.

_HEREDOC_