Introduction:
Every programmer, once in a life, wants to write a simple regular expression parser.
Actually it is not that difficult once you know the steps.
This is actually a two step process.
i. Convert regular expression to NFA
ii. Then traverse through the NFA to match the string.
Now first question arise is what is NFA?
Nondeterministic finite automaton :-
In automata theory, a nondeterministic finite automaton (NFA), or nondeterministic finite state machine, is a finite state machine that (1) does not require input symbols for state transitions and (2) is capable of transitioning to zero or two or more states for a given start state and input symbol.(Borrowed From Wikipedia)
So the point is that, we can represent a regular expression in the form of NFA.
For converting a regular expression to NFA you should just know some basic of how to the different symbol are represented in NFA.
We can use basic figures given in below diagram and can covert any regex to NFA.
For example if want convert a regular expression a(b|c) to NFA it will become
For simple explanation on how to convert Regex to NFA you can can watch this video.
Now you have basic of how to convert regex to NFA, now let us see how to do in program.
As I have told you there two steps for parsing regex. Lets start with first step.
i. Convert regular expression to NFA:-
For converting regex to NFA, we will break down the steps.
a. Put concatenation operator(. Dot) wherever concatenation is required.
b. Convert that concatenated regex to postfix
c. Then build NFA using different part ( as we have build in above example).
a. Put concatenation operator wherever concatenation is required:-
For simple rule where concatenation is required are between these.
ab
, a(
, )a
,*a
,*(
, )(
.So put concatenation operator (. Dot) between these
For example regex "(a|b)*cd" will become "( a | b ) * . c . d"[space is added for readability]
b. Convert that concatenated regex to postfix:-
For quick overview how to convert a arithmetic expression to NFA this video is good starter.
Converting regex is very similar with some other operator with different precedence.
These are the precedence I had given for different operator
PLUS('+', 1), STAR('*', 2), VERTICAL_BAR('|', 3), CLOSING_PARENTHESES(')', 4), DOT('.', 0),OPEN_PARENTHESES('(', -1);
With PLUS('+', 1) and STAR('*', 2) I did not completely tested it. So you can try and make changes accordingly.
So after converting our concatenated regex to postfix it will become "a b | * c . d ."
c. Then build NFA using different part :-
Now this is easy step. After getting postfix expression all you need to do follow the simple steps.
i. If operand push to stack
ii. If operator apply the operator accordingly and push back the result to stack.
So for our above given example "a b | * c . d . ". It will do these operations.
PUSH aAs we can see, it is very similar to the evaluation of arithmetic expressions. The difference is that in regular expressions the star operation pops only one element from the stack and evaluates the star operator.
PUSH b
UNION
STAR
PUSH c
CONCAT
PUSH d
CONCAT
POP R
PUSH
and POP
operations actually work with a stack of simple NFA objects. If we would PUSH
symbol a
on the stack, the operation would create two state objects on the heap and create a transition object on symbol a
from state 1 to state 2. the union pops two elements, makes the transformation and pushes the result on the stack.
ii. Then traverse through the NFA to match the string.:-