Lately I've been playing around with F# and specifically with using it for lexing and parsing. I've had to dig around a bit to get all the information needed to understand how to get a simple project up and running so I thought compiling that information would be a good idea.
This is going to be a tutorial that will help you setup a simple F# console application that will read the user's input and will try to evaluate it based on the basic math syntax that we will declare.
It should be able to evaluate input like:
1 + 1
(1 + -2) - (6 / 3)
Pretty advanced stuff, I know :D
If you don't feel like typing everything and just want to download the code and start tinkering then try the following:
npm install -g github-files-fetcher
github-files-fetcher -url="https://github.com/ThanosPapathanasiou/personal-blog/tree/master/code-examples/lexing-and-parsing"
code lexing-and-parsing
Otherwise, here's a step by step instruction to help you get up and running.
Let's start with a brand new F# console app:
mkdir lexing-and-parsing
cd lexing-and-parsing
dotnet new console -lang=F#
dotnet run
That should give us the default console output:
Hello world from F#
In order to use the F# lexer and parser we'll need to add the FsLexYacc
nuget package as a reference. So our next step is to run:
dotnet add package FsLexYacc
The way lexing and parsing works in F# is by generating F# code from two files, a .fsl one that contains the rules for the lexer and a .fsy one that contains the rules for the parser.
Lets create them:
echo '' > Lexer.fsl
echo '' > Parser.fsy
Next we'll need to edit the .fsproj file and add these two lines in the <PropertyGroup>
<FsLexToolExe>fslex.dll</FsLexToolExe>
<FsYaccToolExe>fsyacc.dll</FsYaccToolExe>
and these ones in the <ItemGroup>
<FsYacc Include="Parser.fsy">
<OtherFlags>--module Parser</OtherFlags>
</FsYacc>
<FsLex Include="Lexer.fsl">
<OtherFlags>--module Lexer --unicode</OtherFlags>
</FsLex>
<Compile Include="Parser.fsi" />
<Compile Include="Parser.fs" />
<Compile Include="Lexer.fs" />
What these changes do is basically tell the F# compiler that it has the fslex and fsyacc tools at it's disposal and it will have to run the transformation of the Lexer.fsl
and Parser.fsy
files before the compilation starts. We can even specify the module name that the generated code is part of. I chose to keep it simple with Lexer and Parser.
You will notice also that we are including three files (Parser.fsi
, Parser.fs
and Lexer.fs
) that don't exist as references in our project. These files will be generated from our .fsl and .fsy files by the lexer and parser tools as the output of the transformation I mention above.
At this point our project is setup to use the lexer and parser however since we don't have anything in the Lexer.fsl
and Parser.fsy
files the compilation will fail.
So lets add something!
Lexing is the process of taking raw text and converting it into a sequense of tokens. A token is a string with an assigned meaning.
In our case we'll try to create a lexer and parser that will do something basic, evaluate simple mathematic operations. i.e. 1 + 1 should return 2, -1 + 2 = 1, (1 + 2) - (4 + 1) = -2 etc
So, our lexer should identify these tokens:
This is the finished Lexer.fsl that tokenizes according to the above rules.
{
open FSharp.Text.Lexing
open Parser
let lexeme lexbuf = LexBuffer<_>.LexemeString lexbuf
}
let digit = ['0'-'9']
let operator = ['+' '-' '*' '/']
let whitespace = [' ' '\t']
rule tokenize = parse
| whitespace { tokenize lexbuf }
| ['-']?digit+ { Number ( System.Int32.Parse( lexeme lexbuf ) ) }
| operator { OP ( lexeme lexbuf ) }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { EOF }
| _ { lexeme lexbuf |> sprintf "Parsing error: %s" |> failwith }
Let's go through it and I'll explain it bit by bit.
First bit:
{
open FSharp.Text.Lexing
open Parser
let lexeme lexbuf = LexBuffer<_>.LexemeString lexbuf
}
I mentioned earlier that Lexer.fsl isn't written in F# but a custom language. That is true for the rest of this file but not for the bits between the { and }
Here we using F# to open the FSharp.Text.Lexing namespace that we need to access the input buffer LexBuffer
that allows us have a look into what part is being tokenized.
We are also opening our own Parser module (the one that will be autogenerated by Parser.fsy). We need that because the definition of the tokens that we want to output is there.
Finally, we are declaring a small helping function that returns the matching characters of a token as a string. i.e. the characters 1, 2, 3 would be returned as "123"
Second bit:
let digit = ['0'-'9']
let operator = ['+' '-' '*' '/']
let whitespace = [' ' '\t']
rule tokenize = parse
| whitespace { tokenize lexbuf }
| ['-']?digit+ { lexeme lexbuf |> System.Int32.Parse |> Number }
| operator { lexeme lexbuf |> OP }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { EOF }
| _ { lexeme lexbuf |> sprintf "Parsing error: %s" |> failwith }
In this part the actual tokenization process happens.
First, for readability, we define what a digit, operator and whitespace are. Then we have the actual tokenization function rule tokenize = parse
followed by all the cases we support (with the final one _
handling all the undesired states by just returning an error). These cases are matching against a kind of regular expression.
This is what it does
lexeme lexbuf
) then parse it into a System.Int32 and then return that int32 as a Number token. (this Number token will be defined in our Parser.fsy file)Remember: Everything between { and } is valid F# code, that's why we are able to use |>
to cast to the Number and OP symbols.
As an example, if we gave (1 + 2) + -3
the symbols should be LPAREN NUMBER(1) OP("+") NUMBER(2) RPAREN OP("+") NUMBER(-3) EOF
That looks good, let's see how to do the parsing now!
Parsing is the process of reading a stream of tokens and attempting to match them to a set of rules with the end result being an abstract syntax tree.
In layman's terms, lexing is the process of defining the alphabet (the actual characters allowed) of your language while parsing is the process of defining the syntax (i.e. subject - verb - object)
The syntax that we want to be able to understand is based upon basic math rules.
NUMBER OP NUMBER
NUMBER OP (NUMBER OP NUMBER)
Here's the end result:
%{
let getOp op =
match op with
| "+" -> ( + )
| "-" -> ( - )
| "/" -> ( / )
| "*" -> ( * )
| _ -> failwith "unknown operator"
%}
%token <int> Number
%token <string> OP
%token EOF LPAREN RPAREN
%start parse
%type <int> parse
%%
parse: expr EOF { $1 }
expr: Number { $1 }
| expr OP expr { $1 |> (getOp $2) <| $3 }
| LPAREN expr RPAREN { $2 }
Like with the lexing above, let's go through it bit by bit.
The first part consists of the following F# code.
%{
let getOp op =
match op with
| "+" -> ( + )
| "-" -> ( - )
| "/" -> ( / )
| "*" -> ( * )
| _ -> failwith "unknown operator"
%}
Here, we are declaring an F# function that matches the operator symbols to actual F# mathematical functions and returns them. So it takes a string
argument as input and returns a function of type: Int->Int->Int
(i.e. it takes two Int
arguments and returns a result of type Int
)
Next, we have the token definitions. Notice that the Tokens Number
and OP
have .net types.
%token <int> Number
%token <string> OP
%token EOF LPAREN RPAREN
Right after that we specify to the parser what rule it should use as a start and what .net type the parsing operation will return:
%start parse
%type <int> parse
%%
Finally, we have the actual rules. These rules declare what syntax is valid in our language.
Let's go through them one at a time.
parse: expr EOF { $1 }
We are parsing a single expression and we expect the EOF symbol right after. The part between the { and } is a sort of substitution. The parse
rule expects an expression and an EOF
symbol and we only care about the first argument, hence $1 (i.e. try to parse the expr
)
expr: Number { $1 }
| expr OP expr { $1 |> (getOp $2) <| $3 }
| LPAREN expr RPAREN { $2 }
This is a bit more complicated. You can see that an expr
can be different things. The rule is also recursive.
It basically states that an expression can be
You can also notice that in the second case, we are calling the getOp
function we declared earlier instead of just propagating the operator symbol. So in the generated code the +
operator will not be a simple string
type variable but an Int->Int->Int
function.
Let's try and use the input from the lexing example.
The initial input was
(1 + 2) + -3
and the symbols we got were:
LPAREN NUMBER(1) OP("+") NUMBER(2) RPAREN OP("+") NUMBER(-3) EOF
The symbols are matched against the parse
rule and they give us this abstract syntax tree.
expr op expr
/ | \
/ OP("+") NUMBER(-3)
/
LPAREN expr RPAREN
|
expr op expr
/ | \
NUMBER(1) OP("+") NUMBER(2)
That would basically be translated and executed like this:
NUMBER(1) OP("+") NUMBER(2)
translates to 1 |> (+) <| 2
and results in 3NUMBER(3) OP("+") NUMBER(-3)
translates to 3 |> (+) <| -3
and results in 0parse
returns.We have the lexer and parser all ready. If's finally the time to use them!
Open Program.fs
and replace the code with this:
open System
open FSharp.Text.Lexing
open Lexer
open Parser
let evaluate (input:string) =
let lexbuf = LexBuffer<char>.FromString input
let output = Parser.parse Lexer.tokenize lexbuf
string output
[<EntryPoint>]
let main argv =
printfn "Press Ctrl+c to Exit"
while true do
printf "Evaluate > "
let input = Console.ReadLine()
try
let result = evaluate input
printfn "%s" result
with ex -> printfn "%s" (ex.ToString())
0 // return an integer exit code
This is a very simple main
function that will continue to loop until the user closes the program. It just takes the user's input and calls the evaluate
function declared in the beginning of the program.
The evaluate
function does a couple interesting things.
First, it creates a lex buffer from the input string given by the user: let lexbuf = LexBuffer<char>.FromString input
Next, it uses that buffer and the Lexer.tokenize
function we declared in the lexer as input to the Parser.Parse
function we declared in our parser : let output = Parser.parse Lexer.tokenize lexbuf
Finally, it converts the output from int
to string
That's it. Now we can open our terminal and type dotnet run
and we should be greeted with:
Press Ctrl+c to Exit
Evaluate >
Try entering: (1 + 2) + -3
and see if it will evaluate to 0
. Try any number of complicated expressions and see if the result matches what you expect. Have a look at what happens if you use illegal characters.
Try replacing the expr
with this equivalent:
expr: Number { $1 }
| OP expr expr { (getOp $1) $2 $3 }
| LPAREN expr RPAREN { $2 }
After we recompile, suddenly our parser will be expecting that we use polish notation!
If you want to understand this a bit better I would encourage you to start tinkering with the parser. Try coming up with your own rules!
As always,
Happy coding!
If you want to have a look at the final project, instead of typing it you can get it here
If you want to get into parsing and lexing and expand your understanding a bit more you can have a look at these sites:
If you want to understand a bit more about the abstract syntax trees then you can read wikipedia's entry on Backus-Naur form