“I hear and I forget. I see and I remember. I do and I understand ”-Confucius

True learning is only when we turn knowledge and knowledge into action and practice with it. If you only watch or listen without any action plan, that knowledge will be very difficult to become your skill.

Today's post is a very important one. It contains a lot of new knowledge and important concepts related to the compiler. I really wish you to spend some time (maybe a few hours in the evening or more) to actively learn the knowledge in this and the following lessons. Active learning is that you read and master the concepts in the article, and practice it by coding.

The whole code of part2 you can pull from Github

Before going to this article, I would like to briefly outline about the previous article. Before that we developed 1 lexical analysis, It breaks a string of html strings into a stream of tokens. Tokens often carry information like type (can be Element, Text, Attribute in HTML), and take the value as the string they represent (for example,

, ...)
For example, if we have the following HTML string:

My mother has blue eyes.

We get an array of tokens as follows:

(
   {
      type: TAG_OPEN,
      value: 'p'
   },
   {
      type: TEXT,
      value: 'My mother has' 
   },
   {
      type: TAG_OPEN,
      value: 'span'
   },
   {
      type: ATTR,
      value: ('style', 'color:blue')
   },
   {
      type: TEXT,
      value: 'blue'
   },
   {
      type: TAG_CLOSE,
      value: 'span'
   },
   ... ,
   {
      type: EOF,
      value: NONE
   }
)

That is more than enough for the stage lexical analysis already.

Phase syntax analysis

This is the stage to discover the structure of the token stream, or convert the token stream into another format. This process is called parsing good syntax analysis. The part of the compiler that performs this task is called parser, or also called syntax analyzer.
In order to parse the complex syntax of Angular or other programming languages, we need to build one intermediate representation (IR). And Tree is the data structure that best suits IR

I will talk about the term Tree, probably everyone knows or hears it.

A tree is a data structure consisting of 1 or more nodes arranged into a hierarchical tree
1 tree has 1 root, that root is called the top node
All nodes except the root contain 1 parent only
1 parent may contain 1 or more childrens. Children are arranged from left to right
A node that does not contain children is called 1 leaf node
1 node that contains children but not the top node is called 1 interior node
Children can also be considered as 1 subtree (1 tree child)
In KHMT, one draws a tree from top to bottom, starting from the root node and its branches growing down from left to right.

Here is a tree for the HTML:

My mother has blue eyes.

The IR that I use throughout this series is called abstract-syntax tree (AST)
The picture above is also an example of abstract syntax tree. You see, nodes are just text, it's not a syntax or a rule (I will explain what the rule is below), it does not contain any detailed information about the node at all. . That's why people call it abstract.

Typically, a parser will reformat the stream tokens and generate AST, based on 1 grammar Specifically. The next part of this article I will talk about grammar and its importance.

Grammar là 1 notation to describe the syntax of a programming language. It is also known as context-free grammars (grammars stands for). For example, here is the grammar for the HTML tree I applied:

Body: (Text | Tag)*
Tag :  OPEN_TAG (ATTRIBUTE)* (Body)? CLOSE_TAG

These are also called EBNF grammar. I will go into grammar details for you to understand:

1 grammar consists of 1 sequence of rules, also called as productions. There are 2 rules in my grammar:

first rule including 1 non-terminal, called head or left-hand side of production, a colon ‘:'In the middle, and a series of terminals / non-terminal called body good right-hand side of production.

Tokens such as Text, OPEN_TAG, ATTRIBUTE, CLOSE_TAG are called terminal, và variables like Body, Tag are called non-terminal. Non-terminal including 1 sequence of terminal and / or non-terminal. BiDemonstration as shown below:

The non-terminal on the left-hand side of the first rule is called start symbol. In my case, the start symbol is Body.

You can read the Body rule as follows: 1 Body can be many Text or Tag combination. Where 1 tag includes 1 opening tag (OPEN_TAG), this opening tag may not have or have one or more attributes (ATTRIBUTE). Following could be another Body. And finally followed by a closing tag (CLOSE_TAG)

The picture contains the following symbols:

| - Alternatives, 1 Vertical Bar means "or". So Text | Tag means Text or Tag
(...) - represents a group of terminal and / or nonterminal. Ex: (Text | Tag)
(...)* - Matches content within 0 or more times
(...)? - Match content within 0 or 1 time

1 grammar helps define a language by explaining the sentences (sentences) of the language. For example, a Vietnamese sentence consists of the subject and the predicate. Grammar will combine the subject with the predicate to form a sentence. Similar to the grammar I defined above, it tries to replace the body of the Tag rule into the non-terminal Tag in the Body rule to form a complete body. 1 complete body is 1 body consisting of only terminals without permission
contain non-terminals. Grammar sIt is not possible to define a language if it cannot be replaced non-terminals equal terminal.

OKay, here you seem to be from someone who doesn't understand grammar and the terms related to it, now you seem to feel like this.

Ok, if you feel like that then congratulations, you've started the journey to discover the compiler. Although the road ahead is long, but you have put one foot on that road now. It is better to walk sooner than to accumulate later.

Now I will say how to map grammar to the code. You just need to obey guide line after:

Each rule R, Defined in grammar will be a corresponding method in the parser, each reference to the rule is a method call: R (). The body of the rule also follows this guide line.
The alternative (a1 | a2 | aN) become commands if-elif-else
1 optional grouping (...) * become 1 statement while which can loop 0 or more times
1 optional grouping (...)? become 1 statement if
Every Token T Every time it is referenced it becomes a call to the method eat (T). Method eat That operation is that it consumes T tokens if current_token matches T, and then it receives the next token and assigns current_token to that next token. (current_token is a variable that holds the current token during paring)

Visual representation is as follows:

Ok, I will explain step by step convert grammar to code following the instructions above.

There are 2 rules in grammar: Body rule and Tag rule. I started with Body previous rule. First, following guideline 1, you need to create a method named body. The magic part of this rule is an optional grouping so it will be 1 while loop according to guideline 3. And alternative (TEXT | TAG) become the command if-elif-else. All in all, we get:

def body(self):
    while self.current_token.type in (TAG_OPEN, TEXT):
        token = self.current_token
        if token.type == TAG_OPEN:
            self.tag()
        elif token.type == TEXT:
            self.eat(TEXT)

Similar to the rule Tag, we will create a method named tag (guideline 1). Next call to the method eat to consume tokens TAG_OPEN (guideline 5), followed by an optional grouping (...) * will become whileloop (guideline3), followed by an optional grouping (...)? will become an if statement (guideline 4). And finally call eat to consume tokens TAG_CLOSE (guideline5)

    def tag(self):
        self.eat(TAG_OPEN)
        while self.current_token.type == ATTR:
            self.eat(ATTR)
        if self.current_token.type in (TAG_OPEN, TEXT):
            self.body()                
        self.eat(TAG_CLOSE)

OK, let's try the program and see. If with any HTML string, the code runs without error syntax, and the entire token has been parsed through. Then my code has run properly already.

Here is my test on my computer, with the input being:

My mother has blue eyes.

python testpart2.py
Start tag: p
Data     : My mother has
Start tag: span
     attr: ('style', 'color:blue')
Data     : blue
End tag  : span
Data     :  eyes.
End tag  : p
(Token(TAG_OPEN, 'p'), Token(TEXT, 'My mother has '), Token(TAG_OPEN, 'span'), Token(ATTR, ('style', 'color:blue')), Token(TEXT, 'blue'), Token(TAG_CLOSE, 'span'), Token(TEXT, ' eyes.'), Token(TAG_CLOSE, 'p'), Token(EOF, None))

(Token(TAG_OPEN, 'p'), Token(TEXT, 'My mother has '), Token(TAG_OPEN, 'span'), Token(ATTR, ('style', 'color:blue')), Token(TEXT, 'blue'), Token(TAG_CLOSE, 'span'), Token(TEXT, ' eyes.'), Token(TAG_CLOSE, 'p'))

Today I focus on knowledge and concepts related to parser, tree, AST, grammar. I will stop here to make sure you understand them first. After lesson 3 I will guide how to create an AST tree when parser.
Everyone try to read the code and practice offline. This is an extremely important article for you to understand the following.

Check your understanding

Try to answer these questions to reinforce your knowledge:

What is the process of reformatting stream tokens?
What is the name of the compiler that performs the parsing?
What is the context-free grammar (grammar)?
What are rules / production?
What is a terminal and non-terminal? (redefine all of them in the image of the lesson)
What is the head of the rule?
What is the body of the rule?
What is the start symbol of 1 grammar?
master the guideline in the post

Explore more:

https://ruslanspivak.com/lsbasi-part4/

https://blog.mgechev.com/2017/09/16/developing-simple-interpreter-transpiler-compiler-tutorial/