Tokenization - I

Making sense of a given SQL query logically and arithmetically

ยท

4 min read

Tokenization - I
Play this article

Tokenization means (in my words) process making sense of a given statement of a particular syntax (a programming/markup language etc.) to be processed logically and arithmetically in a system.

The important parameter for tokenization is the syntax to which the tokenization is applied to. In kopf, we will use the SQL syntax as is. with that being said, SQL syntax is made of statements. These statements will contain the operands & the operators.

SELECT col_name FROM table_name WHERE id > 2003;

Here:-

  • SELECT, FROM, and WHERE are operators (also >, but I meant the dominant SQL ops)
  • col_name, table_name, id and 2003 are operands

A SQL engine will take this and lay its ops as follows:-

  1. Get everything from table table_name
  2. Iterate through each row & enumerate to get rows with only id > 2003
  3. Extract values of col_name to a memory place from obtained results
  4. Harvested results in given memory are the result of the issued SQL statement * I assumed there are no indexes found for table_name in the DBMS

Constraints of kopf SQL

Since this is an experiment, we will enforce some of the SQL conventions as mandatory. Plus, few add-ons ๐Ÿ˜‰. This is to focus on the core concept rather than wasting time on input sanitization & validations.

As of now, the following constraints are observed and issued queries are assumed to abide when working with kopf:

  • SQL tokens are uppercase
  • single whitespace of width 1 (i.e., Unicode char when spacebar is pressed) will differentiate each token
  • in INSERT queries, to-be inserted values will be inside parenthesis () and delimited by commas , and no whitespace to be found in between values.

Parsing a query

With constraints laid down, let's try to parse it. SQL is somewhat Natural Language. It's intuitive af. E.g.

SELECT col_name FROM table_name

In SQL we will read from left to right. The first operator (first token) signifies what we have to do. We gotta READ the values of col_name, the first operand (the second token). the second operator (the third token) tells us we gotta read from table_name, the second operand (fourth token). Thus, we will read col_name from table_name.

First off, let's have a main entry point. It will read stdin and parse the query and do the job. The entry point will be brainstem. Then skimmed query will be sent to the cerebrum via cerebellum. In cerebrum we will have a simple conditional that will invoke a function for each type of query.

Each such function will signature as follows:

fn operator<T: serde::Value>(
    table_name: &str,
    operand: T,
    where_tuple: WHERE_STRUCT
    ) -> serde::Value {};

e.g.

select('table', col_name, (id, 0b1, 2003);

So, you get the idea. Thus, we have to do the following:

  1. Determine the main operator (SELECT|INSERT|UPDATE|DELETE etc.)
  2. Extract the table
  3. Extract operand [WIP]
  4. Get where_tuple if given [WIP]
  5. Cast the magic [WIP]
  6. Return the result to the callee [WIP]

Prototype

I will use python for a quick prototype. In python, we can work with strings very easily without worrying about memory management like we have to do in rust.

Determine the main operator

query = input("kopf> ")

In constraints, we defined only whitespace is allowed in between tokens and few other rules. These will help us focus on the core itself. So extracting the tokens is as easy as...

tokens = query.split(" ")

Now the main operator will be at tokens[0]. So we can use a very long if-else block to determine the operator logic.

Python doesn't have switch logic, but rust does. So in rust, we will swap ๐Ÿ˜Š

So, the determinant will be

operator = tokens[0]

if operator == "SELECT":
    pass
elif operator == "INSERT":
    pass
elif operator == "UPDATE":
    pass
elif operator == "DELETE":
    pass
else:
    raise Exception(f"Syntax Error in query:\n\t{query}")

Extract Table

Fairly straightforward due to the way SQL works. Table name is either at tokens[1] or token[2] or tokens[3]. Hence, the above determinant will become...

if operator == "SELECT":
    tbl = tokens[3]
elif operator == "INSERT":
    tbl = tokens[2]
elif operator == "UPDATE":
    tbl = tokens[1]
elif operator == "DELETE":
    tbl = tokens[2]
else:
    raise Exception(f"Syntax Error in query:\n\t{query}")

That's it for now ๐Ÿ˜’

I know, I haven't finished the flow. This took me a while to do the research and compile it into an article and program. Remember, this series is asynchronous. It only gives you futures/promises. The actual result might delay or never happen ๐Ÿ˜˜ and fail. Please make sure, you check out later.

Attribution: Photo by Pavel Danilyuk from Pexels

Adieu, this is BE, signing off ๐Ÿ‘‹.

ย