Taken from Why I love Rust for tokenising and parsing:
I am currently writing a analysis tool for Sql:
sqleibniz
, specifically for the sqlite dialect.The goal is to perform static analysis for sql input, including: syntax checks, checks if tables, columns and functions exist. Combining this with an embedded sqlite runtime and the ability to assert conditions in this runtime, creates a really great dev experience for sql.
Furthermore, I want to be able to show the user high quality error messages with context, explainations and the ability to mute certain diagnostics.
This analysis includes the stages of lexical analysis/tokenisation, the parsing of SQL according to the sqlite documentation1 and the analysis of the resulting constructs.
After completing the static analysis part of the project, I plan on writing a lsp server for sql, so stay tuned for that.
The Problem: Inaccurate and Incomplete diagnostics
This post is about the SQL keyword suggestions implementation: If the input contains an identifier at a point at which a SQL keyword was expected, the sqleibniz parser should generate a diagnostic containing a suggestions for a keyword the identifier could be.
For example, misspelling the SELECT
keyword:
1SELET INDEX username;
Invoking sqleibniz with these statements results in the following diagnostic:
1error[Unimplemented]: Unknown Token
2 -> /home/teo/programming/sqleibniz/example/stmt.sql:1:1
3 01 | SELET INDEX username;
4 | ^^^^^ error occurs here.
5 |
6 ~ note: sqleibniz does not understand the token Ident("SELET"), skipping ahead to next statement
7 * Unimplemented: Source file contains constructs sqleibniz does not yet understand
8=============================== Summary ================================
9[-] example/stmt.sql:
10 1 Error(s) detected
11 0 Error(s) ignored
12
13=> 0/1 Files verified successfully, 1 verification failed.
However, this diagnostic is really not that useful. The parser knows it expects a keyword and a keyword can only be one of the list of known keywords. The parser thus should propagate this knowledge to the user. Sqleibniz now does exactly this.
The Solution: A new Diagnostic
1error[UnknownKeyword]: Unknown Keyword
2 -> /home/teo/programming/sqleibniz/example/stmt.sql:1:1
3 01 | SELET INDEX username;
4 | ^^^^^ error occurs here.
5 |
6 ~ note: 'SELET' is not a know keyword, did you mean:
7 - SELECT
8 - SET
9 - LEFT
10 * UnknownKeyword: Source file contains an unknown keyword
11=============================== Summary ================================
12[-] example/stmt.sql:
13 1 Error(s) detected
14 0 Error(s) ignored
15
16=> 0/1 Files verified successfully, 1 verification failed
This reworked diagnostic includes a specific new rule: Rule::UnknownKeyword
,
this rule is used for all identifiers found at a point where a keyword should
have occured. Furthermore the note
of the diagnostic includes a list of, at
most three, known keywords similar to the found identifier.
Computing similarity of two words
There is a list of algorithms to choose from when attempting to classify the similarity of two strings, for instance:
I choose the last item of that list: the levenshtein distance.
The Idea
The idea of the levenshtein distance is to compute the minimum number of
insertions, deletions and substitutions of characters to change a word a
to a
word b
. The naïve and slow implementation is fairly easy to code up, the
wikipedia article linked above already includes a nice mathematical definition of:
If you do now know this notation, the left part of the definition is the operation to execute, the right side is the condition for this to happen, for instance
|a| |b| = 0
resolves to returning the length ofa
if the length ofb
is equal to 0. Other useful bits to know regarding this notation:
|x|
: length ofx
head(x)
: first character ofx
tail(x)
: all characters except the first ofx
Levenshtein works like this: If either of a
or x
is equal to 0
, return
the other. If the first characters are equal, return the distance of both the
tails. Otherwise, the distance is equal to one plus the lowest value of either
the insertion (line 1), deletion (line 2) or the substitution (line 3):
The Implementation
Implementing this is a one to one match to the above definitions:
1pub fn distance(a: &[u8], b: &[u8]) -> usize {
2 if b.len() == 0 {
3 a.len()
4 } else if a.len() == 0 {
5 b.len()
6 } else if a[0] == b[0] {
7 return distance(
8 a.get(1..).unwrap_or_default(),
9 b.get(1..).unwrap_or_default(),
10 );
11 } else {
12 let first = distance(a.get(1..).unwrap_or_default(), b);
13 let second = distance(a, b.get(1..).unwrap_or_default());
14 let third = distance(
15 a.get(1..).unwrap_or_default(),
16 b.get(1..).unwrap_or_default(),
17 );
18 let mut min = first;
19 if min > second {
20 min = second
21 }
22 if min > third {
23 min = third
24 }
25 1 + min
26 }
27}
I took a test from the wikipedia article:
1#[cfg(test)]
2mod lev {
3 use super::distance;
4
5 #[test]
6 fn kitten_sitting() {
7 // https://en.wikipedia.org/wiki/Levenshtein_distance#Example
8 assert_eq!(distance("kitten".as_bytes(), "sitting".as_bytes()), 3);
9 }
10}
Hooking it up to the diagnostic creation in the parser
I then created a list of the already known keywords, mapped it to a tuple of
their distance
value with the identifier found and the keyword itself.
Collecting these tuples into a binary tree map (BTreeMap
) and extracting the
values in this map sorts these keywords according to their distance to the
identifier, I only want the top three, so I take these:
1impl Keyword {
2 /// suggestions returns three suggestions based on their smallest Levenshtein_distance computed via lev::distance
3 pub fn suggestions(s: &str) -> Vec<&str> {
4 let input = s.to_uppercase();
5 let bytes = input.as_bytes();
6 vec![
7 "ABORT",
8 // .....
9 "WITHOUT",
10 ]
11 .into_iter()
12 .map(|keyword| (lev::distance(bytes, keyword.as_bytes()), keyword))
13 .collect::<BTreeMap<usize, &str>>()
14 .into_values()
15 .take(3)
16 .collect()
17 }
18}
Keyword
is the value enclosed inType::Keyword
and holds all possible SQL keywords.
If a top level token with the Type::Ident
is found, the diagnostic is
generated:
1/// Function naming directly corresponds to the sqlite3 documentation of sql syntax.
2///
3/// ## See:
4///
5/// - https://www.sqlite.org/lang.html
6/// - https://www.sqlite.org/lang_expr.html
7impl<'a> Parser<'a> {
8 /// see: https://www.sqlite.org/syntax/sql-stmt.html
9 fn sql_stmt(&mut self) -> Option<Box<dyn nodes::Node>> {
10 let r = match self.cur()?.ttype {
11 // ....
12 Type::Ident(_) => {
13 if let Type::Ident(name) = &self.cur()?.ttype {
14 self.errors.push(self.err(
15 "Unknown Keyword",
16 &format!(
17 "'{}' is not a know keyword, did you mean: \n\t- {}",
18 name,
19 Keyword::suggestions(name).join("\n\t- ").as_str()
20 ),
21 self.cur()?,
22 Rule::UnknownKeyword,
23 ));
24 }
25 self.skip_until_semicolon_or_eof();
26 None
27 }
28
29 };
30 // ...
31 }
32}
Generating a diagnostic with my architecture works by pushing diagnostics into
the array in the Parser.errors
field - an error is created via Parser.err
,
this method fills the values of the error::Error
structure with a title, the
position in the source code and a note. These errors are then passed to a
formatter in the error
module. This module is responsible for implementing
the pretty printing of the diagnostic locations and the color highlighting.
The whole implementation is probably not the fastest or even remotely efficient
way of doing it, I will switch to a matrix approach, as described in
Wagner–Fischer
algorithm.
Once I’ve done so, I will also use the lev::distance
function to suggest
column names and table names.