Making SQL Keyword Suggestions Work

Table of Contents

Tags:

Taken from Why I love Rust for tokenising and parsing:

I am currently writing a analysis tool for Sql: sqleibniz, specifically for the sqlite dialect.

The goal is to perform static analysis for sql input, including: syntax checks, checks if tables, columns and functions exist. Combining this with an embedded sqlite runtime and the ability to assert conditions in this runtime, creates a really great dev experience for sql.

Furthermore, I want to be able to show the user high quality error messages with context, explainations and the ability to mute certain diagnostics.

This analysis includes the stages of lexical analysis/tokenisation, the parsing of SQL according to the sqlite documentation1 and the analysis of the resulting constructs.

After completing the static analysis part of the project, I plan on writing a lsp server for sql, so stay tuned for that.

The Problem: Inaccurate and Incomplete diagnostics

This post is about the SQL keyword suggestions implementation: If the input contains an identifier at a point at which a SQL keyword was expected, the sqleibniz parser should generate a diagnostic containing a suggestions for a keyword the identifier could be.

For example, misspelling the SELECT keyword:

1SELET INDEX username;

Invoking sqleibniz with these statements results in the following diagnostic:

 1error[Unimplemented]: Unknown Token
 2 -> /home/teo/programming/sqleibniz/example/stmt.sql:1:1
 3 01 | SELET INDEX username;
 4    | ^^^^^ error occurs here.
 5    |
 6    ~ note: sqleibniz does not understand the token Ident("SELET"), skipping ahead to next statement
 7  * Unimplemented: Source file contains constructs sqleibniz does not yet understand
 8=============================== Summary ================================
 9[-] example/stmt.sql:
10    1 Error(s) detected
11    0 Error(s) ignored
12
13=> 0/1 Files verified successfully, 1 verification failed.

However, this diagnostic is really not that useful. The parser knows it expects a keyword and a keyword can only be one of the list of known keywords. The parser thus should propagate this knowledge to the user. Sqleibniz now does exactly this.

The Solution: A new Diagnostic

 1error[UnknownKeyword]: Unknown Keyword
 2 -> /home/teo/programming/sqleibniz/example/stmt.sql:1:1
 3 01 | SELET INDEX username;
 4    | ^^^^^ error occurs here.
 5    |
 6    ~ note: 'SELET' is not a know keyword, did you mean:
 7        - SELECT
 8        - SET
 9        - LEFT
10  * UnknownKeyword: Source file contains an unknown keyword
11=============================== Summary ================================
12[-] example/stmt.sql:
13    1 Error(s) detected
14    0 Error(s) ignored
15
16=> 0/1 Files verified successfully, 1 verification failed

This reworked diagnostic includes a specific new rule: Rule::UnknownKeyword, this rule is used for all identifiers found at a point where a keyword should have occured. Furthermore the note of the diagnostic includes a list of, at most three, known keywords similar to the found identifier.

Computing similarity of two words

There is a list of algorithms to choose from when attempting to classify the similarity of two strings, for instance:

I choose the last item of that list: the levenshtein distance.

The Idea

The idea of the levenshtein distance is to compute the minimum number of insertions, deletions and substitutions of characters to change a word a to a word b. The naïve and slow implementation is fairly easy to code up, the wikipedia article linked above already includes a nice mathematical definition of:

$$ \textrm{lev}(a,b) \begin{cases} |a| & |b| = 0 \\ |b| & |a| = 0 \\ \textrm{lev}(\textrm{tail}(a), \textrm{tail}(b)) & \textrm{head}(a) = \textrm{head}(b) \\ 1 + \textrm{min} \begin{cases} \textrm{lev}(\textrm{tail}(a), b) \\ \textrm{lev}(a, \textrm{tail}(b)) \\ \textrm{lev}(\textrm{tail}(a), \textrm{tail}(b)) \\ \end{cases} & \textrm{otherwise} \end{cases} $$

If you do now know this notation, the left part of the definition is the operation to execute, the right side is the condition for this to happen, for instance |a| |b| = 0 resolves to returning the length of a if the length of b is equal to 0. Other useful bits to know regarding this notation:

  • |x|: length of x
  • head(x): first character of x
  • tail(x): all characters except the first of x

Levenshtein works like this: If either of a or x is equal to 0, return the other. If the first characters are equal, return the distance of both the tails. Otherwise, the distance is equal to one plus the lowest value of either the insertion (line 1), deletion (line 2) or the substitution (line 3):

$$ 1 + \textrm{min} \begin{cases} \textrm{lev}(\textrm{tail}(a), b) \\ \textrm{lev}(a, \textrm{tail}(b)) \\ \textrm{lev}(\textrm{tail}(a), \textrm{tail}(b)) \\ \end{cases} $$

The Implementation

Implementing this is a one to one match to the above definitions:

 1pub fn distance(a: &[u8], b: &[u8]) -> usize {
 2    if b.len() == 0 {
 3        a.len()
 4    } else if a.len() == 0 {
 5        b.len()
 6    } else if a[0] == b[0] {
 7        return distance(
 8            a.get(1..).unwrap_or_default(),
 9            b.get(1..).unwrap_or_default(),
10        );
11    } else {
12        let first = distance(a.get(1..).unwrap_or_default(), b);
13        let second = distance(a, b.get(1..).unwrap_or_default());
14        let third = distance(
15            a.get(1..).unwrap_or_default(),
16            b.get(1..).unwrap_or_default(),
17        );
18        let mut min = first;
19        if min > second {
20            min = second
21        }
22        if min > third {
23            min = third
24        }
25        1 + min
26    }
27}

I took a test from the wikipedia article:

 1#[cfg(test)]
 2mod lev {
 3    use super::distance;
 4
 5    #[test]
 6    fn kitten_sitting() {
 7        // https://en.wikipedia.org/wiki/Levenshtein_distance#Example
 8        assert_eq!(distance("kitten".as_bytes(), "sitting".as_bytes()), 3);
 9    }
10}

Hooking it up to the diagnostic creation in the parser

I then created a list of the already known keywords, mapped it to a tuple of their distance value with the identifier found and the keyword itself. Collecting these tuples into a binary tree map (BTreeMap) and extracting the values in this map sorts these keywords according to their distance to the identifier, I only want the top three, so I take these:

 1impl Keyword {
 2    /// suggestions returns three suggestions based on their smallest Levenshtein_distance computed via lev::distance
 3    pub fn suggestions(s: &str) -> Vec<&str> {
 4        let input = s.to_uppercase();
 5        let bytes = input.as_bytes();
 6        vec![
 7            "ABORT",
 8            // .....
 9            "WITHOUT",
10        ]
11        .into_iter()
12        .map(|keyword| (lev::distance(bytes, keyword.as_bytes()), keyword))
13        .collect::<BTreeMap<usize, &str>>()
14        .into_values()
15        .take(3)
16        .collect()
17    }
18}

Keyword is the value enclosed in Type::Keyword and holds all possible SQL keywords.

If a top level token with the Type::Ident is found, the diagnostic is generated:

 1/// Function naming directly corresponds to the sqlite3 documentation of sql syntax.
 2///
 3/// ## See:
 4///
 5/// - https://www.sqlite.org/lang.html
 6/// - https://www.sqlite.org/lang_expr.html
 7impl<'a> Parser<'a> {
 8    /// see: https://www.sqlite.org/syntax/sql-stmt.html
 9    fn sql_stmt(&mut self) -> Option<Box<dyn nodes::Node>> {
10        let r = match self.cur()?.ttype {
11            // ....
12            Type::Ident(_) => {
13                if let Type::Ident(name) = &self.cur()?.ttype {
14                    self.errors.push(self.err(
15                        "Unknown Keyword",
16                        &format!(
17                            "'{}' is not a know keyword, did you mean: \n\t- {}",
18                            name,
19                            Keyword::suggestions(name).join("\n\t- ").as_str()
20                        ),
21                        self.cur()?,
22                        Rule::UnknownKeyword,
23                    ));
24                }
25                self.skip_until_semicolon_or_eof();
26                None
27            }
28
29        };
30        // ...
31    }
32}

Generating a diagnostic with my architecture works by pushing diagnostics into the array in the Parser.errors field - an error is created via Parser.err, this method fills the values of the error::Error structure with a title, the position in the source code and a note. These errors are then passed to a formatter in the error module. This module is responsible for implementing the pretty printing of the diagnostic locations and the color highlighting.

The whole implementation is probably not the fastest or even remotely efficient way of doing it, I will switch to a matrix approach, as described in Wagner–Fischer algorithm. Once I’ve done so, I will also use the lev::distance function to suggest column names and table names.