Inserting 100k rows 66 times faster

Mar 20, 2024
828 Words
4 Minute read

Tags:

In the process of implementing the initial data syncing logic for a mostly offline application I noticed my database abstraction taking a solid 2 minutes for inserting short of 750k rows into a single database table.

I set out to fix this bottleneck.

Problems and Issues

A new issue arose before i could even tackle the aforementioned performance problems: At first I used the spread syntax for a variadic amount of function parameters.

1class Database {
2  async create(...items) {}
3}
4new Database().create(...new Array(750_000));

This polluted the call stack and promptly caused a stack overflow:

TEXT

 1/home/teo/programming_wsl/blog_test/main.js:7
 2new Database().create(...new Array(750_000));
 3               ^
 4
 5RangeError: Maximum call stack size exceeded
 6    at Object.<anonymous> (/home/teo/programming_wsl/blog_test/main.js:7:16)
 7    at Module._compile (node:internal/modules/cjs/loader:1275:14)
 8    at Module._extensions..js (node:internal/modules/cjs/loader:1329:10)
 9    at Module.load (node:internal/modules/cjs/loader:1133:32)
10    at Module._load (node:internal/modules/cjs/loader:972:12)
11    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:83:12)
12    at node:internal/main/run_main_module:23:47
13
14Node.js v19.9.0

Believe me this seems obvious in retrospect, but it took me a solid day of combing trough typescript source maps to find the cause.

Takeaway

I will never use the spread operator again - looking at you react.js

First Estimates and Benchmarking

Fixing this issue allowed me to call my create function and pass the rows to insert in. The first tests were nothing short of disappointing: 57 seconds for inserting 500k rows, 1min 10 seconds for inserting 750k rows - this was too slow.

 1  async create(items) {
 2    return new Promise((res, rej) => {
 3      this.connection.serialize(() => {
 4        this.connection.run("BEGIN TRANSACTION");
 5        for (let i = 0; i < items.length; i++) {
 6          try {
 7            this.connection.run(
 8              "INSERT INTO user (name, age) VALUES (?, ?)",
 9              items[i]?.name,
10              items[i]?.age
11            );
12          } catch (e) {
13            this.connection.run("ROLLBACK TRANSACTION");
14            return rej(e);
15          }
16        }
17        this.connection.run("COMMIT");
18        return res();
19      });
20    });
21  }

My next idea was to get specific and reproducible numbers, thus i created a benchmark, for the simplified create implementation above, with differing loads, starting with 10 rows and stopping at a million rows:

 1const Database = require("./main.js");
 2
 3const DB = new Database();
 4
 5const amount = [10, 100, 1000, 10_000, 100_000, 1_000_000];
 6const data = { name: "xnacly", age: 99 };
 7
 8describe("create", () => {
 9  for (const a of amount) {
10    let d = new Array(a).fill(data);
11    test(`create-${a}`, async () => {
12      await DB.create(d);
13    });
14  }
15});

Measured times:

	`create`
10 rows	2ms
100 rows	1ms
1,000 rows	13ms
10,000 rows	100ms
100,000 rows	1089ms
1,000,000 rows	11795ms

Approaches

My first idea was to omit calls in the hot paths that obviously are contained in the create methods loop for inserting every row passed into it. After taking a look, i notice there are no heavy or even many operations here. How could I improve the database interaction itself? This was the moment I stumbled upon bulk inserts.

SQL

1INSERT INTO user (name, age) VALUES ("xnacly", 99);

The naive example implementation makes a database call for every row it wants to insert. Bulk inserting reduces this to a single call to the database layer by appending more value tuples to the above statement:

SQL

1INSERT INTO user (name, age) VALUES
2    ("xnacly", 99),
3    ("xnacly", 99),
4    ("xnacly", 99);

Using the above to implement a faster create method as follows:

 1  async createFast(items) {
 2    if (!items.length) return Promise.resolve();
 3    let insert = "INSERT INTO user (name, age) VALUES ";
 4    insert += new Array(items.length).fill("(?,?)").join(",");
 5    let params = new Array(items.length * 2);
 6    let i = 0;
 7    for (const item of items) {
 8      params[i] = item.name;
 9      params[i + 1] = item.age;
10      i += 2;
11    }
12    return new Promise((res, rej) => {
13      this.connection.serialize(() => {
14        this.connection.run("BEGIN TRANSACTION");
15        try {
16          this.connection.run(insert, params);
17        } catch (e) {
18          this.connection.run("ROLLBACK TRANSACTION");
19          return rej(e);
20        }
21        this.connection.run("COMMIT");
22      });
23      return res();
24    });
25  }

Extending our tests for createFast:

1describe("createFast", () => {
2  for (const a of amount) {
3    let d = new Array(a).fill(data);
4    test(`createFast-${a}`, async () => {
5      await DB.createFast(d);
6    });
7  }
8});

	`create`	`createFast`	Improvement
10 rows	2ms	1ms	2x
100 rows	1ms	1ms	/
1,000 rows	13ms	3ms	4.3x
10,000 rows	100ms	22ms	4.5x
100,000 rows	1089ms	215ms	5.1x
1,000,000 rows	11795ms	1997ms	5.9x

Info

The before benchmarks were done on the following system:

CPU: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz 4 cores
RAM: 8,0 GB DDR3 1600 MHz

The system was under heavy load while benchmarking, thus the recorded times are still pretty slow.

Escaping from micro benchmarks into the real world

Applying the results from the micro benchmarks to my real world projects database layer resulted in significant runtime improvements, up to 66x faster.

	`create`	`createFast`	Improvement
10 rows	7ms	2ms	3.5x
100 rows	48ms	1ms	48x
1000 rows	335ms	9ms	37.2x
10000 rows	2681ms	80ms	33.5x
100000 rows	25347ms	390ms	66x