I'm genuinely surprised that there isn't column-level shared-dictionary string compression built into SQLite, MySQL/MariaDB or Postgres, like this post is describing.
SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.
There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.
Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).
> Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
Enums? Foreign key to a table with (id bigint generated always as identity, text text) ?
> I have databases I know would be reduced in size by at least half.
Most people don't employ these strategies because storage is cheap and compute time is expensive.
The current prices for SSDs and DRAM may change the strategies.
In a mini-PC that I have assembled recently, with a medium-priced CPU (an Arrow Lake H Core Ultra 5) the cost of storage has been 60% of the total cost of the computer, and it was only 60% because I have decided to buy much less memory than I would have bought last summer (i.e. I bought 32 GB DRAM + 3 TB SSDs, while I wanted double amounts, but then the price would have become unacceptable).
Moreover, I bought the mini-PC and memories in Europe, but the same exact computer model from ASUS, with the same memories, has in USA, on Newegg, a price about 35% higher than in Europe, and a similar ratio is valid for other products, so US customers are likely to be even more affected by this great increase in the cost of storage.
Strings in textual index are already compressed, with common prefix compression or other schemes. They are perfectly queryable. Not sure if their compression scheme is for index or data columns.
Global column dictionary has more complexity than normal. Now you are touching more pages than just the index pages and data page. The dictionary entries are sorted, so you need to worry about page expansion and contraction. They sidestep the problems by making it immutable, presumably building it up front by scanning all the data.
Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.
Storing the strings themselves as dictionary IDs is a good idea, as they can be processed quickly with SIMD.
> Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.
I believe the reason is that FSST allows access to individual strings in the compressed corpus, which is required for fast random access. This is more important for OLTP than OLAP, I assume.
More standard compression algorithms, such as zstd, might decompress very fast, but I don't think they allow that
1, complicates and slows down update, which is typically more important in OLTP than OLAP
2, is generally bad for high cardinality columns, which requires tracking cardinality to make decisions, which further complicates things.
lastly, additional operational complexity (like the table maintenance system you described in last paragraph) could reduce system reliability, and they might decide it's not worth the price or against their philosophy.
There are some databases that can move an entire column into the index. But that's mostly going to work for schemas where the number of distinct values is <<< rowcount, so that you're effectively interning the rows.
But string interning is what they're doing, isn't it?
> Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:
That's string interning!!
Is interning just too old a concept now and it has to be rediscovered/reinvented and renamed?
String interning only stores whole strings, dictionary compression stores substrings. Essentially string interning is a trivial subset of dictionary compression (because you don’t need a substringing scheme, which is the hard part).
Doesn't interning usually refer to when you only consider identical copies, as opposed to dictionary compression where you allow for concatenations? E.g.
Interning:
1: "foo"
2: "bar"
my_string = "foo" // stored as ref->1
my_other_string = "foobarbaz" // not found & too long to get interned, stored as "foobarbaz"
Dictionary compression:
1: "foo"
2: "bar"
my_string = "foo" // stored as ref->1
my_other_string = "foobarbaz" // stored as ref->1,ref->2,"baz" (or ref->1,ref->2,ref->3 and "baz" is added to the dict)
Only if you explicitly intern the string. Interning can be expensive because
- the runtime has to check whether the string already is interned.
- you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.
Both get more expensive the more strings you intern.
I think interning is a hack that very rarely, if ever should be used. It likely won’t work well for large strings, as these tend to be unique, and use cases where it helps for shirt strings often are better handled by using enums or symbols, or by using a custom set of strings. If you do the latter, you have more control over Emory usage; you can do such things such as removing the least recently used strings, or ditching the entire cache when you’re done needing it (prime example: parsing a large XML file with many repeated nodes)
> you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.
Nowadays lots of runtime will GC interned strings to avoid this, this can mean churn in the interning table but avoids mis-estimations bloating the process too much.
Their own cherry-picked example benchmarks show that there is frequently no benefit, after enabling FSST, to query runtimes and sometimes query runtimes are actually negatively affected. Despite hard empirical evidence against FSST, they decided to enable it across-the-board? Not even "hey, we built this cool thing which may or may not improve your workloads, here's how to opt-in", but pushing straight up performance regressions for their customers with no way of opting-out?
Why would I ever trust my data to engineers who are more interested in their academically-interesting resume-driven-development than in actual customer outcomes?
The compression algorithm is very similar to a greedy subword tokenizer, which is used in BERT and other older language models, but has become less popular in favor of BPE.
I've implemented a similar system based on the original 2020 paper,
but we applied it to the text log to try to "extract" similar features
from free-form text. It looked promising and even supported full
regex search, but the work was ultimately abandoned when we got
acquired.
I was evaluating it recently but it's not FOSS, so buyer beware. I'm totally fine with commercialization, but I hesitate to build on top of data stores with no escape hatches or maintenance plans–especially when they're venture backed. It is self-hostable, but not OSS.
It's probably going to be acquired. The last effort to commercialize the TUM (Technical University of Munich) database group's work was acquired by Snowflake and disappeared into that stack.
CedarDB is the commercialization of Umbra, the TUM group's in-memory database lead by professor Thomas Neumann. Umbra is a successor to HyPer, so this is the third generation of the system Neumann came up with.
Umbra/CedarDB isn't a completely new way of doing database stuff, but basically a combination of several things that rearchitect the query engine from the ground up for modern systems: A query compiler that generates native code, a buffer pool manager optimized for multi core, push-based DAG execution that divides work into batches ("morsels"), and in-memory Adaptive Radix Tries (never used in a database before, I think).
It also has an advanced query planner that embraces the latest theoretical advances in query optimization, especially some techniques to unnest complex multi-join query plans, especially with queries that have a ton of joins. The TUM group has published some great papers on this.
> It also has an advanced query planner that embraces the latest theoretical advances in query optimization, especially some techniques to unnest complex multi-join query plans, especially with queries that have a ton of joins. The TUM group has published some great papers on this.
I always wondered how good these planners are in practice. The Neumann/Moerkotte papers are top notch (I've implemented several of them myself), but a planner is much more than its theoretical capabilities; you need so much tweaking and tuning to make anything work well, especially in the cost model. Does anyone have any Umbra experience and can say how well it works for things that are not DBT-3?
Umbra is not an in-memory database (Hyper was). TUM gave up on the feasibility of in-memory databases several years ago (when the price of RAM relative to storage stopped falling).
Thanks for the correction. My understanding was that it was still in-memory but "fell back on" disk. ART indexes were touted as one of the novel aspects of Umbra, and my understanding is that ART doesn't work well as an on-disk data structure, so I guess I need to read up on the architecture now.
No, again, ART was Hyper's specialty. Because you're right, ART specializes at in-memory workloads it is not amenable to paging.
I believe Umbra is heavily BTree based, just like its cousin LeanStore.
One of its specific innovations is its buffer pool which uses virtual memory overcommit and multiple possible buffer sizes to squeeze better performance out of page management.
My understanding is the research projects LeanStore & Umbra -- and now I assume the product CedarDB based on the people involved, etc. -- are systems based on the observation that a) existing on-disk systems aren't built well with the characteristics of nVME/SSD drives in mind b) RAM prices up to this year were not dropping at the same rate as they were early in the 2010s, meaning that pure in-memory databases were not so competitive, so it's important to look at how we can squeeze performance out of systems that perform paging. And of course in the last 6 months this has become extremely relevant with the massive spike in RAM prices.
That and the query compilation stuff, I guess, which I know less about.
Thanks for the corrections and info. Will check out that video. I could have sworn I had read these things about Umbra, but I suppose it was HyPer. Both interesting designs!
I am currently writing an on-disk B+tree implementation inspired by LMDB's blazing fast memory-mapped CoW approach, so I'm quite keen to learn what makes TUM's stuff fast.
Yeah I think the way Umbra was pitched when I watched the talks and read the paper was as more as "hybrid" in the sense that it aimed for something close to in-memory performance while optimizing the page-in/page-out performance profile.
The part of Umbra I found interesting was the buffer pool, so that's where focused most of my attention when reading though.
This isn't strictly correct: you probably mean wrt compressed size. Compression is a tradeoff between size reduction and compression and decompression speed. So while things like Bellard's tz_zip (https://bellard.org/ts_zip/) or nncp compress really well they are extremely slow compared to say zstd or the much faster compression scheme in the article. It's a totally different class of codec.
SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.
There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.
Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).
Enums? Foreign key to a table with (id bigint generated always as identity, text text) ?
> I have databases I know would be reduced in size by at least half.
Most people don't employ these strategies because storage is cheap and compute time is expensive.
In a mini-PC that I have assembled recently, with a medium-priced CPU (an Arrow Lake H Core Ultra 5) the cost of storage has been 60% of the total cost of the computer, and it was only 60% because I have decided to buy much less memory than I would have bought last summer (i.e. I bought 32 GB DRAM + 3 TB SSDs, while I wanted double amounts, but then the price would have become unacceptable).
Moreover, I bought the mini-PC and memories in Europe, but the same exact computer model from ASUS, with the same memories, has in USA, on Newegg, a price about 35% higher than in Europe, and a similar ratio is valid for other products, so US customers are likely to be even more affected by this great increase in the cost of storage.
Global column dictionary has more complexity than normal. Now you are touching more pages than just the index pages and data page. The dictionary entries are sorted, so you need to worry about page expansion and contraction. They sidestep the problems by making it immutable, presumably building it up front by scanning all the data.
Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.
Storing the strings themselves as dictionary IDs is a good idea, as they can be processed quickly with SIMD.
I believe the reason is that FSST allows access to individual strings in the compressed corpus, which is required for fast random access. This is more important for OLTP than OLAP, I assume. More standard compression algorithms, such as zstd, might decompress very fast, but I don't think they allow that
1, complicates and slows down update, which is typically more important in OLTP than OLAP
2, is generally bad for high cardinality columns, which requires tracking cardinality to make decisions, which further complicates things.
lastly, additional operational complexity (like the table maintenance system you described in last paragraph) could reduce system reliability, and they might decide it's not worth the price or against their philosophy.
But string interning is what they're doing, isn't it?
> Dictionary compression is a well-known and widely applied technique. The basic idea is that you store all the unique input values within a dictionary, and then you compress the input by substituting the input values with smaller fixed-size integer keys that act as offsets into the dictionary. Building a CedarDB dictionary on our input data and compressing the data would look like this:
That's string interning!!
Is interning just too old a concept now and it has to be rediscovered/reinvented and renamed?
- the runtime has to check whether the string already is interned.
- you’re permanently creating the string, so if it isn’t reused later, that means your memory usage needlessly goes up.
Both get more expensive the more strings you intern.
I think interning is a hack that very rarely, if ever should be used. It likely won’t work well for large strings, as these tend to be unique, and use cases where it helps for shirt strings often are better handled by using enums or symbols, or by using a custom set of strings. If you do the latter, you have more control over Emory usage; you can do such things such as removing the least recently used strings, or ditching the entire cache when you’re done needing it (prime example: parsing a large XML file with many repeated nodes)
Nowadays lots of runtime will GC interned strings to avoid this, this can mean churn in the interning table but avoids mis-estimations bloating the process too much.
It's not old yet; it'll be old once it's been renamed two or three times...
Their own cherry-picked example benchmarks show that there is frequently no benefit, after enabling FSST, to query runtimes and sometimes query runtimes are actually negatively affected. Despite hard empirical evidence against FSST, they decided to enable it across-the-board? Not even "hey, we built this cool thing which may or may not improve your workloads, here's how to opt-in", but pushing straight up performance regressions for their customers with no way of opting-out?
Why would I ever trust my data to engineers who are more interested in their academically-interesting resume-driven-development than in actual customer outcomes?
The paper suggests that you could rework string matching to work on the compressed data but they haven't done it.
Seems to be another commercial cloud-hosted thing offering a Postgres API? https://dbdb.io/db/cedardb
https://cedardb.com/blog/ode_to_postgres/
Successor to Umbra, I believe.
I know somebody (quite talented) working there. It's likely to kick ass in terms of performance.
But it's hard to get people to pay for a DB these days.
CedarDB is the commercialization of Umbra, the TUM group's in-memory database lead by professor Thomas Neumann. Umbra is a successor to HyPer, so this is the third generation of the system Neumann came up with.
Umbra/CedarDB isn't a completely new way of doing database stuff, but basically a combination of several things that rearchitect the query engine from the ground up for modern systems: A query compiler that generates native code, a buffer pool manager optimized for multi core, push-based DAG execution that divides work into batches ("morsels"), and in-memory Adaptive Radix Tries (never used in a database before, I think).
It also has an advanced query planner that embraces the latest theoretical advances in query optimization, especially some techniques to unnest complex multi-join query plans, especially with queries that have a ton of joins. The TUM group has published some great papers on this.
I always wondered how good these planners are in practice. The Neumann/Moerkotte papers are top notch (I've implemented several of them myself), but a planner is much more than its theoretical capabilities; you need so much tweaking and tuning to make anything work well, especially in the cost model. Does anyone have any Umbra experience and can say how well it works for things that are not DBT-3?
I believe Umbra is heavily BTree based, just like its cousin LeanStore.
One of its specific innovations is its buffer pool which uses virtual memory overcommit and multiple possible buffer sizes to squeeze better performance out of page management.
The talk at https://www.youtube.com/watch?v=pS2_AJNIxzU is delightful.
My understanding is the research projects LeanStore & Umbra -- and now I assume the product CedarDB based on the people involved, etc. -- are systems based on the observation that a) existing on-disk systems aren't built well with the characteristics of nVME/SSD drives in mind b) RAM prices up to this year were not dropping at the same rate as they were early in the 2010s, meaning that pure in-memory databases were not so competitive, so it's important to look at how we can squeeze performance out of systems that perform paging. And of course in the last 6 months this has become extremely relevant with the massive spike in RAM prices.
That and the query compilation stuff, I guess, which I know less about.
I am currently writing an on-disk B+tree implementation inspired by LMDB's blazing fast memory-mapped CoW approach, so I'm quite keen to learn what makes TUM's stuff fast.
https://github.com/leanstore/leanstore
Very different approach from lmdb's mmap though. Not limited to single writer. Explicit buffer pool. Big difference from Umbra is fixed sized pages.
The part of Umbra I found interesting was the buffer pool, so that's where focused most of my attention when reading though.