Introducing RawTree
by Maksim Kita
Most analytics databases today assume you know your table schemas upfront. You define tables, set up indexes, create materialized views, and only then can you start querying your data.
Without indexes or materialized views the database is slow. With them, optimizing queries requires a lot of work, usually involving iterating the tables and materialized view's schemas, and they are not effective at dealing with unstructured/semi-structured data. All of this makes their APIs and interaction complex, which is not ideal for coding agents, because they have to make decisions about data structure/schema/indexes before they have any data. The core issue is "schema before data".
Imagine a team building an AI support agent. Every run produces different data: prompts, tool calls, retrieval results, latencies, error payloads, user metadata, feedback, sometimes extra fields that did not exist the day before. At the beginning they do not know which of those fields will matter most. They just want to keep everything and start asking questions immediately.
Ideally, the agent should be able to send requests like:
curl -X POST https://api.rawtree.com/v1/tables/agent_runs \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '[
{"run_id":"r1","model":"gpt-5","latency_ms":812,"tool":"search","status":"ok"},
{"run_id":"r2","model":"gpt-5","latency_ms":"1540","tool":"browser","status":"error","error_code":"timeout"}
]'So that after a few runs, it could start running queries like these:
curl -X POST "https://api.rawtree.com/v1/query" \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"sql":"SELECT tool, status, count(), avg(latency_ms) \
FROM agent_runs \
GROUP BY tool, status \
ORDER BY count() DESC;"}'We built RawTree to solve this problem. Our main motivation and goal was to create an ideal analytical database for agents, and thus for humans too. The main feature is its simplicity: you just ingest any data, query with SQL, fetch results via API. No schema upfront, and no need to design indexes, primary keys, or materialized views to start working. RawTree is designed to be useful on raw workloads immediately, then improve recurring workloads over time by adapting to actual query access patterns.
Before jumping into details of each feature and history of RawTree I wanted to highlight some of the main features that make RawTree a unique database:
- Schemaless fast and predictable ingestion. You can connect RawTree to log exporter/webhook/custom built API/ingestion pipeline in minutes.
- Reliable type system and SQL for dynamic columns. Custom and fast (almost zero overhead) operators for GROUP BY/JOINs/ORDER BY/DISTINCT/Window functions.
- Fast query execution on raw untyped data. For even faster execution client can add types directly in SQL without necessity to change schema.
- Database optimizes itself based on client query access patterns. There are no indexes/primary keys/materialized views to define; the primary key indexes/JIT projections are created automatically under the hood based on query access patterns without having to define them manually.
- Clean API for agents. A coding agent should be able to learn to use it successfully with a few lines of text.
History of RawTree
RawTree grew out of my previous work on Ursa - ClickHouse research fork. Ursa started as a performance project: I wanted to see how fast I could make an analytical engine if I did not have to preserve backward compatibility. That work gave me a serious engine foundation, but it also made another problem more obvious.
Performance is only part of the difficulty. A lot of analytical systems still assume that before you can do useful work, someone has to decide the schema, the sort order, the indexes, the projections, and how the data model will evolve. That can work when the workload is stable and the team already knows what matters. It doesn't works as well when the data is messy, the shape keeps evolving, and queries appear before the model is settled.
I wanted a different operating model: ingest first, query immediately, and let the physical optimization follow real usage instead of forcing those decisions upfront.
That is not just a product or API problem. It creates a few hard database problems.
- Type system for dynamic data. Data cannot stay as raw JSON internally if you want serious analytical performance. It has to be stored and processed in typed form (String, Integer, etc.). But once a column can contain different types, you need clear and predictable SQL semantics for filtering, grouping, ordering, joins, aggregates, and functions.
- Performance on raw workloads. Accepting semi-structured data is not useful if it makes the engine much slower. The goal is not convenience at any cost. The goal is to keep the engine fast while still allowing flexible ingest and query patterns.
- Optimization without upfront modeling. If the user does not define primary keys, indexes, or projections in advance, the engine needs to infer useful physical structures from actual workload patterns. That means collecting the right statistics and making good decisions automatically, without turning the system into a black box.
RawTree is the result of trying to solve those problems together in one database.
The rest of this post shows some of the main pieces: schemaless ingestion, SQL over dynamic data, and automatic physical optimization based on real query access patterns.
Schemaless ingestion and APIs
RawTree is built around RawMergeTree table engine. This engine allows you to ingest any data and primary key/runtime indexes/projections will be created automatically.
CREATE TABLE test_table ENGINE=RawMergeTree;
INSERT INTO test_table VALUES
('{"id": 0, "value": "Value_0"}'),
('{"id": 1, "value": 1}'),
('{"id": 2, "value": 2.0}');
SELECT id, value, dynamicType(id) AS id_type, dynamicType(value) AS value_type
FROM test_table;
┌─id─┬─value───┬─id_type─┬─value_type─┐
│ 0 │ Value_0 │ Int64 │ String │
│ 1 │ 1 │ Int64 │ Int64 │
│ 2 │ 2.0 │ Int64 │ Float64 │
└────┴─────────┴─────────┴────────────┘This makes it possible to build clean and simple APIs for agents. You just insert your data, and tables and columns are created automatically.
curl -X POST https://api.rawtree.com/v1/tables/events \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '[
{"action":"click","user":"alice","value":42},
{"action":"view","user":"bob","value":10,"info":"some additional info"}
]'
Returns: {"inserted":2}Autogenerated projections
Autogenerated projections are created automatically based on actual query access patterns. The system will automatically identify repeatedly executed slow queries and will create the necessary projections. For example:
SELECT
region,
sum(amount) AS sum
FROM test_auto_projection
GROUP BY region
ORDER BY region ASC
5 rows in set. Elapsed: 2.125 sec. Processed 100.00 million rows, 3.50 GB (47.06 million rows/s., 1.65 GB/s.)
Peak memory usage: 25.60 MiB.If you run this relatively slow query multiple times, then projection advisor will decide to automatically create projection and then:
SELECT
region,
sum(amount) AS sum
FROM test_auto_projection
GROUP BY region
ORDER BY region ASC
┌─region─┬─────────sum─┐
│ 0 │ 9945969605 │
│ 1 │ 9966469035 │
│ 2 │ 9997709681 │
│ 3 │ 10009853640 │
│ 4 │ 10028792257 │
└────────┴─────────────┘
5 rows in set. Elapsed: 0.005 sec.The projection advisor will favour lightweight projections for low-cardinality aggregation scenarios, where the number of groups after aggregation is small. This lets the engine maintain many lightweight projections with low storage overhead to accommodate client access patterns.
Autogenerated PRIMARY KEY
In RawMergeTree, data is stored in parts similar to ordinary MergeTree. But each part has its own primary key. The engine automatically creates a primary key when data is inserted, based on column types and cardinality. Later during actual query access pattern statistics are collected and primary key is rebuilt during merges based on those statistics. For example:
CREATE TABLE test_table ENGINE = RawMergeTree;
INSERT INTO test_table
SELECT concat('{"id":', toString(number), ',"value":"Value_', toString(number), '"}')
FROM numbers(10_000_000);
SELECT id, value FROM test_table WHERE id = 5;
┌─id─┬─value───┐
│ 5 │ Value_5 │
└────┴─────────┘
1 row in set. Elapsed: 0.005 sec. Processed 13.95 thousand rows, 237.33 KB (2.74 million rows/s., 46.64 MB/s.)
Peak memory usage: 187.55 KiB.If we run EXPLAIN we can see amount of data that was read and that index was actually used:
EXPLAIN indexes = 1 SELECT * FROM test_table WHERE id = 5;
┌─explain────────────────────────────────────────────────────────────┐
│ Expression ((Project names + Projection)) │
│ Expression ((WHERE + Change column names to column identifiers)) │
│ ReadFromMergeTree (default.test_table) │
│ Indexes: │
│ PrimaryKey │
│ Condition: true │
│ Parts: 1/1 │
│ Granules: 2/1221 │
│ Ranges: 2 │
└────────────────────────────────────────────────────────────────────┘I will write more about autogenerated PRIMARY KEY features in next blog posts and cover more details like how engine handles complex scenarios like dynamic primary key columns, primary key analysis for dynamic columns, actual filter statistics collection and why there are 2 granules in EXPLAIN output instead of 1.
Type System
All SQL operators DISTINCT, GROUP BY, JOINS, ORDER BY and both ordinary and aggregate functions need to support working natively with dynamic types and provide good and predictable default behavior.
To show why this matters, consider a few places where ordinary SQL semantics become tricky for dynamic columns. In systems without dynamic-type-aware operators, even basic operations such as ORDER BY, GROUP BY, and aggregates either fail or require the user to opt into behavior that may not match what they expect.
ORDER BY operator in ClickHouse with a Dynamic type:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic VALUES (1), (2.0), (3);
SELECT * FROM test_dynamic ORDER BY value;
Received exception from server (version 26.5.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception: Data types Variant/Dynamic are not allowed in ORDER BY keys,
because it can lead to unexpected results. Consider using a subcolumn with a specific data type instead
(for example 'column.Int64' or 'json.some.path.:Int64' if its a JSON path subcolumn) or casting this column to a specific data type.
Set setting allow_suspicious_types_in_order_by = 1 in order to allow it. (ILLEGAL_COLUMN)Even when allowing this operation explicitly, the behavior may not be what a client expects:
SET allow_suspicious_types_in_order_by = 1
SELECT value FROM test_dynamic ORDER BY value;
┌─value─┐
│ 2 │
│ 1 │
│ 3 │
└───────┘This sorting behavior also needs to be consistent with all functions that compare values internally like arraySort:
SELECT arraySort([1, 2.0, 3]::Array(Dynamic))
┌─arraySort(CA⋯Dynamic)'))─┐
│ [2,1,3] │
└──────────────────────────┘In RawTree ORDER BY is supported for dynamic columns and all functions that compare values by default,
and it is possible to set legacy behavior if needed with dynamic_columns_compare_by_value setting:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic VALUES (1), (2.0), (3);
SELECT * FROM test_dynamic ORDER BY value;
┌─value─┐
│ 1 │
│ 2 │
│ 3 │
└───────┘
SELECT arraySort([1, 2.0, 3]::Array(Dynamic))
┌─arraySort(CA⋯Dynamic)'))─┐
│ [1,2,3] │
└──────────────────────────┘Similar issues show up with GROUP BY. Without dynamic-type-aware grouping semantics, compatible values with different concrete integer types can be treated as separate groups:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic SELECT toUInt8(1);
INSERT INTO test_dynamic SELECT toUInt64(1);
SELECT value, COUNT() FROM test_dynamic GROUP BY value;
Received exception from server (version 26.5.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception: Data types Variant/Dynamic are not allowed in GROUP BY keys,
because it can lead to unexpected results. Consider using a subcolumn with a specific data type instead (for example 'column.Int64'
or 'json.some.path.:Int64' if its a JSON path subcolumn) or casting this column to a specific data type.
Set setting allow_suspicious_types_in_group_by = 1 in order to allow it. (ILLEGAL_COLUMN)For example:
SELECT value, COUNT() FROM test_dynamic GROUP BY value;
┌─value─┬─COUNT()─┐
│ 1 │ 1 │
│ 1 │ 1 │
└───────┴─────────┘In RawTree for GROUP BY there is special operator implementations where all integer types are treated as the same virtual Integer type:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic SELECT toUInt8(1);
INSERT INTO test_dynamic SELECT toUInt64(1);
SELECT value, COUNT() FROM test_dynamic GROUP BY value;
┌─value─┬─COUNT()─┐
│ 1 │ 2 │
└───────┴─────────┘Aggregates have the same problem. A function like sum needs to understand which concrete dynamic types it can process:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic SELECT toUInt8(1);
SELECT sum(value) FROM test_dynamic;
Received exception from server (version 26.3.1):
Code: 43. DB::Exception: Received from localhost:9000. DB::Exception: Illegal type Dynamic of argument
for aggregate function sum. (ILLEGAL_TYPE_OF_ARGUMENT)In RawTree it works without any issues:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic SELECT toUInt8(1);
SELECT sum(value) FROM test_dynamic;
┌─sum(value)─┐
│ 1 │
└────────────┘A curious reader could ask what will happen if sum function is executed on column that contains String and Integer types? It is a very interesting system design question. In RawTree we ignore values that function cannot process, but we also provide a hint to the client for visibility and awareness:
CREATE TABLE test_dynamic (value Dynamic);
INSERT INTO test_dynamic VALUES (1), (2), ('Bad string');
SELECT sum(value) FROM test_dynamic;
┌─sum(value)─┐
│ 3 │
└────────────┘
Hint: Aggregate function 'sum' skipped rows where Dynamic column arguments have concrete types (String).
These types are incompatible with the function's signature. For function 'sum(value)'
in scope 'SELECT sum(value) FROM test_dynamic'In RawTree similar hint logic is also implemented for ordinary functions:
SELECT value FROM test_dynamic WHERE value >= 0;
┌─value─┐
│ 1 │
│ 2 │
└───────┘
2 rows in set. Elapsed: 0.002 sec.
Hint: Function 'greaterOrEquals' skipped rows with Dynamic column argument at index 0 for column 'value' with concrete type String.
Type String is incompatible with other function arguments, function was called with types (String, UInt8).
For function 'value >= 0' in scope 'SELECT value FROM test_dynamic WHERE value >= 0'In future blog posts I will deep dive into RawTree type system and explain more details about it.
Performance
As part of the RawTree launch we are also working on RawBench, a benchmark for schemaless JSON analytics. The benchmark ingests and queries raw GitHub JSON data from https://www.gharchive.org/ and is similar to other benchmarks like ClickBench and JSONBench in methodology, but focused on semi-structured data.
The early results look promising. This is a preview of the combined score, where lower is better and bars use a compressed scale so the slower systems remain readable:
RawTree ×2.42 █
StarRocks ×4.50 ██
Apache Doris ×5.85 ██▌
ClickHouse ×6.38 ██▊
Elasticsearch ×44.78 ████████
DuckDB ×85.46 ███████████
PostgreSQL ×101.01 ████████████
MongoDB ×392.21 ████████████████████We will publish RawBench separately with the repository, exact scope, methodology, setup, reproduction steps, and final results soon.
Conclusion
Too much analytical work still begins with modeling decisions that are hard to make well before real data and real queries exist.
We think that assumption is wrong for a growing class of workloads, especially semi-structured and agent-driven ones. But removing schema upfront is only useful if the database still behaves like a serious analytical system: types have to make sense, SQL has to stay predictable, and performance has to remain strong.
That is the problem RawTree is trying to solve. If this sounds interesting, you should join RawTree's private beta