laitimes

Technology selection and engineering practices of TCHouse-C real-time data updates

TCHouse-C is a fully managed cloud service based on the ClickHouse kernel of Tencent Cloud, which has served many enterprise customers, including more than 90% of Tencent's internal ClickHouse business.

The main content includes the following parts:

1. Real-time data updates

2. SCHEMA LESS

3. Q&A

Speaker|Peng Jian, Senior Engineer, Tencent

Edited and organized|Zhang Junguang

Content proofreading|Li Yao

Produced by DataFun

01

Real-time data updates

1. Real-time data update scenarios

Technology selection and engineering practices of TCHouse-C real-time data updates

In the current context of big data analysis, the demand for real-time data updates is becoming increasingly prominent, especially focusing on two key scenarios.

  • The first type is to carry out high-frequency addition, deletion, modification and review of data. Although performing this kind of operation in ClickHouse is generally considered to be burdensome, and we generally recommend that customers avoid doing it frequently, in the actual analysis business, this kind of scenario is frequent. FOR EXAMPLE, SCENARIOS SUCH AS REAL-TIME KANBAN, IOT DEVICE DATA MONITORING, USER BEHAVIOR TRACKING, AND E-COMMERCE TRANSACTIONS REQUIRE FREQUENT UPDATES AND CORRECTIONS OF REAL-TIME DATA.
  • These scenarios often require very frequent data updates, which can reach tens or even hundreds of thousands of times per second (QPS), while requiring low latency, i.e., data should be visible immediately after it is updated. However, under the current architecture and design of ClickHouse, it is difficult to meet such real-time and high concurrency requirements.
  • The second type is to use the ability to update some columns to build a large and wide table. We generally recommend building large-wide tables at the business level to take full advantage of ClickHouse's query and analytics performance. However, the method provided by the community to build a large wide table is not easy to use in practical applications, and the performance is not satisfactory.
  • With ClickHouse's support for UPSERT, it has become easier and more efficient for customers to build large-width tables. By taking advantage of the ability to update some columns, different upstream business data can update only their corresponding columns, which greatly simplifies data integration efforts. This is one of the two main types of scenarios that customers face today when building large and wide tables.

2. ClickHouse's existing data update scheme

Technology selection and engineering practices of TCHouse-C real-time data updates

Let's take a look at ClickHouse's existing data update scheme. In the community version, it provides ALTER TABLE ... UPDATE/DELETE operations. However, these features are not actually sufficient to meet the requirements of real-time data updates. The main reason is that when an update is committed, ClickHouse will rewrite the data synchronously or asynchronously in the background, that is, rewrite these data parts. This kind of rewrite is relatively expensive, and the effect of the updated data is not visible until the rewrite is completed, that is, it cannot be updated in real time.

There are two glaring shortcomings to this approach. First of all, it is based on the mutation mechanism to complete the data update, which may affect the normal operation of the business under the large amount of data, especially during peak hours, which is more likely to cause problems. For this reason, we generally do not recommend that customers do this during peak business hours. Second, its updates are non-real-time, which means that users can't see the status of the updated data immediately after the data is updated.

3. ClickHouse Cloud 提供 lightweight-update/delete 功能

Technology selection and engineering practices of TCHouse-C real-time data updates

From the public profile, we learned that ClickHouse Cloud provides the lightweight-update/delete feature, which is a compelling feature because it enables real-time visibility of data updates. Rather than simply being based on a traditional mutation mechanism, this feature takes a more flexible and efficient approach.

Specifically, the lightweight-update/delete function stores the expressions to be updated in the in-memory keeper. When a query is executed, the system first checks for the relevant update expression. If the corresponding data partition (parts) has been rewritten, the underlying data is directly returned to the client. If the rewrite is not complete, the updated expression is applied to the underlying data and the updated result is computed at query time. This mechanism ensures real-time visibility of data updates.

However, there are some potential issues with this feature. First, it still relies on the mutation mechanism, which means that in the case of very frequent updates and large scales, there can be an impact on query performance. Because the system needs to apply these updated expressions at query time, this increases the complexity and computational burden of the query. Second, since update expressions are stored in memory, if the amount of data updated is very large, it can consume a lot of memory resources and need to be carefully managed and monitored.

Overall, ClickHouse's lightweight-update/delete feature provides a new solution for real-time data updates, but it also needs to be evaluated for applicability and performance impact based on specific business scenarios and needs.

4. Real-time data update in the field of data warehouse, and existing solutions in the industry

(1)Copy-On-Write

Technology selection and engineering practices of TCHouse-C real-time data updates

Copy-on-Write policy. In this scenario, when the data is updated, if there is existing stock data that overlaps or conflicts with the update operation, the system rewrites this part of the data. This means that the system has already updated and rewritten the data during the data writing process. The updated data is one that is already in the correct state, i.e. it contains all the necessary updates.

A notable feature of this approach is that it is expensive to write. Because it is similar to the traditional mutation mechanism, but puts this type of operation in the process of data writes, it increases the complexity and resource consumption of the write. However, it has the advantage of being less expensive to query. Because the updated data is already in the correct state, it can be used directly when customers make queries, without the need for additional calculations and transformations. This is the core strength of Copy-on-Write technology.

(2)Merge-On-Read

Technology selection and engineering practices of TCHouse-C real-time data updates

Compared with Copy-on-Write, the Merge-on-Read solution solves the problem of high write cost. Specifically, when data is written, the system does not immediately resolve the data conflict, but directly appends the updated data to the write. This means that write operations become relatively simple and efficient, as the system does not need to deal with conflicts and overlaps between data immediately.

However, the problem of data conflicts was deferred until the query phase to be resolved. When executing a query, if the data in the background has not been merged, the system needs to resolve the data conflict, merge and deduplicate during the query, and then return the correct data result. Therefore, the Merge-on-Read scheme is characterized by a relatively low write cost, but a relatively high query cost. This is because you may need to process more data and logic when querying to ensure that the results returned are correct.

(3)Dleta-Store

Technology selection and engineering practices of TCHouse-C real-time data updates

Compared with the previous two schemes, Delta-Store introduces a key feature of primary key indexing. When Delta-Store writes updated data, it can quickly locate the data that conflicts with the update operation through the primary key index. For conflicting data, Delta-Store will make incremental updates, that is, append the updated data to the conflicting data, and append and write to the new data.

Although Delta-Store also suffers from data conflicts, it resolves them in a more efficient way than Merge-on-Read. During the write process, Delta-Store identifies and handles conflicts, which sacrifices some write performance but significantly improves performance when querying. Because conflicting data is already processed at write time, query-time merge is relatively inexpensive, improving query responsiveness and efficiency.

Delta-Store is characterized by its ability to handle data conflicts during the write process, avoiding a lot of merging and deduplication operations at query time. This design makes Delta-Store more efficient and stable when handling a large number of update and query requests.

(4)Delete-Insert

Technology selection and engineering practices of TCHouse-C real-time data updates

The Delete-Insert scheme can greatly improve query performance in certain scenarios, and it is implemented by introducing primary key indexes to optimize the data update and query process.

In the Delete-Insert scenario, when new data needs to be written, the system first checks whether there is any existing data that conflicts with the new data through the primary key index. If conflicting data is found, it is not modified directly, but rather marked as deleted (i.e., tombstoned) by a marker. The new data is then appended to the stock data.

During the query, all data is read, but the data that has already been marked for deletion (i.e., expired) is filtered out based on the previously set deletion marker. In this way, queries can skip data that is no longer needed, significantly improving query efficiency.

In the vectorization engine of OLAP (Online Analytical Processing), the Delete-Insert scheme can take full advantage of it. Since OLAP systems typically work with large-scale datasets, and vectorization engines are capable of processing multiple data items at the same time, filtering operations in the Delete-Insert scheme can be performed very efficiently, further improving query performance.

In addition, the Delete-Insert scheme also conforms to the data processing characteristics of columnar storage. In columnar storage, data is stored in columns, whereas deletion tags in the Delete-Insert schema can be handled as a separate column. This allows the system to read only those columns of data that are not marked for deletion at query time, further reducing I/O operations and data processing time.

In general, the Delete-Insert scheme implements efficient data updating and querying by introducing primary key indexing and tombstone mechanisms, especially in OLAP scenarios.

5. TCHouse-C 实时数据更新方案:Delete-Insert

(1) Overall plan

Technology selection and engineering practices of TCHouse-C real-time data updates

In the field of data warehousing, we are faced with a variety of mature technical solutions. During the technology selection process, we decided to use the Delete-Insert solution to implement data management. Specifically, we support standard Upsert (Merge Insert/Update) semantics at the SQL access layer, which allows us to handle data updates, deletions, and inserts with precision. Whether it's a condition-based deletion or a per-field update, including updates to specified fields, we're able to handle it with ease, which is perfectly in line with standard Upsert operational requirements.

At the storage engine layer, we introduced row-level indexes at the table level. The main purpose of this row-level index is to help us efficiently locate potentially conflicting data and mark it as pending deletion when data is written. In the specific data part, we use bitmap to store these delete marks, which helps us to efficiently manage data and optimize queries.

In addition, we also support ClickHouse's multi-copy synchronization mechanism. This mechanism provides us with higher data availability and fault tolerance. Through multi-replica synchronization, you can ensure redundant storage of data on multiple nodes, and even if one node fails, data can be quickly recovered by other nodes to ensure business continuity.

(2) Data writing and deduplication

Technology selection and engineering practices of TCHouse-C real-time data updates

We rely primarily on row-level indexes to solve the deduplication problem. When a new part of data is written, we will make a judgment on each of the primary keys to quickly determine whether the data already exists in the existing stock.

If there is existing data in stock, the Bitmap of the data parts corresponding to the stock data is marked as deleted. Once this process is complete, the new data section submits the markup for the entire bitmap. Typically, this submission process is completed in a short time of about 15 seconds, ensuring data consistency and timeliness.

When querying the data, we introduce a virtual column exists_row. This virtual column is built based on the bitmap of the data section. In most cases, if a part of the data does not have a corresponding bitmap (because in most cases the bitmap does not exist), then the virtual column will involve almost no computation during the query, and the query speed is very fast.

However, if there is a bitmap in a part of the data, the query will use the bitmap to find out whether the row-level index exists. In this way, by introducing row-level indexing and bitmap mechanism, we can improve query efficiency and effectively solve the problem of deduplication while ensuring data accuracy.

(3) Multi-copy data

Technology selection and engineering practices of TCHouse-C real-time data updates

First, we solved the problem of multi-replica replication data synchronization. We use ClickHouse's multi-replica sequential synchronization mechanism, which synchronizes data between different replicas by maintaining a log entry queue on ZooKeeper (ZK), and each replica synchronizes its own data based on log entry.

In terms of project implementation, we have introduced a row-level versioning mechanism to solve the problem of conflicts when data is updated. Considering that when data is updated, the business layer may perform concurrent update operations on the same key, and when these update operations arrive at the server, their timing relationships may be misaligned due to factors such as network latency. In order to solve this problem, we have introduced a versioning mechanism that allows the business layer to explicitly specify the time series, so as to solve the problem of conflicts when data is updated.

Second, in order to synchronize row-level deletion operations, we introduced the "tombstone" mechanism. With the presence of "tombstones", we can synchronize the deletion status to each replica, ensuring that all replicas are aware of what data has been deleted.

Finally, we addressed the mechanism for persistent deletion. In the default implementation of ClickHouse, deleting data is usually done by directly deleting and synchronizing a new part of the data to the replica node. However, in our scenario, each replica node needs to be explicitly informed what data has been deleted. To this end, we have introduced a special type of data part that records and synchronizes these deletion events.

In summary, we solve the problems of multi-copy replication data synchronization, data update conflicts, and persistent deletion mechanism through the above three methods.

6. Performance test of TCHouse-C real-time data update scheme

We tested the UniqueMergeTree table engine introduced in the TCHouse-C project with the ReplacingMergeTree table engine of the Community Standard Edition. The tests were conducted in three different scenarios, all on the same high-performance machine configured with 64 CPU cores, 256GB of memory, and equipped with four 3.5TB SSD hard drives.

Technology selection and engineering practices of TCHouse-C real-time data updates

Scenario 1: Data import.

A test set of 600 million rows of data was constructed using the standard SSB (Star Schema Benchmark) dataset, specifically the lineorder_fat table. The purpose of this test is to evaluate the performance of both table engines when large amounts of data are imported quickly.

Scenario 2: Concurrent data updates.

The SSB dataset is also used, and the dataset size is maintained at 600 million rows. This test simulates the data update operation in a high-concurrency environment to verify the efficiency and stability of UniqueMergeTree and ReplacingMergeTree when handling a large number of concurrent update requests.

The third scenario: a specific dataset update.

The New York Taxi Data dataset from the official website was used, and although the original dataset only contained 20 million rows of data, in order to maintain the consistency of the test, we modified it, especially adjusted the order ID, and finally constructed a test set of 600 million rows. This scenario examines the difference in performance between the two table engines when working with real-world datasets with specific data distribution and update patterns.

Through the testing of these three scenarios, we can get a more comprehensive understanding of the performance advantages and characteristics of the UniqueMergeTree table engine compared to the ReplacingMergeTree table engine in data import, concurrent updates, and processing of specific datasets.

(1) Data import

Technology selection and engineering practices of TCHouse-C real-time data updates

In the performance evaluation of data import, we directly compared the two major table engines in the community: ReplacingMergeTree and normal MergeTree, and also added our Upsert Table, which is UniqueMergeTree. By accurately recording and counting the import time (the shorter the time, the better the performance), we came to the following conclusions:

In practice, UniqueMergeTree has shown significant performance improvements over ReplacingMergeTree in terms of data import. This improvement is mainly due to UniqueMergeTree's efficient mechanism for handling uniqueness constraints, i.e., leveraging internal table-level or row-level indexes for collision detection and deduplication while writing data, which reduces unnecessary post-processing and speeds up the data import process.

However, compared to MergeTree, which pursues extreme write performance, UniqueMergeTree's import performance, while excellent, is still slightly inferior. This is because MergeTree, as the basic table engine of ClickHouse, is designed to maximize write speed, and does not introduce an additional indexing mechanism to handle uniqueness requirements like UniqueMergeTree. Therefore, in scenarios without uniqueness constraints, MergeTree can usually provide higher write efficiency.

In summary, UniqueMergeTree provides excellent data import performance on the premise of ensuring data uniqueness, which is the preferred solution for processing large-scale datasets with uniqueness requirements. Although the internal indexing mechanism increases the overhead of writing to a certain extent, it also significantly improves data consistency and accuracy.

(2) Data query

Technology selection and engineering practices of TCHouse-C real-time data updates

The second scenario focuses on data query performance. The first is 13 query operations on the SSD on the lineorder table for a single query. To evaluate performance more thoroughly, we tested two different scenarios for the ReplacingMergeTree table engine: one was the FINAL keyword added to the query, and the other was no FINAL keyword.

For queries that use the FINAL keyword, we observe significantly higher latency, which is exactly what the ReplacingMergeTree table engine expects. This is because FINAL forces the table engine to complete all necessary merge operations before querying to ensure eventual consistency of the data, but this process adds an additional computational burden, resulting in increased query latency.

In queries that don't use the FINAL keyword, UniqueMergeTree shows a significant performance advantage over ReplacingMergeTree, actually achieving a performance improvement of several times. This result is a testament to the efficiency of UniqueMergeTree in processing queries, especially in scenarios where no additional merge operations are required to ensure data consistency.

In summary, UniqueMergeTree performs well in terms of single query performance, especially when the FINAL keyword is not used, which can significantly improve query efficiency and provide users with a smoother data access experience.

Technology selection and engineering practices of TCHouse-C real-time data updates

In our performance tests on concurrent queries, we randomly generated 100 primary keys and queried the rows corresponding to those primary keys concurrently in the two table engines (ReplacingMergeTree and UniqueMergeTree). The test used 32 concurrent threads to simulate a high-concurrent access scenario.

The test results show that UniqueMergeTree performs far better than ReplacingMergeTree in terms of concurrent queries. This significant performance advantage is largely due to the fact that UniqueMergeTree is an internal primary key indexing mechanism. Since UniqueMergeTree has already established an index based on the primary key when the data is inserted, it can quickly locate the target data when querying, which greatly improves the query efficiency. Especially in concurrent scenarios, the advantages of primary key indexes are more prominent, enabling UniqueMergeTree to respond to multiple concurrent query requests more quickly.

In summary, UniqueMergeTree achieves a significant performance improvement in concurrent query performance through the built-in primary key indexing mechanism, which provides strong support for processing high-concurrency query requests.

(3) Data Update/Deletion

Technology selection and engineering practices of TCHouse-C real-time data updates

The third scenario focuses on performance evaluation of data updates and deletions. First, we built a test table with 600 million rows of data, and based on that, we tested a single update and delete operation. The tests covered data update volumes of 10 million, 500, and 1 million rows at different scales to fully evaluate the performance of the two table engines (ReplacingMergeTree and UniqueMergeTree) for data update and deletion.

In our tests with a single delete operation, we observed that the latency of the update (or delete) operation correlated closely with the amount of data processed, with the smaller the amount of data, the lower the operation latency and the better the performance. Of particular note is the fact that UniqueMergeTree exhibits very fast update speeds in scenarios with relatively small amounts of updated data, such as 1 million rows or less, and its performance benefits are particularly obvious. This is largely due to the optimization mechanism inside UniqueMergeTree, which allows it to efficiently handle small data updates without the need for large-scale merging or refactoring of entire tables.

In summary, UniqueMergeTree has demonstrated significant performance advantages in terms of data updates and deletions, especially when dealing with small-scale data updates. This advantage makes UniqueMergeTree ideal for use cases that require frequent data updates and deletions.

Technology selection and engineering practices of TCHouse-C real-time data updates

Let's look at the performance of concurrent updates/deletes. We used 32 concurrent threads to simulate deletion operations under high load to evaluate the performance of ReplacingMergeTree and UniqueMergeTree in a concurrent environment.

The test results showed that UniqueMergeTree exhibited extremely low latency in 32 concurrent delete operations. This excellent performance is mainly due to the efficient indexing mechanism and data management strategy inside UniqueMergeTree, which enables it to quickly locate and delete target data in a concurrent environment without the need for large-scale refactoring or merging of entire tables.

In contrast, ReplacingMergeTree exhibits a higher latency in concurrent deletions. This is mainly because ReplacingMergeTree relies on a mutation mechanism in the background to handle data updates and deletions. In a concurrent environment, multiple mutations can be triggered at the same time, resulting in resource contention and lock contention, increasing the latency of operations.

In summary, UniqueMergeTree is significantly better than ReplacingMergeTree in terms of concurrent deletion performance, mainly due to its efficient internal indexing and data management mechanisms. This advantage enables UniqueMergeTree to maintain low latency when processing high-concurrency delete requests, providing users with a better data manipulation experience.

7. TCHouse-C 实时数据更新规划

Technology selection and engineering practices of TCHouse-C real-time data updates

As for the follow-up planning of real-time data updates, we currently use hash indexes in memory, and there is still a lot of room for optimization. With support for Upsert semantics, we will optimize the execution schedule for SQL operations such as updates and deletions to achieve more lightweight data versioning (partially implemented). In addition, using existing indexing technology, we will enhance the ability to point up specific data, such as market data.

To improve performance and scalability, we will optimize the query engine to reduce the query burden, and plan to scale from full data indexing to cold data indexing to support larger datasets and a wider range of application scenarios. These plans are designed to further improve the efficiency and flexibility of real-time data updates.

02

SCHEMA LESS

1. Semi-structured data

Technology selection and engineering practices of TCHouse-C real-time data updates

Next, we'll explore strategies for processing semi-structured data. Semi-structured data, such as JSON, monitoring data, logs, etc., is very common in business and production, and has a wide range of sources and great value potential, but the current community does not yet have a solution to efficiently process semi-structured data. Through data analysis and mining, we can gain deep insights and prediction capabilities, and its flexibility makes data production costly, changeable, and adaptable to different business scenarios. At the same time, its diversity and irregularity also provide the possibility to express complex data.

2. ClickHouse handles semi-structured data advantages

Technology selection and engineering practices of TCHouse-C real-time data updates

ClickHouse has shown significant advantages when dealing with semi-structured data similar to log data. First, it is cost-effective, with a high compression ratio that significantly reduces storage requirements compared to systems such as ES, while vectorized retrieval capabilities ensure superior performance, reducing costs and improving query efficiency. This means that the required machine configuration can be reduced accordingly at the same worker node.

Second, ClickHouse is also outstanding in terms of performance, with high write throughput and low query latency, which is extremely friendly to business applications and meets the needs of real-time data processing.

In addition, the ease of operation and maintenance is also one of its highlights. The O&M of ClickHouse is relatively simple, the system has no external dependencies, and has high stability. By using cloud services such as Tencent Cloud, the O&M burden is further reduced, allowing users to focus more on core business logic.

3. ClickHouse is widely used in the industry to process semi-structured data

Technology selection and engineering practices of TCHouse-C real-time data updates

In the industry, ClickHouse is widely used to process semi-structured data, such as Xiaohongshu, Bilibili, Ctrip, Uber and other well-known enterprises have used it to process log and APM data, achieving significant performance improvements. For example, Xiaohongshu has improved performance by nearly 20 times and reduced costs by more than 50 percent by replacing traditional Object or JSON string storage with a schema-less pattern. Bilibili also shared similar significant performance and cost optimization results. Ctrip and Uber also reported significant performance gains, although Uber did not mention specific cost figures. In addition, JD.com, Vipshop, Kuaishou, and others are also exploring the potential of ClickHouse to process semi-structured data, and although there are still challenges in the current application, the industry generally recognizes its advantages and continues to explore them.

4. ClickHouse's pain points in dealing with semi-structured data

Technology selection and engineering practices of TCHouse-C real-time data updates

ClickHouse faces some pain points when dealing with structured data. First, if you store data as a string and try to use the analysis function, the CPU consumption will be high, the compression ratio will be low, and the schema information will be missing, so you will not be able to take full advantage of the acceleration performance of ClickHouse. Secondly, using the community's early Object solution, that is, storing JSON as an Object column, also faces the problems of low compression ratio, high CPU resource usage, and cannot use advanced functions such as secondary indexes, materialized views, and Projection for performance optimization. Despite these challenges, the community is constantly striving to improve and explore more efficient treatment options.

5. TCHouse-C 处理半结构化新方案 SCHEMA-LESS

(1) Overall plan

Technology selection and engineering practices of TCHouse-C real-time data updates

In Tencent Cloud's ClickHouse service, we have innovatively implemented a schema-less solution to simplify the process of writing business data to ClickHouse. When you create a table, you only need to specify the primary key, partition key, and sorting key, and you do not need to define the schema of the table in detail. When business data is written, it only needs to include these key fields and a JSON structure, whether string or structured, for the system to accept.

After receiving such a request, the server will automatically parse the JSON content, build a tree-like structure, parse the data type based on the traversal results from the root node to the leaf node, and finally convert the data into a block format that can be recognized by ClickHouse. During this process, if a new field is encountered, the system will automatically extend the schema to ensure the integrity and consistency of the data.

It is worth noting that the Schema extension and the commit of the data partition (Part) are completed in the same transaction, and once the transaction fails, the system will automatically roll back to ensure that the internal state of the data will not be damaged by partial writes.

In order to support distributed queries, we implement a fast synchronization mechanism for schemas. With components such as Keeper, new schema information can be quickly subscribed to or actively distributed by other nodes in the cluster, ensuring consistency and accuracy in data queries.

At the query level, we've optimized and rewritten SQL to support advanced features such as fuzzy queries by field name or path. This feature makes full use of the flexibility of JSON data, allowing users to efficiently query through prefix matching and other methods without fully understanding the data structure, which greatly improves the user experience.

In summary, TCHouse-C's schema-less solution provides strong support for efficient writing and flexible query of business data, and is ideal for processing semi-structured data.

(2) Program details

Technology selection and engineering practices of TCHouse-C real-time data updates

The details of TCHouse-C's solution for processing semi-structured data focus on the following four aspects:

  • Parsing semi-structured data: First, parse the received semi-structured data (such as JSON) into a block format that the ClickHouse kernel can recognize. This process involves extracting data from a JSON structure and building chunks inside ClickHouse based on the data type and structure for subsequent processing.
  • Query rewrite: To support flexible queries on fields in semi-structured data, we need to rewrite SQL queries. This includes handling fuzzy queries by field name or path, ensuring that users can take advantage of the flexibility of JSON for effective data retrieval even without a full understanding of the data structure.
  • Dynamically extend the schema: As data continues to be written, new fields may be introduced. That's why we've implemented a dynamic schema extension mechanism that automatically updates a table's schema to include new fields without interrupting service. This mechanism ensures the integrity of the data and the scalability of the system.
  • Intra-cluster schema synchronization: To support distributed queries, you need to ensure that all nodes in the cluster have the latest schema information. Therefore, we have implemented a fast synchronization mechanism for schema information, and used tools such as Keeper to ensure that the new schema can be quickly subscribed or distributed by other nodes in the cluster, so as to ensure the consistency and accuracy of queries.

(3) Cases

Technology selection and engineering practices of TCHouse-C real-time data updates

TCHouse-C's new approach to working with semi-structured data has demonstrated its power through real-world examples. Let's say we create a ClickHouse table, specify a partition key and sort key, and enable the dynamic schema (i.e., schema-less) feature. When data is written, a JSON string field is included in addition to the necessary partition and sort keys. This JSON string may have originated from a storage service such as S3 and was downloaded and written to ClickHouse as part of the data.

With schema-less enabled, the table actually has a non-fixed schema structure inside ClickHouse that dynamically stores data based on JSON fields. Users don't need to define all the possible fields beforehand to start the data writing process.

Once the data has been successfully written, users can take advantage of the rich features provided by ClickHouse to perform in-depth analysis of JSON fields. They can use JSON fields directly in their queries, or they can create indexes for those fields to improve query efficiency. In addition, users can create Projections based on JSON fields to optimize the performance of specific queries. Furthermore, through materialized views, users can synchronize data based on JSON fields to another table, realizing flexible processing and transformation of data.

During the whole process, users don't need to care about changes or adjustments to the fields. TCHouse-C's schema-less approach automatically handles these underlying details, ensuring data integrity and consistency, while providing users with a simple yet powerful way to process and analyze semi-structured data.

Technology selection and engineering practices of TCHouse-C real-time data updates

(4) Effect

Technology selection and engineering practices of TCHouse-C real-time data updates

TCHouse-C's new approach to processing semi-structured data delivers significant performance and cost benefits. First of all, this solution greatly simplifies the data writing process and is extremely friendly to service access. Whether it is a new business or an existing business, there is no need to frequently check the schema with the platform service provider, the business side only needs to know what data and its schema it wants to write, and the service side focuses on providing stable storage and computing services. This decentralized approach to data management significantly reduces communication costs and time consumption.

Second, because the JSON data is internally expanded and stored by field, the scheme achieves a higher cost performance and compression ratio. Common compression ratios can be as high as 5-10x, which not only reduces the need for storage space, but also improves the query performance of the data. Users can query based on specific fields without having to scan the entire JSON object, further improving query efficiency.

In addition, the new TCHouse-C solution provides additional optimization tools. This includes secondary indexing of fields, support for materialized views (projections), and data management capabilities, such as deleting unwanted or incorrect fields. These features, which were often difficult or expensive to implement in previous scenarios, are now easily feasible. At the same time, the time-to-live (TTL) mechanism on a per-field or per-table basis provides more flexibility for data management.

In practice, these improvements have resulted in significant performance gains and cost reductions. Our testing in real customer production environments has shown at least a 20x performance improvement and a 50% cost reduction. This shows that TCHouse-C's schema-less scheme is a very suitable tool for processing semi-structured data such as log data on the ClickHouse platform. It not only improves the efficiency of data processing and query, but also reduces the overall operating cost, providing strong support for the digital transformation of enterprises.

03

Q&A

Q1:TCHOUSE-C 主键索引是单机级别还是集群级别的?

A1: TCHOUSE-C's primary key index is designed to be at the stand-alone level. In the process of data writing, if you want to implement a specific data distribution strategy, that is, distribute data writes according to a certain rule (such as a specific key), then you still need to explicitly specify this key to guide the distribution of data. That is, although the primary key index itself is at the single-node level, it can affect the distribution of data in the cluster by selecting the appropriate KEY, thereby optimizing query efficiency and data management.

Q2: If the community uses copy-on-write, why isn't it faster to write slow queries? Because you don't need to worry about the tomb marker when querying.

A2: Community ClickHouse uses the merge-on-read asynchronous merging mechanism and the synchronous query function of the final keyword to strike a good balance between data writing efficiency and query result consistency.

Q3: How much does increasing row-level index storage increase? What does a data structure look like? Is the index built in real time at write time, or asynchronously?

A3: In the TCHOUSE-C project, when we talk about adding "row-level indexes" to ClickHouse (although ClickHouse itself does not directly support row-level indexes in the traditional sense, it can be understood here as a way to optimize query performance by primary key), we really need to pay attention to the close correlation between memory footprint and the length of primary key fields. We avoid hashing primary keys directly, mainly to avoid hash collisions, data consistency, and query accuracy. Instead, we build the index directly from the primary key field, which means that the memory footprint of the index varies with the size of the primary key content.

Under the premise that the overall design is reasonable, the primary key setting can effectively control the storage size of the index. As you can observe in our test scenario, the memory footprint of the primary key index is limited to 30-40 GB, even with data volumes of more than 600 GB, indicating that the primary key field selection and indexing strategy is appropriate.

For the implementation details of indexes, we currently use table-level full indexes, which means that the indexes cover all the data in the table. We do use some features or principles of hash indexing to optimize query efficiency, but it should be noted that the "hash index" here is not a type that ClickHouse natively supports, but a technical means or strategy we use at the implementation level.

The index building process is divided into two phases: first, when the service starts, the system loads the data and builds the initial index; Second, as new data is written, it is indexed in real time. This dynamic index update mechanism ensures real-time data and efficient queries.

However, it's also important to note that if the index is too large, it can indeed put a strain on ClickHouse's memory management. Therefore, when designing an indexing strategy, it is necessary to comprehensively consider the characteristics of the data, the requirements of the query, and the resource constraints of the system to find the best balance.

Q4: If the index is too large and puts pressure on memory management, what solution does TCHouse-C consider based on to solve this problem?

A4: In a production environment, not all scenarios require real-time logical updates to the data of the entire table. By limiting the scope of updates, especially for partitions that are considered "cold data", the need for memory and compute resources can be effectively reduced.

Specifically, you might consider building your primary key index at the partition level, rather than at the entire table level. In this way, the data inside each partition can ensure the uniqueness and non-duplication of the primary key, without the need for strict synchronization and inspection on a global scale. This strategy is acceptable in most cases because it allows you to be more flexible with old or cold data while maintaining consistency in critical data (i.e., the most recent data).

Of course, there is also a trade-off process involved in this optimization. On the one hand, it reduces the complexity and resource consumption of the system; On the other hand, it may also sacrifice certain data consistency and integrity guarantees in some scenarios. Therefore, when implementing such a strategy, it is necessary to fully understand the business needs and data characteristics to ensure that the trade-offs made are realistic.

Q5: Based on the json optimization direction of Schema Free, logs are stored in ClickHouse for analysis, but the concurrency support of ClickHouse is not very good, and when multiple people perform log analysis at the same time, will it often fail?

A5: It's true that ClickHouse's storage mechanism and data model are essential for performance optimization. When data is stored in ClickHouse in a manner similar to "schema-less", it still follows a strict columnar storage structure internally, which has significant performance advantages for large-scale data analysis. As a columnar database management system designed for online analytical processing (OLAP), ClickHouse has many optimization tools and mechanisms, such as partitioning, indexing, and compression, which can be fully utilized in the large and wide table mode to provide extremely high query performance.

When it comes to concurrent operations and potential failures, it does depend on a variety of factors, including the configuration of the system's resources, the complexity of the query, the frequency of data updates, and so on. When resources are sufficient and used correctly, ClickHouse is usually able to handle concurrent requests well and reduce the occurrence of failures. In addition, ClickHouse's distributed architecture also provides good fault tolerance and scalability, which helps to further improve the stability and reliability of the system.

The practical experience you mentioned in using ClickHouse to store logs at Bilibili is very valuable. From a practical point of view, even on a high-traffic platform like Bilibili, ClickHouse can easily cope with the storage and query requirements of log data. This is due to the efficient performance and stability of ClickHouse, which makes it ideal for working with large-scale log data. At the same time, this also proves that under the premise of correct configuration and use, ClickHouse can be competent for various complex business scenarios.

Q6: How does ClickHouse handle JSON field type changes and their impact on query performance and correctness? Specifically, when the data type of a JSON field changes from a number to a string during the storage process, how does ClickHouse identify and store the changed data type? Does this type change affect the correctness or performance of subsequent queries?

A6: There are some details involved in handling JSON field type changes in ClickHouse. First of all, regarding the derivation of numeric types, ClickHouse tries to store the data with the smallest possible storage space based on the actual size of the data, such as step-by-step derivation from bytes to integer types. However, this automatic derivation mechanism can cause problems when faced with subsequent changes in the data type, especially when the values are outside the range of the initial derivation type.

To solve this problem, we took precautions: for numeric types, we uniformly store them as 64-bit integers (such as Int64) or double-precision floating-point numbers (such as Double), regardless of the actual data size. This strategy ensures that even if the data type changes subsequently, it is stored correctly, avoiding data overflow or loss of accuracy due to type derivation errors.

For the deformability of JSON fields, ClickHouse takes a series of measures when it finds that the field type does not match the defined type. The most direct way is to throw an exception and notify the business layer to handle it. The business layer can modify the table structure to accommodate new data types through the ALTER TABLE statement, or remove problem fields and introduce new fields. While straightforward, this approach requires sufficient flexibility and responsiveness at the business level.

We're also exploring a smarter way to do this: when data is written, try to convert a field of type string to an expected numeric type (e.g. long). If the conversion is successful, it is stored as a new type; If the conversion fails (for example, the string contains non-numeric characters), an error is reported to the user. This method improves the automated processing capacity of the system to a certain extent and reduces the need for manual intervention.

Also, for numeric types, the highest precision data types, such as 64-bit integers and double-precision floating-point numbers, are used uniformly to ensure data integrity and accuracy. For the conversion of strings to numeric values, it is also an effective way to handle errors by trying to convert and reporting errors. This design not only ensures the robustness of the system, but also improves the user experience.

That's all for this sharing, thank you.

Technology selection and engineering practices of TCHouse-C real-time data updates

Read on