laitimes

Shopping cart migration from Cassandra to Mongo to enhance the shopping experience - Translation

author:Flash Gene

introduce

The shopping cart is a key component in an e-commerce business. A smooth checkout process boosts user confidence in the platform. Given the importance of shopping carts in keeping users intent to buy, we need to ensure that the shopping cart system is highly stable, performant, fault-tolerant, and able to handle high traffic.

Myntra's user base continues to grow rapidly. The more users on the platform, the more shopping carts are created and the more actions are taken on each cart. The shopping cart data is saved in the Cassandra data store. Cassandra is a strong choice for shopping cart persistence because of its huge scalability in terms of write operations and fault tolerance. However, why it wasn't the perfect choice for us and our journey to MongoDB is exactly what we'll be discussing in this blog post.

The decision to migrate from Cassandra to MongoDB

During peak traffic periods, the shopping cart service started to experience an error rate of about 0.05%, which was due to timeouts and latency increasing by about 1 second. Cassandra DB has a latency of more than 250 milliseconds. This resulted in a large number of users not being able to "add to cart" and led to a drop in our revenue. Here are some of the key observations from Cassandra DB.

Problems with Cassandra

We use Cassandra as the primary database to hold user carts. While this was initially justified with high availability and scalability in mind, recently we've started to notice quite a few performance issues affecting the overall stability of the shopping cart application.

  • Tombstones cause high read latency: in the shopping cart, we have almost 35% of the operations that are updates (overwrites), deletions, and TTL expirations, which results in a very high number of tombstones in Cassandra. Each delete/update operation on the cart results in 16 tombstones. When the size of a tombstone increases too much, it results in slower read queries because the database needs to scan a large number of tombstones to get real-time data. In our case, this resulted in latency spikes of over 250 milliseconds, increasing our SLAs and service performance. To solve this problem, we need to fix and compress the database on a regular basis, especially before the HRD incident.
  • Use batch statements: Multiple lookup tables created in Cassandra result in at least 4-5 queries required to read and update the tables each time a cart is created, merged, or deleted. These batch statements are used for consistency. This increases network traffic to the database, increases overall network bandwidth utilization, and increases latency.
  • Support for non-columns: Lack of support for counter columns and non-counter columns in tables. We needed a separate table to track the number of products, which added complexity to managing multiple updates.

Cassandra 改进

Hardly any improvements have been made to correct these issues.

  • Reduced GC grace period: To reduce tombstone accumulation, the GC grace period has been reduced from the default of 10 days to just 6 hours, resulting in faster tombstone deletion and improved performance. We use quorum as consistency, which ensures that we don't have issues with deleted data reappearing because the tombstone is not propagated to the replica before compression.
  • Long GC pauses: Frequent system pauses and read latency spikes are observed, coinciding with long GC pauses lasting 2 to 3 seconds. Enabling GC logging and analyzing GC performance can help identify this issue. After identifying and resolving long GC pauses, system stability is enhanced and read latency spikes are reduced.
  • Off-heap caching: Cache settings in Cassandra have been reconfigured to address performance issues. key_cache_size_in_mb and row_cache_size_in_mb decreased from 2GB to 512MB and from 4GB to 512MB, respectively, while file_cache_size_in_mb increased from 8GB to 18GB. Use off-heap caching options, such as block caching, to cache portions of the SS table read from disk in parallel. Reconfiguring cache settings and leveraging off-heap caching improves cache usage efficiency and speeds up read operations.
  • Resize the bloom filter: The bloom filter helps speed up read operations by avoiding unnecessary checks in the SS table, which are tuned to optimize their size in memory. Optimizing the in-memory bloom filter size speeds up read operations by avoiding unnecessary checks in the SS table.
  • Compression strategy: It's critical to choose the right compression strategy based on your workload. The default Size Tiered Compaction policy is for write-intensive workloads, while Leveled Compaction may be better suited for read-intensive workloads. Choosing the right compression strategy based on your workload requires optimizing data storage and query response times.

Overall, these measures have made the Cassandra system more efficient and stable, reducing stuttering, improving cache utilization, and optimizing compression strategies for specific workloads. With these changes, we're able to support 1.5 times the load we used to have. However, these measures were not enough to solve our problem, as our database is transient in nature and requires a solution that is highly optimized for reads and deletes.

Select Mongo

There are a few major factors to consider when choosing a data store for a CART. Strong consistency, support for secondary indexes, and more help us evaluate whether Mongo is the right choice.

  • Cart is a high-volume read system, with a 3:1 ratio of reads to writes. You can handle loads with high read volumes by adding more slave nodes to your MongoDB cluster.
  • MongoDB supports secondary indexing of any field in a document, which makes it more suitable for certain shopping cart use cases.
  • Cassandra nodes require regular repair and compaction operations that MongoDB clusters can avoid.
  • Mongo is a low-maintenance server with a smaller number of nodes than Cassandra nodes required to support the same amount of traffic. We've reduced our footprint by 40%.
  • In contrast to Cassandra, Mongo can easily map business data to schemas with optional attributes.

As we move forward with MongoDB, we also consider a few other factors,

  • The entire shopping cart data is stored as a single document. While this optimizes reads that load the entire cart at once, this can increase disk data throughput because each read or write incurs a lot of IO. However, for workloads that load an entire shopping cart, this model can provide better performance with internal database caching and OS page caching.
  • Checkout operations require highly consistent data, so checkout operations across the cart need to be supported by masternodes

Data modeling and schema design

旧 Cassandra 架构

One of the main problems with the old Cassandra schema was the use of multiple lookup tables to get the data. To access the cart, the system must make two read calls: one for the lookup table to get the cart ID, and one for the lookup table to get the relevant data. This results in at least two read calls per cart read operation. When performing any insert or delete operation, we must make at least 2 write calls.

Mongo mode

In the new Mongo architecture, we've made some changes to correct the above issues. We decided to move away from using a lookup table to get the cart ID and instead use the explicit user identifier portion of the cart ID, which eliminates the need for multiple lookup tables.

  • Logged-In-Cart: This document stores the cart ID and session ID as well as cart-related data. The decision to explicitly store the session ID in the login cart document was made to reduce the number of merge operations.
  • Non-Logged-In-Cart: This document contains the shopping cart of the user who is not logged in. This cart is valid for a different period than a logged-in cart. We do not store user-related data in this shopping cart. Once the cart merging process is complete, this cart will be deleted.

Cart merge process

When a user who is not logged in creates a shopping cart. The customer then logs in to his/her Myntra account, and then the customer wants to display the items from the customer's cart that is not logged in and the cart that is logged in. The process of merging logged-in and non-logged-in carts is known as cart merging.

Cassandra 与 Mongo DB 处理

Cassandra 流

Shopping cart migration from Cassandra to Mongo to enhance the shopping experience - Translation

Mongo 流

Shopping cart migration from Cassandra to Mongo to enhance the shopping experience - Translation

MongoDB cart expiration TTL solution

The system maintains a collection for both logged-in and non-logged-in user carts, each with a different TTL (time to live). The expiration date is set to 0, and the actual expiration time is managed by the application using TTL's datetime format. The TTL monitor periodically scans the Mongo indexes created through TTL and identifies expired documents that need to be deleted. Checkpoints occur every 50 seconds, and within the next 10 seconds, the monitor asks the MongoDB thread to expire as many documents as possible.

If not processed within 10 seconds, unexpired documents pile up in the stack, and then MongoDB threads are asked to expire them in bulk. After expiration, the reindexing process is triggered. To minimize downtime during periods of high indexing, documents are scheduled to expire at midnight, when overall traffic is low. This approach reduces the number of indexes created and ensures that all expiration times occur during off-peak traffic hours.

MongoDB 规模基准测试

The goal of the benchmark was to determine the correct hardware and software configuration to handle the cart traffic, keeping in mind the consistency requirements of the cart application compared to Cassandra.

  • The first step is to identify the NFR that you want to benchmark
  • The number of concurrent requests to meet future demand
  • Expected throughput for each API
  • Consider all ongoing features to achieve scale
  • Expected latency levels
  • Identify the data model to use with Mongo for benchmarking exercises
  • Determine the hardware configuration -
  • Number of cores
  • Consider the RAM of the workload's hot data (e.g., 25–30%)
  • Estimate storage capacity
  • A cache-enabled hard drive is required
  • 确定 MongoDB 配置。
  • Mongo version
  • Master, slave setup (1 master, 2 slaves at the beginning)
  • Checkpoints - Every 1 minute
  • Log synchronization—every 100 milliseconds
  • Cache configuration
  • Consistency level
  • Determine the workload based on current traffic patterns.
  • Heavy read, read:write ratio of 3:1
  • Prepare the sample dataset
  • Determine the client configuration
  • Connection pool size
  • Perform different workload runs
  • Consider all reads
  • Consider mixed loads
  • Consider different levels of consistency
  • Capture application and database metrics
  • Latency, throughput
  • CPU utilization, RAM usage
  • The number of IOPS read and written to the disk per second
  • Disk IO utilization
  • Disk IO throughput
  • P99 和 p99.9 延迟

We were able to benchmark ~1.7 million RPM and latency ~50ms (p99.9).

Migration strategy

Here are the goals set by the immigration strategy,

  • Zero downtime
  • Zero impact on traffic
  • Enable AB to route traffic

Enable bi-directional dual synchronization between Cassandra and Mongo

Before performing a one-time migration, we set up a data synchronization pipeline between Cassandra and MongoDB. Any writes to Cassandra will be synchronized to MongoDB and vice versa. This needs to be processed asynchronously so that it doesn't affect the write latency of incoming user requests. Handle all race conditions, failures, and retries, so that the data between the two data stores is always in sync.

As part of the dual synchronization process, all operations performed in one database are synchronized to the other. A Kafka-based pipeline from the primary database to the secondary database is established. When a user performs any action, a journal entry is created to capture the action he/she performed. Create and publish events in a Kafka pipeline. This journal entry in the database ensures that we don't switch user-assigned databases without using events from the secondary database. Once the event is consumed by the consumer, the database entry is deleted. If we can't find any entries related to the user, then the user is eligible for migration.

One-time data migration

The plan involves migrating the full shopping cart dataset from Cassandra to MongoDB in one go. This will be done using a batch application that reads the lookup table and moves the cart associated with each entry by querying the Cassandra Cart table for the latest data. To prevent duplicate migrations, records that have been migrated through dual synchronization will not be migrated in bulk.

Use A/B for incremental traffic routing

Publish one-time migration configurations for feature-gated and user-driven A/B. Enable A/B so that traffic flows into the MongoDB cluster (e.g., 1% of users). Once no issues are reported, roll out to 100% of users.

Ensure fault tolerance

Dual synchronization logs

  • To ensure that no data is lost during a double sync or one-time migration, we decided to maintain a log. The idea behind this log is that whenever a write is performed on the user's primary database (which can be Cassandra or Mongodb), we maintain that event in the log until that data is synchronized to another database.
  • Let's consider a situation where if a user makes some changes to their cart and we enable AB for that user during this time, then some data may be lost because double sync is an asynchronous process and therefore it takes some time to copy the entire data.
  • So, whenever a user tries to access their cart, we first call the log and check if an event already exists for that user, we get the primary database (i.e. where the last write occurred), which is also maintained in that schema.
  • Once the dual synchronization job completes, the event is removed from the log.
  • In addition, we only maintain a single event for any user in our logs, which means that if a user performs 2 writes in a row, even if the first write is not replicated to the secondary database, we will only keep the event of the most recent write, which ensures that the latest data is always synced to the other database.
  • For all records that fail to be published in Kafka (for whatever reason), the synchronization job takes the database entries related to the change and then updates the second database based on those entries.
  • During a one-time migration, one of our jobs is to continuously sample the database data of random users to find any discrepancies. This is important because we have made significant changes to the schema design in both data stores.
Shopping cart migration from Cassandra to Mongo to enhance the shopping experience - Translation

Metrics, alerts, and monitoring

Here are some metrics that can help resolve issues faster.

One-time data migration metrics

  • One-time migration (rate, overall)
  • Errors in one-time migrations (ratio, overall)
  • Data Validation Errors (Rate, Overall)

Dual-sync indicators

  • Update Event Releases (Rate, Overall)
  • Update Event Consumption (Rate, Overall)
  • Publication Errors, Consumption (Rate, Overall)

Deployment policies

MongoDB is "secretly published" to avoid any risk to incoming traffic. The sequential rollout was instrumental in making the move to Mongo seamless.

  • First, we enable dual sync
  • Then we enable the sync log (database decision table, which will ensure that your database doesn't change until the data is fully migrated)
  • One-time migration of existing shopping carts (all shopping carts that are not moved via dual sync)
  • Reconcile all errors in a one-time migration
  • Turn on feature gating
  • Turn on A/B for a small subset of users to route traffic to Mongo
  • Monitor the dashboard for any anomalies
  • Fix any synchronization-related issues between the two datastores
  • Incrementally increase your A/B users
  • Cover 100% of user traffic to Mongo
  • Clean up all existing data in Cassandra
  • Once stabilized, it is planned to retire the Cassandra node

conclusion

It's been a few months now, we haven't had any issues since the migration, and we've successfully resolved all the performance bottlenecks we've observed in Cassandra. Tens of millions of shopping carts have been migrated, and latency has improved from 250 ms (during peak periods) to < 25 ms. Read performance has been dramatically improved through better schema design and read performance of Mongo DB. This has also reduced our hardware footprint by 40% and eliminated the significant maintenance and operating costs required by Cassandra. But we never want to rule out the fact that Cassandra, which has a better architecture from the very beginning of the cart evolution, can achieve better performance than we do.

Overall, by scaling Mongo nodes, we now have the ability to handle up to 3 times the current amount of traffic before we need to reconsider the next best option in the future.

作者:Agnihotri-Myntra Engineering hair

Co-Contributers: Shubham Shailesh Doizad, Pramod MG, Ajit Kumar Avuthu, Vindhya Shanmugam, Subhash Kumar Mamillapalli

Source: https://medium.com/myntra-engineering/cart-migration-from-cassandra-to-mongo-for-an-enhanced-shopping-performance-104c5d3b0da7

Read on