Using INFINI Console for Incremental Data Migration in Elasticsearch
Elasticsearch
INFINI Console
2023-06-09

Introduction #

In version 1.3.0 of INFINI Console, the data migration feature has added support for incremental migration. This article will provide an overview of the use cases and implementation of incremental migration.

Scenario Overview #

Taking the logging scenario as an example, let’s assume that Cluster A has an index called request-logs for recording online HTTP request logs, with the following data structure:

{
  "request_body": {...},
  "request_header": {...},
  "method": "POST",
  "request_time": "2023-06-09 12:30:09+800", // Time when the client recorded the request
  "@timestamp": "2023-06-09 12:30:11+800" // Time when the request was written to Elasticsearch
}

We want to fully import the data from this index into the request-logs index in Cluster B. To ensure the integrity of the imported data, we first need to consider the delay of data write to Elasticsearch:

  1. Data uploaded to Cluster A may have some delay. Logs are often collected asynchronously from different nodes, and there could be network delays, so the time for logs to arrive at Elasticsearch may vary.
  2. Data written to Elasticsearch is not immediately visible for query requests. Elasticsearch asynchronously refreshes the written data.

In other words, assuming the delay for each request log from collection to writing to Elasticsearch to being queryable is d, the data range that we can fully migrate during each incremental migration is [Current Time - Migration Interval - d, Current Time - d). As long as the data writing delay does not exceed d, we can pull the complete dataset from Cluster A and push it to Cluster B.

Updates in Cluster A Data? #

Typically, we do not perform updates on the log documents. Each document is immutable once written, and we only need to filter based on the @timestamp field to identify the data to be migrated. Furthermore, each data record only needs to be migrated once to ensure consistency in the target cluster.

If the source data is updated, how can we perform incremental migration? In most cases, we’ll record the update time of each document in the update_time field. This allows us to perform incremental data migration using the update_time field.

During the first migration, Index A contains the following data, and after performing the first migration operation, the data is successfully written to the target index:

Migration step 1

Before the second migration, one record in Index A is updated, then the migration process can detect this data through the update_time field, and the updated record will be copied and overwrite the old record in the target cluster:

Migration step 2

As you can see, even if the source data is updated, as long as we record the update time for each data record, the data written to Cluster B during the migration process will remain complete and consistent.

Deletions in Cluster A Data? #

If the source cluster’s data includes deletions, based on the aforementioned data migration logic, the second migration process cannot determine whether the already migrated data has been deleted:

Migration process with hard delete

If we want to ensure complete migration of the data, we need to avoid performing deletion operations on the source cluster’s data. We can mark the documents as deleted but refrain from actual deletion operations. Then, we can perform a complete data migration based on the time of document deletion (or update):

Migration process with soft delete

Performing Incremental Data Migration with INFINI Console #

In INFINI Console, we can use the Data Tool - Data Migration feature to perform incremental data migration. As an example, let’s create a data migration task to migrate the data from the .infini_requests_logging-000002 index to the request index in the target cluster.

Create migration task in console

We need to migrate to a newly created index. In the “Initialize Configuration” step, configure the mapping and setting information for the target index based on the prompts. If no specific configuration is required, the “Auto Optimize” feature can be used to auto-fill the settings.

Update target index setting

Next, we need to configure the update field for data writing and the writing delay. Request logs typically record the request time through the timestamp field, and the delay is usually not more than 1 minute. In the “Migrate Setting” step, we configure the corresponding “Incremental” information. If the historical data volume is large or the interval between incremental tasks is long, we can also configure “Partition” rules to divide the task into finer granularity, avoiding excessive load on Elasticsearch due to long-term data export in a single task.

Update migration settings

Finally, when creating the task, we select “Detect Incremental Data” and set the task to check for incremental data every 15 minutes.

Update execution settings

After clicking “Start,” the incremental task will begin running. During the first run, historical data will be fully migrated, and subsequent checks will be performed automatically every 15 minutes to migrate new data to the target index.

Task execution detail

Summary #

In addition to data consistency, the design of migration functionality also needs to consider performance, stability, Elasticsearch version compatibility, and other factors. INFINI Console provides a user-friendly data migration solution that can be used for various data migration requirements in different scenarios.

热门文章
标签
Easysearch x
Gateway x
Console x