BLOG

Migrating a critical service from NoSQL to SQL with zero downtime

Tech kitchen
DESCRIPTION

by: Jona Derksen

Within Kramp Hub we are working on an e-commerce platform that sells agricultural products. This platform has a service landscape that consists of a micro services architecture. Each development team has a domain for which they are responsible, with the corresponding micro services. The teams choose the persistence solutions they find suitable for their services

An order management service has been developed for the e-commerce. When this service went live, it was used by hundreds of thousands of customers. Over time, it was decided to replace the persistence. In this article I describe how this process went from the perspective of our development team. I will describe the choices and problems we encountered along the way.

The order service was initially created with google’s datastore database to store all order data. While the use of the service was growing, so was the amount of data to store. Over time extra fields were added to the model and new queries had to be supported. This impacted the performance of our order service a lot. Even to the point where the service was not able to support a specific query because of time-outs.

We had to improve the performance of the service to support the increasing traffic and amount of data. Since the data model had become more relational, we decided to move to a relational database. Therefore we chose to start using a SQL database instead of the current NoSQL database. The more complex entities could then be stored in separate tables and linked using the relation between the data. Next to this, we would be able to start writing better queries to support all the specific requests and most likely be able to support the query that gave time-outs.

To replace the database of this highly critical service without any downtime the following plan was created:
1) The database schema had to be defined.
2) The old data would be migrated using a custom created service. When the data was migrated, a new version of the order service would be deployed. This version would write to both database instances.
3) The data would be validated for being in sync on both database instances.
4) The order service is switched to only depend on the postgres database for the order data.

In the end we managed to migrate without any downtime but came across some problems. Per step I am going to discuss the problems encountered and the way these problems were tackled.

Step 1

 

Database Schema

 

In datastore there were two major kinds that contained all the important order data. In these kinds there were multiple entities that contained embedded entity. These embedded entities contained well-structured data. These embedded entities could therefore be moved to a separate table in postgres and linked using a foreign key to the order.
In one of the kinds the key of entities was a string value containing letters and numbers. Moving the data from this kind to postgres the key would be changed from an alphanumeric key to a numeric key.

These two changes would be the main changes to the data and the schema. Keeping the changes as small as possible would help us avoid logical changes to the order service. Therefore the focus could be kept on the database migration.

Step 2

 

Updating the order service

 

The added logic, was to start writing the order data to both datastore (as usually) and to postgres (new). The data would still be read from datastore until step four, when we start reading from postgres instead of datastore.

The entire migration is started because of the performance of the order service not being satisfactory. Letting the service write to two databases would possibly deteriorate the performance even more. The deterioration in performance was expected to be minimal. The extra write operation would take a few milliseconds of extra time in the request. The write action is non-blocking, so it doesn’t delay the service response. The reading action would only happen from either datastore or postgres and therefore not affect the performance negatively.

It was important for the service to keep the same behaviour. This means if something would fail during writing to postgres, this should not affect the order flow. Therefore the service would catch any postgres related error, log this error and continue like nothing happened. This way of working would result in data inconsistencies, but the order flow would not be affected. This would give us time to find and fix any problems as well as any data inconsistencies if they arose.

The e-commerce platform is active in multiple countries. Therefore it is possible to roll out to the new feature, reading from postges database, per country. The rollout per country was done to ensure stability of the platform and catch problems as quickly as possible with as little impact as possible. Starting with a country with fewer users and monitoring the behaviour before rolling out to the next country.

Migrating the data

 

To migrate all the data the decision was made to create a custom service. The service would read from datastore, map the data to the newly created database schema and insert it into the postgres database. During creating the migration service we found out that it would take too long to migrate all the data this way. Therefore instead of directly connecting to the postgres database and writing the data for every entity, a sql file was created with an insert statement for each entity. This sql file was then used to import all the data into the postgres database. Our test data contained some inconsistencies, which resulted in the import failing. The faulty insert statement was localized in the import file. Some checks were added to validate the data to prevent adding such an invalid statement.

There was a requirement to prevent any downtime on the order service. Because of the requirement to avoid any downtime, there would be an old version of the service still running during the data migration. This service could still create new orders, which wouldn’t exist in Postgres. These uncertain order(s) we called delta data. To ensure this delta data to be as minimal as possible, the migration of the data and deploy of the new order service was planned at the end of a business day. To ensure that both databases instances would be in sync a second smaller migration of data would be performed. The second migration would only consider the time where we expected the “delta data”. Some of the data could possibly be already in the postgres databases, but in that case the data would be overwritten with the same data which is no problem. The initially created migration service, that would read the data from datastore, mapped and insert it into postgres, could be reused for this purpose. This could be done since it would be a small amount of data and therefore less time consuming to run a migration. The only adaptation to the migration services was to add a date and time range filter to select the data that needed to be migrated. This service would also be able to log if anything would fail, giving the ability to monitor this migration process more precisely.

Step 3

 

Validating

 

Parts of the migration service could be re-used for validating the data. Given a date range, the service would check if all the data in datastore matches the data in postgres. The same service could be used to migrate the correct data from datastore to postgres if any out-of-sync records were identified. After that, it would be possible to investigate what caused the incorrect data in postgres and fix this. No big problems arose during this step, so after a while of monitoring and validating we went on to the next step. Although we should not have done so yet.

Step 4

 

Reading from postgres

 

The order service was then switched to read from postgres instead of datastore. Now that the order service was dependent on postgres, it would throw an error instead of just logging it, if something in the process failed. The service would still write to both instances, keeping a backup of the data in datastore, giving the option to rollback the service and have the correct data still available to the service at any time. The service was rolled out per country as mentioned before. Starting with a smaller country and monitoring the behaviour. The database cpu consumption went up extremely high. Fortunately making more cpu power available to Postgres solved the problem. We were surprised by this dramatic cpu usage. Seeing such a high resource consumption in a small country, we should be expecting a lot more resource consumption with the larger countries going live. After creating some extra indexes the cpu consumption dropped dramatically. Everything was stable on the database instance again. The data looked valid through the process. Therefore it was time to roll out to a bigger country (our home country). After rolling out in to our home country, we were notified about users seeing incorrect order information. Rolling back, so we depend on datastore again, made the problems disappear and gave us time to investigate what happened.

Due to rolling out per country the impact on the users was limited to the rolled out countries. However the expectation was that everything was going correctly since we did not get any failure reports from the deploy to the smaller country. This could have two reasons. It could be, because of the smaller amount of users in the country in combination with the small amount of time the problem arose, nobody had experienced the problem. Another reason could be that some users did experience the problems, but since this was not happening in our home country the communication was inadequate and we did not manage to pick it up yet.

Validating

 

Because the validation of the data was done on all fields in an entity the process of validating a day of orders was slow. Therefore we were unable to validate all the data. The data we have selectively validated was consistent, however we missed inconsistencies in the remaining data. Once the painful problem was reported by the users, the data validation method was changed. The inconsistency was in the part of data used to determine the status of the order. This status field was the only field that would be updated after the initial creation of an order. Since this field was responsible for all the identified data inconsistencies, our validation process was changed to look only at this field. This was a much faster check and was done on all the data. All inconsistencies were found and the correct data was migrated.

Kubernetes

 

The inconsistencies in the data were hard to explain, since everything seemed to work fine. Looking into the times the data inconsistencies happened it was during periods of deploying the order service. The order service is running on a Kubernetes cluster. When a new deploy of the order service would happen the current pods would be shut down and new pods would start up. Because of the way the configuration of the shutdown of pods was done, a pod would not have enough time to ensure any possible remaining connection to postgres to be completed before shutting down. After this configuration was changed, no more inconsistencies in the order data occurred.

Conclusion

 

Because of the change we are now able to perform the query that was giving time-outs before. We see a slight improvement of performance overall and are able to start creating better performing queries to improve even further.

During the whole process we ensured data correctness in datastore. Therefore we were always able to rollback anytime or could migrate data from datastore to postgres. Maintaining a correct set of data throughout the process was crucial.

Having a migration service to migrate the data, initially only for the delta, was really useful later on. When something goes wrong there is quite a bit of data impacted. Being able to pick a list of records to migrate automatically is in that case very helpful.

Rolling out per country is nice to have. But keep in mind that when you apply it to smaller groups of users, this will not ensure a foolproof prediction of the correctness of the service functionality.

In the end we managed to migrate without any downtime, although some users have experienced some inconsistencies. We managed to keep this impact to a minimum throughout the process and are now serving all order data from postgres.

COMMENTS