What's a data migration?
Data migration is about transferring information (data) from one place to another.
[This is a guest post from Elliot Gunn, a talented technical writer at Datafold. Check out his medieval themed guide to data quality.]
The TL;DR
Data migration is about transferring information (data) from one place to another. It’s like relocating everything you own when you move to a new place, but digitally.
Let’s break it down:
Data can include documents, pictures, videos, databases, and more. It’s anything that a company stores, whether on-premises or in the cloud
Migration is the process of transferring the data from one system to another, or between multiple systems
Data migrations generally refer to database migrations–like how the British government’s notifications system recently migrated their PostgreSQL database into their own Amazon Web Services account. On the surface, it’s pretty straightforward: you move your data from one database to another one. So why am I writing about this, you ask?
The answer is that data migrations are typically very expensive and lengthy projects. They’re something that developers dread, and will do everything in their power to avoid. While most of us probably won’t ever have to contend with a data migration, it’s likely that you’ve already done a much scaled down version. If you’ve ever had to switch from a Windows to Mac operating system and vice versa, whether because of work requirements or personal preferences changing, you’ve done data migration-lite.
Both systems do similar things: allow you to check emails, browse the web, and store photos. But there are slight variations too, and some of them are tricky enough to add some friction to the process. For example, some software that work on Windows may not have direct equivalents on Macs, and you’ll need to reinstall with alternate programs.
Why do data migrations matter to me?
Even if you’ve not been involved in a data migration, you’ve probably been impacted by one, and negatively.
Because data migration involves moving and transforming data between non-identical systems, a lot of care and expertise must be taken to ensure it’s done right. And it’s very tough to do one without creating errors and inconsistencies along the way that could impact employees and customers.
MySpace famously lost 12 years of data during a server migration gone wrong–that’s 12 years of users’ photos, videos, and music gone forever.
Data migrations can have cascading effects when things go wrong:
If you work in analytics, you need accurate data to keep your dashboards up-to-date. Any issues with data integrity can hurt the team’s credibility
Because data migrations can take a while to complete, your usual data sources may become temporarily unavailable, or you might lose time looking for new data storage locations
Security teams often worry about whether sensitive data is migrated securely: any breaches can hurt customers
The worst case scenario: something goes wrong with a migration, and your app goes down. Or worse, your app goes down and you lose data.
Context: why you would need to do a migration in the first place
Alright, you get the idea – migrations are difficult and costly (we’ll walk through the steps later). So then why do companies risk all of their time and effort to do them in the first place?
Upgrading to better tech
As a previous Technically piece on databases outlined, there are hundreds of databases and they’re all slightly different when it comes to storage formats, data models, query languages, scalability, indexing, and more.
Databases are constantly evolving, and new databases pop up every day (see: DuckDB, CockroachDB). Many legacy databases just don’t cut it anymore for what companies need them for, whether it’s to improve performance, scalability, or functionality. Others want to switch to more user-friendly databases like Snowflake to enable self-serve analytics to more and not-as-technical employees.
Growing needs
Migrations are often prompted when companies hit different stages of growth and need different technologies to reach their next target.
When startups grow and mature, they collect and produce higher volumes of data, and eventually their databases hit a limit in performance and capacity. This milestone means companies start to look elsewhere to find a solution that can support their growth.
Many early stage companies move from on-premises to the cloud to support their explosive growth. Netlix’s 7-year migration to AWS was motivated by the ability to spin up new machines quickly to support their global expansion without having to first purchase and set up hardware. This also allowed them to provision new resources on-demand to run experiments on their key recommendation algorithms.
Others make the choice to move away from the cloud to on-premises to save money. 37 Signals made ripples in the tech community last year explaining their exit from AWS and GCP after a public cost benefit review finding that they could save $7 million over five years.
Security
Many companies are compelled to migrate data because of new security regulations or to meet industry standards. For example, companies trying to attain SOC 2 compliance might need to migrate their data to more secure and compliant platforms or services, or platforms that allow better audit trail capabilities.
Remember: the larger of a company you are and the more data you have, the more painful a migration is going to be. So if you’re eventually going to have to move to a new database, it’s usually better to get it done earlier rather than later.
How do you do a migration?
Imagine you’re on the engineering team at Technically, and you’ve started to realize that your basic self-hosted PostgreSQL database isn’t going to cut it for the next few years of subscriber growth. Instead, you want to move to a managed, serverless provider like PlanetScale. How do you get started?
Have a plan
Planning ahead is key for migrations to succeed. Companies identify what data assets (e.g. dashboards, tables, metrics) they have, and decide when and how to migrate each one over based on their importance to business operations. There are also many migration strategies to choose from: parallel migration (running old and new systems simultaneously), phased migration (moving in stages), and lift-and-shift (moving everything without changing any pre-existing code in the process).
Duplicate data and redirect traffic
The easiest way to start a migration is by copying existing data and code from the old to the new system. Once the new system is stable, users can shift to querying data from the new system instead. As the migration often proceeds in stages, this helps to minimize disruptions to existing workflows. One way to think of this: when you’re moving, you don’t want to land in a completely empty house.
Validate data
As many things can go wrong in the migration process, it’s important to make sure that the data is transferred over correctly and completely. This is called validating data integrity, and can be done through automated tests to find any discrepancies. It’s surprisingly more complicated than just “copying and pasting” your data, especially at large volumes.
Why are migrations so difficult?
Companies often architect migration plans that can last years. For example, the Nasdaq announced plans to migrate their data to Amazon Web Services back in 2021. Two years later, they posted an update on part two. They estimated that the entire migration to AWS will likely take 10 years. 10 years!
Why is the process so challenging? Imagine that you’re a data engineer at an ecommerce store serving millions of customers worldwide, and you’re been hired to help move their data infrastructure off a legacy database like Oracle to Snowflake. Here are some challenges you’ll likely encounter along the way:
Disrupting everyday work
One key metric you’ll be measured by is downtime–how long a system is not available for use. Migrations are very hard to achieve with zero downtime. When Notion was ready to complete their database transformation (dividing it into smaller “shards” for better performance), it meant taking their application offline for 5 minutes for the server to switch over. That’s 5 minutes where nobody could use the product, and that’s considered fast for this type of thing.
Your company’s award-winning customer support staff rely on real-time dashboards to track where purchases are. When a data source is taken offline for longer than anticipated, whether accidentally or intentionally during a migration, it throws these dashboards into chaos. The team can’t help customers troubleshoot delivery issues, leading to delays in resolving tickets and negatively impacting customer loyalty.
Companies may also call for a freeze on new analytics work until data validation (making sure that the old and new datasets match up) is done. Although companies try to minimize this downtime in their planning, this phase can still take weeks or months to complete.
Migration is a team effort
While migration is a technical challenge, it also requires extensive collaboration. Data migrations involve intensive planning between levels and departments of a company. You might be spending just as much time coding as meeting with different departments on conducting IT infrastructure audits, figuring out contingency plans, and jumping on calls to negotiate contracts with external migration specialists.
For larger companies, the planning process alone can take years (seriously) as companies need to first look at their existing data infrastructure, choose between different migration strategies, allocate resources (or more commonly, hire vendors), and have a process in place to reduce disruptions to the business.
SQL comes in dialects
You’ve worked with Oracle databases for your entire career and know how to optimize all its configurations. But Snowflake is new to you, and after some research, you learn that their SQL dialects are different enough that you’ll need to come up with some way to translate between them.
Databases have their own SQL dialects. They’re generally pretty similar, with a lot of overlap, but sometimes that final five percent difference is enough to cause complications in a data migration. Someone has to translate and transform SQL queries, which can introduce errors into the process.
Automated SQL translators have emerged to make the process less manual, but they’re still not perfect solutions because they miss the context of the query and can produce incorrect translations. Also, a query optimized for one database will need to be tuned to perform in another database. Translating SQL still needs human effort, which adds time.
Is the data even the same?
Your team pulls an all-nighter one quiet Sunday evening to actually run your migration script, and everything miraculously moves from Oracle to Snowflake without any errors and only a few minutes of downtime. But you’re not done yet. You get a message from the machine learning team the next day: “Hey, our models are producing unexpected results. Is the data in Snowflake actually right?”
Data migration experts grapple with a puzzle similar to the Ship of Theseus paradox: How do you know your data remains exactly the same amidst all the transformations?
Trying to answer this question is challenging, and many companies have emerged to address this, and other problems, with solutions.
Some tools for making it easier
Data migrations are a $10+ billion dollar industry, and it’s a really hard problem to completely automate. Most successful migrations are the result of data professionals with deep technical expertise working together to manage the process and solve for edge cases.
In recent years, better tools have emerged to automate the most critical parts of the process and make them less painful:
Amazon’s Data Migration Service
Amazon’s DMS is a cloud service that helps migrate a variety of data systems. It’s a convenient option for those looking to do migrations within Amazon’s infrastructure, as in the case of the UK government where they used DMS to migrate PostgreSQL databases.
Datafold
The most challenging data migrations happen between two different kinds of databases. Datafold’s cross-database diffing helps validate whether you’re getting the same data (called data parity) during a migration, and their row-level diffs (comparisons at each data value) works fast even at billions of rows. Datafold provides full-cycle migration automation, combining SQL code translation and cross-database validation to simplify and speed up the process. The UI makes it easy to eyeball where the discrepancies are coming from and it’s free to trial and test out on your own databases.
Open source options
There are a bunch of open source tools out there for data migrations. Github has gh-ost, Percona has pt-online-schema-change, Facebook/Meta has OSC, and there’s also Vitess, which was originally developed at YouTube (migrations are a small side feature). They’re all slightly different, but the general gist of all of these tools is that they help you copy data from one place to another.