This post is part 1 of a two-part series about how organizations use Azure Cosmos DB to meet real world needs, and the difference it’s making to them. In part 1, we explore the challenges that led the Microsoft Office Licensing Service team to move from Azure Table storage to Azure Cosmos DB, and how it migrated its production workload to the new service. In part 2, we examine the outcomes resulting from the team’s efforts.
The challenge: Limited throughput and other capabilities
At Microsoft, the Office Licensing Service (OLS) supports activation of the Microsoft Office client on millions of devices around the world—including Windows, Mac, tablets, and mobile. It stores information such as machine ID, product ID, activation count, expiration date, and more. OLS is accessed by the Office client more than more than 240 million times per day by users around the world, with the first call coming from the client upon license activation and then every 2-3 days thereafter as the client checks to make sure the license is still valid.
Until recently, OLS relied on Azure Table storage for its backend data store, which contained about 5 TB of data spread across 18 tables—with separate tables used for different license categories such as consumer, enterprise, and OEM pre-installation.
In early 2018, after years of continued workload growth, the OLS service began approaching the point where it would require more throughput than Table storage could deliver. If the issue wasn’t addressed, the inherent throughput limit of Table storage would begin to threaten overall service quality to the detriment of millions of users worldwide.
Danny Cheng, a software engineer at Microsoft, who leads the OLS development team explains:
“Each Table storage account has a fixed maximum throughput and doesn’t scale past that. By 2018, OLS was running low on available storage throughput, and, given that we were already maintaining each table in its own Table storage account, there was no way for us to get more throughput to serve more requests from our customers. We were being throttled during peak usage hours for the OLS service, so we had to find a more scalable storage backend soon.
In looking for a long-term solution to its storage needs, the OLS team wanted more than just additional throughput. We wanted the ability to deploy OLS in different regions around the world, as a means of minimizing latency by putting copies of the service closer to where our users are. But with Table storage, geo-replication capabilities are fairly limited.”
The OLS team also wanted better disaster recovery. With Table storage, they were storing all data in multiple regions within the United States. All reads and writes went to the primary region, and there were no SLAs in place for replication to the two backup regions, which could take up to 60 minutes. If the primary region became unavailable, human intervention would be required and data loss would be likely.
“If a region were to go down, it would be a real panic situation—with 30 to 60 minutes of downtime and a similar window for data loss,” says Cheng.
The solution: A lift-and-shift migration to Azure Cosmos DB
The OLS team chose to move to Azure Cosmos DB, which offered a lift-and-shift migration path from Table storage—making it easy to swap-in a premium backend service with turnkey global distribution, low latency, virtually unlimited scalability, guaranteed high availability, and more.
“At first, when we realized we needed a new storage backend, it was intimidating in that we didn’t know how much new code would be needed,” says Cheng. “We looked at several storage options on Azure, and Azure Cosmos DB was the only one that met all our needs. And with its Table API, we wouldn’t even need to write much new code. In many ways, it was an ideal lift-and-shift—delivering the scalability we needed and lots of other benefits with little work.”
Design decisions
In preparing to deploy Azure Cosmos DB, the OLS team had to make a few basic design decisions:
Consistency level, which gave the team options for addressing the fundamental tradeoffs between read consistency and latency, availability, and throughput.
“We picked strong consistency because some of our business logic requires reading from storage immediately after writing to it,” explains Cheng.
Partition key, which dictates how items within an Azure Cosmos DB container are divided into logical partitions—and determines the ultimate scalability of the data store.
“With the Azure Cosmos DB Table API, partition keys naturally map to what we had in Table storage—so we were able to reuse the same partition key,” says Cheng.
Migration process
Although Azure Cosmos DB offered a data migration tool, its use at that time would have entailed some downtime for the OLS service, which wasn’t an option. (Note: Today you can do live migrations without downtime.) To address this, the OLS team built a data migration solution that consisted of three components:
- A Data Migrator that moves current data from Table storage to Azure Cosmos DB.
- A Dual Writer that writes new database changes to both Table storage and Azure Cosmos DB.
- A Consistency Checker that catches any mismatches between Table storage and Azure Cosmos DB.
The Data Migrator component is based on the same one provided to Microsoft customers by the Azure Cosmos DB team.
“To solve the downtime problem, we added Dual Writer and Consistency Checker components, which run on the same production servers as the OLS service itself,” explains Cheng.
The OLS team completed the migration process in late 2019. Today, Azure Cosmos DB is deployed to the same three regions as Table storage, which the team did to mimic the Table storage topology as closely as possible during the migration. Similarly, North Central US is the primary (read/write) region while the other two regions are currently read-only. The Azure Cosmos DB environment has 18 tables containing 5 TB of data and consumes about 1 million request units per second (RU/s), which are the units used to reserve guaranteed database throughput in Azure Cosmos DB.
Now that migration is complete, the team plans to turn on multi-master capabilities, which will write-enable all regions instead of just the primary one. Tying into this, the team also plans to scale out globally by replicating its backend store to additional regions around the world—as a means of improving latency from the perspective of the Office client by putting copies of the OLS data closer to where its users are.
In part 2 of this series, we examine the outcomes resulting from the team’s efforts to build its new Office Licensing Service on Azure Cosmos DB.