Let’s review the options for data sharing in the cloud to understand what’s available and where things might be shifting more and more in the future. This is relevant for many organizations facing the challenges of data movement cost, speed of access, or the risks of trying out new and promising approaches.
Multi-cloud comes into play with data sharing as well. With improving interoperability between technologies, many companies want to choose the best services and products to solve their scenarios — even if they are services on different clouds. For example, Cloud X is good for A, Cloud Y for B. This leads to data distributed over different services, regions or clouds. Take even logging information on different clouds: organizations still want to be able to query and analyze the data efficiently and with minimal costs.
Need a solution for optimizing management and governance in a multi-cloud environment?
This need has been met by VMware CloudHealth. Check out our course to learn about this unifying platform with one dashboard to manage resources across different cloud providers such as AWS, Azure, and Google Cloud Platform.
There are challenges, trade-offs, and multiple approaches we can take with data sharing in the cloud.
Data Sharing By Copy (Static Data Pattern)
Here several services are running in the same cloud. Compute and storage aren’t independent. To work with the data from another service we’d have to copy or move it. Nothing wrong with this approach for simple architectures, if this meets your criteria.
No-Copy Data Sharing (Direct Access Pattern)
This approach is quickly gaining adoption. Again, several services in the same cloud. Compute and storage are completely separate.
Now there is better scalability, cost-efficiency, direct access to data. As an example, this can be enabled by features such as Azure Synapse Link for Azure Cosmos DB, BigQuery external data sources, or Snowflake data sharing.
Hovering Multi-Cloud Replication Data Sharing (Twinning Pattern)
The “twinning pattern” is common in third-party managed systems, or systems you manage that span multiple clouds or regions.
In this approach, consumers of the system that spans multiple clouds can access data from different clouds.
This technique requires replication, hence there’s cost and additional replication time, even if it’s happening under-the-hood. An example of this approach in action is cross-region data sharing in Snowflake.
Multi-Cloud No-Copy Data Sharing (Data Portal Pattern)
In this emerging approach, we observe a service running across multiple clouds, storage across multiple different-cloud services, separation of compute and storage, and direct data access with no need to copy or move it.
This is enabled by technologies like Azure Arc or Google Anthos. E.g. BigQuery Omni, Azure Arc-enabled data services.
How cloud data sharing impacts organizations
There are two themes that are true for enabling better data sharing options for an organization:
- Separation of storage and compute, link to other data sources for direct access.
- Multi-cloud or hybrid platforms that bring compute to data located anywhere (Anthos, Arc).
For cloud providers this means, companies might not be choosing some cloud services because they have already committed to a certain cloud and are storing data there.
If companies can store data anywhere with easy access & sharing, they will also adjust decision-making. Developer experience, programmability, interoperability, integrations, open standards – I think these factors will become more prevalent when choosing cloud services.
Which cloud data sharing pattern is right for me?
We reviewed four different and common approaches to data sharing, within a cloud or across multi-cloud.
Simple and common approaches for data sharing by copying are still a fine option for use cases where they solve the given problem for a particular scenario, pending the company’s data processing cost and latency requirements are satisfied.
Approaches with direct data linking enabled by separation of storage and compute significantly reduce data movement costs, and help with eliminating silo-teams and monster-data-lakes, enabling better data sharing options for domain-organized data-product teams, per the data mesh concept.
The emerging multi-cloud technologies still need to develop and mature, but they are really promising and useful in enabling organizations to choose the right tool for the right job and stay focused on business problems without incurring data transfer costs.
Looking for more cloud goodness? Check these out:
About the author
Lena Hall is a Director of Engineering at Microsoft working on Azure, where she focuses on large-scale distributed systems and modern architectures. She is leading a team and technical strategy for product improvement efforts across Big Data services at Microsoft. Lena is the driver behind engineering initiatives and strategies to advance, facilitate and push forward further acceleration of cloud services. Lena has 10 years of experience in solution architecture and software engineering with a focus on distributed cloud programming, real-time system design, highly scalable and performant systems, big data analysis, data science, functional programming, and machine learning. Previously, she was a Senior Software Engineer at Microsoft Research. She co-organizes a conference called ML4ALL, and is often an invited member of program committees for conferences like Kafka Summit, Lambda World, and others. Lena holds a master’s degree in computer science. Twitter: @lenadroid. LinkedIn: Lena Hall.