Running analytics on real-time data is a challenge many data engineers are facing today. But not all analytics can be done in real time! Many are dependent on the volume of the data and the processing requirements. Even logic conditions are becoming a bottleneck. For example, think about join operations on huge tables with more than 100 million rows in each. A join operation is possible, but it might not happen in real-time; we might call it … near real-time.
Real-time: When you need information processed immediately (such as a credit card transaction on e-commerce).
Near real-time: You don’t need the data immediately (such as machines’ log dashboards). You can endure a delay of 2-15 min.
What is Real-Time Analytics?
Real-time analytics lets users see, analyze, and understand data as it arrives in a system. The data ingested into the system in real-time or near real-time. Operations, logic, and mathematics are applied to the data to give users insights for making real-time decisions on new fresh data.
For enabling real-time analytics, all the components in the system should operate in real-time. Data ingested in the systems should be processed in a real-time manner, either event-by-event or using sliding windows with a micro-batch stream processing approach.
Micro-batch stream processing is often associated with Apache Spark Structured Streaming. Apache Spark Structured Streaming enables us to process data in a micro-batch that is closer to real-time when comparing real-time processing vs. batch processing. Read more about Spark streaming here.
Some of today’s real-time data analytics platforms’ most significant challenges are the various data streams/pipelines that need to be supported, together with multiple types of resources. With batch processing, we have the privilege of gathering all the data needed offline and extracting, transforming and loading it into a central data warehouse — usually using a global schema.
Do I need a Global Schema?
Defining a global schema is a challenge on its own. We often end up with a global schema where some columns are well-defined, while other columns have shady/vague names like colX, colY, other … Many times, the ‘other’ column is used for data that depends on the origin.
To understand the traditional warehouse’s global schema challenge, think about how you can design a global schema to represent social media channels – Twitter and TikTok in one schema. You might end up with endless columns where half of the rows will be null or default values that won’t necessarily be used.
Do you remember the classic Database Normalization technique? It helps us remove redundant data and better structure our schemas. 1NF, 2NF, 3NF, and more become extremely challenging as the variety, volume, and velocity of data grows together with the complexity and needs of data in various organizations’ systems.
This is why, with a modern data warehouse, there is no need for a global schema; it allows us more flexibility by managing access to structured and unstructured data from various databases, and lets us connect different database tables to one dashboard if needed.
How do I connect multiple data sources?
Another term that you should know is EAI – Enterprise application integration. When we think about data, this is usually enabled through a shared service bus architecture that transfers the messages from one part to the other.
Most times, we will collect data from various resources. Modern data warehouses enable us to stream the data and process it from various resources directly in its computation pools. For example, Apache Spark pools that are available from the Modern Datawarehouse platform can significantly help us with processing and joining data from multiple streaming pub/sub tools like Kafka or messaging systems like RabbitMQ.
When should I use Serverless SQL?
Let’s think for a second: who uses the collected data? We have business analysts who would like to create real-time queries on their data, data scientists, and ML researchers who would like to explore the data before applying an ML algorithm to solve a problem. Data engineers continuously take care of the data; then we have salespeople, customer support, and many more personas. Not all of them need a dedicated distributed computation system that is ON all the time. Many of them would prefer to use a tool on demand for their exact need when they need it.
This is where scalable Serverless SQL can help with cutting costs. For many Data Scientists and Business Analysts, running a query on-demand, knowing querying big data fast (big data fundamentals), and continuing with their work, is crucial and becomes an actual bottleneck and frustration for the teams when not achieved.
Do I need real-time MapReduce or event-driven microservices?
Lastly, I would like to talk about a subject that confuses many. For real-time MapReduce, maybe you are familiar with Storm, Spark Streaming, and Flink open source solutions that enable us to run analytics on real-time data and at scale. Event-driven microservices are often independent small/microservices that enable us to run stream processing individually on each machine, meaning the state between the machines is often not shared; hence it is less suitable for analytics workloads. The two can be combined in shared architecture to support a product. Still, it usually means that there will be a messaging bus where the data from various machines is being saved into a shared, most often distributed database.
? Curious to learn more?
Join us for a 3-hour online FREE event on Dec 7, 8 am PDT | 4 am GMT | 11 am ET, to learn about Data Analytics, Apache Spark, how to pick the right distributed database, Delta Lake, what is Serverless SQL, how to get started with it, and how you can use data to do good — from notable community speakers like Holden Karau, Tim Berglund, Jacek Laskowski, Anna Hoffman and more.
The event is free but subject to pre-registration here.