Architecting a traditional relational data solution is very prescriptive. When you try to analyze or create insights in this paradigm, you have a lot of supporting systems and code. These systems typically have to:
- Connect to a source system to extract data
- Apply business rules or logic through transformations/ETL
- Load data into a data warehouse optimized for storage
- Load/aggregate data within a data mart optimized for reporting
- Build a report or structure to analyze the data
- Batch processing creates stale data - average data age is measured in days (or sometimes months)
- Batch processing can irreversibly transform source data - data consumers don't know where data originated or what business logic was applied
- Raw source data is not persisted - limits the ability to reprocess data if business logic changes or additional rules are developed
Streaming Data is Faster DataTime is money. Capturing data and making it available within an organization quickly will be a differentiator for companies in the modern data architecture. A customer can be interacting with a bank's website, and they run into an issue applying for a mortgage. They reach out for help by calling the customer service line. What if the customer service representative could know exactly what page the customer is on, what he or she was trying to do, and the specific error that is being displayed when the customer calls? This would fundamentally change the way that service reps coach customers into becoming more self-sufficient.
One common misconception with streaming is that all data needs to be delivered in near real time. That is possible, but it would vary by use case and comes with additional costs. The main points to consider with streaming data are:
- How much latency is acceptable on new data?
- What volume of data are you working with?
- Can the records be processed individually?
Streaming Data is More AvailableAnother common practice with a streaming paradigm is to create a streaming hub. This eliminates the point to point connections commonly found within an ETL architecture. "Data Democratization" is a term I hear many clients discussing. In short, it means not storing data in silos. A streaming data hub supports sharing data across departments or lines of business and can significantly increase analytics and insight opportunities. Having a single view of a customer across all product offerings not only creates a streamlined experience for them, but it also allows you to better understand each customer's product utilization, behavior patterns, and willingness to try new products.
Data Democratization also instills the concepts of data producers and data consumers within the organization. Data producers are typically charged with making sure that all data is captured reliably to minimize data quality issues and produce it into the ecosystem quickly. Data consumers are primarily concerned with having that single view of the customer, knowing where data originated, understanding what logic was applied along the way, and accurately reporting results. The streaming data hub only underscores the need for a robust data governance policy to ensure that information is shared effectively, appropriate data security rules are enforced, and data quality checks are implemented.
Streaming Data is More FlexibleIn today's agile environment, flexibility is key-iterative development, experimentation, and failing fast are the norm. Cloud-based architectures offer an environment where storage is relatively cheap. Many leading organizations are realizing the benefits of data experimentation. These companies typically choose an architecture where two versions of data are stored - the raw form as it was originally captured and the enriched data with business transformations applied. The streaming paradigm is central to data experimentation methodology by serving data rapidly to support prototyping and delivering insights quickly.
- You can reprocess data as things change. If the business rules that you've defined within your organization change, you now have historical raw data that you can reprocess to generate new answers or have a clear view into the lineage of events. Organizations can analyze this enriched data to discover new trends or correlations and engineer new data streams for organizational consumption.
- You can future-proof your data as you define new metrics. If your business model changes and you define a new key performance indicator, streaming architectures allow you to process historical data to "seed" new metrics - keeping you from having to start from scratch. This gives your organization the ability to see what a measurement would look like when applied to existing data and offers the opportunity for more proof of concept exploration. When modeling these new KPIs, you might be able to generate some insight that you may have disregarded before. If the KPI doesn't make sense, there are opportunities to tweak or scrap it with minimal disruption to existing business processes.