DataStax is a data management company and its product provides commercial support, software, and cloud database-as-a-service based on Apache Cassandra. DataStax also provides event streaming support and a cloud service based on Apache Pulsar. We’ve talked with Chris Latimer, Vice President of Product Management for DataStax about the history of the company, its products, and the industry.
Can you tell us about the story of DataStax? What were the most important milestones?
DataStax is linked to the history of Apache Cassandra, the NoSQL database that was originally developed at Facebook, before being open-sourced. Jonathan Ellis is the co-founder of DataStax, was one of the most active members of the open-source community and was instrumental in getting the project adopted by the Apache Software Foundation. Since then, Cassandra has become the database that companies rely on when they need to run applications at a massive scale.
Companies like Instagram, eBay, Facebook, Uber and Apple all run Cassandra, so it touches virtually everyone every day. For companies that want to provide service at this kind of scale, Cassandra keeps data always available and clusters operational across geo-distributed environments.
Over the past few years, DataStax has contributed code like its database drivers to the open-source community as well as supporting the development of Cassandra 4.0 with testing and other contributions. The company was the first to launch a serverless open-source database (Astra DB), which helps enterprises focus on innovation rather than database management. The future for Cassandra is very bright – the new release of 4.0 has gone through one of the most stringent series of testing processes because it has to be the most reliable database service ever. The world runs on Cassandra, so this new update has to be production-ready right from the start.
DataStax delivers the open, multi-cloud stack, purpose-built for modern data apps. What are the solutions that differentiate you from your competitors?
A: It’s the combination of open-source, serverless and multi-cloud support that makes DataStax different from competitors. Firstly, Apache Cassandra was built from the start to run across multiple locations, so that companies are not tied to a specific cloud provider and they can run across private, public and hybrid environments. With DataStax, companies can run the same service across multiple cloud regions, across multiple cloud providers, and across multiple geographies at massive scale.
Getting that Open Data Stack together is essential for companies to derive real-time value from their data and create new and innovative offerings in the data-driven economy. According to our State of the Data Race report, leading companies are four times more likely to use Apache Cassandra and Kubernetes along with at least two of Apache Spark, Apache Pulsar, Apache Kafka, and Elasticsearch. Getting this stack together involves more than just putting together individual components – instead, it is how this whole stack works together in production. For those companies that do succeed here, they are twice as likely to drive at least 20% of their revenues using data and analytics.
Building on this, DataStax makes it easier than ever to roll out that data stack for modern and modular applications over time without breaking the architecture. Astra DB simplifies the process of implementing cloud-native Cassandra applications, reducing deployment time from weeks to minutes. This delivers an unprecedented combination that makes it easy to adopt cloud-native applications alongside pay-as-you-go pricing with the freedom and agility of multi-cloud and open-source.
What are the benefits of application event streaming?
Applications and IoT devices create massive amounts of data over time. That data will go into a database for long-term management. On top of this, you might want to use that data when it is created to trigger an action. This is where application event streaming comes into play, as you can create sources of data – publishers – and then services that consume that data when it fits certain criteria, known as subscribers. When a publisher creates event data that fits those criteria it is automatically sent over to the subscriber where action can take place.
Common use cases for application event streaming include moving data between microservices applications components, and updating and consuming application data in real-time. You might have data scientists that need to capture all events and process the data to build predictive models. Another use case is where you have Internet of Things devices sending raw sensor data to you from thousands or even millions of devices, and you have to scale to millions of readings per second and enrich that raw sensor data with contextual information. E-Commerce businesses use unified event streaming to create real-time personalized offers to consumers.
Applications, sensors, bank transactions, website activities, server log files create data. Application streaming data is becoming so important. Could you please explain why?
At a high level, all of those things play an important role in the march towards hyper-automation and a lot of that is tightly coupled to advancements in artificial intelligence (AI) and machine learning (ML). So things like sensors are very central to transformational technologies like industrial IoT and autonomous vehicles. But sensors on their own just provide raw data about the environment where that sensor exists. It’s the data from each sensor that is then combined with data from other sensors and processed in real-time by a ML model that results in something really remarkable.
Likewise, enterprises are finding ways to extract more and more value from data that originates from application activities, logs and transactional data by combining it with AI/ML. Streaming is a very natural technology to use here since you can replay a series of events as many times as you wish. When training and testing ML models, this is very convenient since you can effectively time travel and see how things would have played out if you used model A vs model B.
Could you please give examples from the real-world use cases?
Some examples of event streaming are really apparent in everyday use. For example, watching your rideshare driver navigate to your location is accomplished by the driver’s phone streaming their location to the rideshare company and then the company streaming that data to your phone which is then rendered onto a map and updated in real-time.
Other examples are things we never notice. For example a retailer processing purchase information in real-time and sending that information to participants in the supply chain so that manufacturers can produce goods in smaller batches and with shorter lead time. From your perspective you just experience an item you want to buy being in stock and at a reasonable price. Most people don’t stop to think that the lower price was made possible because the retailer didn’t need to hold inventory as long or that the item was in stock with very little excess inventory. Even fewer people think about the role that event streaming had to play in this scenario.
What are the key components of a streaming data architecture? What decisions do developers and DevOps teams have to make?
Generally, the backbone of a streaming data architecture is going to be a distributed messaging system like Apache Pulsar. This will provide capabilities to ingest data streams as well as to deliver those data streams to consumers who are subscribed to the stream.
For a lot of use cases, you can do a surprising amount with just this simple foundation. Eventually, you will probably need stream processing capabilities to provide in-pipeline mediation/transformation capabilities using something like Pulsar Functions. For more advanced stream processing capabilities, especially when that processing is analytical in nature, you’ll likely want to combine your messaging platform with something like Apache Flink.
Depending on how you want to access stream data, an operational data caching layer may be part of your architecture as well.
Developers and DevOps teams often want to ensure that the platforms they are going to use fit well into their software development lifecycle. This generally means they want to have a convenient way to treat configuration as code to facilitate software releases and they want to rely on APIs/CLI tools for deployments and configuration of the technologies that make up their streaming architecture. This is one of the things that attract these users to Apache Pulsar since it was fundamentally architected with DevOps in mind.
One of your products, Astra Streaming’s public beta version, is out. What is Astra Streaming? Could you please give detailed information about it?
DataStax Astra Streaming is a cloud-native messaging and event streaming platform that is powered by Apache Pulsar.
While more and more enterprises and developers are discovering the advantages of Apache Pulsar, they want those advantages delivered to them as a cloud-native service with all the benefits they’ve come to expect. This is exactly what DataStax has built.
With Astra Streaming, you can deploy Apache Pulsar to the cloud of your choice in the same geographic location where your applications are running. You don’t need to worry about sizing servers or scaling them up and down as your event volume ebbs and flows – DataStax takes care of that for you.
Astra Streaming also lets you start small and grow with consumption-based pricing that allows you to pay only for what you use. If you create an Astra Streaming instance and then stop using it, your pricing scales to zero so you don’t have to worry about being billed for resources you aren’t utilizing.
In addition to all the capabilities that come out of the box with Apache Pulsar, Astra Streaming also has built-in integration with Astra DB, DataStax’s Apache Cassandra-as-a-Service cloud offering. This means you can easily stream data both into and out of Astra DB to build real-time data pipelines and create read-optimized views of your streaming data that can be queried just like any other database table.
What is the role of multi-cloud in building data strategy?
One of the problems that enterprises have struggled with practically forever is having data siloed and inaccessible across the organization. As cloud adoption became the norm, this problem was compounded by the increased reliance on cloud databases provided by public cloud providers. Multi-cloud strategies add one more level of complexity to this problem by trapping data within cloud-specific databases scattered across multiple cloud providers.
As enterprises refine their multi-cloud strategies, they realize that their data strategy needs to be multi-cloud as well. One of the standard practices that you’ll commonly find is a reliance on technology like Apache Cassandra that was built on an architecture fundamentally created to run in this type of geographically dispersed deployment.
Of course, when you embark on a cloud strategy you generally want to see your operational burden and responsibilities decrease, so enterprises want the capabilities of Cassandra but they want all the benefits of an as-a-service model. This is why so many multi-cloud enterprises are drawn to DataStax AstraDB. Once you’ve wrestled with the problems associated with siloed data cross-cloud, the value of Astra DB is immediately embraced as a way to solve all those challenges.
What are the benefits of Apache Cassandra in data modeling?
I love this question because I was stuck in a bad relationship with third normal form for a long time.
The problem every application developer has inevitably faced is that they spend all this time carefully crafting beautiful relationships between tables and normalizing things just right, but then they run a query and it’s horrifically slow – at least once you reach a significant volume of data and load on your application.
So what do you do? Typically you add indexes and start denormalizing your data. Indexes are going to severely impact your write throughput and denormalized views are going to have different tradeoffs depending on the RDBMS you are using. At the end of the day you find that while your relational data model may make for a very beautiful diagram, it’s not helping you deliver a performant, usable application which is the only thing you really care about as a developer.
Rather than go through that entire process, Apache Cassandra forces you to model your data in a way that’s optimized for the way you want to store and retrieve your data. And while your entity-relational-diagram won’t be as pretty, that’s a small price to pay to be the developer who built the app that responds to user requests in a few milliseconds even when it’s being hammered by millions of concurrent users.
What are the challenges for those managing cloud and data center environments?
Before NoSQL, companies would spend time and money on physical hardware to support their database, as it was technically difficult to run clusters and keep up with demand compared to just buying bigger physical boxes. After this, it was hard to virtualize database workloads too, so in most cases, database infrastructure sat on dedicated hardware next to the virtualized systems of the application. As cloud adoption grew, similar issues persisted. Ephemeral cloud instances worked great for web and app servers, but teams would have to run dedicated instances for their databases, even though this added management overheads and costs.
Today, we are seeing more of a shift due to the impact of Kubernetes. Organizations don’t want multiple versions of infrastructure to manage, as this requires hiring more people and keeping track of more stuff. Instead, they want to automate and manage things in the same way, regardless of where it runs or who provides that infrastructure. With more companies moving to hybrid and public cloud deployment, databases have to change.
There are some valid objections when it comes to running databases in containers, like needing high-performance file systems and looking at how to prevent possible contention. However, those issues are going away – you can see this in how distributed databases like Apache Cassandra work, where placement of individual nodes means that hardware failure doesn’t impact database uptime.
The current state-of-the-art thinking around data and Kubernetes involves creating operators that translate how databases want to work into what Kubernetes wants them to do. However, this does not go far enough. Instead, we can design our databases so they use more of what Kubernetes offers with resource management and orchestration for basic operation of the database, and then use this to support more orchestration and automation around our applications and data together.
What does this mean for those looking at multi-cloud and data? We will be able to run services where the data is closer to the application, and they will operate in the same way. This reduces the mental load and technical debt over time for application developers, and makes it easier to plan migrations over time too. Want to shift your entire application workload, data included? Using this approach, it will be possible in the future.
What can you say about the future of streaming data?
In my view, event streaming is still in its infancy. I think some of the technologies that were early pioneers in this space like Apache Kafka will stop being a technology standard and transform into a legacy API standard. There won’t be a lot of enthusiasm for doing big-bang migrations which require updating every Kafka app in the enterprise, but there will be even less enthusiasm for expanding the adoption of Kafka’s monolithic architecture and incomplete feature set. Next-generation streaming platforms like Apache Pulsar that offer better performance, superior architecture, comprehensive capabilities along with Kafka compatibility will really accelerate the exodus away from Kafka.
I also think there are a lot of interesting possibilities on the horizon for event streaming in general. What would the world look like if you started reimagining everyday data structures as event streams? Why couldn’t a web server be implemented as an event stream of requests and responses? Why couldn’t every log message that gets logged be an event? In a way similar to how object-oriented programming changed the way we thought about data as objects, I expect equally impactful changes to happen as we increasingly embrace an event-oriented model of seeing the world.
Things will really get interesting as all the second-and third-order effects start to emerge. As we treat more and more of our data as event-oriented, and capture those events in a replayable way, imagine how substantially this will accelerate the capabilities of AI/ML. And as edge computing continues to get more powerful, what will happen as we can deliver those events, often enriched with ML to the edge? We’re talking about advancements here that could easily reshape our world every bit as much as the internet and mobile computing did.