Sumo Logic is a cloud-based machine data analytics company that focuses on security, operations, and BI use cases. The company provides log management and analytics services, leveraging machine-generated big data to deliver real-time IT insights. We’ve talked with Erez Barak, Vice President Observability of Sumo Logic, about observability and its future.
What is observability today? Where did it come from?
The definition of observability today is that it is the combination of logs, metrics and trace data to tell you what is going on in your application. However, it comes from an approach that has existed for many years in control theory, where you use output data from your system to tell you how that system is performing.
I like this context, as it points out that observability is about much more than the sum of the data you have. It’s about context, it’s about how things change over time, and it provides that continuous intelligence into what is happening and the impact that your decisions have. This goes from the smallest technical decision – we need to update from version 2.0 to version 2.1 on one of the open-source tools we use – through to bigger business decisions like changing an application design or launching a new service. That observability data coming through can be used to inform all those decisions over time and ensure your application reliability is where customers expect it to be.
How are software development teams using observability, and how does it compare to the other potential solutions that they might have used in the past?
Software teams have used log data for years to tell them what happened in their applications and to get to root causes. This mainly meant looking for things that had gone wrong, so they could be fixed, or where performance was not as good as expected, so it could be improved. Over time, metrics and trace data and their correlation to log data provided more context. The next step was to bring all this together so you had more context and you could make better decisions.
As applications have moved to the cloud for infrastructure, and to microservices designs, the value around observability has gone up. Tracking where problems are is harder because applications are more complex, and so these tools have gone up in the amount of value they can provide back to developers. You can think of Observability as the next generation of application performance management in the cloud, where distributed tracing plays the role of code-level instrumentation.
How is open source getting used in this area?
When we started around logs more than a decade ago, we had to build everything that we provided. Today, the open-source community has developed and there is more standardization on those open source projects as to what developers actually want to use. OpenTelemetry brings together all the different point projects into a unified approach that developers can use to understand their applications. We are investing in open source projects for SLO and SLI management as a means to share a common methodology across the organization.
This approach helps developers get the insights they need, but they also aren’t locked into any particular provider when it comes to working with that data. This helps them be more productive and focus on what is really important to them. At the same time, we have launched our own distribution of OpenTelemetry, so we can help our customers implement this faster and use it. This provides customers with the best of the open-source approach as well as the scale around processing data and management that we provide.
How does this data get used today, and what other processes can it be used to support?
The first use case around observability is to tell you how reliable your application is from systems, application and user perspective. Applications are more distributed, and getting that insight helps you see where potential problems exist. However, that same data can be used to help you understand other things alongside your software. For example, your observability data provides you with a record of how your applications perform over time, but you should also get insight into your software pipelines and how your team is getting releases through your CI/CD pipeline.
This data can be used to help you improve your processes around software development as a whole, not just the software itself. If you can combine your data from your tools, like Atlassian, JIRA and GitHub, you can get a picture of how your team is doing and where you can find opportunities to help them improve. We talk a lot about how software and data can be used to improve the business, and we can apply those same learnings for software development optimization too.
How can other teams use data in their operations?
Security is a big use case for observability data. Security operations centers or SOCs have huge amounts of data coming in and they have to analyze that data to spot risks or new attacks. They can use observability data as part of this process to see where there are outliers in the data and where things might need to be investigated. Where this can help too is that you then only need to keep that data once – rather than having a set of data for observability, and then the same data set for the security team to use, you can have one copy that both teams can use. That helps save on costs over time, particularly when your applications scale up.
One big benefit in this is that you can help the security team ask different questions of their data. Rather than looking for patterns alone, you can help them understand what expected application behavior looks like, and then help them understand what changes are coming up and where to look for risks. This helps them avoid false positives in their data, and concentrate on the right areas.
What does this mean in practice for businesses? Do they have to change their processes, their mindsets?
Consolidating data and tools is a good opportunity for the future. Both development and security teams in organizations have to work at high volume and velocity around data, but this won’t organically result in collaboration. We’ll continue to see more of the same if senior leadership doesn’t force a change in approach. As workloads increase due to digital transformation initiatives, organizations are in danger of individual teams doubling down on their own silos of work, focusing on what they can deliver and ignoring the bigger picture.
I predict more enterprises will put together centers of excellence, teams that constantly work on that collaboration across functions and improve results around data. Teams should be working from the same platform, from the same data and making decisions that benefit both sides. This should deliver better application reliability and security in equal measure. Human nature won’t change without the right support, and you might need to force the issue in order for things to succeed.
What do you see as the future here? What’s coming next around observability?
More organizations are using artificial intelligence (AI) and machine learning (ML) capabilities across their infrastructure. It helps them work at a faster pace, with more efficiency and greater accuracy. AI has been sitting on the sidelines of observability for a while now, but we are seeing more systems starting to monitor AI’s use in production systems over the last two years, and this will increase in line with more adoption of AI.
Teams need to be able to answer the question ‘how do I know if this is working correctly and fairly?’ This means extending their existing observability approaches to cover their AI and ML capabilities. Today, there’s a big gap there, and that creates risk as organizations can’t manage those assets properly.
I think the work being done around OpenTelemetry provides insight into how this will develop. The open-source community will work on how observability into AI models functions over time. This will help those teams answer questions about their approach and the results that they see coming in. For example – can you say that your models are unbiased if you can’t explain why it came up with one result for one person compared with another? Using observability data, you can go back to that transaction and use it to explain why you got the result that you did.