In any data management policy, there are two extremes: Save everything (just in case), and delete everything that’s out of date. The two extremes work hand in hand when, eventually, you decide that even though you want to save it all, the realities of storage costs have forced you to arbitrarily delete your data.
Ideally, you would retain “interesting” data that might be useful and remove the rest. Even better would be not to collect the data in the first place.
Occupational and environmental impact of additional data
Storing huge amounts of data is disastrous not only for business but also for the planet. Global emissions from cloud computing account for 2.5% to 3.7% of all global greenhouse gas emissions, according to The SIFT Project think tank, and storing monitoring data, logs, traces and other metrics that you may never see again, this Part of the footprint. ,
These harmful emissions can be significantly reduced by adopting a conscious and targeted approach to storing only relevant and valuable data and proactively cleaning up boring data. Operations staff benefit from having additional annotated references and can be much more productive in analyzing past events. And the business could (for once) see hard cost savings in reducing overall storage and data transfer, which has a bottom-line impact you can’t ignore.
critical vs boring time period
Operations teams always want to come back and analyze what happened after an incident. Perhaps you have automated runbooks that spawn your services during peak load times, and you’d like to know if that works. Your automation should have rolled back, turned off feature flags, or failed on another site – right?
New deployments are usually important to retrospectively because you understand how influential a release is on the production user experience or if you are trying to move from a maintenance window to a rolling update. Save important data, and discard the rest.
Programmatically classifying boring from critical
Service level objectives (SLOs) are goals of how a service performs against expectation to keep users happy and provide an early warning for poor performance. Many organizations automate their incident response and resource planning by defining SLOs in code (eg using OpenSLO). Automated SLOs can also be used as part of a data management strategy because they give a clear picture of when a system is behaving normally (“boring” times) and when something unexpected or important is happening. “interesting” timing). Service Level Indicators (SLIs) that feed into SLOs make excellent summary visualizations and allow you to discard more detailed data, especially when combining multiple data sources into a single SLO.
Automated SLOs can also trigger time-stamped events that add additional enrichment to the data (often called “annotations”), which can then serve as a “system of record” for the reliability of the system. . If you interpret the exciting times, you can easily use this extra contextual data to create a data retention plan that gives you precise fidelity on dramatic moments and moving boring data to cold storage, and is eventually scheduled to be removed.
With a sufficiently large and stable system, you may be able to use additional machine-learning methods to identify and interpret critical times. AIOps, machine-learning-driven operational automation, can play a role on top of SLO and other forms of classifying time series data using equally important event tracking systems.
Case Study: Clean Up After Yourself
A company I worked with to improve observability data retention used SLOs to shorten the data retention window. They had a short window – 7 days – and the retention policy would delete all observation data. They had a lot of data on a huge platform, and the sheer volume of it forced them into a brute force method of controlling the costs associated with saving the data. Of course, the team would have liked to learn how to investigate and fix everything, which led to scrambling to figure out what happened within a week! Unfortunately, weekly purges made it harder to decipher frequent issues because they could not compare recent events with previous ones.
To solve this problem, he created a list of timestamps for changes in the running system that could be useful to investigate. A simple API would mark a timestamp and a text field with details about why the position was worth looking at. They then automated calls to this API when an SLO was running down, pushing new releases to production, or when an operator noticed a problem in real time that could be the start of an event. Since the source of these timestamps was automatically and manually tagged, there were plenty of references to events available to understand why the rules created these events without much extra effort.
Once they had timestamps, it was easy to get the data collection system to automatically scan the source data and export the interesting times (+/- 1 hour from each timestamp) to cold storage, before the retention policy became permanent. Removes data per retention period from Because they were confident that they would save interesting data, they were able to shorten the primary retention period to three days (halving the hot storage cost) and gain much greater visibility into recurring issues, thereby increasing the team’s ability to learn. greatly increased.
Case Study: Triggering Actionable Collections
Another company had a similar situation, but with a twist: they wanted to prevent the data from being written in the first place! The company has a large fleet of equipment that emits centrally collected and analyzed log data. Carrying that data over the network had high data transfer costs and was usually unnecessary (unless it was at all!) They erred on the side of collecting more data instead of less, but one to reduce storage costs. Even with the cleanup strategy, upfront network costs were still a significant recurring expense.
To solve this, they created an API that could adjust the amount of verbosity produced by the instruments and sent home to a central observation system. They can remotely trigger verbose metrics in any part of the fleet and also bring devices back into brief mode. This became especially valuable during critical times like over-the-air (OTA) updates when the team pushed new software updates and needed to closely monitor the rollout.
Before the OTA update, they would trigger the “verbose” mode. Then they would rollout and could do detailed logging to manage the release. As soon as each device has completed the update, it will revert back to “collapse” mode. Apart from OTA updates, they can also trigger changes in the tracing level using SLO. For example, if an automated SLO detects errors that have increased relative to expectations, they can automatically switch to verbose mode to improve the team’s ability to investigate. On the other hand, if there were network bandwidth issues and the data pipeline was slowing down, they could switch the logging from concise mode to an even milder form of metrics – heartbeat-only mode – and reduce the data load on the service. Were.
The entire process from introducing OTA updates to modifying logging levels to increasing error rates and marking OTA updates was annotated in a time series database using a simple API. This added additional context to the investigation and also aided in data retention and cleanup after the fact.
Keep interesting data; delete the rest
Everyone is watching their budget closely and looking for places to cut costs creatively while preserving productivity. Now is a perfect time to take a close look at how much data you’re storing and what you’re actually doing with it. Observability data has a short shelf life, but that doesn’t stop it from growing to fill all the available space. Even a small improvement in your collection, retention, summarization and reference creation can be a big boost and enable your organization to do a lot with less.