Imagine you are a cybersecurity incident responder. 

Picture it: A developer discovers that the website of one of your subsidiaries has been defaced by a well-known hacking group that targets your industry. The website is running in a Kubernetes cluster. Before the report of the incident makes it to your incident response team, well-meaning decisions get made and the website is removed from production by spinning down the web application pods in Kubernetes. You, as the incident responder, now have three big problems:

  • All employees, including those of your subsidiaries, need to know how to properly report and respond to cyber incidents. You will need to develop training for the future.
  • Depending on state and federal laws, you are required to report this website defacement to local and federal law enforcement agencies (If you are in the United States, the Department of Justice has a federal reporting chart here).
  • The pods no longer exist in Kubernetes, so you cannot observe the website defacement for yourself. You also have very limited information in Kubernetes and in the system logs of the node the application pod was running on. Right now, if the incident response team needed to look at the logs, the server resource usage, or the network connections, each team member would need to be granted direct and elevated access to each of the relevant servers.

In summary, your incident response team does not have enough information readily available to perform proper analysis, identify whether or not the incident has been contained, identify whether more eradication steps are necessary, and determine if the organization is ready to enter the recovery phase of incident response by bringing the website back online. This also limits the information available to law enforcement and could affect your cyber insurance. Any analysis that does happen must be performed manually and requires elevated permissions to the servers. Observability tools eliminate the need for elevated access to individual servers while providing centralized logging, monitoring, and tracing data.

This is why observability matters for cybersecurity. Observability data is the foundation of analysis.

What exactly is observability?

Observability includes three major categories. The first is logging, which collects and centralizes the system, network, security, and application logs from environments. Usually, this is done with a lightweight agent like Promtail, Fluent Bit, or Elastic’s Beats. Shipping logs are also highly customizable to your environment. If the logs are in a standard logging format, this is quick and easy to configure. If the log format is non-standard it is still shippable but requires additional configuration to conform to formatting requirements. There are also additional configurations to implement to ensure the important logs are shipped, but any noise gets dropped.

The next major category is tracing. The definition of tracing may vary, but the general gist is that tracing shows flow. This flow might be at the network or service mesh level (Istio, Traefik, etc.), or it might be at the application layer such as applications using OpenTelemetry standards to ship tracing data to tools like Jaeger. Grafana defines tracing as “a representation of a chain of events through a system.”

The final category is metrics. This includes information about resource usage and custom metrics generated by an application. With the rise in popularity of Prometheus, many applications like GitLab ship with metric endpoints already created and ready for ingestion—simply add the target metric endpoint to Prometheus’ scrape configurations.

Observability should include anything that enhances visibility into a system.

This is not limited to only servers and should include additional information meaningful to your environment from sources like Netflow, any cloud services, or even operational technology. 

Why is observability important?

The visibility provided by observability is crucial for decision-making. Developers and administrators use this information when evaluating application performance, tracking down bugs, or troubleshooting failures. Security engineers use these tools when tracking and analyzing anomalies or for threat hunting. The information provided aids in decision-making during incidents—both the “something is broken” kind of incident and the security kind of incident.

These tools simplify the path to identifying important information. Having observability tools readily available for all involved in troubleshooting or incident resolution efforts saves time and effort when tracking down logs, resources, or connection information on individual servers. It also provides historical data to compare with rather than only current details. This is particularly useful when investigating resource usage issues or identifying bottlenecks, for instance. Centralized logging tools, in particular, provide powerful search features in products like Elasticsearch, Splunk, and Grafana Loki. 

Having observability tools is necessary for identifying a breach. Some hacking incidents, like website defacement, are obvious. Others are not so obvious. An attacker may sit in a system quietly observing and siphoning off data using backdoors and other obfuscation techniques. According to Mandiant, the average “dwell time” where an attacker is in a system undetected is 16 days. Even if an attacker is not caught in real time, the historical data provided by observability should verify the activity. This historical data will include information about connections between systems (tracing), individual application and system activity (logging), and any unusual resource usage such as sudden spikes in memory or CPU (metrics). 

In contrast to Mandiant’s dwell time statistics, IBM’s Cost of a Data Breach Report 2022 states that 207 days is the average amount of time for organizations to discover a data breach. Most entities that claim they have never been breached likely do not use observability tools. If there is little or no visibility, these entities cannot know for certain that there has not been a breach.

How is this different from security tools?

At this point, you may be wondering how observability is different from security tools like SIEM (security information and event management). The answer is it’s not exactly. Observability acts as the foundation for security tools like SIEM. The security tools layer on functionality for correlating information and flagging anomalies. Some tools like Elasticsearch and Splunk have both observability and SIEM functionalities available. Organizations will sometimes deploy agents for both observability tools and SIEM separately, some choose tools with both sets of functionalities to limit the number of agents, and others will deploy only observability tool agents and then ship the information from the observability tools to their SIEM solution.

How do I get to a solution?

The first step in the pursuit of any new technical solution should be to identify your requirements. Here are some examples:

The observability solution:

  • Shall use a single pane for displaying information
  • Shall run well in K8s
  • Shall integrate with the cloud
  • Shall collect data from on-premises systems
  • Shall meet compliance requirements for FEDRAMP (or other required compliance or cybersecurity frameworks like CMMC, RMF, HITRUST, or ISO 27001)
  • Shall encrypt data at rest and in transit

Next, evaluate a variety of products. Consider both commercial and open-source solutions and be open to mixing and matching stacks to find the best working solution for your set of requirements and budget. For example, for log collecting, a common stack is Elasticsearch, Logstash, and Kibana (ELK), but if it is run in Kubernetes, it turns into Elasticsearch, Fluent Bit, and Kibana (EFK). Likewise, Promtail in Promtail, Loki, and Grafana (PLG) can be swapped out for other log scrapers as well. Another consideration is leveraging powerful cloud tools such as AWS CloudWatch and CloudTrail. 

After evaluating products, continue to make architectural decisions that meet your functional and non-functional requirements. This will include decisions like storage location, firewall rule strategies, and access control for any web portals. This will also include decisions about data collection strategies, such as whether to use a push model, a pull model, or a combination of both for metrics collection.

Next, build out your metrics, logging, and tracing services. This will include building new servers or services for your new tools, while also altering settings for currently running servers and applications to ship metrics, traces, and logs in meaningful and useful formats.

After building, you will need to manage storage resources with retention policies and iterate on configurations to filter out what is unimportant. This can be done by leveraging scrape configurations or log pipelines, updating metric endpoints, and customizing tracing configurations to keep data under control. 

Once your data is in a meaningful format, leverage it by creating dashboards displaying useful data. This requires finding the balance where the information is useful between information overload and not enough data. You can also add in monitoring and automated alerting for events like high CPU usage metrics or specific log messages. This can be taken a step further by leveraging powerful AI/ML algorithms for anomaly detection and correlation (check out Oteemo’s AI/ML services). The higher the volume of data, the more important and useful monitoring and automated alerting becomes for daily operations.

Finally, iterate continuously. Observability takes constant maintenance to remain useful and meaningful. These systems cannot be set up and left alone. They must be maintained.

Where will there be challenges with observability tools?

The biggest challenge in observability is collecting the necessary data at the right time and filtering out the noise. Here are some tips:

  • Use scrape configs or data pipelines to drop any log lines that are empty, filter out other unimportant data, and mask any secrets or tokens
  • Turn on debug logging in applications only for troubleshooting, then immediately turn it back off
  • For prebuilt metrics, consider creating custom dashboards that only display the data points that are important to you
  • For custom metrics, only write metrics you are positive you need. Avoid collecting metrics that you do not need. Leverage experts on your teams to define which metrics are important.
  • Adjust sampling settings to filter out unwanted traces and specify how often traces should be sampled

Another common challenge is storage space and available server resources. Retention policies and storage space is especially important for log collection. Each log solution handles index management of time series data differently.  Setting data retention policies and ensuring storage space is available is crucial to success. Define a strategy for meeting any compliance requirements and moving data between hot and cold storage. 

Grafana Loki, for example, works most efficiently with larger chunks of data stored together. It divides data up based on unique combinations of labels. The more labels that logs are given, the smaller the chunks of data, resulting in smaller indexes that will slow down query results. In contrast, Elasticsearch best practice is to keep each index smaller and under 50GB in size. Any indexes over 50GB will affect performance. Elastic also replicates data, which means more storage space is required based on your replication needs.

A final challenge is the reliable transport of information between services in the observability stack. If network connectivity is interrupted (during updates, for example) or slow between the collection of information and the centralized storage, data can be lost. Some tools in an observability stack will keep a local buffer and clear it out once connectivity returns to normal, but if the buffer fills up before connectivity resolves, any additional data that would have been added to the buffer will be lost. Configure any collector buffers to be large enough to handle the amount of data generated during the average downtime of the connection between the collector and the ingestor where possible.

Imagine you are a cybersecurity incident responder with observability tools.

Now that we have an understanding of what observability is and the basic steps and challenges with implementation, let’s return to the beginning scenario. Your incident response team had very little available information regarding a website defacement, but now we’re going to switch to a situation where you have observability tools available. 

Imagine the same scenario where your subsidiary’s website has been defaced. A well-meaning team member has still taken down the Kubernetes pods running the website, making them unavailable, but your incident response team has access to an observability stack containing logs, traces, and metrics. 

You start the analysis process by looking at the logs for the web service in your log stack’s web interface. You use powerful search features to identify the specific activity, including an IP address and exact time, of the exploit that allowed the attacker to deface your subsidiary’s website. You do some more digging and identify indications of further compromise in the pod logs. These suggest the attacker was not only able to gain access to the web service pods, but also escape the container into the node operating system by using an old, unpatched Docker exploit. You are able to correlate this information with traces and metrics that show suspicious activity with anomalous traffic in traces and high resource usage metrics where the attacker appears to be siphoning off data. Through your analysis, you are now able to provide law enforcement with up-to-date information and are able to proceed through the incident response steps of containment, eradication, and recovery while using the observability data for decision-making. 

In the end, your website is back up and running with a fix for the initial exploit, the operations team has patched Docker and server versions, and you are continuing to work with the subsidiary to implement additional network segmentation to prevent pivoting to other critical systems. The rest of the investigation is now in the hands of law enforcement. The IP address belongs to a commonly used VPN, but law enforcement is able to take that information and continue investigating the cybercrime.

Observability tools provide much-needed information not only for developers, administrators, and anyone in a network operations center but also for security teams responding to incidents and making day-to-day security decisions. At Oteemo we specialize in architecting and implementing flexible and robust observability stacks that form the foundation of a comprehensive cybersecurity toolset. Get in touch today to find out how we can help strengthen your security posture by adding or refining observability.