The realities of security monitoring and the promise of SIEM?
In enterprise IT, data is collected from any number of IT and security devices, and then used to monitor, protect, understand and manage our technology-enabled businesses. Due to the ever-expanding attack surface, the amount of data collected today is overwhelmingly unmanageable, and ironically, we only have a very general idea of what value it should provide. One of the most common objections I hear when talking about security operations is, “I have 100 or more data sources to monitor.” What makes this statement a myth, is not that we collect data from more than 100 data sources, but how we use that data in our security operations programs and how it gets reported to executive management.
What’s Your Monitoring Effectiveness?
With decades of direct experience managing many security operations teams that collect mountains of data, it’s evident to me they barely leverage any of it for security monitoring and detection purposes. If you have 100 data sources reporting into your Security Information and Event Management system (SIEM), but you only have rule logic applied to 5-10 of them (and then maybe only one or two rules for most) what is your actual monitoring effectiveness? To understand this more deeply, we’ll review here all the uses for data collected, in addition to what data applies to particular uses within our security programs.
“I have seen the largest most sophisticated companies on the planet report the fact they collect 100+ data sources but fail to explain that only 3-5 of them have any form of rule logic or other use that can support the cost of collecting them all.”
How Enterprise Data is Used in the Security Environment
The diversity of users needing access to this information presents a key friction point for data aggregation and collection, this is very often seen in shared Splunk instances. Previously, business intelligence strategies were implemented to architect a data warehouse to collect everything and then create various datamarts specific to each user’s information requirements. Today, big data and data-bus solutions have solved this issue, however, they are not fully adopted and deployed with all users in mind. IT operations and IT security present the highest conflict situation because they are the most likely to be on a shared infrastructure with very different business requirements. The team responsible for managing the SIEM (or big data solution) and collecting data has the mission to collect everything, since more is better, right? Often the operations team does not have sufficient authority to get engineering to focus narrowly on what matters to them for detection and response. This is the tail wagging the dog!
Monitoring & Analysis
Looking at data in a time sequence from the first time it’s collected until it is discarded at the end of its information lifecycle; we generally start with analytical monitoring in as close to real-time as possible. Within monitoring, we pay attention to performance, health status, availability, configuration, and security. Each of these requires different fields from the logs produced. They are analyzed entirely differently and are all high volume, low signal monitoring problems.
Typically, the security signal comes from malicious signatures, traffic patterns or behaviors. However, another theory exists which is that anomalies are more likely to be malicious than signatures. I do not agree with this theory, but merely add them to the mix along with signatures and all other potential indicators of malicious activity. These have historically proven extremely difficult for humans to monitor at scale effectively, which is why automation is the more important requirement.
Reporting & Metrics
One of the next uses in the lifecycle for the data we collect is reporting and metrics. These are designed to answer questions of governance, risk, compliance (GRC), operations analysis and overall effectiveness and efficiency of the business infrastructure. They rarely do, the common top 10 report is an example of “data” without “information.”
There is a ubiquitous statement attributed to Edward Deming — “You get what you measure.” A key challenge in operational security is the inability to secure sufficient budget to support security operations adequately. In these cases, many of the metrics reported are used for budget justification rather than measuring effectiveness and efficiency. It is tough to prove a “cost avoidance ROI.”
The lack of nuance and the naked agendas that are so common in reporting and metrics result in the “Myth of 100 Data Sources.” I have seen the largest most sophisticated companies on the planet report the fact they cover 100+ data sources but fail to explain that only 3-5 of them have any form of logic or use that can support the cost of collecting them all. The reasoning that “we might use it someday, forensically,” is a feeble and expensive justification.
Hunting, a relatively new security discipline, and in position to overtake monitoring rapidly. Hunting requires a more extended and more in-depth set of data. This includes tasks such as identifying hyper-current attack methods, locating the appropriate data sources, doing pivot-and-search from suspicious sources, looking at specific slices of time where enterprise activity was deemed to be suspicious and visualizing large amounts of data. In the security operation centers I have been involved in building, we always dedicated a four-hour block of time on Friday to conduct a retrospective hunt over the last week. In that hunt we discovered that, for the most part, more incidents were missed than detected during 24×7 operational monitoring. The highest quality incidents were found during that retrospective hunt.
Once we’ve identified a security situation exists, then some data becomes useful for forensic analysis. This includes determining the extent of an intrusion, performing root cause analysis supporting or confirming the conclusions of incident response teams and supporting investigations as requested by the corporate legal department. Where data is likely to be presented for forensic purposes in a court of law, you must be able to establish that it was either collected as a business record or has been maintained in a forensically approved data store to verify that it could not have been maliciously modified.
One of the most critical applications of IT data, especially as data volumes scale beyond human management, is process automation. The types of automation that are appropriate include things like the application of logic and algorithms to identify specific issues within the data, the addition of context, business process intelligence for operational improvement and the automation/orchestration of appropriate response action based on security situations detected.
There are many data collection automation opportunities. However, one thing is still valid — there is still a ton of useless data that needs sifting through when automating from raw telemetry. This fact is one of the guiding principles of what I call the “small data” movement as opposed to the big data movement; where you find an actual use for the data before you collect and store a ton of it.
The data sources we collect provide visibility across many different technical perspectives. There is network telemetry, security telemetry, application telemetry, host-based telemetry, cloud telemetry, and contextual telemetry, to start. Within each of these categories, we have infrastructure devices that produce logs and sensors that monitor, whether for operations or security.
Even within a category each vendor or specific implementation contains radically different information in its log files or alert stream. While there are common logging formats and methods, there is very little agreement or commonality in the information contained used for specific purposes. Logs are highly inconsistent and hard to leverage for any purpose effectively.
Let’s walk through a high-level list of the type of sources were talking about:
Network sources include routers, switches, load balancers, and other LAN/WAN network infrastructure equipment. These sources provide visibility on what is transiting your network. Inbound, lateral and outbound are all interesting viewpoints, but these devices mostly provide performance and availability information and are highly repetitive in their log messages. These sources would also include network specific sensors like deep packet inspection and network flow collection these being used for anomaly detection, performance monitoring, and forensic review.
With security devices, there are some blended network infrastructure technologies like firewalls and proxies, and also dedicated security infrastructure like Identity and Access Management systems or Intrusion Prevention sensors. Security focuses on sensors that detect potential attack signatures and behaviors at various chokepoints and critical nodes. These sensors are looking for signatures or anomalies that might indicate suspicious activity in the form of malicious hackers or code. These also have a high noise to signal ratio and are hard to analyze.
Application telemetry is the least mature and has the highest attack rate. This runs the gamut from web server logs to user-experience application monitoring for custom e-commerce applications. Application telemetry is highly focused on performance and availability, and almost wholly ignores all but very basic security uses. This makes them almost useless for detection monitoring. The main exception is the authentication of users but that is a shallow set of use cases.
Host-based also has multiple categories. The native operating system has defensive and logging facilities that can be monitored, though the volume is extremely high with only a tiny number of events indicating maliciousness. There are also host-based agents; from NexGen Antivirus (NGAV), Endpoint Protection Platforms (EPP), Endpoint Detection and Response (EDR) or simply IT operations instrumentation.
With cloud as the newest critical component in our business IT strategy, it also has most of its focus on performance and availability rather than security controls. There are still minimal use cases detected through monitoring in cloud tools. The primary argument is that much of the traditional security is handled by the cloud provider and invisible to the user. This situation led to the development of the Cloud Access Security Broker (CASB), so the cloud user could demonstrate compliance and security requirements without resorting to the provider’s controls. Native cloud telemetry is very shallow at the moment with CASBs filling the gap.
Context is a category that many people fail to think about thoroughly. Maintaining a historical record of context is essential to identifying assets impacted but which are transient on the network. For example, what IP address did an offending host have at any given point in time? Can I map back to the hostname of that asset and how far back in time can I go to locate a malicious incident on a highly mobile endpoint? We commonly refer to this use as event alignment. This same problem applies to users, and all of that contextual information is very critical to making informed decisions. Criticality of every entity in the enterprise is also key to business prioritized risk decisions.
Not surprisingly, data sources are all narcissistic. They only talk about themselves and often repeat things that are not valuable. Routers are notorious for talking about route flapping; route-up, route-down, route-flap, If no one cares, why do we collect it? Unfortunately, an agreed upon set of informational fields across data categories, vendors and applications does not exist. If it did, this would allow us to derive much higher value from all this data.
Moral of the story — don’t get sucked into the myth that collecting 100 data sources into a single data platform, is providing real value for operational security or any of the other applications. The reality is, these efforts are not providing much value at all and most likely increasing your cost and volume without reducing your risk significantly. Collect only what matters and focus on deriving deeper value from it via automation. And remember, humans should not monitor consoles for high noise, low signal use cases.