Reducing Your Splunk License Usage (Trimming the Fat)

Reducing Your Splunk License Usage (Trimming the Fat)

Reducing Your Splunk License Usage

In a perfect world we would each be able to collect every byte of data from all of our systems and store it forever on free infrastructure that requires absolutely no maintenance and is easily accessible and completely secure. Unfortunately here on Planet Earth we are commonly forced to optimize our systems along a cost-benefit curve – proving the value of the data we collect is worth more than the cost of the infrastructure, licensing, and maintenance. For most organizations, Splunk has no trouble demonstrating its worth year after year but even your local office Splunk addict wants to be sure they’re getting the most bang for their buck before asking for a license increase at the yearly budget meeting. According to Splunk’s pricing guide, with a 100 GB/day term license you’re paying $600/year per GB/day of ingestion – not to mention the cost of the infrastructure required to manage this load.

Before starting a discussion about renewing an oversized Splunk license or purchasing additional license capacity, take the time to verify you’re making the most of the license you have. In this article, we’ll explore ways you can trim the fat from your Splunk license consumption and better align your Splunk usage with your organizations strategic goals.

Where is all this data coming from?

The first step in reducing your license consumption is understanding the sources of your daily data ingestion. The best way to begin analyzing this is to look at license consumption by “sourcetype”. “sourcetype” is a  required metadata field Splunk attaches to each event that (when properly configured) describes the format of the data (e.g. “syslog”, “WinEventLog:Security”, “cisco:asa”) and in most cases can give you a rough idea of where that data comes from and what it contains.

Splunk core provides a very robust and practical dashboard out-of-the-box (Log in to your license master as an admin > Settings > Licensing > Usage Report > Previous 30 Days > Split by: Source type), or you can run your own query, something like this:

index=_internal source=*license_usage.log type=”Usage” earliest=-14d@d latest=@d

| fields _time, st, b

| bin _time span=1d

| stats sum(b) as b by _time, st

| stats avg(b) as avg_b max(b) as max_b by st

| eval avg_GB=round(avg_b/1024/1024/1024, 3)

| eval max_GB=round(max_b/1024/1024/1024, 3)

| fields st, avg_GB, max_GB

| sort – avg_GB, max_GB

By closely examining your largest sourcetypes – anything consuming more than 1% of your license quota on average per day (300 GB/day license = 3 GB/day or more) – can you explain the contents of each of these sourcetypes or the reason you’re ingesting them? If not, consider making a simple Excel sheet listing your largest sourcetypes, investigating these sourcetypes one by one in Splunk, and making a note of their contents (e.g. “wireless network traffic”, “SSO access logs”, “application crash logs”). Be as specific as possible without over doing it – if you can identify the specific product, data center, location, or responsible team, do so. Completing this may require a little research, but if it starts to take more than a few minutes, move on and keep reading. Side note: In my own environment I’ve gone so far as to write out descriptions for every sourcetype, save them as a lookup in Splunk, and create a dashboard with details on every sourcetype and index for other users to reference.

At this point you may find 99% (or some other large majority) of your license consumption is assigned to a single sourcetype. If that’s the case, pivot your investigation on another Splunk metadata field such as host, source, or index – is one subset of hosts, sources, or indexes accumulating more data than others? Try something like:

By host:

index=_internal source=*license_usage.log type=”Usage” st=”<your_large_sourcetype>”

| stats sum(b) by h

By source:

index=_internal source=*license_usage.log type=”Usage” st=”<your_large_sourcetype>”

| stats sum(b) by s

By index:

index=_internal source=*license_usage.log type=”Usage” st=”<your_large_sourcetype>”

| stats sum(b) by idx

How did I get all this data?

Now that you have an idea of where all that license goes every day, you may be wondering how Splunk began ingesting some of this data in the first place. The most common cause of unintended ingestion is Splunk administrators neglecting to fully vet apps and add-ons (TA’s, etc.) downloaded from Splunkbase during deployment.

The best and most common example of this is with the Splunk add-ons for Linux (*Nix) and Windows. A Splunk administrator downloading these add-ons for the first time would open the inputs configuration files (inputs.conf) and find literally dozens of inputs to gather data on system performance, security, and change monitoring. Having never used these add-ons before, that administrator would likely start by enabling ALL of these inputs and deploying the app to collect data. Depending on your environment this could be (and usually is) a TON of data.

A good Splunk administrator (or one who has any time on their hands at all) will review all the new sourcetypes coming in and trim any unneeded inputs. A bad Splunk administrator (or one of the many who has no time on their hands) may just verify the data is flowing in as expected and move on. If a license warning is flagged that night for going over the daily quota, the bad Splunk administrator may review the data he has collected and trim his enabled inputs down to something more manageable but otherwise the data will continue to be collected, unnoticed. I should note here that these particular add-ons are some of the most useful available on Splunkbase, and they are not the only apps or add-ons guilty of luring Splunk administrators into collecting unnecessary data – but without careful attention before, during, and after deployment these apps and add-ons can greatly increase your daily ingest.

How do I start reducing my Splunk license usage?

The most logical approach would be to assign a value to each data set (sourcetype) and start cutting these sets from least to most valuable. However, assigning a specific, meaningful value to each data set can be very challenging and needlessly time consuming. A more practical method would look like the following:

  • Eliminate completely useless data
  • Focus on your key use case
  • Tune your key data sources

Eliminate completely useless data

Believe it or not, there is most likely some data being ingested into your Splunk environment today that serves no purpose whatsoever and in some cases is actually throwing off your metrics. The most common example of this is duplicate data. Many applications rotate logs on a schedule and if Splunk is not configured to monitor these logs in a way that matches your rotation scheme Splunk may re-ingest logs after they rotate. The Splunk add-on for Linux (*Nix) contains a monitor input for the /etc directory that triggers a re-ingest when a file’s modtime is changed – some applications may change this modtime frequently without actually modifying the file causing Splunk to ingest the same file thousands of times.

Another common problem is meaningless data. Some legitimate data sources may occasionally send events to Splunk that provide little or no meaning (e.g. empty messages, encrypted messages, overly long messages). These events can generally be filtered out with the combination of props.conf, regular expressions defined in transforms.conf, and a little bit of CPU on the indexer. Longer messages can be trimmed down using the TRUNCATE parameter in props.conf.

Focus on your key use case

Once you are sure you’ve cut all completely useless data, take a moment to consider how Splunk is used in your organization. If Splunk was bought to serve as a SIEM tool for your SOC, do you really need to ingest logs from server performance statistics from perfmon and top? If Splunk was bought to serve as your IT operations log management tool, do you really need to be ingesting firewall logs or access logs from your authentication stack? If Splunk was bought to serve the Splunk app developers, do you really need to be monitoring your internal domain controller infrastructure? You get the idea.

It may be the case that Splunk was purchased as an enterprise security tool (for example) but because server performance logs were being ingested, the IT operations team has come to rely on Splunk. If that is the case, consider addressing this in the next budget cycle – maybe it’s time the IT operations team starts paying their share of the license cost. In any case, ideally you can avoid cutting any data sources that are actually being leveraged by your users.

Tune your key data sources

Now that you’ve thinned out the herd to your must-have data sources, you can begin tuning these data sources to your must-have events. Splunk provides out-of-the-box methods for filtering events and blacklisting individual files or directories, but most tuning will likely take place at the log source.

One of the best examples of this involves Windows Security Event Logs. The amount of data generated by your Windows servers can vary greatly depending on the configured audit policies. Some audit policy settings will actually cause a Windows Server to log every allow and deny from the local Windows firewall – easily generating gigabytes of data per day per server. The best way to address this is to find a reasonable audit policy that works for your organizations need and deploy it using group policy across your environment.

Another good candidate for tuning is your syslog feeds. Many network devices and applications have adopted syslog as a standard logging mechanism. These same devices and applications will often allow administrators to select a facility log level – the maximum level of detail to log. As a rule of thumb, “INFO” provides useful logging without being overly verbose – but there are use cases that would require “DEBUG” and even others that don’t need more detail than “ERROR”. Consider your use cases – do you really need “DEBUG” level logging on all of your wireless controllers?

Also consider reviewing your data collection intervals. Do you really need to poll Perfmon every 10 seconds? Or would 5 minutes suffice? That one change would result in a 30x reduction in license usage for that data set.

Is there a misconfiguration in your IT environment in general generating a lot of noise in your logs? A broadcast storm could send your license usage through the roof overnight. A disabled account repeatedly attempting login via a script or service may set off a slew of security alarms or trigger additional events in Splunk. These are just a few examples of the many ways to tune your key data sources.

Bottom Line

Even the best Splunk administrator could benefit from a second look or a second pair of eyes on their license usage. While not every Splunk environment suffers from unintended or unnecessary data ingestion, Splunk Professional Service consultants routinely find gigabytes per day of savings in client environments. The bottom line is Splunk gives you the freedom to get as specific as possible with your data collection – you just need the time, expertise, and creativity to match your ingestion with your strategic goals.

SEE HOW ADITUM CAN HELP