How Splunk Retrieves Data Quickly…the Secret Sauce!

How Splunk Retrieves Data Quickly…the Secret Sauce!
splunk data processing big data

As a SIEM solution leader, Splunk has proven that it has the capability to accommodate and process big data very efficiently. Splunk views big data analysis and processing as an enjoyable achievement rather than a challenge; it’s one of the facts they have on their t-shirts:

splunk big data

So, if Splunk claims that it likes big data, it should have a reliable methodology and processing algorithm to prove itself when it meets big data, right?

Sources of Big Data

When we have a firm that deals with daily internal and external activities, like employee access activity, sales performance, financial transactions, unauthorized intruders, etc., we want to know what is actually happening, don’t we? Big data is what our machines and applications are using to keep us informed about those activities. This is what we call “logging.”

Every machine, regardless of its category, (sever, network, security, access, etc.) as well as each application (whether it’s in-house or in the cloud), are all logging data, and if we read and process those logs we can efficiently achieve the following:

  • Business activity analysis and reporting
  • External and internal security threat monitoring
  • Internal employee activity monitoring
  • Network and communication monitoring
  • Vulnerability management
  • And many more…

This data will enhance our business analyses and improve security controls within our businesses. To handle this amount of data and information, we will need a robust and trusted solution that provides all-time, custom-time, and real-time processing. This is where Splunk takes the lead.

This is shown in another Splunk t-shirt that I personally really like and truly believe in:

splunk data processing big data

How Does Splunk Process Data so Quickly?

Think about your logs as a very big book with thousands of pages and subjects. To make it easier for readers to access a certain subject, there is an “index table” which is a table that associates a certain subject with a page number. If we are searching for a specific subject within that book, we don’t need to go over all the pages in the book to find it. We can just get the page number from the index table and jump directly to that subject. Nice!

But what if the subject by itself is not enough for us to get to the information we need within this huge book? We will need a more detailed index table that will take us directly to what we are looking for, like a character name, an incident, time of an event, etc.

Splunk leverages this basic feature when it ingests and retrieves our data, so it’s actually creating an “index table” of sorts for each of the data sources ingested into it. By default, this “index table” will include host, source, and sourcetype. We call these metadata fields, and those “index table” files are called time-series index (TSIDX) files.

TSIDX Files…the Secret Sauce!

When data is being ingested into Splunk, it will be stored in a pre-defined index as a pre-defined sourcetype. This “raw” data will be stored in directories called “buckets.” These buckets include epoch timestamps in their names to indicate the earliest and latest time of the events within the bucket.

At the same time, the associated TSIDX file will be created and store the default metadata fields automatically (host, source, and sourcetype). When we search for data, we can specify the index and the host (the host the data is coming from) for example, so Splunk then will use a bloom filter to filter the buckets according to our 3 conditions (index, time, and host).

“Index” represents the directory the data lives in on the file system, “time” tells Splunk which buckets should be considered as part of this search, and “host” is looked up in the TSIDX files. This will expedite the search process because Splunk must only search through the raw data that is being pointed to by our “index table.”

Case#1 : index=cisco_asa host=firewall-1 earliest=-7d
This search will return all the logs gathered by Splunk during that period of time and only from that specific host. But what if those logs are still huge and what if we need to narrow our search further to a particular IP address with activity on this host?

Case#2: index=cisco_asa host=firewall-1 client_IP=10.1.1.1 earliest=-7d
Because this field client_IP does not exist in the TSIDX files, it’s extracted at search time (the moment we search the data). Splunk will retrieve all the raw data just as in Case #1 and then will extract that field from each raw event and return only the matching events where client_IP=10.1.1.1. This will take some time, right?

But if the fieldclient_IP actually exists in the TSIDX files (our “index table”) then the data will be retrieved very efficiently and very quickly, because again Splunk will be using the “index table” to get only the data we need for that particular condition -client_IP=10.1.1.1.

The “index table” may be adjusted and/or appended to by users so we can add as many interesting fields as we need for a particular type of data. Using this strategy, we can absolutely boost our searching and data processing.

Tstats, Data Models, and Acceleration

To get more benefit out of indexed fields and data, Splunk provides the “tstats” command. This command looks only on the “index table” (the TSIDX files) and gets the results with optimal processing time. This command will help end users and administrators get statistical data very quickly. Additional information about the usage of tstats can be found here.

Splunk has also introduced data models. This knowledge object is a data set that we can configure with our own preferred searching criteria and interesting fields. When we create and accelerate this data model all the returned fields that have been extracted within the data model will be saved into TSIDX files. This means that when we search a data model, we are searching just the TSIDX files (our “index table”) and so the data will be retrieved very quickly.

This acceleration is achieved by having Splunk periodically run searches in the background to accumulate the required data inside of our data model. Splunk makes sure that the data is always up to date and can be quickly retrieved. Users must enable acceleration and be sure Splunk has the resources (mostly CPU) it needs to get the job done. This feature is widely leveraged within Splunk Enterprise Security (ES). You can access more information about data model acceleration here.

Additional Considerations

Something that should be carefully considered here is storage capacity because the more the “index table” is appended to, the more storage will be consumed. Also, the data model accelerations will have their own capacity in addition to the existing raw data. Keep in mind that:

  • As a rule of thumb, Splunk provides approximately 50% data compression (15% is the compressed raw data and 35% is the TSIDX files).
  • Data model accelerations and indexed fields affect storage considerations but do not count extra against Splunk license usage.
  • Splunk has many solutions to capacity planning that can be found here.

Summary

In this article, we demonstrated the Splunk methodology for efficiently storing and retrieving data. We identified the “Secret Sauce” that Splunk uses to provide the best results. This process is very important and I consider it the core value of Splunk. Storing, processing, and retrieving time are the most important criteria customers consider when they are looking for a SIEM solution or big data processing tool.

About Aditum

Aditum’s Splunk Professional Services consultants can assist your team with best practices to optimize your Splunk deployment and get more from Splunk.

Our certified Splunk Architects and Splunk Consultants manage successful Splunk deployments, environment upgrades and scaling, dashboard, search, and report creation, and Splunk Health Checks. Aditum also has a team of accomplished Splunk Developers that focus on building Splunk apps and technical add-ons.

Contact us directly to learn more.