Splunk, What’s in My Data?!

An absurdly brief document on discovering interesting things

 

David Carasso

 

 

In each of the below examples, replace “…” with whatever search you want to pull events from. If you want to understand a particular source, then the “…” should be replaced with “source=<path to source>”. If you want to understand fatal errors “…” might be replaced with the keyword “fatal”.

 

Event Overview

 

Show 5 examples from each cluster, from most common cluster to least:

 

…| cluster labelonly=t showcount=t

| sort -cluster_count, cluster_label, _time

| dedup 5 cluster_label

 

Group events by most common first 5 punctuation characters:

 

…| rex field=punct "(?P.{5})"

| eval smallpunct= "*" + smallpunct

| stats first(_raw) as example count by smallpunct

| sort -count

 

Field Overview

 

Show patterns of co-occurring fields. 1.0 means two fields always co-occur:

 

…| fields - date* source* time*

| correlate

Use 'top' to look at the most common Log_Level by Component...

 

… | top Log_Level by Component

 

Alternatively, we can build a contingency table to more conveniently see the patterns:

 

… | contingency Log_Level Component maxcols=5

 

The associate command can be used to automatically deduce conclusions:

 

… | associate Log_Level Component

 

Transaction Bursts

 

One of the most useful ways to see transactions, or higher-level events, is to look for pauses in your events, as real-physical events often happen in bursts. Here we see the most common pauses between events:

…| delta _time as timedelta

| top timedelta

 

Now let's group events into transactions, splitting when there's a pause of more than 2 seconds, and then filtering out all the transactions of size 1:

 

…| transam maxpause=2s

| search eventcount > 1

 

Incidentally, transactions of size one are also interesting, as they're often longer running single, higher level commands:

 

…| transam maxpause=2s

| search eventcount = 1

 

Anomalies

 

Very often you want to find “problems” in your IT data, but you don’t know what to look for.

 

look for events that do not cluster into large groups.:

 

…| cluster showcount=true

| sort - cluster_count

 

find unexpected events by finding values that are far from the standard deviation. The delay field here:

 

… | anomalousvalue delay action=filter pthresh=0.02

 

unexpected values based on the past historical context

 

… | anomalies

 

Once you have searches that find unexpected events, you can set alerts for them. You can also combine events together into ‘transactions’, and look for anomalies in groups of events.

 

Graphing

 

One of the most obvious methods of discovering what your data is saying is to simply graph your data. The problem is that most people don’t graph enough. Graph lots and lots of fields against other fields. Delay by time, delay by IP, max(delay) by IP, min(delay) by referrer, count(referrer) by day of week, time of day, delay by state, etc.

Eventually you’ll tweak out little nuggets of knowledge.