Multi-field extractions in Splunk

06.13.2011

As a SysAdmin, one of the cooler tools that I’ve worked with is Splunk. A project I’m on indexes absolutely every log that it generates into Splunk, from firewall logs to system logs to custom application logs. Splunk does an excellent job of identifying the format of the data we ingest and automatically extracting fields for log types that it knows about. Our custom application logs, though, need a little massaging before we can put them to use. That’s where field extractions come in handy.

Field extractions allow you to define a regular expression to run against log messages and extract out fields that you define. If our custom logs contain a hostname and an error code, for example, field extractions let me pull those values from the logs and then write searches based on them. For a quick overview, check out the Splunk video on using the Interactive Field Extractor (IFE). One feature of field extractions that I just discovered is the ability to extract multiple fields from one field extraction. This allows you to parse an entire log message into its component fields using just one field extraction statement.

Extracting xferlog fields

As an example, let’s look at an xferlog generated by ProFTP.

Mon Feb 26 12:52:43 2001 2 3.example.com 26295 /var/ftp/pubinfo/jpeg/NeptDS.jpg

b _ o a mozilla@ ftp 0 * c

(I’ve added linebreaks throughout this post to make it more readable. The actual examples would be all one line without the ‘’ characters.)

The xferlog format consists of fourteen fields, all of which we may be interested in searching on at some point. The long way to get those fields would be to write thirteen individual field extractions, one for each field. (The date is parsed automatically by Splunk, so we’ll leave that one alone). The better way is to create a long regular expression that can extract all of the fields that we’re interested at once.

The documentation on doing this using the Interactive Field Extractor is pretty sparse (read: non-existent). After experimenting with the IFE, I found that Splunk was expecting a regular expression that looked something like:

^[^ ] (?P<FIELDNAME1>[^ ]+)(?:[^ n]* ){6}(?P<FIELDNAME2>[^ ]+)

With that in mind, the regular expression to parse the xferlog would be:

^[^ ]+ [^ ]+ [^ ]+ [^ ]+ [^ ]+ (?P<FIELDNAME1>[^ ]+) (?P<FIELDNAME2>[^ ]+) (?P<FIELDNAME3>[^ ]+) (?P<FIELDNAME4>[^ ]+) (?P<FIELDNAME5>[^ ]+) (?P<FIELDNAME6>[^ ]+) (?P<FIELDNAME7>[^ ]+) (?P<FIELDNAME8>[^ ]+) (?P<FIELDNAME9>[^ ]+) (?P<FIELDNAME10>[^ ]+) (?P<FIELDNAME11>[^ ]+) (?P<FIELDNAME12>[^ ]+) (?P<FIELDNAME13>[^ ]+)$

When I tried to enter that into the IFE the form began to act squirrely after the eighth field and wouldn’t let me add anything else. As far as I can tell, there’s a limit on how long your regex can be when using the Interactive Field Extractor. I saved the field extraction anyway, providing field names when I saved it.

In the Splunk management interface you can create and edit raw field extractions without the (helpful) overhead of the IFE. The extraction I’d just created looked like:

^[^ ]+ [^ ]+ [^ ]+ [^ ]+ [^ ]+ (?P<transfer_time>[^ ]+) (?P<remote_host>[^ ]+) (?P<file_size>[^ ]+) (?P<filename>[^ ]+) (?P<transfer_type>[^ ]+) (?P<special_action_flag>[^ ]+) (?P<direction>[^ ]+)

Adding the last six fields was as simple as appending them to the end of the regular expression and pressing save. No errors about the string being too long and, when I performed a search, all of the fields in each xferlog message were available to search against.

Note: When naming fields, you cannot use the ‘-’ character. This caught me up for a while. It doesn’t give you an error, but it also doesn’t give you results.

The final regular field extraction ended up looking like:

^[^ ]+ [^ ]+ [^ ]+ [^ ]+ [^ ]+ (?P<transfer_time>[^ ]+) (?P<remote_host>[^ ]+) (?P<file_size>[^ ]+) (?P<filename>[^ ]+) (?P<transfer_type>[^ ]+) (?P<special_action_flag>[^ ]+) (?P<direction>[^ ]+) (?P<access_mode>[^ ]+) (?P<username>[^ ]+) (?P<service_name>[^ ]+) (?P<authentication_method>[^ ]+) (?P<authenticated_user_id>[^ ]+) (?P<completion_status>[^ ]+)$

Moral of the story

If you’re writing a complex field extraction, then the Interactive Field Extractor is a really useful tool to see if you’re going to get the results you want. If you’re writing a long field extraction, then adding the extraction through the management interface is the way to go.