Hexacorn

Many forensic artifacts can be looked at from many different angles. A few years ago I proposed a concept of filighting that tried to solve a problem of finding unusual, orphaned and potentially malicious files dropped inside directories that contain files that DO NOT reference these orphaned files at all.

I really hope that forensic analysis tools will evolve to add more features that will help to automate file system analysis based not only on a list of known hashes and/or file extensions, but also paths, partial (relative) paths, file names, actual file types based on their content, and ideas that rely on more complex algorithms: using prebuilt artifacts collections, leveraging various correlations (ideas like filighting), and of course machine learning and AI.

Today I want to explore one more angle of looking at file system artifacts — classes of file content. There are many file formats out there: executables, documents, configuration files, database files, and many other file types. The classification I am focusing on today though is slightly different – the format itself doesn’t interest me too much, but the function of the file does…

My guinea pig will be a license file. The type of a file that is all over the place, but no one reads them. And yes, removing them from the examiner’s view (during file system analysis) may not add a lot of value, but it’s used here only to illustrate the idea. There are many other file classes like this that can be classified as noise to the examiners’ eyes and if we start clustering them together, who knows, maybe we have just saved some personhours there…

I asked myself the following question:

– having a file system in front of me, how do I find all license files on it?

There are at least a few approaches I can think of:

use hashes of known license files,
use file names typically used by license files,
analyze content of all files and look for content that resembles a license file.

All of them have their own challenges:

the first one needs a lot of prep work to collect good hashes,
the second one is hard to do w/o some proper analysis of a clean sampleset, and
the third one is the most reliable, but it’s slow & needs even more preparation because it has to take into account a few more aspects: localization issues (license in various languages), file encoding issues (Unicode variants, ASCII, MBCS), file formats (TXT, RTF, HTM(L), PDF, DOC(X), etc.), and of course — performance (reading many files to analyze their content is expensive, plus not every file referencing GPL, LGPL, GNU is a license file)

I am going to focus here on the second one.

Your typical license file is usually called license, license.txt, eula.txt, and in case of Open Source, we often see files named like gpl.txt, license.gpl.txt, lgpl.txt, etc.

When you start researching this file naming bit a bit more, you will soon realize that there are a lot of variations. A lot of issues listed in 3rd point come to play as well f.ex.:

file names can be localized,
file extensions can be .txt, .rtf, .htm(l), .doc(x), .pdf, .xml,
some of the file names have typos,
many license file names use various prefixes or suffixes that identify the licensed software, the language or code page identifying the language the license file is written in,
some file names may refer to compressed file names f.ex. *.tx_ (in installation packages),
some license files may be stored inside the archives (including password-protected files) or installers,
some licenses are embedded inside the compiled help files (.hlp, .chm),
some programs may be hiding the licensing information in files named with various infixes: copying, releasenotes, thirdparty, copyright, and their variants, etc.,
some may refer to software version in terms of full, trial f.ex. evaluation,
some files with a license in name often refer to actual software licensing (getting keys, subscription, transferring the licenses, etc.),
finally, some file names may be available in a 8.3 DOS notation only.

As usual, the more you look, the more complex the problem you see.

For this post I have compiled a large file containing possible license file names. You can download it here.

Will it make anybody’s life easier?

I don’t know.

What matters is that we learned a little bit more how difficult the process of automated file system analysis is. What started as a trivial and frivolous idea ended up being a Don Quixotish attempt to formalize something that is impossible to tackle, even with a data-heavy approach…

Excel is the emperor of automation. Not the SOAR type, but the local one – yours.

Why?

Its formulas and VBA capabilities can turn many awfully mundane tasks into plenty of automation opportunities…

For instance… certain programming tasks.

The case/switch syntax is a beautiful construct. It allows us to define a large set of complex if/then statements in a very elegant way. It is very often used to split a data/value set into conditions that determine the result/output/outcome based on the input.

Now, there are some programming languages that do not support case/switch statements well (not a part of its functional specification, introduced in late versions hence not fully compatible, etc.).

Writing case/switch statements for these languages is TOUGH. This is because their limitations often force us to rely on a bunch of NESTED IF clauses… Writing many of these, nested, is not for the faint-hearted. Typos, incorrect number of opening or closing parenthesis, and overall — getting lost in the complexity of this nested logic is very easy.

And this is where Excel can help a lot.

Imagine a hypothetical scenario where you have to write a code that gets this input data (names that are a loose set of Value1, Value2, Value3) to generate the output (A, B, C, D, E) based on peoples’ names:

It’s pretty straightforward in case/switch supporting programming languages to define rules based on this set, but if the only thing available are nested ifs – it’s much harder.

Let’s try though…

The first idea we can tackle is to convert these 3 values (some are empty) to an array of per-row list of values:

You will admit that the List column looks more ‘programmer friendly’ now.

The formula that produces the values in the List column is obscenely simple:

We simply build a parenthesis-embraced list of values where the first one is always present, and the two others are added only if the respective cells are not empty.

Trivial!

But how do we convert this list of values into a programmatic construct that gives us A, B, C, D or E depending on the name (one of these three values/per row)?

This is one way to do it (using ternary operator):

(if (name in ('John', 'Jack', 'Anne') ? 'A' : (if (name in ('Peter') ? 'B' : (if (name in ('Kate', 'Leo') ? 'C' : (if (name in ('Paul', 'Ariel', 'Fyodor') ? 'D' : (if (name in ('Amy', 'Maria') ? 'E' : "N/A"))))))))))

If you are curious where the formula comes from — it is from the very same spreadsheet – it is the value of the F2 cell:

And how did we build this one?

This is how:

What we do here is building a basic ternary logic where we take the one part of the comparison from the current row, and then the alternative is taken from the row below. In the end, the sort-of recursion happens and we end up with a sequence of nested ternary operators doing its work.

It may be a bit surprising, but using Excel for building complex logical statements like the one above, the one that in the end can be pasted directly to your favorite programming editor is actually very easy…

You benefit from the fact the input data is saved in Excel format and is easy to edit, plus – as long as the formulas are correct – the resulting nested constructs are written (generated) in a syntactically correct way and with far less chances to introduce a basic typo error.

Hexacorn

A license (metadata) to kill (for)…

Excelling at Excel, Part 4