Taxonomy, at its heart, is about making connections between concepts and labels. On the conceptual side taxonomy design requires analyzing and understanding users’ needs and mental models. On the label side there is the body of content (or “corpus” in info science speak), which may be quite large, running to millions of words (or more!). Getting a handle on that much text can be challenging for a human mind, but luckily we live in a time with technology that doesn’t break a sweat running millions of processes.

Text analysis and processing can be useful for a number of common taxonomy development tasks including:

  • Text mining for candidate terms & synonyms
  • Search log analysis
  • Statistical analysis of current metadata use (e.g. from a CMS database export)
  • Term extraction (e.g. from product names or article titles)
  • Data clean up or transformations
  • Aggregation or separation of values based on different criteria
  • Mapping free text to new controlled taxonomy terms
  • Summarizing labels used in a folder structure
  • Replacing a subset of terms
  • Frequency analysis (seeing how many times any term from a list appears in a corpus)

There are a number of high end, enterprise grade applications available for purchase or as a service that advertise advanced analysis, complex machine learning algorithms, and dazzling visualizations. But not everyone has the resources, or need for that level of support. Luckily, there are many approaches that can do a lot of the heavy lifting and provide very useful results using readily available tools that you probably already have on hand.

Excel / Open Office / gSheet are all different spreadsheets with the same core functionality including the ability to use formulas, pivot tables, and extend them with more complex programming or plug-ins (sometimes requiring additional purchase).

Command line tools are available natively in all Linux and Mac OS computers and can be added to Windows (free!). Many of these commands take only a few minutes to learn and have the added advantage of being able to apply them to multiple files (or an entire directory). Command can be combined or chained together to form more complex processes. For example “uniq -dc | sort” will return all the lines in the file that occur more than once, along with a count, and then pass that to the sort function which will sort them alphabetically.

Scripting (simple programming) may seem daunting but, with a very basic introduction to the overall approach (i.e. how to create and run code), there are so many examples available with a quick Google search, there is almost never a need to actually write code. The most common programming language for simple text manipulation is Python. Just search “normalize text python” and then cut and paste the results:

  • # convert to lower caselower_
  • string = string.lower()
  • print(lower_string)

To see examples of each of these approaches and learn more about DIY Text Analysis, check out my presentation from Taxonomy Boot Camp. You can use this as a cheat sheet for all the most useful operations to use in your taxonomy work.