Dovecot is pleased to announce that, after an enormous technical and editorial effort, the UN Office of the High Commissioner for Human Rights (OHCHR) has launched their new, taxonomy-powered website. This Drupal site features over 20,000 pages of HTML content and tens of thousands of digital assets supporting the crucial and sensitive work of the organization across the globe.
In this site, taxonomy drives complex content aggregation and dynamic placement as well as search and filtering.
Dovecot will be presenting “Optimizing the haystack: Improving findability in content-heavy websites” with partners Bluestate and Axelerant at DrupalCon 2022 in Portland, Oregon on April 26. Be sure to say hi if you are able to attend!
Read our case study for more on how we helped OHCHR with taxonomy harmonization and development.
Taxonomy, at its heart, is about making connections between concepts and labels. On the conceptual side taxonomy design requires analyzing and understanding users’ needs and mental models. On the label side there is the body of content (or “corpus” in info science speak), which may be quite large, running to millions of words (or more!). Getting a handle on that much text can be challenging for a human mind, but luckily we live in a time with technology that doesn’t break a sweat running millions of processes.
Text analysis and processing can be useful for a number of common taxonomy development tasks including:
- Text mining for candidate terms & synonyms
- Search log analysis
- Statistical analysis of current metadata use (e.g. from a CMS database export)
- Term extraction (e.g. from product names or article titles)
- Data clean up or transformations
- Aggregation or separation of values based on different criteria
- Mapping free text to new controlled taxonomy terms
- Summarizing labels used in a folder structure
- Replacing a subset of terms
- Frequency analysis (seeing how many times any term from a list appears in a corpus)
There are a number of high end, enterprise grade applications available for purchase or as a service that advertise advanced analysis, complex machine learning algorithms, and dazzling visualizations. But not everyone has the resources, or need for that level of support. Luckily, there are many approaches that can do a lot of the heavy lifting and provide very useful results using readily available tools that you probably already have on hand.
Excel / Open Office / gSheet are all different spreadsheets with the same core functionality including the ability to use formulas, pivot tables, and extend them with more complex programming or plug-ins (sometimes requiring additional purchase).
Command line tools are available natively in all Linux and Mac OS computers and can be added to Windows (free!). Many of these commands take only a few minutes to learn and have the added advantage of being able to apply them to multiple files (or an entire directory). Command can be combined or chained together to form more complex processes. For example “uniq -dc | sort” will return all the lines in the file that occur more than once, along with a count, and then pass that to the sort function which will sort them alphabetically.
Scripting (simple programming) may seem daunting but, with a very basic introduction to the overall approach (i.e. how to create and run code), there are so many examples available with a quick Google search, there is almost never a need to actually write code. The most common programming language for simple text manipulation is Python. Just search “normalize text python” and then cut and paste the results:
- # convert to lower caselower_
- string = string.lower()
To see examples of each of these approaches and learn more about DIY Text Analysis, check out my presentation from Taxonomy Boot Camp. You can use this as a cheat sheet for all the most useful operations to use in your taxonomy work.