How Analytics can help you uncover the knowledge hidden in your documents

21 January 2016

Organizations often possess a lot of data that’s being stored in unstructured formats. Most of the time, this involves data that people were able to enter as free text, such as e-mails, call center logs, presentations, manuals etc. In these instances, analytics can help to access the value hidden within this data.

To make this point clear, let’s take a look at a project we've been working on here at AE involving the CVs of our consultants.

AE consultans have got lots of experience in various technologies and methodologies. While we like to help each other out with issues or by answering questions, it’s not always easy to keep track of who can be relied on for a certain skill or expertise. These skills are included in the CVs of our consultants, but these files are not easily searchable in a structured way.

Next to that, we’d also like to get an overview of how our colleagues relate to one another: which people have the same skills, what are potential gaps within the organization etc.

To get a clear insight on this data, we followed the typical steps of an analytics project: data gathering, data cleaning, data analysis and visualization. This process was then further operationalized to make the data accessible to all AE colleagues.

Let’s zoom in on each of these steps below.

Data gathering and cleaning

We need a list of potential skills as well as a list of which consultants possess them. Each consultant has their own AE CV in a Word document where they’ve listed all of their skills in a bulleted list. That means we can automatically look through these CVs, extract the skills and then clean up the data.

First, we need to convert the Word documents to a format that allows for automated parsing. Word docs are structured files from which we can extract an XML-representation. To complete this step, we used a PowerShell script to fetch the most recent version of each CV and to transform it to the XML-representation, with the help of Apache Tika.

$files = dir dump $Directory -rec -exclude "* ned*","*old*","* nl*","*.pdf"."*.xls","*-NED*"
$files | group directory | foreach {
$a = @($ | sort {[datetime]$_.lastwritetime} -Descending)[0]
$b = "xml\" + $a.Directory.Name + ".xml"
echo $a" to "$b;
java -jar tika.jar -r $a.FullName >> $b

This generates an XML-file for each AE consultant:


The skills can be found as elements between the chapters “Professional Skills” and “Professional Experience”. To extract the skills from the XML-files, we can use XPATH queries in R. R is a powerful and extensible open source Analytics tool.

Further down this blog post, we’ll use R to visualize the skills, but for now we only use the XML package to parse the Word documents.

doc = xmlInternalTreeParse(x,addAttributeNamespaces = FALSE)
src = xpathSApply(doc, "(//p[following::p[contains(translate(text(), 'PROFESINALXRC', 'profesinalxrc'),'professional experience')] and preceding::p[contains(translate(text(), 'PROFESINALK', 'profesinalk'),'professional skills')] and (@class='opsomming' or @class='opsom2')])",xmlValue,simplify=TRUE)

This results in a list of skills, but our work is not done yet.

On the AE CVs, skills often have been combined in one sentence or phrase as part of one bullet point (e.g. “R & SAS”), descriptions have been added (e.g. “experienced in…”), the same tool is mentioned in various ways (e.g. "Microsoft Systems Center", "MSSC", "MS Systems Center" ...), sometimes even with version numbers, and so on.

This list has to be cleared further to streamline the notations and to remove any clutter. To do this, we used an excellent open source tool called Open Refine. Using R, we exported the list of extracted skills for each consultant to a CSV-file in which each row exists out of two columns: the name of the person and a single skill. Next, we performed the following actions on the second column:

  • Split on special signs such as '(', '&', ')', ...
  • Remove spaces before and after skills
  • Merge similar skills and convert to 1 specific notation
  • Delete skills which we don’t use for our analysis.

analytics_data_skilled_2 analytics_data_skilled_4 analytics_data_skilled_7

Open Refine contains a number of helpful operations to aid in this process, such as automatic clustering of similar skills, the ability to sort on the most common skills etc. The result of this step is to create a clean list of skills for each consultant, which we can then use to search and analyze.

Data analysis and visualization

Now that we’ve got a clean set of consultant skills data, we want to visualize it. We want to know which consultants have similar skills and visualize that info in an intuitive way.

We’ve chosen to do this using the Multidimensional Scaling (MDS) technique: consultants are spread out across two dimensions, with consultants with ‘similar’ skills placed closely together whereas consultants with ‘less similar’ skills are placed further apart. The amount of ‘similar’ skills between consultants is determined by looking at how many skills these consultants have in common.

To calculate the MDS results, we once again turned to R, this time with the stats package. We visualized the result using D3.js, a JavaScript library for data visualization, as shown here below. This way, we can gain insight into which consultants have similar skills, which groups/clusters we can define, which consultant profiles overlap almost completely, etc.

(Open full map in new window)


To make the cleaned-up list fully searchable, we used this list for the internal launch of an application of ours called Sk!lld.

The initial idea for Sk!lld was launched at our AE hackathon. Sk!lld is a web app everyone at AE can use to find people based on their skills and in which consultants can keep their CV up to date. The skills were added to the user profiles, after which they could be searched using Apache Solr.


Our colleagues can now keep their skills up to date inside the app and even indicate which skills they would like to develop.


Sk!lld also makes suggestions based on your search terms or based on the skills you’ve already got on your profile. We get these suggestions by applying Analytics (more specifically: Association Rule Mining). The algorithm looks for sets of skills that often appear together and determines which rules can be deducted from this. These rules are recalculated daily and integrated into the Sk!lld search engine. This lowers the threshold to add relevant skills to a user profile or to find related profiles. We once again used R to implement the Association Rule Mining algorithm, this time with the arules package.


As you can see, the information uncovered from the unstructured text documents that our consultant CVs were, turned out to be so valuable that we have made the data available to everyone at AE, so they can easily find people with specific skills.

Special thanks to Davy Sannen for his help creating this post.

Want to know more about what value analytics can unlock for your company? Contact us for more information.

Alex Wauters

Written by Alex Wauters

Post a Comment

Lists by Topic

see all

Posts by Topic

see all