Anonymizing Reviewer Identity in Microsoft Word 2011 with Ruby and Automator

Microsoft Word 2011 lets you turn on the Track Changes feature and distribute your files to friends and colleagues. You can then see their comments and corrections.

Now consider the case of an editor of a scientific journal. The workflow goes something like this:

  1. Receive article from fresh-faced graduate student.

  2. Pick three reviewers.
  3. Receive article back from reviewers, bathed in blood-red ink. Figuratively speaking, of course.
  4. Anonymize reviewer comments.
  5. Return paper to graduate student, who reads it and weeps.

Notice step 4, in which the identity of the reviewers is cast into shadow. That way the graduate student only knows to hate "Reviewer number 2" instead of a specific person.

Microsoft's solution to this is to strip all personal identifying information (which they like to refer to by the acronymn PII) from the document in the following way:

Step 1: Open a saved copy of the document.

Step 2: Go to Word / Preferences / Security and check the box under "Privacy options" that says "Remove personal information from this file on save."

Step 3: Go to File / Save a Copy... and save a copy of the file, naming it something like MyFile-clean.docx.

This file now has no more metadata in it. The comments that have been made using the Track Changes feature have been anonymized into a single entity called "Author".

That's great. But there is a problem. In our scientific journal editor workflow example above, the graduate student doesn't know who to hate. Typically you'll get two good reviewers and then one who is just out in left field. This reviewer has just been turned down for promotion, missed a big grant, and got some bad gas station coffee. We want to isolate that reviewer, not mix in the grumpy comments with the good.

That got me thinking.

Doesn't Microsoft now use an open standard, Office Open XML (OOXML), for all of their documents? Can't we just go in there and tweak things a bit?

If we could sift through the contents of the XML and find the bits where the reviewers are identified, we could enumerate them and then anonymize them on an individual basis.

So that's what I did. I wrote a Ruby script that basically does the following:

  • Make a copy of the original file.

  • Extract the Open Packaging Conventions contents to a working directory.
  • Reach in and find all attributes of authors and their initials (w:author and w:initials). They look like this in the wild:

<w:comment w:id="13" w:author="John Doe" w:date="2010-10-29T13:37:00Z" w:initials="JD">

  • Transform the attributes into the following for each identified author:

<w:comment w:id="13" w:author="Reviewer 1" w:date="2010-10-29T13:37:00Z" w:initials="">

  • Repackage the files using the Open Packaging Conventions.

  • Present the user with filename-clean.docx.

The reviewers are now conveniently renamed Reviewer 1, Reviewer 2, etc., and their initials are empty strings.

Now I know what you're thinking. The correct way to do this is to use SAX or DOM and a proper XML parser to reach into portions of the tree and change attributes. You'll want use XSLT. You'll want to read the 5,220-page Office Open XML Part 4 - Markup Language Reference. You can do that if you want to. But that's not what I did (OK, I admit to downloading it and skimming sections). I took the quick and dirty way and used a gnarled old copy of sed.

The problem, though, is that my colleagues are unlikely to whip open a terminal and run a Ruby script. They want a nice, self-contained little droplet that they can drop a .docx file onto and magically a cleaned version will appear.

"Easy!" I thought. "I can do that with Automator!". Well, sort of. It turns out there is no good way to raise an informative error if you want to save your Ruby script as an Automator app that uses the Run Shell Script action. Rather, you get the unhelpful dialog box The action "Action Name" encountered an error. Check the action's properties and try running the workflow again. This is only going to confuse my colleagues who, I repeat, expect to just drop a file onto an icon and have it work. If they mistakenly drop a text file on there, or a .doc instead of a .docx, I want them to receive a nice error message like, This only works with .docx files. The file you dropped on me was not a .docx file.

Using Ruby's raise keyword only puts the error in the Automator log and raises the aforementioned dialog box instead. Enter...AppleScript. Sigh. It always comes back to AppleScript, doesn't it?

So I added a brief AppleScript to error-check the dropped file. If it encounters a suffix other than .docx it displays a helpful dialog box. If it doesn't, it passes through to the next action in the workflow (that is, the Ruby script) and all is well.

I give you now the pain of my labors, in both Workflow and Application form. I don't guarantee that it will work for you. I don't guarantee that it will expunge all mention of everyone in the document. I don't even guarantee that it will download to your browser without setting the internet on fire. Remember, nothing in life is guaranteed. Especially this software which you're downloading for free off of some guy's internet website.

Update February 2014: OS 10.9 Mavericks has changed Ruby versions to 2.0.0. The attached application named (note the underscore) has been changed to work on 10.9. But before it will run, you will need to open Utilities / Terminal and type:

sudo gem install ftools


The app works perfectly for me, I followed the instructions regarding entering 'sudo gem install ftools' in Terminal. Running Word for Mac 2011 on Sierra 10.12.4. Thanks so much!