The five pitfalls of document labeling - and how to avoid them

By Nick Adams, Ph.D.

Whether you call it ‘content analysis’, ‘textual data labeling’, ‘hand-coding’, or ‘tagging’, a lot more researchers and data science teams are starting up annotation projects these days. Many want human judgment labeled onto text to train AI (via supervised machine learning approaches). Others have tried automated text analysis and found it wanting. Now they’re looking for ways to label text that aren’t so hard to interpret and explain. Some just want what social scientists have always wanted: a way to analyze massive archives of human behavior (like the Supreme Court’s transcripts or diplomatic correspondence) at high scales. With so much digitized textual data now available, and so many patterns and insights to be discovered, it’s no wonder people are excited about annotation.

We always encourage researchers to dream big and tackle the most intricate and impactful questions in their fields. But annotation projects are not exactly easy (I say that as someone who has consulted a hundred or more annotation projects since my days at UC Berkeley when I was teaching research methods, and founding and leading text analysis organizations at the D-Lab and Berkeley Institute for Data Science). Here, I outline the five most common ways things go wrong, and offer some advice to keep you in the clear. 

  1. Gathering data that are too thin

This is the most common mistake researchers make. When you first read through a sample of your documents, you may become excited by a single research question and decide to just focus on labeling content that seems pertinent to that question. Or maybe you want to label more contextual information, but you’ve been advised to limit your scope so the project won’t become too large to manage.

Limiting the richness of your conceptual scheme (AKA coding scheme, ontology, or label set) might seem like the quickest way to answers and a published journal article. But it can actually prevent you from getting any useful answers, much less a publication. Reviewers will be keen to point out variables omitted from your study and suggest that you may have cherry-picked passages. This has been especially true since Biernacki’s scathing critique of all first generation manual content analysis methods.

Moreover, while it’s tempting to quickly label what you think you need for your narrow research question, there are start-up costs associated with labeling projects. And going back to the documents to satisfy a reviewer or pull on some thread of inquiry is not a trivial amount of work. As long as you have annotators ready to go, and they’re reading through your documents, it’s worth labeling information on potentially confounding, mediating, moderating, or instrumental variables. Doing so means more than just satisfying picky reviewers. It also means doing better science, and uncovering complex relationships and mechanisms you would have missed––the kind of discoveries that will take your research from good to great.

2. Gathering too little data

If you are gathering rich data, you might think you don’t need much of it to perform your analyses. This can be true if you’re just writing up a theory memo, or simply trying to argue that some new social phenomenon or mechanism exists. But, if people find the phenomenon or mechanism at all interesting, they will soon want to know how and why it appears under various conditions. These questions can only be answered by gathering and labeling more data.

Sometimes AI and machine learning researchers succumb to this pitfall, as well. They want to use data labels to train a text classifying algorithm via a machine learning process. The problem is: they estimate the size of their annotation job based on one variable label that is rather common in their raw documents. Their resulting AI performs well when labeling for that variable. But it stinks at accurately classifying text by the other variables/labels. That’s because they have gathered too little training data on the less common variables/labels.

The power of your comparative, statistical analyses and your machine learning training set depend on going the extra mile and gathering more data. It’s very often the difference between impressionistic, somewhat-better-than-nothing results, and world-class, game-changing results.       

3. Failing to appropriately validate your data

Language is slippery. That’s why poetry can be so beautiful and moving––and why machines can’t understand it. But it’s also why your critics will seem to have such abundant ammunition as they try to shoot holes in your evidence and conclusions. If you want to identify, count, compare, and analyze meanings recorded as textual data, you’ll need to armor up. And that means you need a comprehensive validation strategy. When someone asks how you know that your team labeled all the relevant information, and labeled it correctly, it won’t be enough to say they are very smart and dedicated. It won’t be enough to report their confidence in their own work. The research standard now requires that you show multiple independent annotators applying labels identically. Some tools make it easy to find and report this annotator consensus while others leave it to you to figure that out on your own––or more likely, plead with reviewers to accept your results as they are.

Woman sat at desk surrounded by flood of paper documents

There’s more. Not only will annotators’ skills and performance vary, they will vary across the different variables/labels in your project. For instance, annotator variance will likely be much lower for straightforward labels such as ‘# of people reported in attendance’ than for more subjective labels like ‘evidence of quixotic mood’. You will need to report Krippendorf’s alpha scores for each of your variables/labels and ensure they are above a threshold of inter-rater reliability appropriate for your field and application (usually around 0.68 or higher for social science, or 0.5 or higher for machine learning). Here, again, some tools help you report these statistics easily while others require you to train up your data science skills and write some scripts.

In addition to these metrics, it’s important to have the ability to monitor and improve your data as you proceed. Some tools allow you to calculate everything at the end. That’s better than nothing, but it is too late to make the improvements to variable/label definitions and annotator instructions that would allow you to produce higher quality labels. Ideally, you want annotation tools that allow you to monitor and adjust an annotator’s output early and often, so that you are improving your processes and ensuring sufficient label quality as you go. The best annotation tools provide machine and human data validation features, so that high consensus labels are automatically accepted and a veteran annotator can adjudicate edge cases.

4. Underestimating the management load

If you’re like most researchers, you’re probably more expert than you know. You spend your days talking to people who are already on your same wavelength. Then you go to conferences where the amplitude rises, but the wavelength is the same. To you, annotating your documents seems easy. All you have to do is just read through them and drag the labels onto the relevant content. No problem!

But if you’re doing a project of any size, you’ll need help. And you’ll soon find that what is so obvious to you is not so obvious to your team of annotators. Their lay definitions of variables/labels will not totally overlap with yours. The exceptions that stump them will require more explanation than you expected. And you will have to repeat yourself dozens of times until all your instructions are well documented in your codebook. Then, you will have to repeat yourself more to ensure everyone is applying the codebook similarly.

You will also have to define your annotation tasks and who on your team does which. You’ll need to nag your annotators to complete their work and send it back to you. And, before all that, you’ll have to train them on how to use whatever annotation tools you’ve chosen. All of this becomes a lot of work. And if you’re working with the most common form of research assistant––a university student––you’re in for an unwelcome surprise. You’ll have to repeat your training process at least once every year, if not each semester. While some of the other pitfalls can be detrimental for a completed annotation project, this may be the one that most often kills a project. Many a researcher has abandoned their project after two or three semesters of management headaches. To be sure you don’t suffer this fate, you’ll need to find ways to boost your management skills and capacity, or carefully choose tools that help (rather than exacerbate) this pitfall.

5. Using the wrong tools for the job

Most projects that succumb to the pitfalls above are led by smart and competent professionals. They just aren’t using the right tools. It’s hard to collect rich data at a high scale using most available technology. CAQDAS, like AtlasTI, NVivo, MaxQDA, and DeDoose were initially designed

to help a single expert annotate a couple hundred interview transcripts. Using them with a team requires perfectly transferring your expertise to every team member, an error-prone management and documentation challenge that always effects downstream data quality. Users of CAQDAS often land in pitfall #2 (gathering too little data). Sometimes, however, users of these tools choose to reduce the nuance and complexity of their labeling, which very often lands them in pitfall #1 (gathering thin data).

Other tools like LightTag, TagTog, and DiscoverText are better for small research teams, but are designed only for very simple tasks, like labeling sentiment in tweets, or identifying named entities in documents. Users who want to dig deep into their documents with a very rich label set will soon find that they have to create, manage, and supervise many dozens of different tasks with different output to be collected, refined, and re-routed from each of their pools of annotators. Such projects become vulnerable to all pitfalls as researchers struggle to manage task delegation (#4) and data quality (#3) while they walk a tightrope between #1 and #2.

First generation crowd annotation platforms like MTurk and Figure Eight limit project managers in other ways. While their access to a global Internet workforce holds out the hope of high scale data (avoiding pitfall #2), these tools were never designed for research-grade annotation. They allow an annotator to check boxes classifying blocks of text (i.e. tagging a product description as ‘furniture’), but they don’t allow annotators to place categorical labels on specific words and phrases of text, as all the other tools do. For this reason, users of these platforms routinely find themselves in pitfall #1 or #3. Managing tasks and data are not easy with these platforms either.

Only one tool has been specifically designed to help you gather all the data you want from every single item in a giant set of documents. It was created by a social scientist and research software engineer determined to overcome all the pitfalls listed above. It’s called TagWorks, and it is a second generation crowd annotation platform that efficiently guides volunteer & crowdworker efforts via custom-rigged annotation assembly lines. Its task interfaces are designed to yield highly valid data without requiring your close training and supervision of annotators. So, your management load will be cut in half while your document throughput increases tenfold.

TagWorks data is trusted by scientists, too. With multiple features to monitor, spot-check, improve, and measure the reliability of project data and annotator performance, it has been endorsed by the global leader in social science methods, SAGE Publishing. They even invested in TagWorks’ parent company. With TagWorks, you can have your expertise applied to your documents in months not years––no trade-offs required.

To get more helpful tips on annotation projects, sign up for our email list here. And if you’d like to schedule a free consultation as you plan your next project, email the TagWorks team at


Nick Adams is an expert in social science methods and natural language processing, and the CEO of Thusly Inc., which provides TagWords as a service. He holds a doctorate in sociology from the University of California, Berkeley and is the founder and Chief Scientist of the Goodly Labs, a tech for social good non-profit based in Oakland, CA.