Talking Papers: a world without data entry?

•November 16, 2009 • 10 Comments

Humanitarian Data Collection 2.0

Last week at Camp Roberts, entrepreneur Todd Huffman was kind enough to take me on a tour of  Walking Papers, a remarkable service that allows users to print out paper maps, annotate them manually, upload them into OpenStreetMap, and use the annotations to transcribe new content.  It’s like digital tracing paper.  Walking Papers is a brilliant idea in its recognition that paper – like it or not — still has an important role to play in field environments.

What really caught my attention was that the paper forms Walking Papers emits encode map quadrant coordinates, as well as a unique identifier, in a 2-D barcode that is used to process annotated maps once they’ve been scanned and uploaded.  When a map is uploaded, Walking Papers is able to read the barcode and plot the location on the globe to which the scan corresponds.  Although it’s not yet possible for Walking Papers to decipher my annotations automatically, the barcode is at least machine-readable:  once the scan has been uploaded, I can take it from there to transcribe what I have drawn.  This imaginative and insightful approach got me thinking about a related problem I’ve been keen to address for some time:  data entry.  How can we use paper as a more effective channel for information flow during and after humanitarian emergencies?

Paper, Paper Everywhere

In every disaster zone and every rural development environment where I’ve worked, paper is still king when it comes to  collection of structured data, from population needs assessments, to tracking inventory stocks levels, conducting health surveys, filing situation reports, logging security incidents, and in general maintaining shared awareness of the situation unfolding on the ground.  In spite of more than a decade of work by literally hundreds of organizations developing PDA-based data collection systems, the default option in the deep field remains unchanged: print out a form, take it to the field, fill it out with clipboard and stubby pencil, bring it back, and enter the data manually at a computer.

One day, hopefully soon, we won’t need paper in the field.  But that day is still years away.   There are many reasons for paper’s continuing status as the tool of choice for field data collection.  It’s cheap.  It’s light.  It’s compact. It doesn’t need recharging.  It doesn’t need Internet access. It’s a familiar – and spacious — form factor.  It works in hot weather or cold.  It can be read under bright sun.  It’s not affected by dust.  Yes, it fares poorly if it gets wet, or torn, or smudged, and options for sensor integration and data validation are…lacking, but on balance, as a data collection tool beyond the edge of the network, it still has a lot going for it.  As a tool for data transport, however, particularly in crises where time is of the essence, paper is inefficient and ineffective.

Yet Another “Last Mile” Problem

If you depend on the data collected on paper forms to understand the needs of vulnerable populations and make decisions that affect their welfare, paper is the weakest link in your information supply chain.  At virtually every stage in a paper-based process, there is room for human error to alter or lose critical data:  when it’s written down, during transport, when it’s read, when it’s entered into a database.  Paper is a fragile medium to begin with, but paper in the hands of hot, tired, busy, stressed-out relief workers in the chaos of a major disaster is fraught with problems.  As long as paper is used for data collection, error and data loss will continue to reduce the effectiveness of humanitarian coordination, and unless someone invents self-validating paper, it’s hard to see ways that technology can help here anytime soon.

An Opportunity

There is, however, one shortcoming of paper that we might be able to address today.  Virtually everywhere in the world of relief and development, completed paper forms accumulate in piles until someone has the time to enter the data manually into a spreadsheet, database, or other application.  Data entry is not only a juncture where errors tend to be introduced; it’s also the point that tends to contribute most heavily to latency in the flow of humanitarian information.   When critical information needed to match needs to resources reaches decision-makers too late, coordination breaks down, further delays are introduced, resources are misallocated, and too little arrives too late to help a population in need.  Components of a potential solution to this data entry problem already exist, though no one seems to have solved it decisively. Before I suggest where we might go, I need to explain why current tools haven’t filled the gap.

Limitations of OCR

Optical Character Recognition (OCR) technology, for example, has been around for decades and has improved markedly in recent years.  Sahana, who brought a team to Camp Roberts, have already done some excellent work in configuring their disaster management system to emit OCR-friendly forms, and I am convinced that such approaches have tremendous potential to increase the viability of OCR in the field and the quality of the data captured.   But there’s another reason OCR along won’t eliminate the need for data entry in humanitarian work the way it has, say, for many of the forms we complete in a non-crisis setting.   A major limitation in applying OCR to paper forms in a humanitarian context is that the underlying schema of the data being collected is itself in a state of constant flux.

Emergencies are by their very nature dynamical systems characterized by emergent effects.  Weather, disease, and natural hazards may worsen conditions without warning.  Poorly understood needs – or poor communications – may have secondary effects that change the situation on the ground dramatically.  Populations affected by the situation may respond in unforeseen ways – constructive and destructive – in ways that alter both availability of resources and their need for them. Political decisions, news reports, and the choice of a single word may may all change the course of events.  What one thought one needed to know yesterday may no longer be important to ask today, or it may have been the wrong question all along, and as the response moves from critical intervention to mitigation and recovery, needs keep changing.

As a result of this dynamic, the forms designed to assess population needs at the outset of a response soon become inadequate. Questions must be added.  Others must be removed.  The schema of the data being collected has changed, impacting form and database design.   A few days later it happens again. And again.  And layouts change, as does the wording of questions. In many cases, updated or entirely new forms are designed, printed, distributed and collected in the field. Even if OCR could be used to extract data from these forms with 100% accuracy, it would do little for a decision-maker looking to make sense of the data, because this data is organized according to an unfamiliar schema that emerged at the edge of the network.

Self-Describing Paper

Walking Papers, however, suggests a way forward. That little barcode in the corner in effect contains a machine-readable schema for the map annotations, and it got me thinking about an article I read several years ago which noted that PDF-based forms could potentially encode their schemas automatically within 2-D barcodes.  I find this idea fascinating.  Print such a PDF, and you have a paper analog of XML:  a self–describing document, machine- and human-readable, that contains both data and the schema describing that data.    It’s a paper form that tells you what it is.   After a chat with Mike Migurski of Walking Papers, I’m code-naming the concept “Talking Papers”, and I’m hoping to get a team together to work on making it a reality.

Talking Papers

Imagine the following scenario.  There has been a major earthquake, and you’re a nutrition expert working in the Food Security cluster in a makeshift office near the center of the affected area. You design a Household Nutrition Survey form on your laptop, pair it with a Walking Papers map of a village, print out 100 copies, and hand them out to a few trusted local volunteers to take house to house.  As completed forms come back to you, you quickly scan them into your PC – no waiting for time to perform cumbersome manual data entry.  Auto-magically, the data and metadata are extracted from the form and – Internet access permitting — uploaded directly into an online, collaborative environment where you and your colleagues review, correct, and validate the data against its schema.  Once scrubbed, the data moves on to the next step in the supply chain:  some download it in one or more standard formats, while others publish it into online repositories such as Freebase, GeoCommons, DevInfo, Sahana, Mesh4X, RapidSMS, GATHER, etc. for mapping, analysis, and sharing.

Building Blocks

I’ve bounced this idea off of a few folks already, including John Crowley, Matt Berg, Chamindra De Silva, Todd Huffman, Ed Jezierski, and Chris Blow.  We’ve agreed that making this work will require, at a minimum:

1) a tool to create printable forms,

2) a tool to read uploaded scans of completed forms, and

3) a tool to review, scrub, and publish data once it has been extracted.

Ideally, building, reading, and scrubbing features should be available offline, since Internet access is a scarce commodity in places where Talking Papers would be most useful, but it probably makes sense to get an online, browser-based version up and running first to get user input as quickly as possible.  I think each tool should exist as a completely separate service, as there may be other uses for such capabilities.  Where existing tools can be modified to address the requirements described above, I’m all for it.

Below are a few initial thoughts on building blocks.

1.  Form Generator

This should be a user friendly online tool with a drag-and-drop interface that allows users to design text-entry-friendly, OCR-friendly forms with an option to export to PDFs.  The tool would encode a serialized version of the schema in a supported standard format (e.g., Turtle, XForms) within a band of high-capacity 2-D barcodes directly on the form.  The barcodes should be duplicated across both the top and the bottom of the printable form for redundancy.  Fields on the form, in addition to human-readable text labels, might have tiny machine-readable labels – perhaps also in barcode format – that associate the values that follow with data elements in the encoded schema.  When a form is created, the designer would specify a default URL for the data-scrubbing workspace to which scans will be uploaded and processed, so that URL could also be encoded in the barcode – making Talking Papers not only self-describing, but self-routing. Ideally the tool would be able import schemas in standard formats generated by other tools and let users work with those as a starting point for form layout.

Here is a mock-up of a Talking Papers form I’ve annotated in red, based on a Sahana OCR form:


















2.  Forms Reader

This code library, like the Form Generator, would ideally be embeddable in both online and offline services.  It would be able to process uploaded scans of completed forms, performing OCR on the data entered while also extracting the schema and target data-scrubbing workspace from within the barcode.  Ideally it should be able to tag segments in the data with an OCR confidence level to assist with scrubbing.  Once the data has been extracted, both it and the schema should be pushed into the Data-Scrubber.  This payload should also include the UID of the associated Walking Papers map, if any.

3.  Data Scrubber

This online application would help users clean up data sets within collaborative workspaces where they can review, edit, and publish data processed by the Forms Reader.  A simple data grid UI would be a great start, and Google Spreadsheets would probably take us part of the way there.  Data successfully extracted should be displayed in rows, with columns corresponding to each field in the schema.  Some visual indication – perhaps coloring or shading?  — could indicate where content was suspect, or could draw the eye to blank areas where OCR failed entirely. It might also be helpful to have a feature that detects and highlights or clusters duplicate entries. Each row should contain a link to the original scan to assist users in inferring the original intent of the individual who completed the form (e.g., to review content that OCR could not interpret, as well as marginalia and other annotations).  Once the user is happy with the content, hooks should be provided to allow him or her to download the data in common formats or push the data into a variety of repositories.

How to Get Involved

Chris Blow has generously agreed to contribute to the design process; he has already stood up a repository at and has begun working on scenarios.  We need your ideas, suggestions and concerns.  We need designers, developers, testers, and user-practitioners willing to test this system in the field and help us shape it into something genuinely useful.  Substantial early user input, an agile, open-source, collaborative process, support for open data standards, and well-designed mashup-friendly APIs will be critical.

If you are interested in contributing to this effort, please contact me.

Closing Thoughts

Most of the concepts underlying Talking Papers are not new.  Many of the required building blocks already exist in some form.  But these capabilities as far as I know have never been brought together in a simple, flexible implementation that will actually work in the humanitarian field. What if we could design a system that generates and reads such forms, creating a seamless bidirectional bridge between paper and applications that takes you from data collection to data scrubbing in one hop – skipping data entry entirely?  Can you imagine a paper-based XForms client?  I believe strongly that this kind of technology could help to streamline the information supply chain in humanitarian operations dramatically, allowing those who depend on such information to save lives.