Talking Papers: a world without data entry?

Humanitarian Data Collection 2.0

Last week at Camp Roberts, entrepreneur Todd Huffman was kind enough to take me on a tour of  Walking Papers, a remarkable service that allows users to print out paper maps, annotate them manually, upload them into OpenStreetMap, and use the annotations to transcribe new content.  It’s like digital tracing paper.  Walking Papers is a brilliant idea in its recognition that paper – like it or not — still has an important role to play in field environments.

What really caught my attention was that the paper forms Walking Papers emits encode map quadrant coordinates, as well as a unique identifier, in a 2-D barcode that is used to process annotated maps once they’ve been scanned and uploaded.  When a map is uploaded, Walking Papers is able to read the barcode and plot the location on the globe to which the scan corresponds.  Although it’s not yet possible for Walking Papers to decipher my annotations automatically, the barcode is at least machine-readable:  once the scan has been uploaded, I can take it from there to transcribe what I have drawn.  This imaginative and insightful approach got me thinking about a related problem I’ve been keen to address for some time:  data entry.  How can we use paper as a more effective channel for information flow during and after humanitarian emergencies?

Paper, Paper Everywhere

In every disaster zone and every rural development environment where I’ve worked, paper is still king when it comes to  collection of structured data, from population needs assessments, to tracking inventory stocks levels, conducting health surveys, filing situation reports, logging security incidents, and in general maintaining shared awareness of the situation unfolding on the ground.  In spite of more than a decade of work by literally hundreds of organizations developing PDA-based data collection systems, the default option in the deep field remains unchanged: print out a form, take it to the field, fill it out with clipboard and stubby pencil, bring it back, and enter the data manually at a computer.

One day, hopefully soon, we won’t need paper in the field.  But that day is still years away.   There are many reasons for paper’s continuing status as the tool of choice for field data collection.  It’s cheap.  It’s light.  It’s compact. It doesn’t need recharging.  It doesn’t need Internet access. It’s a familiar – and spacious — form factor.  It works in hot weather or cold.  It can be read under bright sun.  It’s not affected by dust.  Yes, it fares poorly if it gets wet, or torn, or smudged, and options for sensor integration and data validation are…lacking, but on balance, as a data collection tool beyond the edge of the network, it still has a lot going for it.  As a tool for data transport, however, particularly in crises where time is of the essence, paper is inefficient and ineffective.

Yet Another “Last Mile” Problem

If you depend on the data collected on paper forms to understand the needs of vulnerable populations and make decisions that affect their welfare, paper is the weakest link in your information supply chain.  At virtually every stage in a paper-based process, there is room for human error to alter or lose critical data:  when it’s written down, during transport, when it’s read, when it’s entered into a database.  Paper is a fragile medium to begin with, but paper in the hands of hot, tired, busy, stressed-out relief workers in the chaos of a major disaster is fraught with problems.  As long as paper is used for data collection, error and data loss will continue to reduce the effectiveness of humanitarian coordination, and unless someone invents self-validating paper, it’s hard to see ways that technology can help here anytime soon.

An Opportunity

There is, however, one shortcoming of paper that we might be able to address today.  Virtually everywhere in the world of relief and development, completed paper forms accumulate in piles until someone has the time to enter the data manually into a spreadsheet, database, or other application.  Data entry is not only a juncture where errors tend to be introduced; it’s also the point that tends to contribute most heavily to latency in the flow of humanitarian information.   When critical information needed to match needs to resources reaches decision-makers too late, coordination breaks down, further delays are introduced, resources are misallocated, and too little arrives too late to help a population in need.  Components of a potential solution to this data entry problem already exist, though no one seems to have solved it decisively. Before I suggest where we might go, I need to explain why current tools haven’t filled the gap.

Limitations of OCR

Optical Character Recognition (OCR) technology, for example, has been around for decades and has improved markedly in recent years.  Sahana, who brought a team to Camp Roberts, have already done some excellent work in configuring their disaster management system to emit OCR-friendly forms, and I am convinced that such approaches have tremendous potential to increase the viability of OCR in the field and the quality of the data captured.   But there’s another reason OCR along won’t eliminate the need for data entry in humanitarian work the way it has, say, for many of the forms we complete in a non-crisis setting.   A major limitation in applying OCR to paper forms in a humanitarian context is that the underlying schema of the data being collected is itself in a state of constant flux.

Emergencies are by their very nature dynamical systems characterized by emergent effects.  Weather, disease, and natural hazards may worsen conditions without warning.  Poorly understood needs – or poor communications – may have secondary effects that change the situation on the ground dramatically.  Populations affected by the situation may respond in unforeseen ways – constructive and destructive – in ways that alter both availability of resources and their need for them. Political decisions, news reports, and the choice of a single word may may all change the course of events.  What one thought one needed to know yesterday may no longer be important to ask today, or it may have been the wrong question all along, and as the response moves from critical intervention to mitigation and recovery, needs keep changing.

As a result of this dynamic, the forms designed to assess population needs at the outset of a response soon become inadequate. Questions must be added.  Others must be removed.  The schema of the data being collected has changed, impacting form and database design.   A few days later it happens again. And again.  And layouts change, as does the wording of questions. In many cases, updated or entirely new forms are designed, printed, distributed and collected in the field. Even if OCR could be used to extract data from these forms with 100% accuracy, it would do little for a decision-maker looking to make sense of the data, because this data is organized according to an unfamiliar schema that emerged at the edge of the network.

Self-Describing Paper

Walking Papers, however, suggests a way forward. That little barcode in the corner in effect contains a machine-readable schema for the map annotations, and it got me thinking about an article I read several years ago which noted that PDF-based forms could potentially encode their schemas automatically within 2-D barcodes.  I find this idea fascinating.  Print such a PDF, and you have a paper analog of XML:  a self–describing document, machine- and human-readable, that contains both data and the schema describing that data.    It’s a paper form that tells you what it is.   After a chat with Mike Migurski of Walking Papers, I’m code-naming the concept “Talking Papers”, and I’m hoping to get a team together to work on making it a reality.

Talking Papers

Imagine the following scenario.  There has been a major earthquake, and you’re a nutrition expert working in the Food Security cluster in a makeshift office near the center of the affected area. You design a Household Nutrition Survey form on your laptop, pair it with a Walking Papers map of a village, print out 100 copies, and hand them out to a few trusted local volunteers to take house to house.  As completed forms come back to you, you quickly scan them into your PC – no waiting for time to perform cumbersome manual data entry.  Auto-magically, the data and metadata are extracted from the form and – Internet access permitting — uploaded directly into an online, collaborative environment where you and your colleagues review, correct, and validate the data against its schema.  Once scrubbed, the data moves on to the next step in the supply chain:  some download it in one or more standard formats, while others publish it into online repositories such as Freebase, GeoCommons, DevInfo, Sahana, Mesh4X, RapidSMS, GATHER, etc. for mapping, analysis, and sharing.

Building Blocks

I’ve bounced this idea off of a few folks already, including John Crowley, Matt Berg, Chamindra De Silva, Todd Huffman, Ed Jezierski, and Chris Blow.  We’ve agreed that making this work will require, at a minimum:

1) a tool to create printable forms,

2) a tool to read uploaded scans of completed forms, and

3) a tool to review, scrub, and publish data once it has been extracted.

Ideally, building, reading, and scrubbing features should be available offline, since Internet access is a scarce commodity in places where Talking Papers would be most useful, but it probably makes sense to get an online, browser-based version up and running first to get user input as quickly as possible.  I think each tool should exist as a completely separate service, as there may be other uses for such capabilities.  Where existing tools can be modified to address the requirements described above, I’m all for it.

Below are a few initial thoughts on building blocks.

1.  Form Generator

This should be a user friendly online tool with a drag-and-drop interface that allows users to design text-entry-friendly, OCR-friendly forms with an option to export to PDFs.  The tool would encode a serialized version of the schema in a supported standard format (e.g., Turtle, XForms) within a band of high-capacity 2-D barcodes directly on the form.  The barcodes should be duplicated across both the top and the bottom of the printable form for redundancy.  Fields on the form, in addition to human-readable text labels, might have tiny machine-readable labels – perhaps also in barcode format – that associate the values that follow with data elements in the encoded schema.  When a form is created, the designer would specify a default URL for the data-scrubbing workspace to which scans will be uploaded and processed, so that URL could also be encoded in the barcode – making Talking Papers not only self-describing, but self-routing. Ideally the tool would be able import schemas in standard formats generated by other tools and let users work with those as a starting point for form layout.

Here is a mock-up of a Talking Papers form I’ve annotated in red, based on a Sahana OCR form:


















2.  Forms Reader

This code library, like the Form Generator, would ideally be embeddable in both online and offline services.  It would be able to process uploaded scans of completed forms, performing OCR on the data entered while also extracting the schema and target data-scrubbing workspace from within the barcode.  Ideally it should be able to tag segments in the data with an OCR confidence level to assist with scrubbing.  Once the data has been extracted, both it and the schema should be pushed into the Data-Scrubber.  This payload should also include the UID of the associated Walking Papers map, if any.

3.  Data Scrubber

This online application would help users clean up data sets within collaborative workspaces where they can review, edit, and publish data processed by the Forms Reader.  A simple data grid UI would be a great start, and Google Spreadsheets would probably take us part of the way there.  Data successfully extracted should be displayed in rows, with columns corresponding to each field in the schema.  Some visual indication – perhaps coloring or shading?  — could indicate where content was suspect, or could draw the eye to blank areas where OCR failed entirely. It might also be helpful to have a feature that detects and highlights or clusters duplicate entries. Each row should contain a link to the original scan to assist users in inferring the original intent of the individual who completed the form (e.g., to review content that OCR could not interpret, as well as marginalia and other annotations).  Once the user is happy with the content, hooks should be provided to allow him or her to download the data in common formats or push the data into a variety of repositories.

How to Get Involved

Chris Blow has generously agreed to contribute to the design process; he has already stood up a repository at and has begun working on scenarios.  We need your ideas, suggestions and concerns.  We need designers, developers, testers, and user-practitioners willing to test this system in the field and help us shape it into something genuinely useful.  Substantial early user input, an agile, open-source, collaborative process, support for open data standards, and well-designed mashup-friendly APIs will be critical.

If you are interested in contributing to this effort, please contact me.

Closing Thoughts

Most of the concepts underlying Talking Papers are not new.  Many of the required building blocks already exist in some form.  But these capabilities as far as I know have never been brought together in a simple, flexible implementation that will actually work in the humanitarian field. What if we could design a system that generates and reads such forms, creating a seamless bidirectional bridge between paper and applications that takes you from data collection to data scrubbing in one hop – skipping data entry entirely?  Can you imagine a paper-based XForms client?  I believe strongly that this kind of technology could help to streamline the information supply chain in humanitarian operations dramatically, allowing those who depend on such information to save lives.


~ by rgkirkpatrick on November 16, 2009.

10 Responses to “Talking Papers: a world without data entry?”

  1. Hi Robert,

    I’m happy to see that you have pointed out the importance and the benefits that we can achieving through the use of paper based forms especially when it comes to managing post disaster activities. With your ideas, suggestions and recommendation I’ll improve the form layout which I have been developing for Sahana.


  2. […] Talking Papers: a world without data entry? Using self-describing paper forms on disaster zone (tags: paper forms disaster) […]

  3. Thanks, Hayesha. We’ve got a Google Group to discuss the project here:

    * Group name: Talking Papers
    * Group home page:

  4. Robert – Interesting thoughts and I’d like to work with you a bit more on this idea, especially if you think a “where” component could be added into the mix. I could use some clarification though – how do you conceptially and operationally differentiate what you’re talking about from the use of something like scan-tron or bubble-fill type forms? I work pretty closely with numerous emergency management agencies and this has long been a problem for field crews, especially for situations involving thousands of people or very large geographic extents.

    You might be interested in the digital solution used by Victoria Police during the recent Australian bush fires:

    • Hi Talbot,

      Thanks for your comments. Scantron…painful childhood memories of #2 pencils. I do see an analogy here to Scantron-like forms, and indeed, in some cases, it might be appropriate to use bubbles, at least for answering multiple-choice questions. But I think the Scantron “interface” (if one can use that term about paper), although it decreases the chance of “read” error by the scanner in comparison to OCR of free text, ironically increases the chance of “write” errors by the person filling out the form. I can remember having to go back over every answer to make sure I hadn’t shifted all my answers once column to the right. Now try to get a first responder who is short on sleep and long on stress to complete such a form. That wont go well. I can definitely imagine that this has caused problems in the past.

      I also don’t like the fact that Scantron forms take up huge amounts of real estate on the page. My hope is that we can find a way to label fields on a Talking Papers form with 1- or 2-D barcodes in such a way that the Reader application can infer which text segment the label refers to, while allowing for a decent number of data elements to be filled out on each page compared to what is possible with Scantron forms.

      Unlike Scantron forms, Talking Papers forms will be open — anyone will be able to create them. They will be highly flexible. Because the payload extracted from a Talking Papers form is eventually transformed into a self-describing XML document, users in the field with laptops should theoretically be able to create these forms while offline, print them, get them completed, and submit them into recognition systems that have no pre-configured ability to read that particular form layout, and then push the data into repositories that have no prior record of that schema.

      Anyway, there are a few thoughts on the comparison with Scantron forms. We do want Talking Papers to be as easy to use as possible. I often draw an analogy between first responders in a disaster and climbers approaching the summit of Everest. High altitude mountaineers often do things like write messages on the sleeve of their jackets such as “untie boots before removing”, because they know that in the oxygen-poor atmosphere, they are effectively working with an IQ reduced by 20 points. Post-disaster environments have a similar effect. You are certainly not at your best, from a cognitive standpoint, and you want any tools or forms you have to work with to be as intuitive as possible to use. For example, I know that even if I haven’t slept in three days, I will still be able to figure out how to play my favorite song on my iPod. I would hope we can make tools for first responders similarly straightforward to use.

      Thanks also for the link to the ESRI handheld solution. Let me be clear here. Whenever it is technically, financially, and operationally feasible to use a high-tech approach such as that, I’m in favor of doing so. Talking Papers is not an alternative to PDAs. The issue is that, particularly in the developing world, we are many years away from a paperless field, and I would like to see if Talking Papers can help us get better data sooner from paper-based systems that cannot quickly be replaced by mobile devices.

      If you have ideas on the design or approach, please join our Google Group and share them.

  5. I would also like to take a minute to extend an open invitation to all interested in crisis/emergency response mapping and related support activities to join the community of practice established through the Geospatial Information and Technology Association. It’s free to join and you’ll find a decent amount of materials (papers, webcasts, etc.) openly available. There is a Google group associated with the ERS and I would love to see more participation, especially by the likes of the forward thinking folks here. Website is below

  6. This diagnostic is so true!!!

    “Virtually everywhere in the world of relief and development, completed paper forms accumulate in piles until someone has the time to enter the data manually into a spreadsheet, database, or other application.“

    To address this issue, some actors have worked on using PDA for data collection (, some others have contracted with companies to rationalize their big surveys with such kind of forms processing software and other Optimal Mark Reader (OMR). Both cases to very specific and “predictable” types of survey
    As rightly pointed out, to improve this very specific step of data entry, we definitely miss something easy to deploy, cheap & deep field suitable.

    What you trying to describe here would be of real added value in the humanitarian world!! The main points to stress would be that it should work easily out-of-the-box (no need for consultant) and remains totally open source and free (no license). This would be a true conceptual difference with systems like scantron!
    As you rightly said, it would also have to work both online (so that information can be shared and spread) and offline (to suit real field condition). There are definitely a lot of challenges out there but the game is worth the candle…

    Maybe a kind of “form builder wizard” could be added in the first “Building Block”. Data management officers in the field, could then avoid, as much as possible, some common mis-design when they need to build their own forms. Here is a sample of different “needs assessment forms” used in the field: . Those of those forms are tending to become standard de facto and could be used as adaptable templates offered by the wizard – (maybe an online depository of forms template shall be developed?).

    You may have seen the offer posted recently to have a “needs assessment” application developed in the context of the camp management cluster.
    It sounds like what you describe here could be very useful for such kind of application…

  7. […] of Groove and Microsoft Humanitarian Systems. With the formalities out of the way, I can tear apart his first post to get at the raw meat […]

  8. Hi Robert, it’s wonderful to see an end to end re-thinking of the data lifecycle pre-“authoritative database.” I’ve a few questions/comments.

    – It seems you consider the use cases in humanitarian relief and development communities together? I’ve never worked in the former, but it seems that, for example, the wake of a hurricane and a rural health clinic have vastly different data collection requirements. Take personnel, I imagine the level of education of a team of aid workers swooping in for a short time is quite different from that a new team of local community health workers. In terms of accessibility as a design concern, would this variability split your current effort to some extent?

    – I like the idea of imbuing pieces of paper with schema, giving each some measure of data independence. How long-lived will these pieces of paper be? If they are not going to be kept around as the original source documents, but rather are used as a transport medium for information which will arrive in its authoritative state only after transcription — how much data independence is necessary? For instance, why not just one small barcode per form keyed to a schema in your scanning/cleaning program? Maybe the point is that you wouldn’t have to have the schema in your scanning program in order to process a self-describing piece of paper – a great advantage if you foresee a future of mix-and-match form generation tools and scan-and-clean tools with a barcode-based api in between them.

    – There are form design tools in various existing applications, like MS Access and OpenMRS. Do you think it’s most useful to roll your own or create a library/plugin/web service for schema-tizing a paper form? In the web service scenario, maybe you send it a schema, it sends you barcodes. In this scenario, you provide an opportunity to “pre-integrate” your data, e.g. match your request schema elements and attributes against existing schemata. As part of the OpenII project, a suite of open source tools for information integration, we have a schema repository and various integration tools you could leverage:

    – You mention that OCR of free text is hard, but at the end of the day, your solution is still OCR — only you add a data cleaning/profiling tool to the mix. I wonder, for the OCR problem, could you lean on digital paper (e.g. Annoto)? And for the data cleaning tool, would you prefer to roll your own or integrate one of the open source tools like Talend?

    A potentially related piece of research is our recent work on Usher — where we push data cleaning techniques (specifically multivariate outlier detection) to the time of data entry for electronic forms. A wrinkle is that we assume that some data already exist from which we build a probabilistic model. The model can be helpful with tasks like form design suggestions (e.g. generate a set of constraints) and anomaly detection (esp. in the multivariate sense). More info on Usher in the link to my website below.

    Thanks for the great read. Look forward to hearing more! (I joined your google group).


  9. Kuang,

    Many thanks for your thoughtful comments! I’m thrilled to have you involved in this project. Please accept my apologies for the delay in replying. I’m in the process of relocating to the East Coast, and the logistics have been rather more demanding than I’d anticipated.

    Yes, I agree that data collection requirements in the immediate aftermath of a sudden-onset emergency do differ qualitatively from the relative routine of ongoing rural health surveillance, and to a certain extent, a split in the effort will probably be necessary. For example, as I mentioned in my original post, data collection in the context of a humanitarian crisis often consists, in part, of determining what new types of data one ought to begin collecting; this requirement underlies my conviction that Talking Papers should support schema evolution driven from the bottom up, where field workers with visibility into events on the ground may have the option to extend forms on the fly. Data collected in a global health context, by contrast, will typically be relatively inflexible, conforming to the requirements of long-standing program supported by national health information systems or databases used for reporting to donors. In such contexts, bottom up changes to schema would likely be inappropriate.

    I do, however, for several reasons, still believe that Talking Papers have potential applicability in both contexts:

    1. Even in the context of routine rural health data collection, I believe that Talking Papers represents an improvement over existing paper-based methods on the data entry side.

    2. Designing for post-crisis end users imposes stringent constraints in terms of ease of use, self-documenting interfaces, etc. It’s a bit like the NASA mentality: if we make sure it holds up in hard vacuum under 10 Gs of thrust, it will probably work in a less demanding context as well. In other words, if we do all we can to ensure that a hot, tired, distracted relief volunteer can figure out how to complete the form, our efforts will probably yield quality improvements for community heath workers as well. The converse does not necessarily apply.

    3. As I’ve often noted, the most useful technology in a crisis is the one you are already using. Introduction of new and unfamiliar technologies in a post-disaster context is inherently challenging, due to the reduced “absorptive capacity” of users under pressure. I’m reminded of climbers nearing the summit of Everest, who often write notes to themselves like “untie boots before removing” because they know that in that oxygen-poor environment, they are essentially operating at a reduced I.Q. My point here is that in many of the countries where Talking Papers would be most useful, the local population faces both increased exposure to conflict, poverty, disease, natural hazards, and macroeconomic shock combined with relatively weak resilience to such events. In other words, if we can drive adoption of Talking Paper as a standard tool for routine data collection in these parts of the world, the mechanism will already be institutionalized and broadly familiar when a genuine crisis emerges, and those aspects of Talking Papers most valuable in emergencies will be available.

    4. Finally, Talking Papers’ inherently flexible schemas might potentially be useful in the months and years following a major crisis, data collection requirements gradually change to support a shift from response and recovery operations reconstruction and, hopefully, positive community adaption.

    Your question about how long-lived printed Talking Papers forms are likely to be makes sense, and I certainly don’t know the answer, but my intention here was to maximize options. If one operates under the assumption that there will be a single master repository for data that also stores every version of every schema, then one could simplify matters tremendously by expecting that the full schema would be preserved only online, requiring that each form merely encode a reference to that schema and version. But I can also imagine scenarios where extended offline data entry, scrubbing, fusion, analysis, sharing and reporting must be supported, as well as cases where either paper copies of data, or storage devices, are lost, found, lost, and found again. Responses to sudden-onset complex emergencies are, as you know, chaotic. Bad things happen to data, software, hardware, and the people involved. Too often in years past I have seen technical solutions that strove for a performance-oriented ideal of “clean efficiency” at the expense of reliability – which often means a great deal redundancy and local persistence of data and schema. Talking Papers at first glance might seem like overkill. But I’m aiming to create a system that is easy to punch holes in but hard to take out completely.

    Regarding integration with existing tools, I am keen to support a range of existing form-design tools, provided that we can figure out how to transform their output into a Talking Papers form and back again. Pre-integration sounds like a very appealing option – one I had not considered. I did hear a bit about OpenII with a team at Google last year, and it sounds like that’s good option to consider. Please let us know how you’d suggest we proceed.

    Yes, it’s true that Talking Papers is still dependent on OCR. For digital paper, Anoto would probably be a great option to explore, provided that Anoto support would be an add-on. Have you experimented with their technologies? I don’t know how pervasive digital pens are yet, or what they cost, but one (unproven) hypothesis proposed by several of my colleagues in the Open Mobile Consortium is that the cost of even relatively expensive mobile phones is soon offset by the savings gained in moving from paper to mobile data collection (paper is cheap, but fuel is not). I’d expect a move to digital paper would likely yield a substantial ROI as well, due to improvements in timeliness, accuracy, and training. It would be interesting to see where the break-even point is.

    I haven’t used Talend, though it’s supposed to be quite powerful. Do you believe it would suffice as a starting point to get us to the requirements I’ve described? Would we not run the risk of giving users more than they need? I’m really envisioning a simple interface with built-in data scrubbing helper services. As I’ve mentioned, I’d be interested in collaboration feature as well, down the road. It occurred to me that something like Google Fusion Tables might be worth experimenting with as well, from this perspective, though I don’t know of any ways to extend its functionality.

    Kuang, your paper on Usher is inspiring, I think the entire group would be interested in learning about how Usher could be used. The approach reminds me of visual processing in the retina – you starting preprocessing as far upstream as possible. That’s quite exciting to consider, once we get the basics up and working. We’d need to keep in mind the possibility that completed Talking Papers forms might be OCR’d by a device running a data-scrubbing application that may or may not be connected to the Internet at the time – an interesting, though probably not insurmountable, constraint. Where do things stand with your planned field tests? Are you looking at mobile implementations? Finally, I can see how the techniques you describe could also be adapted to enable other kinds of interactions such as rapid “smart tagging” of collections of items.

    Kuang, thanks very much for your insights and suggestions, and again, please accept my apologies for the delay in replying to your thoughtful post. Please do follow up and share more of your ideas on our Google group!

    Warm Regards,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: