In numerical analysis courses we discuss condition numbers as a means for measuring the sensitivity of the solution of a problem to perturbations in the data. Traditionally, we say there are three main sources of data errors:
- Rounding errors in storing the data on the computer. For example, the Hilbert matrix with entry cannot be stored exactly in floating point arithmetic.
- Measurement errors. If the data comes from physical measurements or experiments then it will have inherent uncertainties, which could be quite large (perhaps of relative size ).
- Errors from an earlier computation. If the data for the given problem is the solution to another problem it will inherit errors from the previous problem.
Recently I learned of a fourth source of error: scanning and photocopying.
Traditionally, photocopiers were based on xerography, whereby electrostatic charges on a light sensitive photoreceptor are used to attract toner particles and then transfer them onto paper to form an image. Nowadays, photocopiers are more likely to comprise a combined scanner and printer, as for example in consumer all-in-one devices.
Last year, German computer scientist David Kriesel discovered that the Xerox WorkCentre 7535 and 7556 machines can jumble up different areas in a scan. In particular, he found an example where many occurrences of the digit “6” are replaced by “8” during the scanning process. See his blog post.
It seems that the Xerox scanners in question use the JBIG2 compression algorithm (a specialized version of JPEG), which segments the image into patches and uses pattern matching, and that the default parameters used were not a good choice because they can lead to these serious errors. Xerox subsequently released software patches.
One would not imagine that scanning on today’s high resolution machines could change whole blocks of pixels. Given the wide range of uses of scanners, including transmission of exam marks, financial information, and engineering specifications, as well as the ubiquitous digitizing of historic documents including journal articles, this is very disturbing.
The problem of mangled scans may not be limited to Xerox machines, as other reports show (see this post and this post).
The motto of the story is: run sanity checks on your scanned data and do not assume that scans (or the results of optical character recognition on them) are accurate!