In numerical analysis courses we discuss condition numbers as a means for measuring the sensitivity of the solution of a problem to perturbations in the data. Traditionally, we say there are three main sources of data errors:
- Rounding errors in storing the data on the computer. For example, the Hilbert matrix with entry cannot be stored exactly in floating point arithmetic.
- Measurement errors. If the data comes from physical measurements or experiments then it will have inherent uncertainties, which could be quite large (perhaps of relative size ).
- Errors from an earlier computation. If the data for the given problem is the solution to another problem it will inherit errors from the previous problem.
Recently I learned of a fourth source of error: scanning and photocopying.
Traditionally, photocopiers were based on xerography, whereby electrostatic charges on a light sensitive photoreceptor are used to attract toner particles and then transfer them onto paper to form an image. Nowadays, photocopiers are more likely to comprise a combined scanner and printer, as for example in consumer all-in-one devices.
Last year, German computer scientist David Kriesel discovered that the Xerox WorkCentre 7535 and 7556 machines can jumble up different areas in a scan. In particular, he found an example where many occurrences of the digit “6” are replaced by “8” during the scanning process. See his blog post.
It seems that the Xerox scanners in question use the JBIG2 compression algorithm (a specialized version of JPEG), which segments the image into patches and uses pattern matching, and that the default parameters used were not a good choice because they can lead to these serious errors. Xerox subsequently released software patches.
One would not imagine that scanning on today’s high resolution machines could change whole blocks of pixels. Given the wide range of uses of scanners, including transmission of exam marks, financial information, and engineering specifications, as well as the ubiquitous digitizing of historic documents including journal articles, this is very disturbing.
The problem of mangled scans may not be limited to Xerox machines, as other reports show (see this post and this post).
The motto of the story is: run sanity checks on your scanned data and do not assume that scans (or the results of optical character recognition on them) are accurate!
One thought on “A New Source of Data Errors: Scanning and Photocopying”
I am having a bit of a problem understanding why is this happening. We are talking about a scanner+printer combo (in one device). It seems to me that it would be cheaper (and more efficient) to scan the document and then print it, than to scan it, compress it, decompress it right away, and then print it. I assume that the scanning and the printing component share the memory (obviously for the price efficiency), so I don’t see a single benefit from compression and (instant) decompression.
Do you know what’s the reason for introducing the compressing/decompressing into this process?