Segmentation of page images having artifacts of photocopying and scanning
L. Cinque , , a, S. Levialdia, L. Lombardib and S. Tanimotoc
a Dipartimento di Scienze dell 'Informazione, University of Rome “La Sapienza”, Via Salaria 113, 00198 Rome, Italy
b Dipartimento di Informatica e Sistemistica, University of Pavia, Via Ferrata 1, 27100 Pavia, Italy
c Department of Computer Sci. and Engineering, Box 352350, University of Washington, Seattle, WA 98195, USA
Received 18 June 1999; revised 16 February 2001; accepted 26 February 2001 Available online 11 February 2002.
Abstract
The analysis of scanned documents is important in the construction of digital libraries and paperless offices. One significant challenge is coping with artifacts of photocopying and scanning. We present a series of simple techniques for handling these difficulties. Using 125 images of the University of Washington scanned documents database, we demonstrate the effectiveness of these methods in preparing the images for segmentation by a multiresolution algorithm.
Author Keywords: Document analysis; Artifact elimination; Segmentation ; Print-through; Marginal artifact; Partial extra page; Digital library
Article Outline
1. Introduction
1.1. General motivation
1.2. Problem description
2. Previous work
3. Processing methods
3.1. Eliminating print-through
Algorithm 1—[Treatment of print-through]
3.2. Marginal artifacts and partial extra pages
Algorithm 2—[Treatment of marginal artifacts and partial extra pages]
4. Multiresolution segmentation method
4.1. Phase 1: construction of feature pyramids
4.2. Phase 2: classification of regions
5. Experimental results and discussion
5.1. Computational considerations
6. Conclusions
References
Vitae
1. Introduction
1.1. General motivation
Online digital libraries can provide improved distribution of information and more flexible access via search algorithms than can traditional print libraries. However, adding existing print materials to electronic libraries is a costly, slow process unless good automated procedures can be developed. After academic journal articles have been photocopied and/or scanned from their bound, printed versions, various artifacts have often been introduced into the images that make further analysis difficult. Either these artifacts need to be removed before further processing, or special considerations must be given to the following processing steps to make them tolerant of the artifacts. We addressed the problem of artifacts by developing means to reduce and/or eliminate them from the scanned document images prior to segmentation.