Document Scanner

The document scanner takes an image of a document as an input and outputs a black and white cropped and transformed image (similar to what one would receive from a traditional scanner).
~ source code for this in perception toolbox

Resources Used:

  • Computer Vision: Algorithms and Applications by Richard Szeliski
    • Chapter 2.1 Geometric Primitives and Transforms
    • Chapter 6.1 2D and 3D Feature-Based Alignment
      • section covering the use of RANSAC on determining homography
  • PyImageSearchGurus Course
    • Module 1.10 Gradients and Edge Detection
    • Module 1.11 Contours
  • Implementation of OpenCV::findcontours
    • Suzuki, S. and Abe, K., Topological Structural Analysis of Digitized Binary Images by Border Following (1983)

How Does it Work?

The document scanning process is a pipeline of computer vision algorithms, all rely on the output quality of the previous. At a high level, these are the steps involved in scanning a document.

  1. Detecting edges within the image
  2. Searching for the largest closed contour (hopefully our document)
  3. Finding the four corners of this document
  4. Determining the output document dimensions
  5. Transforming the plane described by the four corners of the document to the output document corners

Lets dive into each step in more detail!

Detecting Edges

Detected Edges

The following approach is known as Canny edge detection.
Here we analyze the pixel intensities in a gray image. We are able to identify the edges present in an image by looking for the largest rates of change of pixel intensity.
Images usually have pixel intensity outliers, this is known as ‘noise’. Taking the gradient of noise would accentuate the noise to signal ratio leading to false edges being seen. It’s always an important step to minimize noise in an image by using a pixel blurring technique, a common example is a Gaussian blur.
Pixel gradients above a certain value are seen as ‘edges’ whereas others are not – the final result will be a binary image. We still need to do some work on our edges before we can evaluate contours.
We need to thin out our edges so that we have a single pixel thick contour that outlines the document. Using non-maxima supression we analyze a kernel (grouping of pixels) and determine the orientation of the gradient (in what direction are the intensities increasing). We then preserve the most intense pixel in that direction and suppress the others.
Now that we have thinned our edges by suppressing select pixel gradient values, we must determine which pixel gradient values actually represent an edge. Canny edge detection uses a hysteresis threshold to determine this. Any gradient value above the upper threshold is an edge, any below the lower threshold is not an edge, but any that is in between is only considered an edge if it is part of an edge that has values above the upper threshold.
I determine the values of the threshold by using the median pixel intensity and setting the lower threshold to be X% below, and the upper to be X% above. I tune this value according to the problem.

#find edges
blurred = cv2.GaussianBlur(resized, (11, 11), 0)
sigma = 0.33
v = np.median(blurred)
lower = int(max(0,(1.0-sigma)*v))
upper = int(min(255,(1.0+sigma)*v))
edges = cv2.Canny(blurred, lower, upper)

Document Corners

Located Document Corners (Red Dots)

In order to find document corners we must first find all the contours within our image of edges. Using OpenCV::findcontours( ) we can retrieve a set of closed contours that are described by a list of points that make up the contour line. The implementation of this algorithm is described by the following paper “Suzuki, S. and Abe, K., Topological Structural Analysis of Digitized Binary Images by Border Following (1983)”. The algorithm scans the binary image of edges pixel by pixel until it sees an edge of a closed contour. It then performs a form of traditional border following (Turn left on ‘1’, right on ‘0’), marking each border pixel with a ID unique to the contour. It keeps track of the last seen contour as well as the current contour so that it can establish a topological structure of borders. The output of the algorithm will provide a description of each closed contour (described by each pixel or by each vertice) and a list of indices of contours that are within it. Findcontours can also provide only the external closed contours.
We can sort the retrieved contours by the area within the contour, making an assumption that the largest contour is our document. This is a valid assumption because we can imagine the user would make the document the focus of their image.
The document contour will not be a perfect quadrilateral, it will have far more than four corners due to not being perfectly flat. We will need to approximate the document contour to a quadrilateral. Using OpenCV::approxPolyDP, the Ramer–Douglas–Peucker algorithm, we can recursively remove points from a line that are less important in determining its shape. This algorithm relies on a prescribed acceptable level of deviation from the original line in order to control how much the new line can deviate from the original.
After approximating the shape of the document we will have the location of all corner points of the document.

#find 4 edges of the document

contours = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
contours = contours[0]
contour = sorted(contours, key=cv2.contourArea, reverse=True)[:1]
peri = cv2.arcLength(contour[0], True)
peri_factor = 0.015
approx_cnt = cv2.approxPolyDP(contour[0], peri_factor * peri, True)

Transform Document to Output Image

Scanned homework using my document scanner application

A homography matrix can be used to describe a transformation from any quadrilateral to another quadrilateral.
By determining the homography that transforms our document to the dimensions of the output image we are making the assumption that our document is a plane. Because it is not actually a perfect plane the scan will not be perfect and their will be some minor distortion.
The homography matrix that maps 2D pixels from one image to another is described in a 3×3 matrix with 8 degrees of freedom.
X1*H = X2 where X1 is the coordinate in the first image, and X2 is the coordinate in the second image. Because we have matched X1 (corner of the document) to X2 (corner of the output image) we now have two equations with 8 unknowns. The first of the two equations relate ‘x’ coordinates, the second relates ‘y’ coordinates. The 8 unknowns are each value of the homography values.
In order to compute the homography we need 8 equations (to solve for 8 unknowns). By matching corners, we have four matching points which yields 8 equations.
With a solved homography matrix we can now transform our document in our input image to the output image.

dims_out = (int(doc_w), int(h_output))
scanned = np.zeros(dims_out, dtype=np.uint8)
scanned_corners = np.array([[0, 0], [dims_out[0] - 1, 0],
                            [0, dims_out[1] - 1], [dims_out[0] - 1, dims_out[1] - 1]], dtype =np.float32)

#compute transfrom between input image document corners and output image major dimensions
#apply transform
transform = cv2.getPerspectiveTransform(approx_cnt, scanned_corners)
scanned = cv2.warpPerspective(gray, transform, dims_out)

Leave a comment