Extract Tables From Images in Python

Extract Tables From Images in Python

Extracting tables from images can be a tedious and time-consuming task, especially if you have a large number of images to process. However, with the right tools and techniques, you can automate this process and extract tables from images quickly and easily.

In this article, we will explore how to extract tables from images using Python. We will cover a library(img2table) that can be used to identify and extract tables from images, along with sample code and explanations. Whether you are working with scanned documents, photos, or other types of images, this article will provide you with the tools and knowledge you need to extract tables efficiently and accurately.

What is img2table?

Img2Table is a straightforward, user-friendly Python library for table extraction and identification that is based on OpenCV image processing and supports PDF files in addition to the majority of popular image file formats.

Due to its design, it offers a useful and less heavy alternative to solutions based on neural networks, especially for CPU usage.

It supports the following file formats:

  • JPEG files - .jpeg, .jpg, *.jpe

  • Portable Network Graphics - *.png

  • JPEG 2000 files - *.jp2

  • Windows bitmaps - .bmp, .dib

  • WebP - *.webp

  • Portable image format - .pbm, .pgm, .ppm .pxm, *.pnm

  • PFM files - *.pfm

  • OpenEXR Image files - *.exr

img2table Features

  • Table cell-level bounding boxes and table identification for images and PDF files.

  • Dealing with intricate table structures, like merged cells.

  • Extraction of table titles.

  • Extracting table content while supporting OCR tools and services.

  • A Pandas DataFrame representation and a simple object representing the extracted tables are returned.

  • Preserve the original structure of extracted tables by exporting them to an Excel file.

The package is simple (in comparison to deep learning solutions) and needs little or no training. There are still some limitations though since borderless tables' more complicated identification is not yet supported and may call for CNN-based approaches.

Implementation

Installation

Just like every other Python package, img2table can be installed via pip .

pip install img2table

Working with Images

from img2table.document import Image

image = Image(src,dpi=200, detect_rotation=False)

We instantiate Image , where src is the path to the image (it is required), dpi is used to adapt OpenCV algorithm parameters, it's optional with an int type (default is 200), detect_rotation detects and corrects skew or rotation of the image, it is a boolean type and by default False.

Let's have an example:

from img2table.document import Image

# Instantiation of the image
img = Image(src="image.jpg")

# Table identification
imgage_tables = img.extract_tables()

# Result of table identification
imgage_tables

#output
[ExtractedTable(title=None, bbox=(10, 8, 745, 314),shape=(6, 3)),
 ExtractedTable(title=None, bbox=(936, 9, 1129, 111),shape=(2, 2))]

Working with PDF

from img2table.document import PDF

pdf = PDF(src, dpi=200, pages=[0, 2])

It is the same as the way we work with images, just that we have a new parameter pages, which is a list of PDF page indexes to be processed. But if there are no specified indexes in the pages list, all the pages are processed.

Working with OCR

To parse the content of tables, img2table offers an interface for various OCR tools and services.

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, lang="eng", tessdata_dir="...")

Where n_threads is the number of concurrent threads used to call Tesseract with an int type and the default is 1, lang is used in Tesseract for text extraction and it is optional, finally the tessdata_dir is the directory containing Tesseract traineddata files.

Note: Usage of Tesseract-OCR requires prior installation.

Let's have a look at an example.

from img2table.document import PDF
from img2table.ocr import TesseractOCR

# Instantiation of the pdf
pdf = PDF(src="tablesfile.pdf")

# Instantiation of the OCR, Tesseract, which requires prior installation
ocr = TesseractOCR(lang="eng")

# Table identification and extraction
pdf_tables = pdf.extract_tables(ocr=ocr)

# We can also create an excel file with the tables
pdf.to_xlsx('tables.xlsx', ocr=ocr)

Extracting Multiple tables

The extract_tables method of a document allows multiple tables to be extracted simultaneously from a PDF page or an image.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src, dpi=200)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=True,
                                      borderless_tables=False,
                                      min_confidence=50)

Most of the parameters have been discussed earlier when working with images and PDF, but there are new parameters. ocr is the instance used to parse document text, implicit_rows is a Boolean type indicating if implicit rows should be identified, borderless_tables indicates if borderless tables are extracted, and lastly, min_confidence is the minimum confidence level from OCR in order to process text from 0(the worst) to 99(the best).

Conclusion

The OpenCV-python library and OpenCV are both used for all of the image processing. The Hough Transform, which recognizes lines in an image, serves as the algorithm's foundation. It enables us to recognize the image's horizontal and vertical lines. The library really doesn't have much more to it because the intention was to keep it as straightforward as possible in order to avoid any potential complications that might arise from using other approaches.

View the project's documentation on GitHub.

Let's connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.

Happy Coding!