OCR Reader In Python: Extract Table From Pdf

At Gallup Pakistan Digital Analytics, we have embarked on an exciting project to streamline data extraction from the Pakistan Bureau of Statistics (PBS) trade reports. These reports, rich with vital trade statistics, are published in PDF format each month. Our objective is to harness the power of an OCR Reader to simplify and automate the data extraction process, transforming these PDFs into easily analyzable Excel files.

The Challenge

Trade statistics data from PBS is detailed and voluminous, often spanning numerous pages in PDF format. Manually extracting and organizing this data is a time-consuming and error-prone task. This is where our OCR Reader project comes into play. By leveraging optical character recognition (OCR) technology, we can efficiently extract and process this data.

Our Solution

We have developed a Python-based solution that utilizes several powerful libraries: pandas for data manipulation, tabula-py for reading PDF tables, PyPDF2 for handling PDF files, and openpyxl for exporting data to Excel. Here’s how it works:

User-Friendly Input: The script prompts users to specify which pages of the PDF to process. You can enter specific page numbers or choose to process all pages.
Accurate Data Extraction: Using the OCR Reader capabilities of tabula-py, the script extracts tables from the specified pages. This step involves setting the coordinates of the area containing the tables to ensure precise extraction.
Data Transformation: The extracted data undergoes several transformations:
- Cleaning: Drops empty rows and columns to remove irrelevant data.
- Splitting Cells: Handles multiline cells to ensure each cell contains a single value.
- Column Adjustment: Reorders and renames columns to match the expected format. If columns are missing, they are added to maintain consistency.
- Splitting Specific Columns: Splits certain columns into multiple columns for detailed data, ensuring that all relevant information is captured.
Export to Excel: Finally, the processed data is saved into an Excel file, with each page’s data organized into separate sheets. This structured format makes it easy to analyze and visualize the data.

The Impact

By automating the extraction and processing of PBS trade statistics, we have significantly reduced the time and effort required to handle this data. This project not only enhances our analytical capabilities but also ensures that we can provide timely and accurate insights into Pakistan’s trade statistics.

In conclusion, our OCR Reader project at Gallup Pakistan Digital Analytics exemplifies how innovative technology can transform data management processes. We’re excited about the potential this project holds for improving data accessibility and analysis, paving the way for more informed decision-making in the realm of trade statistics.

Check Out Our Project on Github