Company: Unstructured IO
Role: Senior Machine Learning Engineer
****https://github.com/Unstructured-IO
https://unstructured-io.github.io/unstructured/index.html
Project: Unstructured Library and API for Pre-processing Text Documents
Developed and enhanced unstructured
library, an open-source project that provides components for pre-processing text documents such as PDFs, HTML, and Word documents.
The library offers partitioning, cleaning, and staging bricks that enable users to build tailored pipelines for their specific document processing needs. Additionally, contributed to the development of the unstructured
API, which exposes the library's capabilities as a service.
Technologies and Tools:
- Programming Language: Python
- File Processing Libraries: pandas, pdf2image, pdfminer, lxml, docx, pypandoc, email
- Image Processing Libraries: tesseract-ocr, PaddleOCR, PIL
- ML Models and Libraries: faster_rcnn, DocLayNet, torch, torch-vision, layoutparser, detectron2
Key Accomplishments:
- Implemented partitioning of PDF and image documents using ML models, OCR libraries, and PDF processing libraries.
- Developed a pipeline that combines layout detection ML models (faster_rcnn, DocLayNet), OCR libraries (tesseract-ocr, PaddleOCR), and PDF processing libraries (pdfminer, pdf2image, pdfminer.six, pikepdf, pypdf) to parse and structure PDF and image files.
- Implemented the extraction of image block elements ("Image", "Figure", "Table") as separate images using the
unstructured
library directly and through API calls.
- Enhanced element ordering and sorting by pre-processing bounding boxes and implementing the xy-cut sorting algorithm.
- Improved the parsing and structuring of HTML and MS Word documents.
- Implemented functionality to track emphasized text (
<strong>
, <em>
, <span>
, <b>
, <i>
tags) in HTML documents and added the tracked information to the metadata of the associated elements.
- Added support for tracking emphasized text (
bold/italic
formatting) from paragraphs and tables in MS Word documents.
- Collaborated on the development of the
unstructured
API and client SDK.
- Contributed to the design and implementation of the API endpoints and request/response handling.
- Assisted in the development of the client SDK to facilitate seamless integration with the
unstructured
API.
- Resolved over 300 bugs and reviewed more than 200 pull requests on the project's GitHub repository.
- Addressed critical issues related to out-of-memory errors when processing large PDF documents by implementing chunked processing and temporary file storage.
- Fixed parsing issues in HTML documents resulting in zero elements by adding exception handling and sanity checks.
- Resolved Unicode encoding/decoding issues for various file formats (txt, eml, html, xml) by implementing robust encoding detection and fallback mechanisms.
- Implemented OCR functionality for PDF and image files.
- Added support for full-page and individual-block OCR using tesseract-ocr and PaddleOCR libraries.
- Developed functionality to merge inferred layout with OCR layout, populating inferred region text with OCR text when available.
- Wrote evaluation scripts to validate the performance of the OCR module.
- Enhanced the extraction of image block elements from PDF documents.
- Implemented the ability to save image block elements (Image, Figure, Table) to the local file system when using the
unstructured
library directly.
- Added support for returning image block elements as base64 encoded data when using the
unstructured
API.