Company: Unstructured IO

Role: Senior Machine Learning Engineer

****https://github.com/Unstructured-IO

https://unstructured-io.github.io/unstructured/index.html

Project: Unstructured Library and API for Pre-processing Text Documents

Developed and enhanced unstructured library, an open-source project that provides components for pre-processing text documents such as PDFs, HTML, and Word documents.

The library offers partitioning, cleaning, and staging bricks that enable users to build tailored pipelines for their specific document processing needs. Additionally, contributed to the development of the unstructured API, which exposes the library's capabilities as a service.

Technologies and Tools:

Key Accomplishments:

  1. Implemented partitioning of PDF and image documents using ML models, OCR libraries, and PDF processing libraries.
  2. Improved the parsing and structuring of HTML and MS Word documents.
  3. Collaborated on the development of the unstructured API and client SDK.
  4. Resolved over 300 bugs and reviewed more than 200 pull requests on the project's GitHub repository.
  5. Implemented OCR functionality for PDF and image files.
  6. Enhanced the extraction of image block elements from PDF documents.