Unstructured Model Context Protocol Hackathon

This blog is part of the Unstructured MCP Hackathon. It elaborates on the project details and provides a follow-along tutorial for better understanding. Participating in the Unstructured MCP Virtual Hackathon was an exciting opportunity to explore the Model Context Protocol (MCP) and contribute a tool that simplifies extracting structured data from research papers. Introduction to the Unstructured MCP Virtual Hackathon The Unstructured MCP Virtual Hackathon, announced on Unstructured's blog, invited developers to create MCP servers using the Unstructured API. The challenge aimed to showcase how Unstructured's API can process unstructured data to solve real-world problems. Participants were encouraged to build solutions that benefit the developer community and share their implementations. What is MCP? MCP is an open protocol that standardizes how applications provide context to Large Language Models (LLMs). Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools. Read more about it here. MCP Components: MCP Hosts: Programs like Claude Desktop, Integrated Development Environments (IDEs), or AI tools that want to access data through MCP. MCP Clients: Protocol clients that maintain one-to-one connections with servers. MCP Servers: Lightweight programs that expose specific capabilities through the standardized Model Context Protocol. Local Data Sources: Your computer’s files, databases, and services that MCP servers can securely access. Remote Services: External systems available over the internet (e.g., through APIs) that MCP servers can connect to. Introducing the MCP Server Hackathon Project In response to the hackathon challenge, I developed an MCP server aimed at extracting structured data from research paper PDFs. By leveraging the Unstructured API, the server processes PDF documents to extract elements such as titles, abstracts, sections, figures, tables, and references. This structured data can then be utilized for various purposes, including fine-tuning language models to assist researchers in conducting efficient literature reviews. MCP Server Architecture Follow Along: Building the MCP Server To replicate or build upon this project, follow the steps outlined below: Prerequisites: Python Environment: Ensure Python 3.7 or higher is installed. Unstructured API Key: Obtain an API key from Unstructured to access their document processing services. Google Cloud Service Account: Set up a Google Cloud project and create a service account with appropriate permissions to access Google Drive for reading PDFs. Download the JSON credentials file. (Ensure you have at least one research paper PDF available in Google Drive.) MongoDB Database: Set up a MongoDB cluster with at least one database and a collection to store the structured PDF data (in JSON format). Claude Desktop: To access the MCP server and run workflows over Unstructured using its API. Setup and Installation: 1. Clone the Repository: git clone https://github.com/HeetVekariya/MCPHackathon.git cd MCPHackathon 2. Install uv: uv is an extremely fast Python package and project manager, written in Rust. Download it from here. 3. Install Dependencies: uv add "mcp[cli]" uv pip install --upgrade unstructured-client python-dotenv Alternatively, use: uv sync 4. Set Up Environment Variables: copy the .env.template file and create a new file named .env. Set the necessary values for the keys. Example: UNSTRUCTURED_API_KEY="" MONGO_DB_CONNECTION_STRING="" GOOGLEDRIVE_SERVICE_ACCOUNT_KEY="" 5. Develop the Workflow over Unstructured: After setting up source and destination connectors, develop the workflow as shown below: 6. Configure Claude Desktop Open the Claude Desktop configuration file by running the appropriate command for your operating system: # For macOS or Linux: code ~/Library/Application\ Support/Claude/claude_desktop_config.json # For Windows: code $env:AppData\Claude\claude_desktop_config.json In that file, add and replace the placeholder paths with the actual paths and save the file: { "mcpServers": { "UNS_MCP": { "command": "ABSOLUTE/PATH/TO/.local/bin/uv", "args": [ "--directory", "ABSOLUTE/PATH/TO/YOUR-UNS-MCP-REPO/uns_mcp", "run", "server.py" ], "env": [ "UNSTRUCTURED_API_KEY":"" ], "disabled": false } } } 7. Final step Restart Claude to link with the MCP server, ensure workflow functionality, and start querying with Claude to utilize Unstructured. Results and Ou

Mar 30, 2025 - 07:07
 0
Unstructured Model Context Protocol Hackathon

This blog is part of the Unstructured MCP Hackathon. It elaborates on the project details and provides a follow-along tutorial for better understanding.

Participating in the Unstructured MCP Virtual Hackathon was an exciting opportunity to explore the Model Context Protocol (MCP) and contribute a tool that simplifies extracting structured data from research papers.

Introduction to the Unstructured MCP Virtual Hackathon

The Unstructured MCP Virtual Hackathon, announced on Unstructured's blog, invited developers to create MCP servers using the Unstructured API. The challenge aimed to showcase how Unstructured's API can process unstructured data to solve real-world problems. Participants were encouraged to build solutions that benefit the developer community and share their implementations.

What is MCP?

MCP is an open protocol that standardizes how applications provide context to Large Language Models (LLMs). Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools. Read more about it here.

MCP architecture

MCP Components:

  • MCP Hosts: Programs like Claude Desktop, Integrated Development Environments (IDEs), or AI tools that want to access data through MCP.
  • MCP Clients: Protocol clients that maintain one-to-one connections with servers.
  • MCP Servers: Lightweight programs that expose specific capabilities through the standardized Model Context Protocol.
  • Local Data Sources: Your computer’s files, databases, and services that MCP servers can securely access.
  • Remote Services: External systems available over the internet (e.g., through APIs) that MCP servers can connect to.

Introducing the MCP Server Hackathon Project

In response to the hackathon challenge, I developed an MCP server aimed at extracting structured data from research paper PDFs. By leveraging the Unstructured API, the server processes PDF documents to extract elements such as titles, abstracts, sections, figures, tables, and references. This structured data can then be utilized for various purposes, including fine-tuning language models to assist researchers in conducting efficient literature reviews.

MCP Server Architecture

MCP Hackathon Architecture

Follow Along: Building the MCP Server

To replicate or build upon this project, follow the steps outlined below:

Prerequisites:

  • Python Environment: Ensure Python 3.7 or higher is installed.
  • Unstructured API Key: Obtain an API key from Unstructured to access their document processing services.
  • Google Cloud Service Account: Set up a Google Cloud project and create a service account with appropriate permissions to access Google Drive for reading PDFs. Download the JSON credentials file. (Ensure you have at least one research paper PDF available in Google Drive.)
  • MongoDB Database: Set up a MongoDB cluster with at least one database and a collection to store the structured PDF data (in JSON format).
  • Claude Desktop: To access the MCP server and run workflows over Unstructured using its API.

Setup and Installation:

1. Clone the Repository:

   git clone https://github.com/HeetVekariya/MCPHackathon.git
   cd MCPHackathon

2. Install uv:

  • uv is an extremely fast Python package and project manager, written in Rust. Download it from here.

3. Install Dependencies:

uv add "mcp[cli]"
uv pip install --upgrade unstructured-client python-dotenv
  • Alternatively, use:
uv sync

4. Set Up Environment Variables:

  • copy the .env.template file and create a new file named .env. Set the necessary values for the keys.

Example:

UNSTRUCTURED_API_KEY=""
MONGO_DB_CONNECTION_STRING=""
GOOGLEDRIVE_SERVICE_ACCOUNT_KEY=""

5. Develop the Workflow over Unstructured:

  • After setting up source and destination connectors, develop the workflow as shown below:

Unstrucuted Workflow

6. Configure Claude Desktop

  • Open the Claude Desktop configuration file by running the appropriate command for your operating system:
# For macOS or Linux:
code ~/Library/Application\ Support/Claude/claude_desktop_config.json

# For Windows:
code $env:AppData\Claude\claude_desktop_config.json
  • In that file, add and replace the placeholder paths with the actual paths and save the file:
{
    "mcpServers":
    {
        "UNS_MCP":
        {
            "command": "ABSOLUTE/PATH/TO/.local/bin/uv",
            "args":
            [
                "--directory",
                "ABSOLUTE/PATH/TO/YOUR-UNS-MCP-REPO/uns_mcp",
                "run",
                "server.py"
            ],
            "env":
            [
            "UNSTRUCTURED_API_KEY":""
            ],
            "disabled": false
        }
    }
}

7. Final step
Restart Claude to link with the MCP server, ensure workflow functionality, and start querying with Claude to utilize Unstructured.

Results and Outcomes

  • The MCP server successfully processes research paper PDFs, extracting structured data that can be used for various analytical purposes. By integrating Unstructured's API, the server streamlines the extraction process, making it a valuable tool for researchers and developers working with academic literature.

List unstructured workflows

Run unstructured workflow

  • Log in to the Unstructured platform and go to the jobs tab to check the latest job scheduled by the query we ran above. You will be able to see details like the image below:

Unstructured job details

Resources:

  • GitHub Repository: Access the complete codebase and documentation here.
  • Unstructured's MCP Repository: UNS-MCP
  • Unstructured API Documentation: Refer to the official documentation for detailed information on API usage and features at here.
  • Confused about what is Unstructured? Check out my detailed introduction about Unstructured here.

Conclusion

Participating in the Unstructured MCP Virtual Hackathon was a rewarding experience that led to the development of a tool addressing a real-world need in academic research. The MCP server not only showcases the capabilities of Unstructured's API but also contributes to the broader community by providing a solution for efficient research paper data processing.