AI-Ready Congressional Data: Building a Python Tool for Plaintext Extraction

Let's cut to the chase: Congressional Records are gold mines of political data, but they're trapped in PDFs and scattered across congress.gov. Yeah, there's an API, but it just gives you more PDFs or links to HTML pages. Not exactly AI-friendly. I needed plaintext, I needed it in bulk, and I needed it yesterday. So I built a Python tool that: Grabs data through the congress.gov API Ignores those pesky PDFs Extracts clean plaintext from the HTML pages Organizes everything neatly Why plaintext? Because you can actually do something with it! Search it, analyze it, feed it to AI models, cite specific sentences, or just read it without your eyes bleeding from PDF formatting. Technical Implementation The tool leverages several key strategies to efficiently download records: 1. API Integration The script uses the official congress.gov API, which requires an API key (free to obtain): def fetch_issues(volume, api_key): url = f"https://api.congress.gov/v3/daily-congressional-record/{volume}?format=json&api_key={api_key}&limit=1000" retries = 10 for i in range(retries): response = requests.get(url) if response.status_code == 200: return response.json().get("dailyCongressionalRecord", []) else: wait_time = 2 i print(f"Error: {response.status_code}, {response.text}. Retrying in {wait_time} seconds...") time.sleep(wait_time) print("Failed to fetch issues after multiple retries.") return [] 2. Intelligent Rate Limiting To handle API rate limits, the script implements exponential backoff (which is integrated directly in the API fetch functions as shown above). 3. Concurrent Processing Downloads are accelerated using Python's concurrent.futures: with concurrent.futures.ThreadPoolExecutor() as executor: futures = [executor.submit(fetch_and_compile_articles, args.volume, issue_number, args.api_key, args.force_override) for issue_number in issue_numbers] for future in concurrent.futures.as_completed(futures): future.result() 4. Smart Text Extraction The script handles congress.gov's HTML pages to extract clean plaintext: def fetch_text_from_url(url): print(f"Fetching {url}...") retries = 10 for i in range(retries): try: response = requests.get(url, timeout=30) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') return soup.get_text() else: wait_time = 2 i print(f"Error: {response.status_code}. Retrying in {wait_time} seconds...") time.sleep(wait_time) except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as e: wait_time = 2 ** i print(f"Error: {e}. Retrying in {wait_time} seconds...") time.sleep(wait_time) print(f"Failed to fetch {url} after multiple retries.") return None Instead of using the PDFs provided by the API, the tool navigates to the HTML URLs and extracts the text content directly. This gives us much cleaner text than we'd get from PDF extraction. 5. URL Extraction and Organization The tool intelligently extracts the URLs for the HTML pages from the API response: def extract_formatted_text_urls(data): urls = [] if "articles" in data: for article in data["articles"]: for section in article.get("sectionArticles", []): for text_entry in section.get("text", []): if text_entry.get("type") == "Formatted Text": urls.append({ "url": text_entry.get("url"), "startPage": section.get("startPage"), "endPage": section.get("endPage") }) # Sort by startPage, then endPage urls.sort(key=lambda x: (x["startPage"], x["endPage"])) return [url["url"] for url in urls] This makes sure we get all the text in a logical order, sorted by page number. Example Usage With the core functionality in place, using the tool is straightforward: # Example usage python get_congressional_record.py 170 your_api_key_here The script handles everything: fetching issues, extracting articles, processing HTML, and saving the plaintext files in a structured directory format: def fetch_and_compile_articles(volume, issue, api_key, force_override=False): print(f"Fetching articles for Volume {volume}, Issue {issue}...") directory = f'congressional_records_{volume}/issue_{issue}' os.makedirs(directory, exist_ok=True) data = fetch_articles(volume, issue, api_key) if data and data.get("articles"): urls = extract_formatted_text_urls(data) for url in urls: file_name = url.split('/')[-1].replace('.htm', '.txt') f

May 17, 2025 - 17:04

AI-Ready Congressional Data: Building a Python Tool for Plaintext Extraction

Let's cut to the chase: Congressional Records are gold mines of political data, but they're trapped in PDFs and scattered across congress.gov. Yeah, there's an API, but it just gives you more PDFs or links to HTML pages. Not exactly AI-friendly.

I needed plaintext, I needed it in bulk, and I needed it yesterday. So I built a Python tool that:

Grabs data through the congress.gov API
Ignores those pesky PDFs
Extracts clean plaintext from the HTML pages
Organizes everything neatly

Why plaintext? Because you can actually do something with it! Search it, analyze it, feed it to AI models, cite specific sentences, or just read it without your eyes bleeding from PDF formatting.

Technical Implementation

The tool leverages several key strategies to efficiently download records:

1. API Integration

The script uses the official congress.gov API, which requires an API key (free to obtain):

def fetch_issues(volume, api_key):
    url = f"https://api.congress.gov/v3/daily-congressional-record/{volume}?format=json&api_key={api_key}&limit=1000"
    retries = 10
    for i in range(retries):
        response = requests.get(url)
        if response.status_code == 200:
            return response.json().get("dailyCongressionalRecord", [])
        else:
            wait_time = 2 ** i
            print(f"Error: {response.status_code}, {response.text}. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
    print("Failed to fetch issues after multiple retries.")
    return []

2. Intelligent Rate Limiting

To handle API rate limits, the script implements exponential backoff (which is integrated directly in the API fetch functions as shown above).

3. Concurrent Processing

Downloads are accelerated using Python's concurrent.futures:

with concurrent.futures.ThreadPoolExecutor() as executor:
    futures = [executor.submit(fetch_and_compile_articles, args.volume, issue_number, args.api_key, args.force_override) 
               for issue_number in issue_numbers]
    for future in concurrent.futures.as_completed(futures):
        future.result()

4. Smart Text Extraction

The script handles congress.gov's HTML pages to extract clean plaintext:

def fetch_text_from_url(url):
    print(f"Fetching {url}...")
    retries = 10
    for i in range(retries):
        try:
            response = requests.get(url, timeout=30)
            if response.status_code == 200:
                soup = BeautifulSoup(response.text, 'html.parser')
                return soup.get_text()
            else:
                wait_time = 2 ** i
                print(f"Error: {response.status_code}. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
        except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as e:
            wait_time = 2 ** i
            print(f"Error: {e}. Retrying in {wait_time} seconds...")
            time.sleep(wait_time)
    print(f"Failed to fetch {url} after multiple retries.")
    return None

Instead of using the PDFs provided by the API, the tool navigates to the HTML URLs and extracts the text content directly. This gives us much cleaner text than we'd get from PDF extraction.

5. URL Extraction and Organization

The tool intelligently extracts the URLs for the HTML pages from the API response:

def extract_formatted_text_urls(data):
    urls = []
    if "articles" in data:
        for article in data["articles"]:
            for section in article.get("sectionArticles", []):
                for text_entry in section.get("text", []):
                    if text_entry.get("type") == "Formatted Text":
                        urls.append({
                            "url": text_entry.get("url"),
                            "startPage": section.get("startPage"),
                            "endPage": section.get("endPage")
                        })
    # Sort by startPage, then endPage
    urls.sort(key=lambda x: (x["startPage"], x["endPage"]))
    return [url["url"] for url in urls]

This makes sure we get all the text in a logical order, sorted by page number.

Example Usage

With the core functionality in place, using the tool is straightforward:

# Example usage
python get_congressional_record.py 170 your_api_key_here

The script handles everything: fetching issues, extracting articles, processing HTML, and saving the plaintext files in a structured directory format:

def fetch_and_compile_articles(volume, issue, api_key, force_override=False):
    print(f"Fetching articles for Volume {volume}, Issue {issue}...")
    directory = f'congressional_records_{volume}/issue_{issue}'
    os.makedirs(directory, exist_ok=True)

    data = fetch_articles(volume, issue, api_key)
    if data and data.get("articles"):
        urls = extract_formatted_text_urls(data)

        for url in urls:
            file_name = url.split('/')[-1].replace('.htm', '.txt')
            file_path = os.path.join(directory, file_name)

            # Check if the file already exists and if force_override is not set
            if os.path.exists(file_path) and not force_override:
                print(f"File {file_path} already exists. Skipping...")
                continue

            text_content = fetch_text_from_url(url)
            if text_content:
                with open(file_path, 'w') as f:
                    f.write(text_content)
                    f.write("\n\n")
    else:
        print("No articles")

With just one command, you can download an entire volume of Congressional Records in plaintext format, properly organized and ready for analysis.

This tool has already proven useful in real-world applications - I used it to help create a larger government data AI project described in my article Making Sense of Congressional Data with LightRAG, Amazon Bedrock and Ollama, where having clean plaintext was essential for effective AI processing.

Features and Benefits

Automatic Organization: Files are saved in a logical directory structure by volume and issue
Resume Capability: The --force-override flag lets you restart interrupted downloads
Error Resilience: Automatic retries handle temporary network issues
Minimal Dependencies: Only requires requests and beautifulsoup4
Plaintext Output: All records are saved as clean, structured text files

Example: Working with the Plaintext Data

Here's a simple example of how you might analyze the downloaded plaintext records:

import os
import glob
import re

def find_mentions(keyword, volume):
    mentions = []
    files = glob.glob(f"congressional_records_{volume}/**/*.txt", recursive=True)

    for file in files:
        with open(file, 'r', encoding='utf-8') as f:
            content = f.read()
            if re.search(r'\b' + re.escape(keyword) + r'\b', content, re.IGNORECASE):
                # Extract the paragraph containing the keyword
                paragraphs = content.split('\n\n')
                for para in paragraphs:
                    if re.search(r'\b' + re.escape(keyword) + r'\b', para, re.IGNORECASE):
                        mentions.append({
                            'text': para,
                            'source': f"Congressional Record Vol. {volume}, file: {file}",
                        })

    return mentions

# Example: Find all mentions of "climate change" in Volume 168
climate_mentions = find_mentions("climate change", 168)

Of course you can imagine how easy it would be to ingest this into a RAG pipeline.

Lessons Learned

Building this tool taught me several valuable lessons:

API Rate Limits are Your Friend: They prevent abuse and ensure fair access
Exponential Backoff Works: It's a simple but effective retry strategy
Direct Text Extraction is Superior: Accessing URLs for plaintext extraction yields cleaner results than the API's PDF options
Concurrency Speeds Things Up: But be mindful of rate limits

What's Next?

This tool opens up possibilities for further development:

Add support for specific date ranges or topics
Implement speaker identification to track who said what
Build citation generators for academic and journalistic use
Create a web interface for non-technical users
Develop text analysis tools specifically for Congressional discourse
Build APIs to serve this plaintext data to other applications

I'm already using the plaintext data from this tool in a larger government data AI project (see Making Sense of Congressional Data with LightRAG, Amazon Bedrock and Ollama), which demonstrates how this foundational data extraction work enables more sophisticated AI applications.

Conclusion

Government transparency shouldn't be limited by technical barriers or format issues. By automating Congressional Record downloads and converting them to plaintext, we make this valuable data more accessible and useful for everyone. Whether you're a researcher needing specific citations, a journalist fact-checking claims, a developer building civic tech, or an engaged citizen, having easy access to plaintext Congressional proceedings empowers better civic participation.

The complete source code is available on GitHub. Feel free to contribute, suggest features, or build upon this foundation.

Remember: democracy works best when information flows freely. Let's build tools that make that happen.

Have you built tools for accessing government data? What challenges did you face? Do you have ideas for how plaintext Congressional Records could be used? Let me know in the comments!