AI-Ready Congressional Data: Building a Python Tool for Plaintext Extraction
Let's cut to the chase: Congressional Records are gold mines of political data, but they're trapped in PDFs and scattered across congress.gov. Yeah, there's an API, but it just gives you more PDFs or links to HTML pages. Not exactly AI-friendly. I needed plaintext, I needed it in bulk, and I needed it yesterday. So I built a Python tool that: Grabs data through the congress.gov API Ignores those pesky PDFs Extracts clean plaintext from the HTML pages Organizes everything neatly Why plaintext? Because you can actually do something with it! Search it, analyze it, feed it to AI models, cite specific sentences, or just read it without your eyes bleeding from PDF formatting. Technical Implementation The tool leverages several key strategies to efficiently download records: 1. API Integration The script uses the official congress.gov API, which requires an API key (free to obtain): def fetch_issues(volume, api_key): url = f"https://api.congress.gov/v3/daily-congressional-record/{volume}?format=json&api_key={api_key}&limit=1000" retries = 10 for i in range(retries): response = requests.get(url) if response.status_code == 200: return response.json().get("dailyCongressionalRecord", []) else: wait_time = 2 ** i print(f"Error: {response.status_code}, {response.text}. Retrying in {wait_time} seconds...") time.sleep(wait_time) print("Failed to fetch issues after multiple retries.") return [] 2. Intelligent Rate Limiting To handle API rate limits, the script implements exponential backoff (which is integrated directly in the API fetch functions as shown above). 3. Concurrent Processing Downloads are accelerated using Python's concurrent.futures: with concurrent.futures.ThreadPoolExecutor() as executor: futures = [executor.submit(fetch_and_compile_articles, args.volume, issue_number, args.api_key, args.force_override) for issue_number in issue_numbers] for future in concurrent.futures.as_completed(futures): future.result() 4. Smart Text Extraction The script handles congress.gov's HTML pages to extract clean plaintext: def fetch_text_from_url(url): print(f"Fetching {url}...") retries = 10 for i in range(retries): try: response = requests.get(url, timeout=30) if response.status_code == 200: soup = BeautifulSoup(response.text, 'html.parser') return soup.get_text() else: wait_time = 2 ** i print(f"Error: {response.status_code}. Retrying in {wait_time} seconds...") time.sleep(wait_time) except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as e: wait_time = 2 ** i print(f"Error: {e}. Retrying in {wait_time} seconds...") time.sleep(wait_time) print(f"Failed to fetch {url} after multiple retries.") return None Instead of using the PDFs provided by the API, the tool navigates to the HTML URLs and extracts the text content directly. This gives us much cleaner text than we'd get from PDF extraction. 5. URL Extraction and Organization The tool intelligently extracts the URLs for the HTML pages from the API response: def extract_formatted_text_urls(data): urls = [] if "articles" in data: for article in data["articles"]: for section in article.get("sectionArticles", []): for text_entry in section.get("text", []): if text_entry.get("type") == "Formatted Text": urls.append({ "url": text_entry.get("url"), "startPage": section.get("startPage"), "endPage": section.get("endPage") }) # Sort by startPage, then endPage urls.sort(key=lambda x: (x["startPage"], x["endPage"])) return [url["url"] for url in urls] This makes sure we get all the text in a logical order, sorted by page number. Example Usage With the core functionality in place, using the tool is straightforward: # Example usage python get_congressional_record.py 170 your_api_key_here The script handles everything: fetching issues, extracting articles, processing HTML, and saving the plaintext files in a structured directory format: def fetch_and_compile_articles(volume, issue, api_key, force_override=False): print(f"Fetching articles for Volume {volume}, Issue {issue}...") directory = f'congressional_records_{volume}/issue_{issue}' os.makedirs(directory, exist_ok=True) data = fetch_articles(volume, issue, api_key) if data and data.get("articles"): urls = extract_formatted_text_urls(data) for url in urls: file_name = url.split('/')[-1].replace('.htm', '.txt') f

Let's cut to the chase: Congressional Records are gold mines of political data, but they're trapped in PDFs and scattered across congress.gov. Yeah, there's an API, but it just gives you more PDFs or links to HTML pages. Not exactly AI-friendly.
I needed plaintext, I needed it in bulk, and I needed it yesterday. So I built a Python tool that:
- Grabs data through the congress.gov API
- Ignores those pesky PDFs
- Extracts clean plaintext from the HTML pages
- Organizes everything neatly
Why plaintext? Because you can actually do something with it! Search it, analyze it, feed it to AI models, cite specific sentences, or just read it without your eyes bleeding from PDF formatting.
Technical Implementation
The tool leverages several key strategies to efficiently download records:
1. API Integration
The script uses the official congress.gov API, which requires an API key (free to obtain):
def fetch_issues(volume, api_key):
url = f"https://api.congress.gov/v3/daily-congressional-record/{volume}?format=json&api_key={api_key}&limit=1000"
retries = 10
for i in range(retries):
response = requests.get(url)
if response.status_code == 200:
return response.json().get("dailyCongressionalRecord", [])
else:
wait_time = 2 ** i
print(f"Error: {response.status_code}, {response.text}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
print("Failed to fetch issues after multiple retries.")
return []
2. Intelligent Rate Limiting
To handle API rate limits, the script implements exponential backoff (which is integrated directly in the API fetch functions as shown above).
3. Concurrent Processing
Downloads are accelerated using Python's concurrent.futures:
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = [executor.submit(fetch_and_compile_articles, args.volume, issue_number, args.api_key, args.force_override)
for issue_number in issue_numbers]
for future in concurrent.futures.as_completed(futures):
future.result()
4. Smart Text Extraction
The script handles congress.gov's HTML pages to extract clean plaintext:
def fetch_text_from_url(url):
print(f"Fetching {url}...")
retries = 10
for i in range(retries):
try:
response = requests.get(url, timeout=30)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
return soup.get_text()
else:
wait_time = 2 ** i
print(f"Error: {response.status_code}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
except (requests.exceptions.Timeout, requests.exceptions.ConnectionError) as e:
wait_time = 2 ** i
print(f"Error: {e}. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
print(f"Failed to fetch {url} after multiple retries.")
return None
Instead of using the PDFs provided by the API, the tool navigates to the HTML URLs and extracts the text content directly. This gives us much cleaner text than we'd get from PDF extraction.
5. URL Extraction and Organization
The tool intelligently extracts the URLs for the HTML pages from the API response:
def extract_formatted_text_urls(data):
urls = []
if "articles" in data:
for article in data["articles"]:
for section in article.get("sectionArticles", []):
for text_entry in section.get("text", []):
if text_entry.get("type") == "Formatted Text":
urls.append({
"url": text_entry.get("url"),
"startPage": section.get("startPage"),
"endPage": section.get("endPage")
})
# Sort by startPage, then endPage
urls.sort(key=lambda x: (x["startPage"], x["endPage"]))
return [url["url"] for url in urls]
This makes sure we get all the text in a logical order, sorted by page number.
Example Usage
With the core functionality in place, using the tool is straightforward:
# Example usage
python get_congressional_record.py 170 your_api_key_here
The script handles everything: fetching issues, extracting articles, processing HTML, and saving the plaintext files in a structured directory format:
def fetch_and_compile_articles(volume, issue, api_key, force_override=False):
print(f"Fetching articles for Volume {volume}, Issue {issue}...")
directory = f'congressional_records_{volume}/issue_{issue}'
os.makedirs(directory, exist_ok=True)
data = fetch_articles(volume, issue, api_key)
if data and data.get("articles"):
urls = extract_formatted_text_urls(data)
for url in urls:
file_name = url.split('/')[-1].replace('.htm', '.txt')
file_path = os.path.join(directory, file_name)
# Check if the file already exists and if force_override is not set
if os.path.exists(file_path) and not force_override:
print(f"File {file_path} already exists. Skipping...")
continue
text_content = fetch_text_from_url(url)
if text_content:
with open(file_path, 'w') as f:
f.write(text_content)
f.write("\n\n")
else:
print("No articles")
With just one command, you can download an entire volume of Congressional Records in plaintext format, properly organized and ready for analysis.
This tool has already proven useful in real-world applications - I used it to help create a larger government data AI project described in my article Making Sense of Congressional Data with LightRAG, Amazon Bedrock and Ollama, where having clean plaintext was essential for effective AI processing.
Features and Benefits
- Automatic Organization: Files are saved in a logical directory structure by volume and issue
-
Resume Capability: The
--force-override
flag lets you restart interrupted downloads - Error Resilience: Automatic retries handle temporary network issues
-
Minimal Dependencies: Only requires
requests
andbeautifulsoup4
- Plaintext Output: All records are saved as clean, structured text files
Example: Working with the Plaintext Data
Here's a simple example of how you might analyze the downloaded plaintext records:
import os
import glob
import re
def find_mentions(keyword, volume):
mentions = []
files = glob.glob(f"congressional_records_{volume}/**/*.txt", recursive=True)
for file in files:
with open(file, 'r', encoding='utf-8') as f:
content = f.read()
if re.search(r'\b' + re.escape(keyword) + r'\b', content, re.IGNORECASE):
# Extract the paragraph containing the keyword
paragraphs = content.split('\n\n')
for para in paragraphs:
if re.search(r'\b' + re.escape(keyword) + r'\b', para, re.IGNORECASE):
mentions.append({
'text': para,
'source': f"Congressional Record Vol. {volume}, file: {file}",
})
return mentions
# Example: Find all mentions of "climate change" in Volume 168
climate_mentions = find_mentions("climate change", 168)
Of course you can imagine how easy it would be to ingest this into a RAG pipeline.
Lessons Learned
Building this tool taught me several valuable lessons:
- API Rate Limits are Your Friend: They prevent abuse and ensure fair access
- Exponential Backoff Works: It's a simple but effective retry strategy
- Direct Text Extraction is Superior: Accessing URLs for plaintext extraction yields cleaner results than the API's PDF options
- Concurrency Speeds Things Up: But be mindful of rate limits
What's Next?
This tool opens up possibilities for further development:
- Add support for specific date ranges or topics
- Implement speaker identification to track who said what
- Build citation generators for academic and journalistic use
- Create a web interface for non-technical users
- Develop text analysis tools specifically for Congressional discourse
- Build APIs to serve this plaintext data to other applications
I'm already using the plaintext data from this tool in a larger government data AI project (see Making Sense of Congressional Data with LightRAG, Amazon Bedrock and Ollama), which demonstrates how this foundational data extraction work enables more sophisticated AI applications.
Conclusion
Government transparency shouldn't be limited by technical barriers or format issues. By automating Congressional Record downloads and converting them to plaintext, we make this valuable data more accessible and useful for everyone. Whether you're a researcher needing specific citations, a journalist fact-checking claims, a developer building civic tech, or an engaged citizen, having easy access to plaintext Congressional proceedings empowers better civic participation.
The complete source code is available on GitHub. Feel free to contribute, suggest features, or build upon this foundation.
Remember: democracy works best when information flows freely. Let's build tools that make that happen.
Have you built tools for accessing government data? What challenges did you face? Do you have ideas for how plaintext Congressional Records could be used? Let me know in the comments!