Streamlining Routine ML Tasks with LangChain: A Hacker News Comment Analysis Example

Introduction LangChain and its companion framework LangGraph are synonymous with building autonomous "agents" capable of complex interactions -- retrieving external data, chaining multiple tools together, to name just a few. While that's certainly the case, I noticed they streamline more mundane day-to-day ML tasks. One example is a Hacker News comment-analysis notebook I put together. There’s nothing earth-shattering about the underlying code, but I appreciate how LangChain abstracts away the complexity of interacting with large language models (LLMs). You can see the complete notebook in the langchain_hn_comment_analysis Jupyter notebook, available herein my GitHub repo. Below, I’ll walk you through some of the key pieces—starting with how we define our LLM with structured output, then how we handle prompt templates to direct the model’s responses. Setting up the LLM from langchain_ollama import ChatOllama llm = ChatOllama( model="qwen2.5", temperature="0.0", verbose=False, keep_alive=-1, ) llm = llm.with_structured_output(PostCommentEvaluation, method="json_schema") Here, ChatOllama integrates with the Ollama chat model. The parameters control various aspects: model: Which LLM to use (e.g., “qwen2.5”). temperature: How “creative” the model is (0.0 means very deterministic). verbose: Toggles detailed logging. keep_alive: Manages the LLM’s session or connection. By calling llm.with_structured_output(PostCommentEvaluation, method="json_schema"), we tell LangChain that any response we get from the LLM should conform to the structure defined by our PostCommentEvaluation class. This eliminates the need to manually parse unstructured text or JSON—LangChain handles that for us. And if you ever switch to another chat model, LangChain’s chat model integrations let you do it with minimal code changes. Defining the Output Schema The PostCommentEvaluation class looks like so from typing import List, Literal, Optional, ForwardRef from pydantic import BaseModel, Field class PostCommentEvaluation(BaseModel): """ A class to represent the evaluation of comments. Attributes: summary (str): Summary of comments into a few sentences. key_points (List[str]): Key points from the comments. topics (List[str]): Main topics discussed in the comments. controversies (List[str]): Controversial takeaways from the comments. sentiment (Literal): Overall emotional sentiment of the comments. """ summary: str = Field(..., description="Summary comments into a few sentences") key_points: List[str] = Field([], description="Key points from the comments") topics: List[str] = Field([], description="Main topics discussed in the comments") controversies: List[str] = Field([], description="Controversial takeaways from the comments") sentiment: Literal[ "happiness", "anger", "sadness", "fear", "surprise", "disgust", "trust", "anticipation" ] = Field(..., description="Overall emotional sentiment of the comments") The PostCommentEvaluation is a Pydantic class and it provides data validation out of the box. With this the LLM's response will be in the shape of the PostCommentEvaluation class. Crafting the Prompt The next bit is the prompt template—let’s call it EVALUATE_POST_COMMENTS_PROMPT. This is where LangChain really clicks for me. Here’s what happens: I define a string template with placeholders like {post}, {comments}, and {user_question}. LangChain’s PromptTemplate lets me fill those placeholders with real data (the HN post details, a chunk of comments, and my specific query). Because I’m using f-string formatting by default, I don’t have to worry about weird template injection issues. A snippet might look something like: EVALUATE_POST_COMMENTS_PROMPT = PromptTemplate.from_template( """ Analyze the comments: {post} {comments} Please summarize, highlight key points, identify controversies, and provide an overall sentiment. Respond in JSON format with the following fields: summary, key_points, topics, controversies, sentiment. {user_question} """ ) If you’re curious about more advanced prompt patterns or want to see prompts that other folks are using, check out the LangChain Hub. It’s basically a marketplace where the community shares their custom prompt templates. Putting It All Together So how does this fit into the Hacker News pipeline? Fetch Data: I grab a Hacker News post’s metadata plus its comments (including nested replies) via the Hacker News Firebase API. Format Data: I have small utility functions to convert the post info and comments into a neat, template-friendly format. Generate Prompt: Using those details, I fill in {post}, {comments}, and whatever user question I have (like “What’s the overall sentiment here?”). Invoke the Model: Call llm.invoke(...) with the final prompt. Validate & Use Results: La

Mar 5, 2025 - 01:33
 0
Streamlining Routine ML Tasks with LangChain: A Hacker News Comment Analysis Example

Introduction

LangChain and its companion framework LangGraph are synonymous with building autonomous "agents" capable of complex interactions -- retrieving external data, chaining multiple tools together, to name just a few. While that's certainly the case, I noticed they streamline more mundane day-to-day ML tasks.

One example is a Hacker News comment-analysis notebook I put together. There’s nothing earth-shattering about the underlying code, but I appreciate how LangChain abstracts away the complexity of interacting with large language models (LLMs). You can see the complete notebook in the langchain_hn_comment_analysis Jupyter notebook, available herein my GitHub repo.

Below, I’ll walk you through some of the key pieces—starting with how we define our LLM with structured output, then how we handle prompt templates to direct the model’s responses.

Setting up the LLM

from langchain_ollama import ChatOllama

llm = ChatOllama(
    model="qwen2.5",
    temperature="0.0",
    verbose=False,
    keep_alive=-1,
)

llm = llm.with_structured_output(PostCommentEvaluation, method="json_schema")

Here, ChatOllama integrates with the Ollama chat model. The parameters control various aspects:

  • model: Which LLM to use (e.g., “qwen2.5”).
  • temperature: How “creative” the model is (0.0 means very deterministic).
  • verbose: Toggles detailed logging.
  • keep_alive: Manages the LLM’s session or connection.

By calling llm.with_structured_output(PostCommentEvaluation, method="json_schema"), we tell LangChain that any response we get from the LLM should conform to the structure defined by our PostCommentEvaluation class. This eliminates the need to manually parse unstructured text or JSON—LangChain handles that for us. And if you ever switch to another chat model, LangChain’s chat model integrations let you do it with minimal code changes.

Defining the Output Schema

The PostCommentEvaluation class looks like so

from typing import List, Literal, Optional, ForwardRef
from pydantic import BaseModel, Field

class PostCommentEvaluation(BaseModel):
    """
    A class to represent the evaluation of comments.

    Attributes:
    summary (str): Summary of comments into a few sentences.
    key_points (List[str]): Key points from the comments.
    topics (List[str]): Main topics discussed in the comments.
    controversies (List[str]): Controversial takeaways from the comments.
    sentiment (Literal): Overall emotional sentiment of the comments.
    """
    summary: str = Field(..., description="Summary comments into a few sentences")
    key_points: List[str] = Field([], description="Key points from the comments")
    topics: List[str] = Field([], description="Main topics discussed in the comments")
    controversies: List[str] = Field([], description="Controversial takeaways from the comments")
    sentiment: Literal[
        "happiness", "anger", "sadness", "fear", "surprise", "disgust", "trust", "anticipation"
    ] = Field(..., description="Overall emotional sentiment of the comments")

The PostCommentEvaluation is a Pydantic class and it provides data validation out of the box. With this the LLM's response will be in the shape of the PostCommentEvaluation class.

Crafting the Prompt

The next bit is the prompt template—let’s call it EVALUATE_POST_COMMENTS_PROMPT. This is where LangChain really clicks for me. Here’s what happens:

I define a string template with placeholders like {post}, {comments}, and {user_question}.
LangChain’s PromptTemplate lets me fill those placeholders with real data (the HN post details, a chunk of comments, and my specific query).
Because I’m using f-string formatting by default, I don’t have to worry about weird template injection issues.

A snippet might look something like:

EVALUATE_POST_COMMENTS_PROMPT = PromptTemplate.from_template(
    """
    Analyze the comments:

    {post}

    
    {comments}
    

    Please summarize, highlight key points, identify controversies, and provide an overall sentiment. 
    Respond in JSON format with the following fields:
    summary, key_points, topics, controversies, sentiment.

    
    {user_question}
    
    """
)

If you’re curious about more advanced prompt patterns or want to see prompts that other folks are using, check out the LangChain Hub. It’s basically a marketplace where the community shares their custom prompt templates.

Putting It All Together
So how does this fit into the Hacker News pipeline?

  1. Fetch Data: I grab a Hacker News post’s metadata plus its comments (including nested replies) via the Hacker News Firebase API.
  2. Format Data: I have small utility functions to convert the post info and comments into a neat, template-friendly format.
  3. Generate Prompt: Using those details, I fill in {post}, {comments}, and whatever user question I have (like “What’s the overall sentiment here?”).
  4. Invoke the Model: Call llm.invoke(...) with the final prompt.
  5. Validate & Use Results: LangChain returns a structured Python object that matches my PostCommentEvaluation. For instance:
{
  "summary": "Several commenters discussed the pros and cons of open sourcing…",
  "key_points": ["Open-source concerns", "Security vulnerabilities…"],
  "topics": ["Open source", "Security", "Community engagement"],
  "controversies": ["Whether it should remain private or not"],
  "sentiment": "anticipation"
}

Summary

LangChain may be best known for its advanced agent-based workflows, but this Hacker News project demonstrates its usefulness for simpler, day-to-day ML tasks. Not only did Pydantic schemas make it straightforward to enforce a well-defined output format, but working through the LangChain documentation also highlighted important best practices—like employing f-string templates for secure prompt construction. These lessons transfer directly to real-world ML pipelines, saving you both time and trouble when working with LLMs.