Command Line LLM Text Completions

Lately I've been hacking away at some LLM integrations. I found myself with a need to stream text completions directly into a file from the command line. I decided to release this part of my project with a detailed explanation of how it works. Hopfully this helps some of you get your own local LLM integrations up and running. Buckle up, this one's a long read. Purpose Spinning up a node app that does completions is pretty straightforward, but for CLI usage I ran into a few usability conditions that eventually led to the following code. I wanted this to be easy to use with a smooth user experience. Below is a complete outline of every code block explaining what it does and the reasoning behind it. I decided to release this a guide for anyone interested in the details. Dependencies Uses minimal imports. GPT4All and it's node bindings are the only external requirements. import { loadModel, createCompletionStream } from 'gpt4all'; import { open as openFile } from 'node:fs/promises'; import readline from 'node:readline'; State Management To create a smooth user experience. const state = { busy: false, // Block input during generation killed: false, // Handle cancellation gracefully flashOn: false, // Loading animation state flashLoop: null // Loading animation timer }; Error Handling Gracefully shutdown on unexpected errors. process.on('uncaughtException', err => { console.error(err); if (state.busy) { shutdown(); } else { process.exit(); } }); Processing mode Can be 'cpu' | 'gpu' | 'amd' | 'nvidia' | 'intel' | ''. The best avaiable gpu will be used by default, falls back to cpu if no gpu available. const device = process.env.DEVICE ?? 'gpu'; Model Configuration Load a local model using GPT4All bindings. If you want to experiment with different models you could read these values in from ENV or json file easy enough. const modelName = process.env.MODEL ?? 'mistral-7b-v0.1.Q4_K_M.gguf'; const ctx = process.env.CTX ?? 2048; // 2048 is max for Mistral 7b const model = await loadModel(modelName, { modelConfigFile: "./models.json", // Per-model settings allowDownload: false, // We will manually download gguf file verbose: false, // Supress detailed output from model device: device, // Processing device, set by ENV variable nCtx: ctx, // Max context size, varies by model ngl: 100 // Number of gpu layers to use }); The model must exist in GPT4All's model path. On arch this is ~/.local/share/nomic.ai/GPT4All/. An entry for this model must exist in models.json. You can use the metadata provided by nomic or specify your own in the following format if your model is not listed. The GPT4All wiki provides find guidance on configuring custom models. [ { "order": "a", "name": "Mistral 7B", "filename": "mistral-7b-v0.1.Q4_K_M.gguf", "url": "https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/blob/main/mistral-7b-v0.1.Q4_K_M.gguf?download=true", "md5sum": "a5b363017e471c713665d57433f76e65", "filesize": "4368438912", "requires": "2.5.0", "ramrequired": "8", "parameters": "7 billion", "quant": "q4_0", "type": "Mistral", "description": "For creative completions, developed by Mistral AI", "promptTemplate": "%1", "chatTemplate": "", "systemPrompt": "" } ] This project uses Mistral 7B Base converted to GGUF Format by TheBloke. There are plenty of models to choose from but for lightweight creative writing this one does quite well. It can be run on a decent laptop and is released under the Apache 2.0 license allowing commercial use. There are newer and larger models in this series but this one hits a good balance of resource usage and creativity. The systemPrompt and chatTemplate options are not needed for basic completions. More on chat mode in the next article. This script does not generate responses with a personality. You cannot ask it a question and get a well formed response. To use this tool, you feed it incomplete text, and it will complete the text for you. Example Input This is a story Example Output This is a story about how my best friend and I are complete opposites. My name... Tips for Better Completions End your prompt mid-sentence for more natural continuations Use markdown or code formatting to guide the style Include examples of the desired output format Keep context under 2048 tokens for best performance Use append mode -a for iterative writing Provide useful context details to guide the output Command Line Argument Processing Completions can be done a few different ways: No input, random output to terminal String input from command line with -p or --prompt flag. File input with with -f or --file flag Output can be redirected with > output.txt Output can

Apr 7, 2025 - 09:11
 0
Command Line LLM Text Completions

Lately I've been hacking away at some LLM integrations. I found myself with a need to stream text completions directly into a file from the command line. I decided to release this part of my project with a detailed explanation of how it works. Hopfully this helps some of you get your own local LLM integrations up and running. Buckle up, this one's a long read.

Purpose

Spinning up a node app that does completions is pretty straightforward, but for CLI usage I ran into a few usability conditions that eventually led to the following code. I wanted this to be easy to use with a smooth user experience. Below is a complete outline of every code block explaining what it does and the reasoning behind it. I decided to release this a guide for anyone interested in the details.

Dependencies

Uses minimal imports. GPT4All and it's node bindings are the only external requirements.

import { loadModel, createCompletionStream } from 'gpt4all';
import { open as openFile } from 'node:fs/promises';
import readline from 'node:readline';

State Management

To create a smooth user experience.

const state = {
    busy: false,    // Block input during generation
    killed: false,  // Handle cancellation gracefully
    flashOn: false, // Loading animation state
    flashLoop: null // Loading animation timer
};

Error Handling

Gracefully shutdown on unexpected errors.

process.on('uncaughtException', err => {
    console.error(err);
    if (state.busy) {
        shutdown();
    } else {
        process.exit();
    }
});

Processing mode

Can be 'cpu' | 'gpu' | 'amd' | 'nvidia' | 'intel' | ''.

The best avaiable gpu will be used by default, falls back to cpu if no gpu available.

const device = process.env.DEVICE ?? 'gpu';

Model Configuration

Load a local model using GPT4All bindings. If you want to experiment with different models you could read these values in from ENV or json file easy enough.

const modelName = process.env.MODEL ?? 'mistral-7b-v0.1.Q4_K_M.gguf';
const ctx = process.env.CTX ?? 2048; // 2048 is max for Mistral 7b

const model = await loadModel(modelName, {
    modelConfigFile: "./models.json", // Per-model settings
    allowDownload: false, // We will manually download gguf file
    verbose: false,       // Supress detailed output from model
    device: device,       // Processing device, set by ENV variable
    nCtx: ctx,            // Max context size, varies by model
    ngl: 100              // Number of gpu layers to use
});

The model must exist in GPT4All's model path. On arch this is ~/.local/share/nomic.ai/GPT4All/. An entry for this model must exist in models.json. You can use the metadata provided by nomic or specify your own in the following format if your model is not listed. The GPT4All wiki provides find guidance on configuring custom models.

[
  {
    "order": "a",
    "name": "Mistral 7B",
    "filename": "mistral-7b-v0.1.Q4_K_M.gguf",
    "url": "https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/blob/main/mistral-7b-v0.1.Q4_K_M.gguf?download=true",
    "md5sum": "a5b363017e471c713665d57433f76e65",
    "filesize": "4368438912",
    "requires": "2.5.0",
    "ramrequired": "8",
    "parameters": "7 billion",
    "quant": "q4_0",
    "type": "Mistral",
    "description": "For creative completions, developed by Mistral AI",
    "promptTemplate": "%1",
    "chatTemplate": "",
    "systemPrompt": ""
  }
]

This project uses Mistral 7B Base converted to GGUF Format by TheBloke. There are plenty of models to choose from but for lightweight creative writing this one does quite well. It can be run on a decent laptop and is released under the Apache 2.0 license allowing commercial use. There are newer and larger models in this series but this one hits a good balance of resource usage and creativity.

The systemPrompt and chatTemplate options are not needed for basic completions. More on chat mode in the next article. This script does not generate responses with a personality. You cannot ask it a question and get a well formed response. To use this tool, you feed it incomplete text, and it will complete the text for you.

Example Input

This is a story

Example Output

This is a story about how my best friend and I are complete opposites.

My name...

Tips for Better Completions

  1. End your prompt mid-sentence for more natural continuations
  2. Use markdown or code formatting to guide the style
  3. Include examples of the desired output format
  4. Keep context under 2048 tokens for best performance
  5. Use append mode -a for iterative writing
  6. Provide useful context details to guide the output

Command Line Argument Processing

Completions can be done a few different ways:

  • No input, random output to terminal
  • String input from command line with -p or --prompt flag.
  • File input with with -f or --file flag
  • Output can be redirected with > output.txt
  • Output can be appended to input file with -a or --append flag.
const args = process.argv.slice(2);
const flags = {
    prompt : [ '-p', '--prompt' ],
    file : [ '-f', '--file'   ],
    append : [ '-a', '--append' ]
};
let inputPath, directInput, inputFile, outputFile;
const useFile = args.some(arg => flags.file.includes(arg));
const append = args.some(arg => flags.append.includes(arg));
const prompt = args.some(arg => flags.prompt.includes(arg));
const inputIndex = getInputIndex();

function getInputIndex() {
    if (append) return args.findIndex(arg => flags.append.includes(arg)) + 1;
    if (useFile) return args.findIndex(arg => flags.file.includes(arg)) + 1;
    if (prompt) return args.findIndex(arg => flags.prompt.includes(arg)) + 1;
    return -1;
}

Not the DRYest way to handle this but it gets the job done.

Input Validation

If file mode is specified, make sure a path was provided. For direct input, we can continue with or without a prompt.

// Proceed with or without input
if (useFile || append) {
    // Get file path from args
    inputPath = args[inputIndex];
    if (!inputPath) {
        console.error('Error: No input file specified after ', args[inputIndex-1]);
        process.exit(1);
    }
} else {
    // Use string input from command line or empty input
    directInput = prompt ? args[inputIndex] : '';
}

Generator Settings

Here you can adjust the quality of your output. There are many resources online discussing these options. ChatGPT can give you a good breakdown if needed. The setting below are reasonable defaults, adjust to your use case.

const predict = process.env.PREDICT ?? 128;

const settings = {
    temperature: 0.7,    // Controls creativity (0.0-1.0)
    topK: 40,            // Limits vocabular to top K tokens
    topP: 0.9,           // High probability cutoff
    minP: 0.1,           // Low probability cutoff
    repeatPenalty: 1.2,  // Penalize repeated tokens, 1 = No Penalty
    repeatLastN: 64,     // Lookback window for repeats
    nBatch: 2048,        // Tokens to process concurrently, higher values use more RAM
    nPredict: predict,   // Maximum tokens to generate, increase for longer output
    contextErase: 0.75,  // Percentage of past context to erase if exceeded
    promptTemplate: '%1' // Can override prompt template from config file
};

The big one to adjust here is nPredict. This decides how long your output will be. You can adjust this value with the PREDICT ENV var. The 128 token default setting will result in a decent size paragraph of text or equivalent (lists, code, etc). For example:

Input

$ llm-complete -p "export class SillyButton extends HTMLElement {"

Output

export class SillyButton extends HTMLElement {
  constructor() {
    super();
    this.attachShadow({ mode: 'open' });
    const template = document.createElement('template');
    template.innerHTML = ``;
    this.

You can continue generating in append mode to keep building off previous work. Using a text editor that supports streaming input like vscode or vim you can see the results in real time, make edits, save, then continue generating.

$ llm-complete -a silly-button.js # add some text
$ llm-complete -a silly-button.js # run again to add more

Terminal Interface Setup

Connect input/output streams for writing to terminal. Override the default prompt and prevent tab completions.

const rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
    terminal: true,
    prompt: '',
    completer: () => [[], '']
});

Input Control Functions

Block all keyboard input while processing and hide the cursor.

function blockInput() {
    rl._ttyWrite = () => {};
    process.stderr.write('\x1b[?25l');
}

function restoreInput() {
    rl._ttyWrite = tty;
    process.stderr.write('\x1b[?25h');
}

Allow Ctrl+C to cancel generation even though all other input is blocked.

rl.input.on('data', key => {
    const keyStr = key.toString();
    if (state.busy) {
        if (keyStr === '\x03') {
            state.killed = true;
            return;
        }
    }
});

Interrupting Generation

Return false in this callback to stop the model from generating any more tokens. You can process the current token here to decide whether or not to stop generating. In this script, we trigger cancellation only with ctrl+c but this can be expanded on if needed.

settings.onResponseToken = (tokenId, token) => {
    return !state.killed;
};

Progress Indication

Show a flashing elipses while busy. Uses stderr to avoid poluting our output.

function startIndicating() {
    state.flashLoop = setInterval(() => {
        if (state.flashOn) {
            state.flashOn = false;
            process.stderr.write('   \b\b\b');
        } else {
            state.flashOn = true;
            process.stderr.write('...\b\b\b');
        }
    }, 400);
}

function stopIndicating() {
    clearInterval(state.flashLoop);
    state.flashOn = false;
}

Streaming Append to File

In append mode, tokens are streamed directly back to the input file. Before this we check if the input file ends in a single newline. If so, truncate it. This allows us to pass a partial sentence as input while adhering to POSIX text file standards. We only strip single newlines. Double newlines are left intact to allow starting completion with a new paragraph.

Single Newline Example

Input:

This is a story

Output:

This is a story about something...

Double Newline Example

Input:

# Test Plan:
- Do Tests
- More Tests


Output:

# Test Plan:
- Do Tests
- More Tests

# Test 1:
- Check the input

The model decides how to continue the text. If it determines that there should be a newline after the input, it will add one. Manipulating the input like this helps the model continue the text in a natural way.

To perform this check we read the last two bytes of the input file into a buffer.

async function createWriteStream() {
    try {
        const eof = (await inputFile.stat()).size - 2;
        if (eof > 1) {
            const tempFile = Buffer.alloc(2);
            const lastBytes = (await inputFile.read(tempFile,0,2,eof)).buffer;
            const isSingleNewline = (lastBytes[0] !== 0x0A && lastBytes[1] === 0x0A);
            if (isSingleNewline) { await inputFile.truncate(eof+1); }
        }
        outputFile = inputFile.createWriteStream();
    } catch (err) {
        console.error('Error creating WriteStream:', err);
        shutdown();
    }
}

Write Output

Stream output to file, redirected stdout, or direct to terminal.

function writeToken(token) {
    if (append) {
        try { outputFile.write(token); } catch (err) {
            console.error('Error writing to output file:', err);
            shutdown();
        }
    } else if (!process.stdout.isTTY) {
        process.stdout.write(token);
    } else {
        rl.line += token;
        rl.cursor = rl.line.length;
        rl._refreshLine();
    }
}

Shutdown and Cleanup

Prevents segfault by allowing the model time to free it's own resources.

function shutdown() {
    if (state.busy) {
        setTimeout(dispose, 800);
    } else {
        dispose();
    }
}

function dispose() {
    model.dispose();
    restoreInput();
    process.exit();
}

Stream Processing

Completion will continue until nPredict tokens are generated. This can result in fragmented sentences at the end of output. To prevent this, we will detect sentence boundaries and drop any trailing fragments.

const boundaries = /[.?!…:;\n]/;

To accomplish this task we must buffer the output. Default buffer is 30 tokens. This can be adjusted as necessary with the BUFFER ENV var. Output is delayed until the buffer is full. This allows us to drop the sentence fragment before output writing catches up. After all tokens are collected, we write the remaining buffer out on a timer to simulate the streaming effect of token generation.

async function processStream(input) {
    state.busy = true;

    // For stdout, we need to trim newlines from
    // input like we do when appending to file
    input = input.replace(/[^\n](\n)$/,'');

    try {
        if (append) {
            // appending file already holds input
            await createWriteStream();
        } else {
            // write input to terminal
            writeToken(input);
        }

        // prompt the model
        const stream = createCompletionStream(
            model, input, settings
        );

        // Configure buffer
        const bufferAhead = process.env.BUFFER ?? 30;
        let buffer = [];
        let currentToken = -1;
        let currentIndex = 0;
        let currentBoundary = -1;

        // Loop until all tokens are received
        for await (let token of stream.tokens) {
            if (state.killed) return;

            // Prevent double space between input and output
            if (currentToken < 0 &&
                token.startsWith(' ') &&
                input.endsWith(' ')
            ) {
                token = token.toString().slice(1);
            }

            // Buffer tokens
            buffer.push(token);
            currentToken++;

            // Detect position of last sentence boundary
            if (token.match(boundaries)) {
                currentBoundary = buffer.length;
            }

            // Hide loading indicator before outputing to terminal
            if (currentToken == bufferAhead && !append && process.stdout.isTTY) {
                stopIndicating();
            }

            // Don't start outputting until buffer is full
            if (currentToken >= bufferAhead) {
                writeToken(buffer[currentIndex]);
                currentIndex++;
            }
        }

        // Drop any trailing sentence fragment from buffer
        if (currentBoundary) {
            const boundary = currentBoundary ? currentBoundary : buffer.length-1
            buffer = buffer.slice(currentIndex, boundary);
            while (buffer.slice(-1)[0]?.match(/\n/)) {
                buffer.pop();
            }
        }

        // Process remaining buffer by continuing to output one token at time
        for (let i = 0; i < buffer.length; i++) {
            if (state.killed) return;
            await new Promise(resolve => {
                setTimeout(() => {
                    writeToken(buffer[i]);
                    resolve();
                }, 200);
            });
        }
    } catch (error) {
        handleProcessingError(error);
    } finally {
        proccessingStopped();
    }
}

Error Handling

Gracefully shutdown if errors are encountered during processing.

function handleProcessingError(err) {
    console.error('Error processing stream:', err.message);
    shutdown();
}

Processing Completion

Write a final newline, reset terminal state and shutdown gracefully when processing stops.

function proccessingStopped() {
    writeToken('\n');
    stopIndicating();
    restoreInput();
    shutdown();
}

Process Initialization

We read the input file async to prevent blocking. This ensures the flashing indicator and cancel detection will work while loading. Start processing after the file is fully read.

blockInput();
startIndicating();
if (inputFile) {
    try {
        inputFile.readFile('utf8')
        .then(input => processStream(input));
    } catch (error) {
        console.error('Error reading file:', error.message);
        process.exit(1);
    }
} else {
    processStream(directInput);
}

Future Improvements

I originally had included inline editing of results on the terminal but this proved to be trickier than expected. The core functionality works, but there some quirks that need handling. I may release an update with this feature at some point if I can get it working properly.

Installation

You can use this to base your own implementation on. I have released the code under the MIT License. Or you can install via npm and start using it right away. Run it via the installed llm-complete executable.