Java Split Text File Efficiently

The problem When facing the task of writing java code that splits a text file into chunks of a maximum size we might be tempted to write code like the following public void split(Path inputFile, Path outputDir) throws IOException { Files.createDirectories(outputDir); try (BufferedReader reader = Files.newBufferedReader(inputFile)) { int fileIndex = 0; long currentSize = 0; BufferedWriter writer = newWriter(outputDir, fileIndex++); String line; while ((line = reader.readLine()) != null) { byte[] lineBytes = (line + System.lineSeparator()).getBytes(); if (currentSize + lineBytes.length > maxFileSizeBytes) { writer.close(); writer = newWriter(outputDir, fileIndex++); currentSize = 0; } writer.write(line); writer.newLine(); currentSize += lineBytes.length; } writer.close(); } } private BufferedWriter newWriter(Path dir, int index) throws IOException { Path filePath = dir.resolve("part_" + index + ".txt"); return Files.newBufferedWriter(filePath); } This code is technically ok, however it’s quite inefficient for splitting large files into many chunks. It performs many heap allocations (lines), resulting in lots of of temporary objects created and discarded (Strings, byte arrays). There’s also a less obvious issue, it’s copying data to several buffers and performing context switches between user and kernel modes. Here’s what happens in the current code under the hood: BufferedReader: Calls read() on an underlying FileReader or InputStreamReader. Data is copied from kernel space → user space buffer. Then parsed into Java Strings (heap-allocated). getBytes(): Converts String to a new byte[] → more heap allocations. BufferedWriter: Takes the byte/char data from user space. Calls write() which again involves copying user space → kernel space. Eventually flushed to disk. So data moves back and forth between kernel and user space multiple times, with extra heap churn. Apart from the garbage collection pressure, it has the following consequences: Memory bandwidth is wasted copying between buffers. High CPU utilization for what could be a disk-to-disk transfer. The OS could have handled the bulk copy directly (via DMA or optimized I/O), but Java code hijacks that efficiency by involving user-space logic. The solution So, how can we avoid the above issues? The answer is using zero copy as much as possible i.e. avoid leaving the kernel space as much as possible. This can be done in java by using the FileChannel method long transferTo(long position, long count, WritableByteChannel target) . It will essentially perform a disk-to-disk transfer and will leverage any IO optimization of the operating system. A key challenge arises: the described method operates on byte chunks, potentially disrupting line integrity. To address this, we need a strategy that ensures lines are kept intact even when the file is processed by moving byte segments. Solving the problem without the above challenge would be easy, just call transferTo for every chunk, incrementing position as position = position + maxFileSize until no more data can be transferred. To preserve line integrity, we need to determine the end of the last complete line within each byte chunk. We'll achieve this by initially seeking to the expected end of the chunk and then scanning backwards to locate the preceding line break. This will give us the accurate byte count for the chunk, ensuring the inclusion of the final, unbroken line. This will be the sole section of code performing buffer allocations and copies, and since these operations should be minimal, the performance impact is expected to be negligible. The following code illustrates the techniques discussed private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024; private long maxSizePerFileInBytes; private Path outputDirectory; private Path tempDir; private void split(Path fileToSplit) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r"); FileChannel inputChannel = raf.getChannel()) { long fileSize = raf.length(); long position = 0; int fileCounter = 1; while (position

May 9, 2025 - 15:35
 0
Java Split Text File Efficiently

The problem

When facing the task of writing java code that splits a text file into chunks of a maximum size we might be tempted to write code like the following

public void split(Path inputFile, Path outputDir) throws IOException {
    Files.createDirectories(outputDir);

    try (BufferedReader reader = Files.newBufferedReader(inputFile)) {
        int fileIndex = 0;
        long currentSize = 0;
        BufferedWriter writer = newWriter(outputDir, fileIndex++);

        String line;
        while ((line = reader.readLine()) != null) {
            byte[] lineBytes = (line + System.lineSeparator()).getBytes();
            if (currentSize + lineBytes.length > maxFileSizeBytes) {
                writer.close();
                writer = newWriter(outputDir, fileIndex++);
                currentSize = 0;
            }
            writer.write(line);
            writer.newLine();
            currentSize += lineBytes.length;
        }
        writer.close();
    }
}

private BufferedWriter newWriter(Path dir, int index) throws IOException {
    Path filePath = dir.resolve("part_" + index + ".txt");
    return Files.newBufferedWriter(filePath);
}

This code is technically ok, however it’s quite inefficient for splitting large files into many chunks.

It performs many heap allocations (lines), resulting in lots of of temporary objects created and discarded (Strings, byte arrays).

There’s also a less obvious issue, it’s copying data to several buffers and performing context switches between user and kernel modes.

Here’s what happens in the current code under the hood:

BufferedReader:

  • Calls read() on an underlying FileReader or InputStreamReader.
  • Data is copied from kernel spaceuser space buffer.
  • Then parsed into Java Strings (heap-allocated).

getBytes():

  • Converts String to a new byte[] → more heap allocations.

BufferedWriter:

  • Takes the byte/char data from user space.
  • Calls write() which again involves copying user spacekernel space.
  • Eventually flushed to disk.

So data moves back and forth between kernel and user space multiple times, with extra heap churn. Apart from the garbage collection pressure, it has the following consequences:

  • Memory bandwidth is wasted copying between buffers.
  • High CPU utilization for what could be a disk-to-disk transfer.
  • The OS could have handled the bulk copy directly (via DMA or optimized I/O), but Java code hijacks that efficiency by involving user-space logic.

The solution

So, how can we avoid the above issues? The answer is using zero copy as much as possible i.e. avoid leaving the kernel space as much as possible. This can be done in java by using the FileChannel method long transferTo(long position, long count, WritableByteChannel target) . It will essentially perform a disk-to-disk transfer and will leverage any IO optimization of the operating system.

A key challenge arises: the described method operates on byte chunks, potentially disrupting line integrity. To address this, we need a strategy that ensures lines are kept intact even when the file is processed by moving byte segments.

Solving the problem without the above challenge would be easy, just call transferTo for every chunk, incrementing position as position = position + maxFileSize until no more data can be transferred.

To preserve line integrity, we need to determine the end of the last complete line within each byte chunk. We'll achieve this by initially seeking to the expected end of the chunk and then scanning backwards to locate the preceding line break. This will give us the accurate byte count for the chunk, ensuring the inclusion of the final, unbroken line. This will be the sole section of code performing buffer allocations and copies, and since these operations should be minimal, the performance impact is expected to be negligible.

The following code illustrates the techniques discussed

private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024;

private long maxSizePerFileInBytes;
private Path outputDirectory;
private Path tempDir;

private void split(Path fileToSplit) throws IOException {
    try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r");
            FileChannel inputChannel = raf.getChannel()) {

        long fileSize = raf.length();
        long position = 0;
        int fileCounter = 1;

        while (position < fileSize) {
            // Calculate end position (try to get close to max size)
            long targetEndPosition = Math.min(position + maxSizePerFileInBytes, fileSize);

            // If we're not at the end of the file, find the last line ending before max size
            long endPosition = targetEndPosition;
            if (endPosition < fileSize) {
                endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition);
            }

            long chunkSize = endPosition - position;
            var outputFilePath = tempDir.resolve("_part" + fileCounter);
            try (FileOutputStream fos = new FileOutputStream(outputFilePath.toFile());
                    FileChannel outputChannel = fos.getChannel()) {
                inputChannel.transferTo(position, chunkSize, outputChannel);
            }

            position = endPosition;
            fileCounter++;
        }

    }
}

private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition)
        throws IOException {
    long originalPosition = raf.getFilePointer();

    try {
        int bufferSize = LINE_ENDING_SEARCH_WINDOW;
        long chunkSize = maxPosition - startPosition;

        if (chunkSize < bufferSize) {
            bufferSize = (int) chunkSize;
        }

        byte[] buffer = new byte[bufferSize];
        long searchPos = maxPosition;

        while (searchPos > startPosition) {
            long distanceToStart = searchPos - startPosition;
            int bytesToRead = (int) Math.min(bufferSize, distanceToStart);

            long readStartPos = searchPos - bytesToRead;
            raf.seek(readStartPos);

            int bytesRead = raf.read(buffer, 0, bytesToRead);
            if (bytesRead <= 0)
                break;

            // Search backwards through the buffer for newline
            for (int i = bytesRead - 1; i >= 0; i--) {
                if (buffer[i] == '\n') {
                    return readStartPos + i + 1;
                }
            }

            searchPos -= bytesRead;
        }

        throw new IllegalArgumentException(
                "File " + fileToSplit + " cannot be split. No newline found within the limits.");
    } finally {
        raf.seek(originalPosition);
    }
}

The findLastLineEndBeforePosition method, has certain limitations. Specifically, it only works on Unix-like systems (\n), very long lines can lead to numerous backward read iterations, and files containing lines exceeding maxSizePerFileInBytes cannot be split. However, it's well-suited for scenarios like splitting access log files, which typically have short lines and a high volume of entries.

Benchmarking and Performance Analysis

Theoretically our almost zero copy split file should be faster than the initial implementation, now it’s time to measure how fast it can be. For that I’ve run some benchmarks for the two implementations, these are the results.

Benchmark                                                    Mode  Cnt           Score      Error   Units
FileSplitterBenchmark.splitFile                              avgt   15        1179.429 ±   54.271   ms/op
FileSplitterBenchmark.splitFile:·gc.alloc.rate               avgt   15        1349.613 ±   60.903  MB/sec
FileSplitterBenchmark.splitFile:·gc.alloc.rate.norm          avgt   15  1694927403.481 ± 6060.581    B/op
FileSplitterBenchmark.splitFile:·gc.count                    avgt   15         718.000             counts
FileSplitterBenchmark.splitFile:·gc.time                     avgt   15         317.000                 ms
FileSplitterBenchmark.splitFileZeroCopy                      avgt   15          77.352 ±    1.339   ms/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate       avgt   15          23.759 ±    0.465  MB/sec
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate.norm  avgt   15     2555608.877 ± 8644.153    B/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.count            avgt   15          10.000             counts
FileSplitterBenchmark.splitFileZeroCopy:·gc.time             avgt   15           5.000                 ms

Following is the benchmark code and the file size used for the above results.

int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks

@Setup(Level.Trial) public void setup() throws Exception {
    inputFile = Paths.get("/tmp/large_input.txt");
    outputDir = Paths.get("/tmp/split_output");
    // Create a large file for benchmarking if it doesn't exist
    if (!Files.exists(inputFile)) {
        try (BufferedWriter writer = Files.newBufferedWriter(inputFile)) {
            for (int i = 0; i < 10_000_000; i++) {
                writer.write("This is line number " + i);
                writer.newLine();
            }
        }
    }
}

@Benchmark public void splitFile() throws Exception {
    splitter.split(inputFile, outputDir);
}

@Benchmark public void splitFileZeroCopy() throws Exception {
    zeroCopySplitter.split(inputFile);
}
$ du -h /tmp/large_input.txt 
266M    /tmp/large_input.txt

Analyzing the results, the zeroCopySplitter demonstrates a considerable speedup, taking only 77 ms compared to the initial splitter's 1179 ms for this specific case. This performance advantage can be crucial when dealing with large volumes of data or many files.

Conclusion

Efficiently splitting large text files demands system-level performance considerations beyond mere logic. While a basic approach highlights the problem with excessive memory operations, a redesigned solution leveraging zero-copy techniques and preserving line integrity yields substantial performance improvements.

This demonstrates the impact of system-aware programming and understanding I/O mechanics in creating significantly faster and more resource-efficient tools for handling large textual data like logs or datasets.