Java Split Text File Efficiently
The problem When facing the task of writing java code that splits a text file into chunks of a maximum size we might be tempted to write code like the following public void split(Path inputFile, Path outputDir) throws IOException { Files.createDirectories(outputDir); try (BufferedReader reader = Files.newBufferedReader(inputFile)) { int fileIndex = 0; long currentSize = 0; BufferedWriter writer = newWriter(outputDir, fileIndex++); String line; while ((line = reader.readLine()) != null) { byte[] lineBytes = (line + System.lineSeparator()).getBytes(); if (currentSize + lineBytes.length > maxFileSizeBytes) { writer.close(); writer = newWriter(outputDir, fileIndex++); currentSize = 0; } writer.write(line); writer.newLine(); currentSize += lineBytes.length; } writer.close(); } } private BufferedWriter newWriter(Path dir, int index) throws IOException { Path filePath = dir.resolve("part_" + index + ".txt"); return Files.newBufferedWriter(filePath); } This code is technically ok, however it’s quite inefficient for splitting large files into many chunks. It performs many heap allocations (lines), resulting in lots of of temporary objects created and discarded (Strings, byte arrays). There’s also a less obvious issue, it’s copying data to several buffers and performing context switches between user and kernel modes. Here’s what happens in the current code under the hood: BufferedReader: Calls read() on an underlying FileReader or InputStreamReader. Data is copied from kernel space → user space buffer. Then parsed into Java Strings (heap-allocated). getBytes(): Converts String to a new byte[] → more heap allocations. BufferedWriter: Takes the byte/char data from user space. Calls write() which again involves copying user space → kernel space. Eventually flushed to disk. So data moves back and forth between kernel and user space multiple times, with extra heap churn. Apart from the garbage collection pressure, it has the following consequences: Memory bandwidth is wasted copying between buffers. High CPU utilization for what could be a disk-to-disk transfer. The OS could have handled the bulk copy directly (via DMA or optimized I/O), but Java code hijacks that efficiency by involving user-space logic. The solution So, how can we avoid the above issues? The answer is using zero copy as much as possible i.e. avoid leaving the kernel space as much as possible. This can be done in java by using the FileChannel method long transferTo(long position, long count, WritableByteChannel target) . It will essentially perform a disk-to-disk transfer and will leverage any IO optimization of the operating system. A key challenge arises: the described method operates on byte chunks, potentially disrupting line integrity. To address this, we need a strategy that ensures lines are kept intact even when the file is processed by moving byte segments. Solving the problem without the above challenge would be easy, just call transferTo for every chunk, incrementing position as position = position + maxFileSize until no more data can be transferred. To preserve line integrity, we need to determine the end of the last complete line within each byte chunk. We'll achieve this by initially seeking to the expected end of the chunk and then scanning backwards to locate the preceding line break. This will give us the accurate byte count for the chunk, ensuring the inclusion of the final, unbroken line. This will be the sole section of code performing buffer allocations and copies, and since these operations should be minimal, the performance impact is expected to be negligible. The following code illustrates the techniques discussed private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024; private long maxSizePerFileInBytes; private Path outputDirectory; private Path tempDir; private void split(Path fileToSplit) throws IOException { try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r"); FileChannel inputChannel = raf.getChannel()) { long fileSize = raf.length(); long position = 0; int fileCounter = 1; while (position

The problem
When facing the task of writing java code that splits a text file into chunks of a maximum size we might be tempted to write code like the following
public void split(Path inputFile, Path outputDir) throws IOException {
Files.createDirectories(outputDir);
try (BufferedReader reader = Files.newBufferedReader(inputFile)) {
int fileIndex = 0;
long currentSize = 0;
BufferedWriter writer = newWriter(outputDir, fileIndex++);
String line;
while ((line = reader.readLine()) != null) {
byte[] lineBytes = (line + System.lineSeparator()).getBytes();
if (currentSize + lineBytes.length > maxFileSizeBytes) {
writer.close();
writer = newWriter(outputDir, fileIndex++);
currentSize = 0;
}
writer.write(line);
writer.newLine();
currentSize += lineBytes.length;
}
writer.close();
}
}
private BufferedWriter newWriter(Path dir, int index) throws IOException {
Path filePath = dir.resolve("part_" + index + ".txt");
return Files.newBufferedWriter(filePath);
}
This code is technically ok, however it’s quite inefficient for splitting large files into many chunks.
It performs many heap allocations (lines), resulting in lots of of temporary objects created and discarded (Strings, byte arrays).
There’s also a less obvious issue, it’s copying data to several buffers and performing context switches between user and kernel modes.
Here’s what happens in the current code under the hood:
BufferedReader:
- Calls
read()
on an underlyingFileReader
orInputStreamReader
. - Data is copied from kernel space → user space buffer.
- Then parsed into Java Strings (heap-allocated).
getBytes():
- Converts
String
to a newbyte[]
→ more heap allocations.
BufferedWriter:
- Takes the byte/char data from user space.
- Calls
write()
which again involves copying user space → kernel space. - Eventually flushed to disk.
So data moves back and forth between kernel and user space multiple times, with extra heap churn. Apart from the garbage collection pressure, it has the following consequences:
- Memory bandwidth is wasted copying between buffers.
- High CPU utilization for what could be a disk-to-disk transfer.
- The OS could have handled the bulk copy directly (via DMA or optimized I/O), but Java code hijacks that efficiency by involving user-space logic.
The solution
So, how can we avoid the above issues? The answer is using zero copy as much as possible i.e. avoid leaving the kernel space as much as possible. This can be done in java by using the FileChannel
method long transferTo(long position, long count, WritableByteChannel target)
. It will essentially perform a disk-to-disk transfer and will leverage any IO optimization of the operating system.
A key challenge arises: the described method operates on byte chunks, potentially disrupting line integrity. To address this, we need a strategy that ensures lines are kept intact even when the file is processed by moving byte segments.
Solving the problem without the above challenge would be easy, just call transferTo
for every chunk, incrementing position
as position = position + maxFileSize
until no more data can be transferred.
To preserve line integrity, we need to determine the end of the last complete line within each byte chunk. We'll achieve this by initially seeking to the expected end of the chunk and then scanning backwards to locate the preceding line break. This will give us the accurate byte count for the chunk, ensuring the inclusion of the final, unbroken line. This will be the sole section of code performing buffer allocations and copies, and since these operations should be minimal, the performance impact is expected to be negligible.
The following code illustrates the techniques discussed
private static final int LINE_ENDING_SEARCH_WINDOW = 8 * 1024;
private long maxSizePerFileInBytes;
private Path outputDirectory;
private Path tempDir;
private void split(Path fileToSplit) throws IOException {
try (RandomAccessFile raf = new RandomAccessFile(fileToSplit.toFile(), "r");
FileChannel inputChannel = raf.getChannel()) {
long fileSize = raf.length();
long position = 0;
int fileCounter = 1;
while (position < fileSize) {
// Calculate end position (try to get close to max size)
long targetEndPosition = Math.min(position + maxSizePerFileInBytes, fileSize);
// If we're not at the end of the file, find the last line ending before max size
long endPosition = targetEndPosition;
if (endPosition < fileSize) {
endPosition = findLastLineEndBeforePosition(raf, position, targetEndPosition);
}
long chunkSize = endPosition - position;
var outputFilePath = tempDir.resolve("_part" + fileCounter);
try (FileOutputStream fos = new FileOutputStream(outputFilePath.toFile());
FileChannel outputChannel = fos.getChannel()) {
inputChannel.transferTo(position, chunkSize, outputChannel);
}
position = endPosition;
fileCounter++;
}
}
}
private long findLastLineEndBeforePosition(RandomAccessFile raf, long startPosition, long maxPosition)
throws IOException {
long originalPosition = raf.getFilePointer();
try {
int bufferSize = LINE_ENDING_SEARCH_WINDOW;
long chunkSize = maxPosition - startPosition;
if (chunkSize < bufferSize) {
bufferSize = (int) chunkSize;
}
byte[] buffer = new byte[bufferSize];
long searchPos = maxPosition;
while (searchPos > startPosition) {
long distanceToStart = searchPos - startPosition;
int bytesToRead = (int) Math.min(bufferSize, distanceToStart);
long readStartPos = searchPos - bytesToRead;
raf.seek(readStartPos);
int bytesRead = raf.read(buffer, 0, bytesToRead);
if (bytesRead <= 0)
break;
// Search backwards through the buffer for newline
for (int i = bytesRead - 1; i >= 0; i--) {
if (buffer[i] == '\n') {
return readStartPos + i + 1;
}
}
searchPos -= bytesRead;
}
throw new IllegalArgumentException(
"File " + fileToSplit + " cannot be split. No newline found within the limits.");
} finally {
raf.seek(originalPosition);
}
}
The findLastLineEndBeforePosition
method, has certain limitations. Specifically, it only works on Unix-like systems (\n
), very long lines can lead to numerous backward read iterations, and files containing lines exceeding maxSizePerFileInBytes
cannot be split. However, it's well-suited for scenarios like splitting access log files, which typically have short lines and a high volume of entries.
Benchmarking and Performance Analysis
Theoretically our almost zero copy split file should be faster than the initial implementation, now it’s time to measure how fast it can be. For that I’ve run some benchmarks for the two implementations, these are the results.
Benchmark Mode Cnt Score Error Units
FileSplitterBenchmark.splitFile avgt 15 1179.429 ± 54.271 ms/op
FileSplitterBenchmark.splitFile:·gc.alloc.rate avgt 15 1349.613 ± 60.903 MB/sec
FileSplitterBenchmark.splitFile:·gc.alloc.rate.norm avgt 15 1694927403.481 ± 6060.581 B/op
FileSplitterBenchmark.splitFile:·gc.count avgt 15 718.000 counts
FileSplitterBenchmark.splitFile:·gc.time avgt 15 317.000 ms
FileSplitterBenchmark.splitFileZeroCopy avgt 15 77.352 ± 1.339 ms/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate avgt 15 23.759 ± 0.465 MB/sec
FileSplitterBenchmark.splitFileZeroCopy:·gc.alloc.rate.norm avgt 15 2555608.877 ± 8644.153 B/op
FileSplitterBenchmark.splitFileZeroCopy:·gc.count avgt 15 10.000 counts
FileSplitterBenchmark.splitFileZeroCopy:·gc.time avgt 15 5.000 ms
Following is the benchmark code and the file size used for the above results.
int maxSizePerFileInBytes = 1024 * 1024 // 1 MB chunks
@Setup(Level.Trial) public void setup() throws Exception {
inputFile = Paths.get("/tmp/large_input.txt");
outputDir = Paths.get("/tmp/split_output");
// Create a large file for benchmarking if it doesn't exist
if (!Files.exists(inputFile)) {
try (BufferedWriter writer = Files.newBufferedWriter(inputFile)) {
for (int i = 0; i < 10_000_000; i++) {
writer.write("This is line number " + i);
writer.newLine();
}
}
}
}
@Benchmark public void splitFile() throws Exception {
splitter.split(inputFile, outputDir);
}
@Benchmark public void splitFileZeroCopy() throws Exception {
zeroCopySplitter.split(inputFile);
}
$ du -h /tmp/large_input.txt
266M /tmp/large_input.txt
Analyzing the results, the zeroCopySplitter demonstrates a considerable speedup, taking only 77 ms compared to the initial splitter's 1179 ms for this specific case. This performance advantage can be crucial when dealing with large volumes of data or many files.
Conclusion
Efficiently splitting large text files demands system-level performance considerations beyond mere logic. While a basic approach highlights the problem with excessive memory operations, a redesigned solution leveraging zero-copy techniques and preserving line integrity yields substantial performance improvements.
This demonstrates the impact of system-aware programming and understanding I/O mechanics in creating significantly faster and more resource-efficient tools for handling large textual data like logs or datasets.