Python txt to Excel - 02

Yes! We can significantly speed up the process using multi-threading, chunked processing, and optimized I/O operations. Here's a high-performance version of your script: Optimized Version (Using ThreadPool + Chunked Processing) import pandas as pd import json import re from concurrent.futures import ThreadPoolExecutor, as_completed from tqdm import tqdm # For progress bar (install with `pip install tqdm`) def json_to_excel_optimized(json_file, excel_file='result.xlsx', chunk_size=5000, max_workers=4): def parse_json_chunk(chunk): parsed_data = [] for line in chunk: try: data = json.loads(line) parsed_data.append(data) except json.JSONDecodeError: continue return parsed_data # Step 1: Preprocess file into valid JSON lines print("Preprocessing JSON file...") with open(json_file, 'r') as f: content = f.read() # Fix concatenated JSON objects (e.g., "}{" -> "}\n{") content = re.sub(r'}\s*{', '}\n{', content) lines = content.splitlines() # Step 2: Process in parallel chunks print("Processing chunks in parallel...") chunks = [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)] all_data = [] with ThreadPoolExecutor(max_workers=max_workers) as executor: futures = [executor.submit(parse_json_chunk, chunk) for chunk in chunks] for future in tqdm(as_completed(futures), total=len(chunks)): all_data.extend(future.result()) # Step 3: Write to Excel in one go (fastest for OpenPyXL) print("Writing to Excel...") df = pd.json_normalize(all_data) df.to_excel(excel_file, index=False, engine='openpyxl') print(f"Processed {len(lines)} lines. Output saved to {excel_file}") # Usage json_to_excel_optimized("large_file.txt", max_workers=8) # Adjust workers based on CPU cores Key Optimizations: 1. Multi-threading with ThreadPoolExecutor Parallelizes JSON parsing (CPU-bound task) across multiple threads. max_workers=4 (default) can be increased if your CPU has more cores. 2. Chunked Processing Splits the file into smaller chunks (e.g., 5,000 lines per chunk). Each chunk is processed independently, reducing memory overhead. 3. Batched I/O Operations Reads the entire file once (content = f.read()), then splits it into lines. Writes to Excel in a single operation (faster than incremental writes). 4. Progress Tracking with tqdm Shows a real-time progress bar for monitoring. 5. Regex Optimization Preprocesses the entire file once to fix malformed JSON (e.g., }{ → }\n{). Alternative: Multi-processing (For CPU-Intensive Tasks) If parsing is very CPU-intensive (e.g., complex JSON), replace ThreadPoolExecutor with ProcessPoolExecutor: from concurrent.futures import ProcessPoolExecutor with ProcessPoolExecutor(max_workers=4) as executor: # Instead of ThreadPoolExecutor ... Note: Multi-processing has higher overhead but avoids Python's GIL limitations. Performance Comparison (Estimated) Method Time (20MB File) Memory Usage Original ~5-10 mins High Optimized (Threads) ~1-2 mins Medium Optimized (Processes) ~30-60 secs Low Further Optimizations (If Needed): Use ujson instead of json (2-3x faster parsing): pip install ujson import ujson as json Write to CSV First (Then convert to Excel): df.to_csv("temp.csv", index=False) pd.read_csv("temp.csv").to_excel(excel_file, index=False) Dask for Out-of-Core Processing (For 100MB+ files): import dask.dataframe as dd ddf = dd.read_json(json_file, lines=True) ddf.compute().to_excel(excel_file, index=False) Let me know if you'd like to explore any of these alternatives!

Apr 22, 2025 - 17:19

Yes! We can significantly speed up the process using multi-threading, chunked processing, and optimized I/O operations. Here's a high-performance version of your script:

Optimized Version (Using ThreadPool + Chunked Processing)

import pandas as pd
import json
import re
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm  # For progress bar (install with `pip install tqdm`)

def json_to_excel_optimized(json_file, excel_file='result.xlsx', chunk_size=5000, max_workers=4):
    def parse_json_chunk(chunk):
        parsed_data = []
        for line in chunk:
            try:
                data = json.loads(line)
                parsed_data.append(data)
            except json.JSONDecodeError:
                continue
        return parsed_data

    # Step 1: Preprocess file into valid JSON lines
    print("Preprocessing JSON file...")
    with open(json_file, 'r') as f:
        content = f.read()
    # Fix concatenated JSON objects (e.g., "}{" -> "}\n{")
    content = re.sub(r'}\s*{', '}\n{', content)
    lines = content.splitlines()

    # Step 2: Process in parallel chunks
    print("Processing chunks in parallel...")
    chunks = [lines[i:i + chunk_size] for i in range(0, len(lines), chunk_size)]

    all_data = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(parse_json_chunk, chunk) for chunk in chunks]
        for future in tqdm(as_completed(futures), total=len(chunks)):
            all_data.extend(future.result())

    # Step 3: Write to Excel in one go (fastest for OpenPyXL)
    print("Writing to Excel...")
    df = pd.json_normalize(all_data)
    df.to_excel(excel_file, index=False, engine='openpyxl')
    print(f"Processed {len(lines)} lines. Output saved to {excel_file}")

# Usage
json_to_excel_optimized("large_file.txt", max_workers=8)  # Adjust workers based on CPU cores

Key Optimizations:

1. Multi-threading with `ThreadPoolExecutor`

Parallelizes JSON parsing (CPU-bound task) across multiple threads.
max_workers=4 (default) can be increased if your CPU has more cores.

2. Chunked Processing

Splits the file into smaller chunks (e.g., 5,000 lines per chunk).
Each chunk is processed independently, reducing memory overhead.

3. Batched I/O Operations

Reads the entire file once (content = f.read()), then splits it into lines.
Writes to Excel in a single operation (faster than incremental writes).

4. Progress Tracking with `tqdm`

Shows a real-time progress bar for monitoring.

5. Regex Optimization

Preprocesses the entire file once to fix malformed JSON (e.g., }{ → }\n{).

Alternative: Multi-processing (For CPU-Intensive Tasks)

If parsing is very CPU-intensive (e.g., complex JSON), replace ThreadPoolExecutor with ProcessPoolExecutor:

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=4) as executor:  # Instead of ThreadPoolExecutor
    ...

Note: Multi-processing has higher overhead but avoids Python's GIL limitations.

Performance Comparison (Estimated)

Method	Time (20MB File)	Memory Usage
Original	~5-10 mins	High
Optimized (Threads)	~1-2 mins	Medium
Optimized (Processes)	~30-60 secs	Low

Further Optimizations (If Needed):

Use ujson instead of json (2-3x faster parsing):

   pip install ujson

   import ujson as json

Write to CSV First (Then convert to Excel):

   df.to_csv("temp.csv", index=False)
   pd.read_csv("temp.csv").to_excel(excel_file, index=False)

Dask for Out-of-Core Processing (For 100MB+ files):

   import dask.dataframe as dd
   ddf = dd.read_json(json_file, lines=True)
   ddf.compute().to_excel(excel_file, index=False)

Let me know if you'd like to explore any of these alternatives!