Python file conversion takes more than 2 days - python

I have a folder which includes around 400 txt files. MAx size of txt file is 2 to 2.5 mb.
I am trying to convert these files to csv with python code. My code perfectly works and quickly converts txt to csv when I have small size of txt( even more than 500 files ) But when size it little heavy it takes quite long time.
Well it's obvious to take long time for heavy data but the problem is I am running this conversion process since 2 days and not even 50% is completed.
Is there any idea to convert these txt file to csv quickly?? I mean withing few hours.
If it takes more than 2 days then I will not have enough time to analyze it.
My code is here:
import glob
import os, os.path, glob
import numpy as np
import matplotlib.pyplot as plt
from natsort import natsorted
import pandas as pd
from matplotlib.patches import Ellipse
from matplotlib.text import OffsetFrom
from mpl_toolkits.mplot3d import Axes3D
from random import random
data_folder = "./all/"
data_folder
files = natsorted(glob.glob(data_folder + 'dump*.data'))
number_of_files = len(files)
#print(number_of_files)
#files
file_open = open("./all/dump80000.data", "r")
with open("./all/dump80000.data") as f:
lines = f.readlines()
#removing 'ITEM:'
s = 'ITEM: ATOMS '
lines[8] = lines[8].replace(s, '')
#getting the header names
headers = lines[8].split()
headers.append('TIMESTEP')
df = pd.DataFrame(columns=headers)
counter = 0
for total_files in range(number_of_files):
with open(files[total_files]) as f:
lines = f.readlines()
total_atoms = int(lines[3])
for i in range(total_atoms):
row_elements = lines[9+i].split()
row_elements.append(int(lines[1]))
df.loc[counter] = row_elements
counter=counter+1
df.to_csv(r'all.csv', index = False)
Any idea ? Suggestion?
Thank you
In case, if you need txt sample:
https://raw.githubusercontent.com/Laudarisd/dump46000.data
or
https://raw.githubusercontent.com/Laudarisd/test/main/dump46000.data

How about using simple readline? I am suspect readlines and/or pd.DataFrame are consuming so much time. The following seems to be fast enough for me.
import glob
import time
start = time.time()
data_folder = "./all/"
files = glob.glob(data_folder + 'dump*.data')
# get header from one of the files
with open('all/dump46000.data', 'r') as f:
for _ in range(8):
next(f) # skip first 8 lines
header = ','.join(f.readline().split()[2:]) + '\n'
for file in files:
with open(file, 'r') as f, open(f'all.csv', 'a') as g: # note the 'a'
g.write(header) # write the header
for _ in range(9):
next(f) # skip first 9 lines
for line in f:
g.write(line.rstrip().replace(' ', ',') + '\n')
print(time.time() - start)
# id,type,x,y,z,vx,vy,vz,fx,fy,fz
# 201,1,0.00933075,-0.195667,1.53332,-0.000170702,-0.000265168,0.000185569,0.00852572,-0.00882728,-0.0344813
# 623,1,-0.101572,-0.159675,1.52102,-0.000125008,-0.000129469,6.1561e-05,0.0143586,-0.0020444,-0.0400259
# 851,1,-0.0654623,-0.176443,1.52014,-0.00017815,-0.000224676,0.000329338,0.0101743,0.00116504,-0.0344114
# 159,1,-0.0268728,-0.186269,1.51979,-0.000262947,-0.000386994,0.000254515,0.00961213,-0.00640215,-0.0397847

Taking a quick glance at your code, it seems you're taking the following approach to convert a file:
Open the file
Read the entire file into a buffer
Process the buffer
However, if you can make some small adjustments to your code:
Open the file
Read one line
Process the line
Continue until the file is done
Basically, take an iterative approach instead of reading the whole file all at once. Next, you can then make it even faster using asyncio, where you can process all your files concurrently.

It's hard to give precise help without knowing exactly what data you want to extract from those files but from a first glance you definitely should use one of pandas' built-in file reading methods which are guaranteed to be many times faster than your code. Assuming you wish to skip the first 9 rows, you could do something like:
headers = ["a", "b", ...]
pd.read_csv(open("./all/dump80000.data"), skiprows=9, sep=" ", columns=headers)
If this is still not fast enough, you can parallelize your code since most of the processing is just loading data into memory.

I recommend breaking the problem down into distinct steps for a few files, then once you're sure you understand how to correctly code each step independently, you can think about combining them:
convert all TXT to CSVs
process each CSV doing what you need
Here's how to do step 1:
import csv
out_f = open('output.csv', 'w', newline='')
writer = csv.writer(out_f)
in_f = open('input.txt')
# Consume first 8 lines you don't want
for _ in range(8):
next(in_f)
# Get and fix-up your header
header = next(in_f).replace('ITEM: ATOMS ', '')
writer.writerow(header.split())
# Read the rest of the file line-by-line, splitting by space, which will make a row that the CSV writer can write
for line in in_f:
row = line.split()
writer.writerow(row)
in_f.close()
out_f.close()
When I ran that against your sample .data file, I got:
id,type,x,y,z,vx,vy,vz,fx,fy,fz
201,1,0.00933075,-0.195667,1.53332,-0.000170702,-0.000265168,0.000185569,0.00852572,-0.00882728,-0.0344813
623,1,-0.101572,-0.159675,1.52102,-0.000125008,-0.000129469,6.1561e-05,0.0143586,-0.0020444,-0.0400259
851,1,-0.0654623,-0.176443,1.52014,-0.00017815,-0.000224676,0.000329338,0.0101743,0.00116504,-0.0344114
...
Do that for all 400 TXT files, then write another script to process the resulting CSVs.
I'm on an M1 Macbook Air, with a good, fast SSD. Converting that one .data file takes less than point-one seconds. Unless you've got a really slow disk, I cannot see both steps taking more than hour.

Related

Most efficient way to convert large .txt files (size >30GB) .txt into .csv after pre-processing using Python

I have data in a .txt file that looks like this (let's name it "myfile.txt"):
28807644'~'0'~'Maun FCU'~'US#####28855353'~'0'~'WNB Holdings LLC'~'US#####29212330'~'0'~'Idaho First Bank'~'US#####29278777'~'0'~'Republic Bank of Arizona'~'US#####29633181'~'0'~'Friendly Hills Bank'~'US#####29760145'~'0'~'The Freedom Bank of Virginia'~'US#####100504846'~'0'~'Community First Fund Federal Credit Union'~'US#####
I have tried a couple of ways to convert this .txt into a .csv, one of them was using CSV library, but since I like Panda's a lot, I used the following:
import pandas as pd
import time
#time at the start of program is noted
start = time.time()
# We set the path where our file is located and read it
path = r'myfile.txt'
f = open(path, 'r')
content = f.read()
# We replace undesired strings and introduce a breakline.
content_filtered = content.replace("#####", "\n").replace("'", "")
# We read everything in columns with the separator "~"
df = pd.DataFrame([x.split('~') for x in content_filtered.split('\n')], columns = ['a', 'b', 'c', 'd'])
# We print the dataframe into a csv
df.to_csv(path.replace('.txt', '.csv'), index = None)
end = time.time()
#total time taken to print the file
print("Execution time in seconds: ",(end - start))
This takes about 35 seconds to process, is a file of 300MB, I can accept that type of performance, but I'm trying to do the same for a way much larger file which size is 35GB and it produces a MemoryError message.
I tried using the CSV library, but the results were similar, I attempted putting everything into a list, and afterward, write it over to a CSV:
import csv
# We write to CSV
with open(path.replace('.txt', '.csv'), "w") as outfile:
write = csv.writer(outfile)
write.writerows(split_content)
Results were similar, not a huge improvement. Is there a way or methodology I can use to convert VERY large .txt files into .csv? Likely above 35GB?
I'd be happy to read any suggestions you may have, thanks in advance!
I took your sample string, and made a sample file by multiplying that string by 100 million (something like your_string*1e8...) to get a test file that is 31GB.
Following #Grismar's suggestion of chunking, I made the following, which processes that 31GB file in ~2 minutes, with a peak RAM usage depending on the chunk size.
The complicated part is keeping track of the field and record separators, which are multiple characters, and will certainly span across a chunk, and thus be truncated.
My solution is to inspect the end of each chunk and see if it has a partial separator. If it does, that partial is removed from the end of the current chunk, the current chunk is written-out, and the partial becomes the beginning of (and should be completed by) the next chunk:
CHUNK_SZ = 1024 * 1024
FS = "'~'"
RS = '#####'
# With chars repeated in the separators, check most specific (least ambiguous)
# to least specific (most ambiguous) to definitively catch a partial with the
# fewest number of checks
PARTIAL_RSES = ['####', '###', '##', '#']
PARTIAL_FSES = ["'~", "'"]
ALL_PARTIALS = PARTIAL_FSES + PARTIAL_RSES
f_out = open('out.csv', 'w')
f_out.write('a,b,c,d\n')
f_in = open('my_file.txt')
line = ''
while True:
# Read chunks till no more, then break out
chunk = f_in.read(CHUNK_SZ)
if not chunk:
break
# Any previous partial separator, plus new chunk
line += chunk
# Check end-of-line for a partial FS or RS; only when separators are more than one char
final_partial = ''
if line.endswith(FS) or line.endswith(RS):
pass # Write-out will replace complete FS or RS
else:
for partial in ALL_PARTIALS:
if line.endswith(partial):
final_partial = partial
line = line[:-len(partial)]
break
# Process/write chunk
f_out.write(line
.replace(FS, ',')
.replace(RS, '\n'))
# Add partial back, to be completed next chunk
line = final_partial
# Clean up
f_in.close()
f_out.close()
Since your code just does straight up replacement, you could just read through all the data sequentially and detect parts that need replacing as you go:
def process(fn_in, fn_out, columns):
new_line = b'#####'
with open(fn_out, 'wb') as f_out:
# write the header
f_out.write((','.join(columns)+'\n').encode())
i = 0
with open(fn_in, "rb") as f_in:
while (b := f_in.read(1)):
if ord(b) == new_line[i]:
# keep matching the newline block
i += 1
if i == len(new_line):
# if matched entirely, write just a newline
f_out.write(b'\n')
i = 0
# write nothing while matching
continue
elif i > 0:
# if you reach this, it was a partial match, write it
f_out.write(new_line[:i])
i = 0
if b == b"'":
pass
elif b == b"~":
f_out.write(b',')
else:
# write the byte if no match
f_out.write(b)
process('my_file.txt', 'out.csv', ['a', 'b', 'c', 'd'])
That does it pretty quickly. You may be able to improve performance by reading in chunks, but this is pretty quick all the same.
This approach has the advantage over yours that it holds almost nothing in memory, but it does very little to optimise reading the file fast.
Edit: there was a big mistake in an edge case, which I realised after re-reading, fixed now.
Just to share an alternative way, based on convtools (table docs | github).
This solution is faster the OP's, but ~7 times slower than Zach's (Zach works with str chunks, while this one works with row tuples, reading via csv.reader).
Still, this approach may be useful as it allows to tap into stream processing and work with columns, rearrange them, add new ones, etc.
from convtools import conversion as c
from convtools.contrib.fs import split_buffer
from convtools.contrib.tables import Table
def get_rows(filename):
with open(filename, "r") as f:
for row in split_buffer(f, "#####"):
yield row.replace("'", "")
Table.from_csv(
get_rows("tmp.csv"), dialect=Table.csv_dialect(delimiter="~")
).into_csv("tmp_out.csv", include_header=False)

How to read a large tsv file in python and convert it to csv

I have a large tsv file (around 12 GB) that I want to convert to a csv file. For smaller tsv files, I use the following code, which works but is slow:
import pandas as pd
table = pd.read_table(path of tsv file, sep='\t')
table.to_csv(path andname_of csv_file, index=False)
However, this code does not work for my large file, and the kernel resets in the middle.
Is there any way to fix the problem? Does anyone know if the task is doable with Dask instead of Pandas?
I am using windows 10.
Instead of loading all lines at once in memory, you can read line by line and process them one after another:
With Python 3.x:
fs=","
table = str.maketrans('\t', fs)
fName = 'hrdata.tsv'
f = open(fName,'r')
try:
line = f.readline()
while line:
print(line.translate(table), end = "")
line = f.readline()
except IOError:
print("Could not read file: " + fName)
finally:
f.close()
Input (hrdata.tsv):
Name Hire Date Salary Sick Days remaining
Graham Chapman 03/15/14 50000.00 10
John Cleese 06/01/15 65000.00 8
Eric Idle 05/12/14 45000.00 10
Terry Jones 11/01/13 70000.00 3
Terry Gilliam 08/12/14 48000.00 7
Michael Palin 05/23/13 66000.00 8
Output:
Name,Hire Date,Salary,Sick Days remaining
Graham Chapman,03/15/14,50000.00,10
John Cleese,06/01/15,65000.00,8
Eric Idle,05/12/14,45000.00,10
Terry Jones,11/01/13,70000.00,3
Terry Gilliam,08/12/14,48000.00,7
Michael Palin,05/23/13,66000.00,8
Command:
python tsv_csv_convertor.py > new_csv_file.csv
Note:
If you use a Unix env, just run the command:
tr '\t' ',' <input.tsv >output.csv
You can use chunksize to iterate over the entire file in pieces. Note that this uses .read_csv() instead of .read_table()
df = pd.DataFrame()
for chunk in pd.read_csv('Check1_900.csv', header=None, names=['id', 'text', 'code'], chunksize=1000):
df = pd.concat([df, chunk], ignore_index=True)
source
You can also try the low_memory=False flag (source).
And then next would be the memory_map (scroll down at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
memory_map : bool, default False
If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.
Note that to_csv() has similar functionality.
Correct me if I'm wrong, but a TSV file is basically a CSV file, using a tab character instead of a comma. To translate this in python efficiently, you need to iterate through the lines of your source file, replace the tabs with commas, and write the new line to the new file. You don't need to use any module to do this, writing the solution in Python is actually quite simple:
def tsv_to_csv(filename):
ext_index = filename.rfind('.tsv')
if ext_index == -1:
new_filename = filename + '.csv'
else:
new_filename = filename[:ext_index] + '.csv'
with open(filename) as original, open(new_filename, 'w') as new:
for line in original:
new.write(line.replace('\t', ','))
return new_filename
Iterating through the lines like this only loads each line into memory one by one, instead of loading the whole thing into memory. It might take a while to process 12GB of data though.
UPDATE:
In fact, now that I think about it, it may be significantly faster to use binary I/O on such a large file, and then to replace the tabs with commas on large chunks of the file at a time. This code follows that strategy:
from io import FileIO
# This chunk size loads 1MB at a time for conversion.
CHUNK_SIZE = 1 << 20
def tsv_to_csv_BIG(filename):
ext_index = filename.rfind('.tsv')
if ext_index == -1:
new_filename = filename + '.csv'
else:
new_filename = filename[:ext_index] + '.csv'
original = FileIO(filename, 'r')
new = FileIO(new_filename, 'w')
table = bytes.maketrans(b'\t', b',')
while True:
chunk = original.read(CHUNK_SIZE)
if len(chunk) == 0:
break
new.write(chunk.translate(table))
original.close()
new.close()
return new_filename
On my laptop using a 1GB TSV file, the first function takes 4 seconds to translate to CSV while the second function takes 1 second. Tuning the CHUNK_SIZE parameter might speed it up more if your storage can keep up, but 1MB seems to be the sweet spot for me.
Using tr as mentioned in another answer took 3 seconds for me, so the chunked python approach seems fastest.
You can use Python's built-in read and write to rewrite the file line by line. This may take some time to process depending on your file size, but it shouldn't run out of memory since you're working line by line.
with open("input.tsv", "r") as input_file:
for line in input_file:
with open("output.csv", "a") as output:
line = line.replace("\t", ",")
output.write(line)

Line has wrong number of columns, but I can't find which line

I have this very big text file (about 2.5 Gb), which I need to load and put in a numpy array of 2 columns using Python. Somewhere in the text file the number of columns seems to be wrong, so it can't load it.
I am trying to find out where exactly this happens, so I can fix it. However, the line number I get is not much help. I would like to get the first value of the line.
The file looks like this:
1.001 1
1.002 0
1.003 3
1.004 1
etc...
I am opening the file like this:
import numpy as np
with open('paths 8_10.txt', 'r') as paths_list:
for file_path in paths_list:
with open(file_path.strip(), 'r') as file:
data = np.loadtxt(file_path.strip())
t = data[:,0]
x = data[:,1]
So I would like t at the location where the program crashes.
I was thinking about a for-loop which prints the value up until where it stops loading, but I can't get it to work.
If speed is not an issue, I suggest you write a small test harness as follows:
import csv
with open('paths 8_10.txt', 'rb') as paths_list:
csv_reader = csv.reader(paths_list)
for line_number, line in enumerate(csv_reader, start=1):
if len(line) != 2:
print "Line {} has {} columns: {}".format(line_number, len(line), line)
This would let you identify which entries need fixing for use in your main script.
If needed, this approach could easily be extended to skip over erroneous lines or truncate the extra columns and write out the file automatically, thus fixing it for future use.

Multiple editing of CSV files

I have a small delay with operating CSV files in python (3.5). Previously I was working with single files and there was no problem, but right now I have >100 files in one folder.
So, my goal is:
To parse all *.csv files in the directory
From each file delete first 6 rows , the files consists of the following data:
"nu(Ep), 2.6.8"
"Date: 2/10/16, 11:18:21 AM"
19
Ep,nu
0.0952645,0.123776,
0.119036,0.157720,
...
0.992060,0.374300,
Save each file separately (for example adding "_edited"), so there should be only numbers saved.
As an option - I have data subdivided on two parts for one material. For example: Ag(0-1_s).csv and Ag(1-4)_s.csv (after steps 1-3 the should be like Ag(*)_edited.csv). How can I merge this two files in a way of adding data from (1-4) to the end of (0-1) saving it in a third file?
My code so far is the following:
import os, sys
import csv
import re
import glob
import fileinput
def get_all_files(directory, extension='.csv'):
dir_list = os.listdir(directory)
csv_files = []
for i in dir_list:
if i.endswith(extension):
csv_files.append(os.path.realpath(i))
return csv_files
csv_files = get_all_files('/Directory/Path/Here')
#Here is the problem with csv's, I dont know how to scan files
#which are in the list "csv_files".
for n in csv_files:
#print(n)
lines = [] #empty, because I dont know how to write it properly per
#each file
input = open(n, 'r')
reader = csv.reader(n)
temp = []
for i in range(5):
next(reader)
#a for loop for here regarding rows?
#for row in n: ???
# ???
input.close()
#newfilename = "".join(n.split(".csv")) + "edited.csv"
#newfilename can be used within open() below:
with open(n + '_edited.csv', 'w') as nf:
writer = csv.writer(nf)
writer.writerows(lines)
This is the fastest way I can think of. If you have a solid-state drive, you could throw multiprocessing at this for more of a performance boost
import glob
import os
for fpath in glob.glob('path/to/directory/*.csv'):
fname = os.basename(fpath).rsplit(os.path.extsep, 1)[0]
with open(fpath) as infile, open(os.path.join('path/to/dir', fname+"_edited"+os.path.extsep+'csv'), 'w') as outfile:
for _ in range(6): infile.readline()
for line in infile: outfile.write(line)

Splitting a CSV file into equal parts?

I have a large CSV file that I would like to split into a number that is equal to the number of CPU cores in the system. I want to then use multiprocess to have all the cores work on the file together. However, I am having trouble even splitting the file into parts. I've looked all over google and I found some sample code that appears to do what I want. Here is what I have so far:
def split(infilename, num_cpus=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
total_file_size = os.path.getsize(infilename)
print total_file_size
files = list()
with open(infilename, 'rb') as infile:
for i in xrange(num_cpus):
files.append(tempfile.TemporaryFile())
this_file_size = 0
while this_file_size < 1.0 * total_file_size / num_cpus:
files[-1].write(infile.read(READ_BUFFER))
this_file_size += READ_BUFFER
files[-1].write(infile.readline()) # get the possible remainder
files[-1].seek(0, 0)
return files
files = split("sample_simple.csv")
print len(files)
for ifile in files:
reader = csv.reader(ifile)
for row in reader:
print row
The two prints show the correct file size and that it was split into 4 pieces (my system has 4 CPU cores).
However, the last section of the code that prints all the rows in each of the pieces gives the error:
for row in reader:
_csv.Error: line contains NULL byte
I tried printing the rows without running the split function and it prints all the values correctly. I suspect the split function has added some NULL bytes to the resulting 4 file pieces but I'm not sure why.
Does anyone know if this a correct and fast method to split the file? I just want resulting pieces that can be read successfully by csv.reader.
As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.
The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.
Update
This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.
import csv
import multiprocessing
import os
import tempfile
def split(infilename, num_chunks=multiprocessing.cpu_count()):
READ_BUFFER = 2**13
in_file_size = os.path.getsize(infilename)
print 'in_file_size:', in_file_size
chunk_size = in_file_size // num_chunks
print 'target chunk_size:', chunk_size
files = []
with open(infilename, 'rb', READ_BUFFER) as infile:
for _ in xrange(num_chunks):
temp_file = tempfile.TemporaryFile()
while temp_file.tell() < chunk_size:
try:
temp_file.write(infile.next())
except StopIteration: # end of infile
break
temp_file.seek(0) # rewind
files.append(temp_file)
return files
files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))
for i, ifile in enumerate(files, start=1):
print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
print 'contents of file {}:'.format(i)
reader = csv.reader(ifile)
for row in reader:
print row
print ''

Categories

Resources