Python - multiprocessing with large strings slower - python

I am working on a textual analysis of a large sample of 10Ks (about 150,000) and desperately trying to speed up my program with multiprocessing. The relevant function loads the txt files, parses them with some RegExp and saves them as "clean":
def plain_10k(f):
input_text = open(ipath + "\\" + f, errors = "ignore").read()
# REGEXP
output_file = open(opath + "\\" + f, "w", errors = "ignore")
output_file.write(input_text)
output_file.close()
I try to perform this function over a list of file names as follows:
with Pool(processes = 8) as pool, tqdm(total = len(files_10k)) as pbar:
for d in pool.imap_unordered(plain_10k, files_10k):
pbar.update()
Unfortunately, the program seems to be stuck as it is not returning (i.e. saving clean txt files) anything. Even with a small list of 10 files, nothing happens.
What is the problem here?
If it is relevant: the size of the input txt files ranges from 10kb to 10mb with the majority beeing smaller than 1mb.
I am quite new to Python, so the code above is the result of hours of googling and certainly not very good. I am happy about any comments and suggestions.
Thank you very much in advance!

Related

Processing huge amount of text data in memory

I am trying to process ~20GB of data on a Ubuntu system having 64 GB of RAM.
This step is a part of a some preprocessing steps to generate feature vectors for training an ML algo.
The original implementation(written by someone in my team) had lists in it. It does not scale up well as we add more training data. It is something like this.
all_files = glob("./Data/*.*")
file_ls = []
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
file_ls.append(f.read())
This runs into a memory error(process gets killed).
So I though I should try out replacing the list based thing with tries
def insert(word):
cur_node = trie_root
for letter in word:
if letter in cur_node:
cur_node = cur_node[letter]
else:
cur_node[letter] = {}
cur_node = cur_node[letter]
cur_node[None] = None
trie_root = {}
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
insert(f.read().split())
This too gets killed. The above is a demo code that I have written to capture the memory footprint of the objects. The worse part is that the demo code for list runs standalone but the demo code for trie gets killed, leading me to believe that this implementation is worse than the list implementation.
My goal is to write some efficient code in Python to resolve this issue.
Kindly help me solve this problem.
EDIT:
Responding to #Paul Hankin, the data processing involves first taking up each file and adding a generic placeholder for terms with a normalized term frequency greater than 0.01 after which each file is splitted into a list and a vocabulary is calculated taking all the processed files into consideration.
One of the simple solutions to this problem might be to NOT store data in a list or any data structure. You can try writing these data to a file while doing the reading.

Processing large JSON with multiple root elements and read into pandas dataframe

I want to (pre)process large JSON files (5-10GB each), which contain multiple root elements. These root elements follow each other without separator like this: {}{}....
So I first wrote the following simple code to get a valid JSON File:
with open(file) as f:
file_data = f.read()
file_data = file_data.replace("}{", "},{")
file_data = "[" + file_data + "]"
df = pd.read_json(file_data)
Obviously this doesn´t work with large files. Even the 400MB file doesn´t work. (I´ve got 16GB memory)
I´ve read that it´s possible to work with chunks but I don´t manage to get this in ''chunk logic''
Is there a way to ''chunkenize'' this?
I am glad for you help.
I am having a hard time visualizing the multiple root element idea, but you should write the file_data contents to disk and try reading it in separately. If you have the file open it will consume RAM in addition to having the RAM consumed by the file_data object (and possibly even the modified object, though that's a garbage collector question. I think garbage collection gets done after the function returns.) Try using f.close explicitly instead of the with and return that from a separate function.

What would make this code that combines some flat files run faster?

I'm new to Python and haven't gotten into any optimization work yet. I'm attempting to take a bunch of files that themselves are already pretty large and combine them into one large file that will probably wind up being close to 50-100GB would be my guess. More memory than I have at any rate. I was given the code below and it works great for small files. When I try to run it over the actual files for my use case, it will totally lock up my computer.
I understand that Pandas is fast. I'm guessing that data frames are stored in memory. If that is the case then that is probably what is wrecking stuff up here. Is there any kind or mechanism to spill to disk or possibly write to an existing file instead of trying to hold the whole thing in a dataframe before writing to disk? Or possibly another option that I didn't think of?
import pandas as pd
import os
file_masks = ['fhv', 'green', 'yellow']
def combine_files(file_mask):
csvfiles = []
for path, directories, files in os.walk('TaxiDriveData/'):
csvfiles.extend([os.path.join(path, fn) for fn in files if fn.startswith(file_mask)])
df = pd.concat((pd.read_csv(fn) for fn in csvfiles))
df.to_csv(os.path.join('TaxiDriveCombinedData', file_mask + '_trip_data.csv'), index=False)
for m in file_masks:
combine_files(m)
Here's a non pandas solution that doesn't load everything to memory. I haven't tested it but it should work.
import os
file_masks = ['fhv', 'green', 'yellow']
def combine_files(file_mask):
with open(os.path.join('TaxiDriveCombinedData', file_mask + '_trip_data.csv'),'w') as fout:
csvfiles = []
for path, directories, files in os.walk('TaxiDriveData/'):
csvfiles.extend([os.path.join(path, fn) for fn in files if fn.startswith(file_mask)])
for in_file in csvfiles:
with open(in_file,'r') as fin:
# f.next() # comment this out if you want to remove the headers
for line in fin:
fout.write(line)
for m in file_masks:
combine_files(m)
You don't need Python to do that. There are a lot of tools in a linux system that can join files and are optimized or have parameters to do this very efficiently: join, cat, dd...
This is not the most efficient option, but, for example:
cat input/*.csv > output/combined.csv
If you want a high-performance Python version I recommend you to read and write the files in chunks instead of reading the files line by line.
Your biggest problem is the I/O and you can optimize this by reading and writing larger information blocks of the hard disk. If you read and write in the optimal size of your hard drive and your filesystem you will notice the difference.
For exmaple, a common block size for newer HDDs is 4096-byte (4 KiB).
You can try something like the following:
NEW_LINE = '\n'
def read_in_chunks(f, chunksize=4096):
while True:
chunk = f.read(chunksize)
if not chunk:
break
yield chunk
(...)
fout = open('output.csv', 'w')
for fname in files:
with open(fname) as fin:
buffer = ''
for chunk in read_in_chunks(fin):
buffer += chunk
lines, tmp_buffer = buffer.rsplit(NEW_LINE, 1)
lines += NEW_LINE # rsplit removes the last new-line char. I re-add it
fout.write(lines)
buffer = tmp_buffer
fout.close()

fastest method to read big data files in python

I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (e.g. each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) .
Reading an entire file into memory (e.g. by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. My strategy was to loop through each line iteratively and write it away in some folder. This strategy works fine, but it is extremely slow.
From my understanding there is a trade-off between memory usage and computation speed. Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours).
So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. While not losing functionality? * And if so, how would I implement this? Or am I going about this all wrong? Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month).
The method in particular I am trying is:
#Creates a time series per bond
import csv
import linecache
#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year
filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];
for j in range(0,5):
for i in range(0,12):
filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')
for x in range (0,60):
line = linecache.getline('allBonds.txt', x)
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:
for line in text_file:
temp=line.split(';');
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
Thanks in advance.
P.s. Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed.
I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . It supports reading the file in chunks (iterator=True option)
I think this part of your code may cause serious performance problems if the condition is matched frequently.
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^',
quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop.

Reading JSON from gigabytes of .txt files and add to the same list

I have 300 txt files (each between 80-100mb) that I should have to put to a list object and using the all content in the same time. I already created a working solution, but unfortunately it crashes due MemoryError when I load more than 3 txt-s. I'm not sure that it matters but I have a lot of ram so I could easily load 30GB to the memory if it can solve the problem.
Basically I would like to loop through the 300 txt files inside the same for loop. Is it possible to create a list object that holds 30GB of content? Or achieve it in any different way? I would really appreciate if somebody could explain me the ideal solution or any useful tips.
Here is how I tried, it produces the Memory Error after loading 3 txt.
def addContentToList(filenm):
with open(filenm, encoding="ISO-8859-1") as v:
jsonContentTxt.extend(json.load(v))
def createFilenameList(name):
for r in range(2,300):
file_str = "%s%s.txt" % (name, r,)
filenames.append(file_str)
filename1 = 'log_1.txt'
filename2 = 'log_'
filenames = []
jsonContentTxt = []
with open(filename, encoding="ISO-8859-1") as f:
jsonContentTxt = json.load(f)
createFilenameList(filename2)
for x in filenames:
addContentToList(x)
json_data = json.dumps(jsonContentTxt)
content_list = json.loads(json_data)
print (content_list)
Put down the chocolate-covered banana and step away from the European currency systems.
Text files are a really bad idea to store data like this. You should use a database. I recommend PostgreSQL and SQLite.
Apart from that, your error is probably due to using a 32-bit version of Python (which will cap your memory allocation to 2GB), use 64-bit instead. Even so I think you'd be better off by using a more proper tool for the job and not allocating 30GB of memory space.

Categories

Resources