Processing huge amount of text data in memory - python

I am trying to process ~20GB of data on a Ubuntu system having 64 GB of RAM.
This step is a part of a some preprocessing steps to generate feature vectors for training an ML algo.
The original implementation(written by someone in my team) had lists in it. It does not scale up well as we add more training data. It is something like this.
all_files = glob("./Data/*.*")
file_ls = []
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
file_ls.append(f.read())
This runs into a memory error(process gets killed).
So I though I should try out replacing the list based thing with tries
def insert(word):
cur_node = trie_root
for letter in word:
if letter in cur_node:
cur_node = cur_node[letter]
else:
cur_node[letter] = {}
cur_node = cur_node[letter]
cur_node[None] = None
trie_root = {}
for fi in tqdm(all_files):
with open(file=fi, mode="r", encoding='utf-8', errors='ignore') as f:
insert(f.read().split())
This too gets killed. The above is a demo code that I have written to capture the memory footprint of the objects. The worse part is that the demo code for list runs standalone but the demo code for trie gets killed, leading me to believe that this implementation is worse than the list implementation.
My goal is to write some efficient code in Python to resolve this issue.
Kindly help me solve this problem.
EDIT:
Responding to #Paul Hankin, the data processing involves first taking up each file and adding a generic placeholder for terms with a normalized term frequency greater than 0.01 after which each file is splitted into a list and a vocabulary is calculated taking all the processed files into consideration.

One of the simple solutions to this problem might be to NOT store data in a list or any data structure. You can try writing these data to a file while doing the reading.

Related

Training Word2Vec Model from sourced data - Issue Tokenizing data

I have recently sourced and curated a lot of reddit data from Google Bigquery.
The dataset looks like this:
Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.
I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.
Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.
I am facing the following issue:
When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.
To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:
Knowing that my computer can handle performing the action on the dataset, I simply did:
reddit_subset = reddit_data[:50]
reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))
This produces the following result:
This in fact works with word2vec and produces model one can work with. Great so far.
Because of my inability to operate on such a large dataset in one go, I had to get creative with how I handle this dataset. My solution was to batch the dataset and work on it in small iterations using Panda's own batchsize argument.
I wrote the following function to achieve that:
def reddit_data_cleaning(filepath, batchsize=20000):
if batchsize:
df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
print("Beginning the data cleaning process!")
start_time = time.time()
flag = 1
chunk_num = 1
for chunk in df:
chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
chunk_num += 1
if flag == 1:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Beginning writing a new file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
flag = 0
else:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Adding a chunk into an already existing file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
end_time = time.time()
print("Processing has been completed in: ", (end_time - start_time), " seconds.")
Although this piece of code allows me to actually work through this huge dataset in chunks and produces results where otherwise I'd crash from memory failures, I get a result which doesn't fit my word2vec requirements, and leaves me quite baffled at the reason for it.
I used the above function to perform the same operation on the Data subset to compare how the result differs between the two functions, and got the following:
The desired result is on the new_tokens column, and the function that chunks the dataframe produces the "tokens" column result.
Is anyone any wiser to help me understand why the same function to tokenize produces a wholly different result depending on how I iterate over the dataframe?
I appreciate you if you read through the whole issue and stuck through!
First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.
They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:
Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes
So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.
Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.
That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.
Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.
For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/
After taking gojomo's advice I simplified my approach at reading the csv file and writing to a text file.
My initial approach using pandas had yielded some pretty bad processing times for a file with around 12 million rows, and memory issues due to how pandas reads data all into memory before writing it out to a file.
What I also realized was that I had a major flaw in my previous code.
I was printing some output (as a sanity check), and because I printed output too often, I overflowed Jupyter and crashed the notebook, not allowing the underlying and most important task to complete.
I got rid of that, simplified reading with the csv module and writing into a txt file, and I processed the reddit database of ~12 million rows in less than 10 seconds.
Maybe not the finest piece of code, but I was scrambling to solve an issue that stood as a roadblock for me for a couple of days (and not realizing that part of my problem was my sanity checks crashing Jupyter was an even bigger frustration).
def generate_corpus_txt(csv_filepath, output_filepath):
import csv
import time
start_time = time.time()
with open(csv_filepath, encoding = 'utf-8') as csvfile:
datareader = csv.reader(csvfile)
count = 0
header = next(csvfile)
print(time.asctime(time.localtime()), " ---- Beginning Processing")
with open(output_filepath, 'w+') as output:
# Check file as empty
if header != None:
for row in datareader:
# Iterate over each row after the header in the csv
# row variable is a list that represents a row in csv
processed_row = str(' '.join(row)) + '\n'
output.write(processed_row)
count += 1
if count == 1000000:
print(time.asctime(time.localtime()), " ---- Processed 1,000,000 Rows of data.")
count = 0
print('Processing took:', int((time.time()-start_time)/60), ' minutes')
output.close()
csvfile.close()

Best way to write rows of a numpy array to file inside, NOT after, a loop?

I'm new here and to python in general, so please forgive any formatting issues and whatever else. I'm a physicist and I have a parametric model, where I want to iterate over one or more of the model's parameter values (possibly in an MCMC setting). But for simplicity, imagine I have just a single parameter with N possible values. In a loop, I compute the model and several scalar metrics pertaining to it.
I want to save the data [parameter value, metric1, metric2, ...] line-by-line to a file. I don't care what type: .pickle, .npz, .txt, .csv or anything else are fine.
I do NOT want to save the array after all N models have been computed. The issue here is that, sometimes a parameter value is so nonphysical that the program I call to calculate the model (which is a giant complicated thing years in development, so I'm not touching it) crashes the kernel. If I have N = 30000 models to do, and this happens at 29000, I'll be very unhappy and have wasted a lot of time. I also probably have to be conscious of memory usage - I've figured out how to do what I propose with a text file, but it crashes around 2600 lines because I don't think it likes opening a text file that long.
So, some pseudo-code:
filename = 'outFile.extension'
dataArray = np.zeros([N,3])
idx = 0
for p in Parameter1:
modelOutputVector = calculateModel(p)
metric1, metric2 = getMetrics(modelOutputVector)
dataArray[idx,0] = p
dataArray[idx,1] = metric1
dataArray[idx,2] = metric2
### Line that saves data here
idx+=1
I'm partial to npz or pickle formats, but can't figure out how to do this with either. If there is a better format or a better solution, I appreciate any advice.
Edit: What I tried to make a text file was this, inside the loop:
fileObject = open(filename, 'ab')
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
fileObject.close()
The first time it crashed at 2600 or whatever I thought it was just coincidence, but every time I try this, that's where it stops. I could hack it and make a batch of files that are all 2600 lines, but there's got to be a better solution.
Its hard to say with such a limited knowledge of the error, but if you think it is a file writing error maybe you could try something like:
with open(filename, 'ab') as fileObject:
# code that computes numpy array
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
# no need to .close() because the "with open()" will handle it
However
I have not used np.savetxt()
I am not an expert on your project
I do not even know if it is truly a file writing error to begin with
I just prefer the with open() technique because that's how all the introductory python books I've read structure their file reading/writing processes, so I assume there is wisdom in it. You could also consider doing like fabianegli commented and save to separate files (thats what my work does).

fastest method to read big data files in python

I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (e.g. each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) .
Reading an entire file into memory (e.g. by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. My strategy was to loop through each line iteratively and write it away in some folder. This strategy works fine, but it is extremely slow.
From my understanding there is a trade-off between memory usage and computation speed. Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours).
So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. While not losing functionality? * And if so, how would I implement this? Or am I going about this all wrong? Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month).
The method in particular I am trying is:
#Creates a time series per bond
import csv
import linecache
#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year
filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];
for j in range(0,5):
for i in range(0,12):
filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')
for x in range (0,60):
line = linecache.getline('allBonds.txt', x)
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:
for line in text_file:
temp=line.split(';');
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
Thanks in advance.
P.s. Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed.
I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . It supports reading the file in chunks (iterator=True option)
I think this part of your code may cause serious performance problems if the condition is matched frequently.
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^',
quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop.

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Python Memory error solutions if permanent access is required

first, I am aware of the amount of Python memory error questions on SO, but so far, none has matched my use case.
I am currently trying to parse a bunch of textfiles (~6k files with ~30 GB) and store each unique word. Yes, I am building a wordlist, no I am not planning on doing evil things with it, it is for the university.
I implemented the list of found words as a set (created with words = set([]), used with words.add(word)) and I am just adding every found word to it, considering that the set mechanics should remove all duplicates.
This means that I need permanent access to the whole set for this to work (Or at least I see no alternative, since the whole list has to be checked for duplicates on every insert).
Right now, I am running into MemoryError about 25% through, when it uses about 3.4 GB of my RAM. I am on a Linux 32bit, so I know where that limitation comes from, and my PC only has 4 Gigs of RAM, so even 64 bit would not help here.
I know that the complexity is probably terrible (Probably O(n) on each insert, although I don't know how Python sets are implemented (trees?)), but it is still (probably) faster and (definitly) more memory efficient than adding each word to a primitive list and removing duplicates afterwards.
Is there any way to get this to run? I expect about 6-10 GB of unique words, so using my current RAM is out of the question, and upgrading my RAM is currently not possible (and does not scale too well once I start letting this script loose on larger amounts of files).
My only Idea at the moment is caching on Disk (Which will slow the process down even more), or writing temporary sets to disk and merging them afterwards, which will take even more time and the complexity would be horrible indeed. Is there even a solution that will not result in horrible runtimes?
For the record, this is my full source. As it was written for personal use only, it is pretty horrible, but you get the idea.
import os
import sys
words=set([])
lastperc = 0
current = 1
argl = 0
print "Searching for .txt-Files..."
for _,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
argl=argl+1
print "Found " + str(argl) + " Files. Beginning parsing process..."
print "0% 50% 100%"
for r,_,f in os.walk("."):
for file in f:
if file.endswith(".txt"):
fobj = open(os.path.join(r,file),"r")
for line in fobj:
line = line.strip()
word, sep, remains = line.partition(" ")
if word != "":
words.add(word)
word, sep, remains = remains.partition(" ")
while sep != "":
words.add(word)
word, sep, remains2 = remains.partition(" ")
remains = remains2
if remains != "":
words.add(remains)
newperc = int(float(current)/argl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Set contains " + str(len(words)) + " different words. Sorting..."
sorteddic = sorted(words, key=str.lower)
print "Sorted. Writing to File"
print "0% 50% 100%"
lastperc = 0
current = 1
sdicl = len(sorteddic)-1
fobj = open(sys.argv[1],"w")
for element in sorteddic:
fobj.write(element+"\n")
newperc = int(float(current)/sdicl*100)
if newperc-lastperc > 0:
for i in range(newperc-lastperc):
sys.stdout.write("=")
sys.stdout.flush()
lastperc = newperc
current = current+1
print ""
print "Done. Enjoy your wordlist."
Thanks for your help and Ideas.
You're probably going to need to store the keys on disk. A key-value store like Redis might fit the bill.
Do you really mean 6-10GB of unique words? Is this English text? Surely even counting proper nouns and names there shouldn't be more than a few million unique words.
Anyway, what I would do is process one file at a time, or even one section (say, 100k) of a file at a time, a build a unique wordlist just for that portion. Then just union all the sets as a post-processing step.
My inclination is towards a database table, but if you want to stay within a single framework, checkout PyTables: http://www.pytables.org/moin
The first thing I'd try would be to restrict words to lower-case characters – as Tyler Eaves pointed out, this will probably reduce the set size enough to fit into memory. Here's some very basic code to do this:
import os
import fnmatch
import re
def find_files(path, pattern):
for root, files, directories in os.walk(path):
for f in fnmatch.filter(files, pattern):
yield os.path.join(root, f)
words = set()
for file_name in find_files(".", "*.txt"):
with open(file_name) as f:
data = f.read()
words.update(re.findall("\w+", data.lower()))
A few more comments:
I would usually expect the dictionary to grow rapidly at the beginning; very few new words should be found late in the process, so your extrapolation might severely overestimate the final size of the word list.
Sets are very efficient for this purpose. They are implemented as hash tables, and adding a new word has an amortised complexity of O(1).
Hash your keys into a codespace that is smaller and more managable. Key the hash to a file containing the keys with that hash. The table of hashes is much smaller and the individual key files are much smaller.

Categories

Resources