Huge memory consumption on list .split() - python

I have a simple function that read a file and returns a list of words:
def _read_words(filename):
with tf.gfile.GFile(filename, "r") as f:
words = f.read().replace("\n"," %s " % EOS).split()
print(" %d Mo" % (sys.getsizeof(words)/(1000*1000)))
return words
Notes:
the tf.gfile.GFile comes from TensorFlow, the same is happening with open so you can ignore it.
EOS is a string containing "<eos>"
When I run it with a 1.3GB file, the process reserve more than 20GB of RAM (see htop screenshot), but it prints 2353 Mo for sys.getsizeof(words).
Note that this process is nothing more than:
import reader
path = "./train.txt"
w = reader._read_words(path)
When I run it step by step I see the following:
d = file.read() => 4.039 GB RAM
d = d.replace('\n', ' <eos> ') => 5.4GB RAM
d = d.split() => 22GB RAM
So here I am:
I can understand that split uses extra memory, but that much looks ridiculous.
Could I use better operations/data structures to do it? A solution using numpy would be great.
I can find a workaround (see below) specific to my case, but I would still not understand why I can't do it the easiest way.
Any clues or suggestion are welcome,
Thx,
Paul
Workaround:
As explained in the comment I need to:
Build a vocabulary i.e. a dict {"word": count}.
Then I select only the top n word, by occurence (n being a parameter)
Each of these words are assigned a integer (1 to n, 0 is for end of sentence tag <eos>)
I load the whole data, spliting by sentence (\n or our internal tag <eos>)
We sort it by sentence length
What I can do better:
Count words on the fly (line by line)
Build the vocabulary using this word count (we don't need the whole data then)
Load the whole data as integers, using the vocabulary, and therefore saving some memory.
Sort the list, same length, less memory consumption.

Related

Speed up Spacy processing

I want to pre-process text data using spacy (or something else).
My code below works but is really slow. I only have a 20 MB zipped text file as a demo and it takes more than 10 minutes to process with my code.
The problem is: I'll need to process text files of about 20 GB of zipped text files and want to speed up my algorithms before.
Also, how will I be able to deal with a 20 GB zipped text file? It'll blow my main memory of 16GB if I run the code below. Can I read it line-by-line and still get a good speed?
Any help would be appreciated.
import zipfile
nlp = spacy.load("en_core_web_sm" , n_process=4)
with zipfile.ZipFile(filename, 'r') as thezip:
text=thezip.open(thezip.filelist[0],mode='r').read()
text=text.decode('utf-8').splitlines()
for doc in nlp.pipe(text, disable=["tok2vec", "parser", "attribute_ruler"], batch_size=2000):
# Do something with the doc here
# First remove punctuation
tokens=[t for t in doc if t.text not in string.punctuation]
# then remove stop words, weird unicode characters, words with digits in them
# and empty characters.
tokens = [ t for t in tokens if not t.is_stop and t.is_ascii and not t.is_digit and len(t) > 1 and not any(char.isdigit() for char in t.text)]
# remove empty lines, make it lower case and put them in sentence form
if len(tokens):
sentence= " ".join(token.text.lower() for token in tokens)
# do something useful with sentence here
It looks like you just want to use the spaCy tokenizer? In that case use nlp = spacy.blank("en") instead of spacy.load, and then you can leave out the disable part in nlp.pipe.
Also to be clear, you're using spaCy v2?
Here's a function that makes your code faster and also cleaner:
def is_ok(tok):
# this is much faster than `not in string.punctuation`
if tok.is_punct: return False
if tok.is_stop: return False
if not tok.is_ascii: return False
if tok.is_digit: return False
if len(tok.text) < 2: return False
# this gets rid of anything with a number in it
if 'd' in tok.shape_: return False
return True
# replace your stuff with this:
toks = [tok for tok in doc if is_ok(tok)]
Reading your zip file one line at a time should be totally fine since you're just using the tokenizer.

Make list lookup faster?

I have a file that every line is containing names of persons and a file containing text of speeches. The file with the names is very big(250k lines) ordered alphabetically, the speeches file has around 1k lines. What I want to do is a lookup for the names in my text file and do replacements for every occurring name from my names file.
This is my code EDIT: The with function that opens the list is executed only one time.
members_list = []
with open(path, 'r') as l:
for line in l.readlines():
members_list.append(line.strip('\n'))
for member in self.members_list:
if member in self.body:
self.body = self.body.replace(member, '<member>' + member + '</member>')
This code takes about 2.2 seconds to run, but because I have many speech files (4.5k) the total time is around 3 hours.
Is it possible to make this faster? Are generators the way to go?
Currently, you re-read each speech in memory once for each of the 250,000 names when you check "if member in self.body".
You need to parse the speech body once, finding whole words, spaces, and punctuation. Then you need to see if you have found a name, using a linear time lookup of the known member names, or at worst log time.
The problem is you have to find member names which have various word lengths. So here is a quick (and not very good) implementation I wrote up to handle checking the last three words.
# This is where you load members from a file.
# set gives us linear time lookup
members = set()
for line in ['First Person', 'Pele', 'Some Famous Writer']:
members.add(line)
# sample text
text = 'When Some Famous Writer was talking to First Person about Pele blah blah blah blah'
from collections import deque
# pretend we are actually parsing, but I'm just splitting. So lazy.
# This is why I'm not handling punctuation and spaces well, but not relevant to the current topic
wordlist = text.split()
# buffer the last three words
buffer = deque()
# TODO: loop while not done, but this sort of works to show the idea
for word in wordlist:
name = None
if len(buffer) and buffer[0] in members:
name = buffer.popleft()
if not name and len(buffer)>1:
two_word_name = buffer[0] + ' ' + buffer[1]
if two_word_name in members:
name = two_word_name
buffer.popleft()
buffer.popleft()
if not name and len(buffer)>2:
three_word_name = buffer[0] + ' ' + buffer[1] + ' ' + buffer[2]
if three_word_name in members:
name = three_word_name
buffer.popleft()
buffer.popleft()
buffer.popleft()
if name:
print ('<member>', name, '</member> ')
if len(buffer) >2:
print (buffer.popleft() + ' ')
buffer.append(word)
# TODO handle the remaining words which are still in the buffer
print (buffer)
I am just trying to demonstrate the concept. This doesn't handle spaces or punctuation. This doesn't handle the end at all -- it needs to loop while not done. It creates a bunch of temporary strings as it parses. But it illustrates the basic concept of parsing once, and even though it is horribly slow at parsing through the speech text, it might beat searching the speech text 250,000 times.
The reason you want to parse the text and check for name in set is that you do this once. A set has amortized linear time lookup, so it is much faster to check if name in members.
If I get the chance, I might edit it later to be a class that generates tokens, and fix finding names at the end, but I didn't intend this to be your final code.

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)
If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Rosalind Profile and Consensus: Writing long strings to one line in Python (Formatting)

I'm trying to tackle a problem on Rosalind where, given a FASTA file of at most 10 sequences at 1kb, I need to give the consensus sequence and profile (how many of each base do all the sequences have in common at each nucleotide). In the context of formatting my response, what I have as my code works for small sequences (verified).
However, I have issues in formatting my response when it comes to large sequences.
What I expect to return, regardless of length, is:
"consensus sequence"
"A: one line string of numbers without commas"
"C: one line string """" "
"G: one line string """" "
"T: one line string """" "
All aligned with each other and on their own respective lines, or at least some formatting that allows me to carry this formatting as a unit onward to maintain the integrity of aligning.
but when I run my code for a large sequence, I get each separate string below the consensus sequence broken up by a newline, presumably because the string itself is too long. I've been struggling to think of ways to circumvent the issue, but my searches have been fruitless. I'm thinking about some iterative writing algorithm that can just write the entirety of the above expectation but in chunks Any help would be greatly appreciated. I have attached the entirety of my code below for the sake of completeness, with block comments as needed, though the main section.
def cons(file):
#returns consensus sequence and profile of a FASTA file
import os
path = os.path.abspath(os.path.expanduser(file))
with open(path,"r") as D:
F=D.readlines()
#initialize list of sequences, list of all strings, and a temporary storage
#list, respectively
SEQS=[]
mystrings=[]
temp_seq=[]
#get a list of strings from the file, stripping the newline character
for x in F:
mystrings.append(x.strip("\n"))
#if the string in question is a nucleotide sequence (without ">")
#i'll store that string into a temporary variable until I run into a string
#with a ">", in which case I'll join all the strings in my temporary
#sequence list and append to my list of sequences SEQS
for i in range(1,len(mystrings)):
if ">" not in mystrings[i]:
temp_seq.append(mystrings[i])
else:
SEQS.append(("").join(temp_seq))
temp_seq=[]
SEQS.append(("").join(temp_seq))
#set up list of nucleotide counts for A,C,G and T, in that order
ACGT= [[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))],
[0 for i in range(0,len(SEQS[0]))]]
#assumed to be equal length sequences. Counting amount of shared nucleotides
#in each column
for i in range(0,len(SEQS[0])-1):
for j in range(0, len(SEQS)):
if SEQS[j][i]=="A":
ACGT[0][i]+=1
elif SEQS[j][i]=="C":
ACGT[1][i]+=1
elif SEQS[j][i]=="G":
ACGT[2][i]+=1
elif SEQS[j][i]=="T":
ACGT[3][i]+=1
ancstr=""
TR_ACGT=list(zip(*ACGT))
acgt=["A: ","C: ","G: ","T: "]
for i in range(0,len(TR_ACGT)-1):
comp=TR_ACGT[i]
if comp.index(max(comp))==0:
ancstr+=("A")
elif comp.index(max(comp))==1:
ancstr+=("C")
elif comp.index(max(comp))==2:
ancstr+=("G")
elif comp.index(max(comp))==3:
ancstr+=("T")
'''
writing to file... trying to get it to write as
consensus sequence
A: blah(1line)
C: blah(1line)
G: blah(1line)
T: blah(line)
which works for small sequences. but for larger sequences
python keeps adding newlines if the string in question is very long...
'''
myfile="myconsensus.txt"
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
with open(myfile,'w') as D:
D.writelines(ancstr)
D.writelines("\n")
for i in range(0,len(writing_strings)):
D.writelines(writing_strings[i])
D.writelines("\n")
cons("rosalind_cons.txt")
Your code is totally fine except for this line:
writing_strings=[acgt[i]+' '.join(str(n) for n in ACGT[i] for i in range(0,len(ACGT))) for i in range(0,len(acgt))]
You accidentally replicate your data. Try replacing it with:
writing_strings=[ACGT[i] + str(ACGT[i]) for i in range(0,len(ACGT))]
and then write it to your output file as follows:
D.write(writing_strings[i][1:-1])
That's a lazy way to get rid of the brackets from your list.

Improve efficiency in Python matching

If we have the following input and we would like to keep the rows if their "APPID colum " (4rd column) are the same and their column "Category" (the 18th column) are one "Cell" and one "Biochemical" or one "Cell" and one "Enzyme".
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-2 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
The ideal output will be
A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
"APP-1" is kept because their column 3 are the same and their Category are one "Cell" and the other one is "Enzyme". The same thing for "APP-3", which has one "Cell" and the other one is "Biochemical" in its "Category" column.
The following attempt could do the trick:
import os
App=["1"]
for a in App:
outname="App_"+a+"_target_overlap.csv"
out=open(outname,'w')
ticker=0
cell_comp_id=[]
final_comp_id=[]
# make compound with cell activity (to a target) list first
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
file = open (filename)
line=file.readlines()
if(ticker==0): # Deal with the title
out.write(line[0])
ticker=ticker+1
for c in line[1:]:
c=c.split(',')
if(c[17]==" Cell "):
cell_comp_id.append(c[3])
else:
cell_comp_id=list(set(cell_comp_id))
# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for comp in cell_comp_id:
if (c[3]==comp and c[17]==" Biochemical "):
final_comp_id.append(comp)
out.write(str(temporary_line))
elif (c[3]==comp and c[17]==" Enzyme "):
final_comp_id.append(comp)
out.write(str(temporary_line))
else:
final_comp_id=list(set(final_comp_id))
# After we obatin a final compound list in target a , we go through all the csv again for output the cell data
filename="App_"+a+"_target_Detail_average.csv"
if os.path.exists(filename):
for c in line[1:]:
temporary_line=c #for output_temp
c=c.split(',')
for final in final_comp_id:
if (c[3]==final and c[17]==" Cell "):
out.write(str(temporary_line))
out.close()
When the input file is small (tens of thousands lines), this script can finish its job in reasonable time. However, the input files become millions to billions of lines, this script will take forever to finish (days...). I think the issue is we create a list of APPID with "Cell" in the 18th column. Then we go back to compare this "Cell" list (maybe half million lines) to the whole file (1 million lines for example): If any APPID in the Cell list and the whole file is the same and the 18th column of the row in the whole file is "Enzyme" or "Biochemical", we keep the information. This step seems to be very time consuming.
I am thinking maybe preparing "Cell" , "Enzyme" and "Biochemical" dictionaries and compare them will be faster? May I know if any guru has better way to process it? Any example/comment will be helpful. Thanks.
We use python 2.7.6.
reading the file(s) efficiently
One big problem is that you're reading the file all in one go using readlines. This will require loading it ALL into memory at one go. I doubt if you have that much memory available.
Try:
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh: #iterate through lines 'lazily', reading as you go.
c = line.split(',')
style code to start with. This should help a lot. Here, in context:
# make compound with cell activity (to a target) list first
if os.path.exists(filename):
with open(filename) as fh:
out.write(fh.readline()) # ticker
for line in fh:
cols = line.split(',')
if cols[17] == " Cell ":
cell_comp_id.append(cols[3])
the with open(...) as syntax is a very common python idiom which automatically handles closing the file when you finish the with block, or if there is an error. Very useful.
sets
Next thing is, as you suggest, using sets a little better.
You don't need to recreate the set each time, you can just update it to add items. Here's some example set code (written in the python interperter style, >>> at the beginning means it's a line of stuff to type - don't actually type the >>> bit!):
>>> my_set = set()
>>> my_set
set()
>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])
>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])
>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])
>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])
so you can add items, and remove them from a set without creating a new set from scratch (which you are doing each time with the cell_comp_id=list(set(cell_comp_id)) bit.
You can also get differences, intersections, etc:
>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])
>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])
See the docs for more info.
So lets try something like:
cells = set()
enzymes = set()
biochemicals = set()
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells.add(row_id)
elif row_category == ' Biochemical ':
biochemicals.add(row_id)
elif row_category == ' Enzyme ':
enzymes.add(row_id)
Now you have sets of cells, biochemicals and enzymes. You only want the cross section of these, so:
cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals
You can then go through all the files again and simply check if row_id (or c[3]) is in either of those lists, and if so, print it.
You can actually combine those two lists even further:
cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals
which would be the cells which have enzymes or biochemicals.
So then when you run through the files the second time, you can do:
if row_id in cells_with_enz_or_bio:
out.write(line)
after all that?
Just using those suggestions might be enough to get you by. You still are storing in memory the entire sets of cells, biochemicals and enzymes, though. And you're still running through the files twice.
So one there are two ways we could potentially speed it up, while still staying with a single python process. I don't know how much memory you have available. If you run out of memory, then it might possibly slow things down slightly.
reducing sets as we go.
If you do have a million records, and 800000 of them are pairs (have a cell record and a biochemical record) then by the time you get to the end of the list, you're storing 800000 IDs in sets. To reduce memory usage, once we've established that we do want to output a record, we could save that information (that we want to print the record) to a file on disk, and stop storing it in memory. Then we could read that list back later to figure out which records to print.
Since this does increase disk IO, it could be slower. But if you are running out of memory, it could reduce swapping, and thus end up faster. It's hard to tell.
with open('to_output.tmp','a') as to_output:
for a in App:
# ... do your reading thing into the sets ...
if row_id in cells and (row_id in biochemicals or row_id in enzymes):
to_output.write('%s,' % row_id)
cells.remove(row_id)
biochemicals.remove(row_id)
enzymes.remove(row_id)
once you've read through all the files, you now have a file (to_output.tmp) which contains all the ids that you want to keep. So you can read that back into python:
with open('to_output.tmp') as ids_file:
ids_to_keep = set(ids_file.read().split(','))
which means you can then on your second run through the files simply say:
if row_id in ids_to_keep:
out.write(line)
using dict instead of sets:
If you have plenty of memory, you could bypass all of that and use dicts for storing the data, which would let you run through the files only once, rather than using sets at all.
cells = {}
enzymes = {}
biochemicals = {}
with open(filename) as fh:
out.write(fh.readline()) #ticker
for line in fh:
cols = line.split(',')
row_id = cols[3]
row_category = cols[17]
if row_category == ' Cell ':
cells[row_id] = line
elif row_category == ' Biochemical ':
biochemicals[row_id] = line
elif row_category == ' Enzyme ':
enzymes[row_id] = line
if row_id in cells and row_id in biochemicals:
out.write(cells[row_id])
out.write(biochemicals[row_id])
if row_id in enzymes:
out.write(enzymes[row_id])
elif row_id in cells and row_id in enzymes:
out.write(cells[row_id])
out.write(enzymes[row_id])
The problem with this method is that if any rows are duplicated, it will get confused.
If you are sure that the input records are unique, and that they either have enzyme or biochemical records, but not both, then you could easily add del cells[row_id] and the appropriate others to remove rows from the dicts once you've printed them, which would reduce memory usage.
I hope this helps :-)
A technique I have used to deal with massive files quickly in Python is to use the multiprocessing library to split the file into large chunks, and process those chunks in parallel in worker subprocesses.
Here's the general algorithm:
Based on the amount of memory you have available on the system that will run this script, decide how much of the file you can afford to read into memory at once. The goal is to make the chunks as large as you can without causing thrashing.
Pass the file name and chunk beginning/end positions to subprocesses, which will each open the file, read in and process their sections of the file, and return their results.
Specifically, I like to use a multiprocessing pool, then create a list of chunk start/stop positions, then use the pool.map() function. This will block until everyone has completed, and the results from each subprocess will be available if you catch the return value from the map call.
For example, you could do something like this in your subprocesses:
# assume we have passed in a byte position to start and end at, and a file name:
with open("fi_name", 'r') as fi:
fi.seek(chunk_start)
chunk = fi.readlines(chunk_end - chunk_start)
retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}
for line in chunk:
ID, category = retriever(line.split())
try:
APPIDs[ID].append(category) # we've seen this ID before, add category to its list
except KeyError:
APPIDs[ID] = [category] # we haven't seen this ID before - make an entry
# APPIDs entries will look like this:
#
# <APPID> : [list of categories]
return APPIDs
In your main process, you would retrieve all the returned dictionaries and resolve duplicates or overlaps, then output something like this:
for ID, categories in APPIDs.iteritems():
if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
# print or whatever
A couple of notes/caveats:
Pay attention to the load on your hard disk/SSD/wherever your data is located. If your current method is already maxing out its throughput, you probably won't see any performance improvements from this. You can try implementing the same algorithm with threading instead.
If you do get a heavy hard disk load that's NOT due to memory thrashing, you can also reduce the number of simultaneous subprocesses you're allowing in the pool. This will result in fewer read requests to the drive, while still taking advantage of truly parallel processing.
Look for patterns in your input data you can exploit. For example, if you can rely on matching APPIDs to be next to each other, you can actually do all of your comparisons in the subprocesses and let your main process hang out until its time to combine the subprocess data structures.
TL;DR
Break your file up into chunks and process them in parallel with the multiprocessing library.

Categories

Resources