I am using Python to parse a large file. What I want to do is
If condition =True
append to list A
else
append to list B
I want to use generator expressions for this - to save memory. I am putting in the actual code.
def is_low_qual(read):
lowqual_bp=(bq for bq in phred_quals(read) if bq < qual_threshold)
if iter_length(lowqual_bp) > num_allowed:
return True
else:
return False
lowqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==True)
highqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==False)
SeqIO.write(highqual,flt_out_handle,"fastq")
SeqIO.write(lowqual,junk_out_handle,"fastq")
def iter_length(the_gen):
return sum(1 for i in the_gen)
You can use itertools.tee in conjunction with itertools.ifilter and itertools.ifilterfalse:
import itertools
def is_condition_true(x):
...
gen1, gen2 = itertools.tee(sequences)
low = itertools.ifilter(is_condition_true, gen1)
high = itertools.ifilterfalse(is_condition_true, gen2)
Using tee ensures that the function works correctly even if sequences is itself a generator.
Note, though, that tee could itself use a fair bit of memory (up to a list of size len(sequences)) if low and high are consumed at different rates (e.g. if low is exhausted before high is used).
I think you're striving to avoid iterating over your collection twice. If so, this type of approach works:
high, low = [], []
_Nones = [high.append(x) if is_condition_true() else low.append(x) for x in sequences]
This is probably less than advised because it's using a list comprehension for a side-effect. That's generally anti-pythonic.
Just to add a more general answer: If your main concern is memory, you should use one generator that loops over the whole file, and handle each item as low or high as it comes. Something like:
for r in sequences:
if condition_true(r):
handle_low(r)
else:
handle_high(r)
If you need to collect all high/low elements before using either, then you can't guard against a potential memory hit. The reason is that you can't know which elements are high/low until you read them. If you have to process low first, and it turns out all the elements are actually high, you have no choice but to store them in a list as you go, which will use memory. Doing it with one loop allows you to handle each element one at a time, but you have to balance this against other concerns (i.e., how cumbersome it is to do it this way, which will depend on exactly what you're trying to do with the data).
This is surprisingly difficult to do elegantly. Here's something that works:
from itertools import tee, ifilter, ifilterfalse
low, high = [f(condition, g) for f, g in zip((ifilter, ifilterfalse), tee(seq))]
Note that as you consume items from one resulting iterator (say low), the internal deque in tee will have to expand to contain any items that you have not yet consumed from high (including, unfortunately, those which ifilterfalse will reject). As such this might not save as much memory as you're hoping.
Here's an implementation that uses as little additional memory as possible:
def filtertee(func, iterable, codomain=(False, True)):
it = iter(iterable)
deques = dict((r, deque()) for r in codomain)
def gen(mydeque):
while True:
while not mydeque: # as long as the local deque is empty
newval = next(it) # fetch a new value,
result = func(newval) # find its image under `func`,
try:
d = deques[result] # find the appropriate deque, and
except KeyError:
raise ValueError("func returned value outside codomain")
d.append(newval) # add it.
yield mydeque.popleft()
return dict((r, gen(d)) for r, d in deques.items())
This returns a dict from the codomain of the function to a generator providing the items that take that value under func:
gen = filtertee(condition, seq)
low, high = gen[True], gen[False]
Note that it's your responsibility to ensure that condition only returns values in codomain.
Related
I am new to Python and I was wondering if there was a way I could shorten/optimise the below loops:
for breakdown in data_breakdown:
for data_source in data_source_ids:
for camera in camera_ids:
if (camera.get("id") == data_source.get("parent_id")) and (data_source.get("id") == breakdown.get('parent_id')):
for res in result:
if res.get("camera_id") == camera.get("id"):
res.get('data').update({breakdown.get('name'): breakdown.get('total')})
I tried this oneliner, but it doesn't seem to work:
res.get('data').update({breakdown.get('name'): breakdown.get('total')}) for camera in camera_ids if (camera.get("id") == data_source.get("parent_id")) and (data_source.get("id") == breakdown.get('parent_id'))
You can use itertools.product to handle the nested loops for you, and I think (although I'm not sure because I can't see your data) you can skip all the .get and .update and just use the [] operator:
from itertools import product
for b, d, c in product(data_breakdown, data_source_ids, camera_ids):
if c["id"] != d["parent_id"] or d["id"] != b["parent_id"]:
continue
for res in result:
if res["camera_id"] == c["id"]:
res['data'][b['name']] = b['total']
If anything, to optimize the performance of those loops, you should make them longer and more nested, with the data_source.get("id") == breakdown.get('parent_id') happening outside of the camera loop.
But there is perhaps an alternative, where you could change the structure of your data so that you don't need to loop nearly as much to find matching ID values. Convert each of your current lists (of dicts) into a single dict with its keys equal to the 'id' value you'll be trying to match in that loop, and the value being whole dict.
sources_dict = {source.get("id"): source for source in data_source_ids}
cameras_dict = {camera.get("id"): camera for camera in camera_ids}
results_dict = {res.get("camera_id"): res for res in result}
Now the whole loop only needs one level:
for breakdown in data_breakdown:
source = sources_dict[breakdown["parent_id"]]
camera = cameras_dict[source["parent_id"]]
res = results_dict[camera["id"]]
res.data[breakdown["name"]] = breakdown["total"]
This code assumes that all the lookups with get in your current code were going to succeed in getting a value. You weren't actually checking if any of the values you were getting from a get call was None, so there probably wasn't much benefit to it.
I'd further note that it's not clear if the camera loop in your original code was at all necessary. You might have been able to skip it and just directly compare data_source['parent_id'] against res['camera_id'] without comparing them both to a camera['id'] in between. In my updated version, that would translate to leaving out the creation of the cameras_dict and just directly indexing results_dict with source["parent_id"] rather than indexing to find camera first.
I have a function which accepts a two inputs provided by itertools combinations, and outputs a solution. The two inputs should be stored as a tuple forming the key in the dict, while the result is the value.
I can pool this and get all of the results as a list, which I can then insert into a dictionary one-by-one, but this seems inefficient. Is there a way to get the results as each job finishes, and directly add it to the dict?
Essentially, I have the code below:
all_solutions = {}
for start, goal in itertools.combinations(graph, 2):
all_solutions[(start, goal)] = search(graph, start, goal)
I am trying to parallelize it as follows:
all_solutions = {}
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
results = pool.starmap(search, zip(itertools.repeat(graph_pool),
itertools.combinations(graph, 2)))
for i, start_goal in enumerate(itertools.combinations(graph, 2)):
start, goal = start_goal[0], start_goal[1]
all_solutions[(start, goal)] = results[i]
Which actually works, but iterates twice, once in the pool, and once to write to a dict (not to mention the clunky tuple unpacking).
This is possible, you just need to switch to using a lazy mapping function (not map or starmap, which have to finish computing all the results before you can begin using any of them):
from functools import partial
from itertools import tee
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
# Since you're processing in order and in parallel, tee might help a little
# by only generating the dict keys/search arguments once. That said,
# combinations of n choose 2 are fairly cheap; the overhead of tee's caching
# might overwhelm the cost of just generating the combinations twice
startgoals1, startgoals2 = tee(itertools.combinations(graph, 2))
# Use partial binding of search with graph_pool to be able to use imap
# without a wrapper function; using imap lets us consume results as they become
# available, so the tee-ed generators don't store too many temporaries
results = pool.imap(partial(search, graph_pool), startgoals2))
# Efficiently create the dict from the start/goal pairs and the results of the search
# This line is eager, so it won't complete until all the results are generated, but
# it will be consuming the results as they become available in parallel with
# calculating the results
all_solutions = dict(zip(startgoals1, results))
I have code that generates a list of 28 dictionaries. It cycles thru 28 files and links data points from each file in the appropriate dictionary. In order to make my code more flexible I wanted to use:
tegDics = [dict() for x in range(len(files))]
But when I run the code the first 27 dictionaries are blank and only the last, tegDics[27], has data. Below is the code including the clumsy, yet functional, code I'm having to use that generates the dictionaries:
x=0
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}] # THIS WORKS!!!
#tegDics = [dict() for x in range(len(files))] - THIS WON'T WORK!!!
allRads=[]
while x<len(tegDics): # now builds dictionaries
for line in open(files[x]):
z=line.split('\t')
allRads.append(z[2])
tegDics[x][z[2]]=z[4] # pairs catNo with locNo
x+=1
Does anybody know why the more elegant code doesn't work.
Since you're using x within the list comprehension, it will no longer be zero by the time you reach the while loop - it will be len(files)-1 instead. I suggest changing the variable you use to something else. It's traditional to use a single underscore for a value you don't care about.
tegDics = [dict() for _ in range(len(files))]
It could be useful to eliminate your use of x entirely. It's customary in python to iterate directly over the objects in a sequence, rather than using a counter variable. You might do something like:
for tegDic in tegDics:
#do stuff with tegDic here
Although it's slightly trickier in your case, since you want to simultaneously iterate through tegDics and files at the same time. You can use zip to do that.
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [dict() for _ in range(len(files))]
allRads=[]
for file, tegDic in zip(files,tegDics):
for line in open(file):
z=line.split('\t')
allRads.append(z[2])
tegDic[z[2]]=z[4] # pairs catNo with locNo
Anyway there is a simplest way imho:
taegDics = [{}]*len(files)
i have a multiple processes each dealing with lists that have 40000 tuples. this nearly maxes the memory available on the machine. if i do this:
while len(collection) > 0:
row = collection.pop(0)
row_count = row_count + 1
new_row = []
for value in row:
if value is not None:
in_chars = str(value)
else:
in_chars = ""
#escape any naughty characters
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
new_row = "\t".join(new_row)
rows += "\n"+new_row
if row_count % 5000 == 0:
gc.collect()
does this free more memory ?
Since the collection is shrinking at the same rate that rows is growing, your memory usage will remain stable. The gc.collect() call is not going to make much difference.
Memory management in CPython is subtle. Just because you remove references and run a collection cycle doesn't necessarily mean that the memory will be returned to the OS. See this answer for details.
To really save memory, you should structure this code around generators and iterators instead of large lists of items. I'm very surprised you say you're having connection timeouts because fetching all the rows should not take much more time than fetching a row at a time and performing the simple processing you are doing. Perhaps we should have a look at your db-fetching code?
If row-at-a-time processing is really not a possibility, then at least keep your data as an immutable deque and perform all processing on it with generators and iterators.
I'll outline these different approaches.
First of all, some common functions:
# if you don't need random-access to elements in a sequence
# a deque uses less memory and has faster appends and deletes
# from both the front and the back.
from collections import deque
from itertools import izip, repeat, islice, chain
import re
re_redshift_chars = re.compile(r'[abcdefg]')
def istrjoin(sep, seq):
"""Return a generator that acts like sep.join(seq), but lazily
The separator will be yielded separately
"""
return islice(chain.from_iterable(izip(repeat(sep), seq)), 1, None)
def escape_redshift(s):
return re_redshift_chars.sub(r'\\\g<0>', s)
def tabulate(row):
return "\t".join(escape_redshift(str(v)) if v is not None else '' for v in row)
Now the ideal is row-at-a-time processing, like this:
cursor = db.cursor()
cursor.execute("""SELECT * FROM bigtable""")
rowstrings = (tabulate(row) for row in cursor.fetchall())
lines = istrjoin("\n", rowstrings)
file_like_obj.writelines(lines)
cursor.close()
This will take the least possible amount of memory--only a row at a time.
If you really need to store the entire resultset, you can modify the code slightly:
cursor = db.cursor()
cursor.execute("SELECT * FROM bigtable")
collection = deque(cursor.fetchall())
cursor.close()
rowstrings = (tabulate(row) for row in collection)
lines = istrjoin("\n", rowstrings)
file_like_obj.writelines(lines)
Now we gather all results into collection first which remains entirely in memory for the entire program run.
However we can also duplicate your approach of deleting collection items as they are used. We can keep the same "code shape" by creating a generator that empties its source collection as it works. It would look something like this:
def drain(coll):
"""Return an iterable that deletes items from coll as it yields them.
coll must support `coll.pop(0)` or `del coll[0]`. A deque is recommended!
"""
if hasattr(coll, 'pop'):
def pop(coll):
try:
return coll.pop(0)
except IndexError:
raise StopIteration
else:
def pop(coll):
try:
item = coll[0]
except IndexError:
raise StopIteration
del coll[0]
return item
while True:
yield pop(coll)
Now you can easily substitute drain(collection) for collection when you want to free up memory as you go. After drain(collection) is exhausted, the collection object will be empty.
If your algorithm depends on pop'ing from the left side or beginning of a list, you can use deque object from collections as a faster alternative.
As a comparison:
import timeit
f1='''
q=deque()
for i in range(40000):
q.append((i,i,'tuple {}'.format(i)))
while q:
q.popleft()
'''
f2='''
l=[]
for i in range(40000):
l.append((i,i,'tuple {}'.format(i)))
while l:
l.pop(0)
'''
print 'deque took {:.2f} seconds to popleft()'.format(timeit.timeit(stmt=f1, setup='from collections import deque',number=100))
print 'list took {:.2f} seconds to pop(0)'.format(timeit.timeit(stmt=f2,number=100))
Prints:
deque took 3.46 seconds to to popleft()
list took 37.37 seconds to pop(0)
So for this particular test of popping from the beginning of the list or queue, deque is more than 10x faster.
This large advantage is only for the left side however. If you run this same test with pop() on both the speed is roughly the same. You can also reverse the list in place and pop from the right side to get the same results as popleft from the deque.
In term of 'efficiency', it will be far more efficient to process single rows from the database. If that is not an option, process your list (or deque) 'collection' in place.
Try something along these lines.
First, break out the row processing:
def process_row(row):
# I did not test this obviously, but I think I xlated your row processing faithfully
new_row = []
for value in row:
if value:
in_chars = str(value)
else
in_char=''
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
return '\t'.join(new_row)
Now look at using a deque to allow fast pops from the left:
def cgen(collection):
# if collection is a deque:
while collection:
yield '\n'+process_row(collection.popleft())
Or if you want to stick to a list:
def cgen(collection):
collection.reverse()
while collection:
yield '\n'+process_row(collection.pop())
I think that your original approach of pop(0), process the row, and call gc every 5000 rows is probably suboptimal. The gc will be called automatically far more often than that anyway.
My final recommendation:
Use a deque. It just like a list but faster for left side push or pops;
Use popleft() so you do not need to reverse the list (if the order of collection is meaningful);
Process your collection in place as a generator;
Throw out the notion of calling gc since it is not doing anything for you.
Throw out 1-4 here if you can just call the db and get 1 row and process 1 row at a time!
I have a set of filenames coming from two different directories.
currList=set(['pathA/file1', 'pathA/file2', 'pathB/file3', etc.])
My code is processing the files, and need to change currList
by comparing it to its content at the former iteration, say processLst.
For that, I compute a symmetric difference:
toProcess=set(currList).symmetric_difference(set(processList))
Actually, I need the symmetric_difference to operate on the basename (file1...) not
on the complete filename (pathA/file1).
I guess I need to reimplement the __eq__ operator, but I have no clue how to do that in python.
is reimplementing __eq__ the right approach?
or
is there another better/equivalent approach?
Here is a token (and likely poorly constructed) itertools version that should run a little bit faster if speed ever becomes a concern (although agree that #Zarkonnen's one-liner is pretty sweet, so +1 there :) ).
from itertools import ifilter
currList = set(['pathA/file1', 'pathA/file2', 'pathB/file3'])
processList=set(['pathA/file1', 'pathA/file9', 'pathA/file3'])
# This can also be a lambda inside the map functions - the speed stays the same
def FileName(f):
return f.split('/')[-1]
# diff will be a set of filenames with no path that will be checked during
# the ifilter process
curr = map(FileName, list(currList))
process = map(FileName, list(processList))
diff = set(curr).symmetric_difference(set(process))
# This filters out any elements from the symmetric difference of the two sets
# where the filename is not in the diff set
results = set(ifilter(lambda x: x.split('/')[-1] in diff,
currList.symmetric_difference(processList)))
You can do this with the magic of generator expressions.
def basename(x):
return x.split("/")[-1]
result = set(x for x in set(currList).union(set(processList)) if (basename(x) in [basename(y) for y in currList]) != (basename(x) in [basename(y) for y in processList]))
should do the trick. It gives you all the elements X that appear in one list or the other, and whose basename-presence in the two lists is not the same.
Edit:
Running this with:
currList=set(['pathA/file1', 'pathA/file2', 'pathB/file3'])
processList=set(['pathA/file1', 'pathA/file9', 'pathA/file3'])
returns:
set(['pathA/file2', 'pathA/file9'])
which would appear to be correct.