python efficiency and large objects in memory - python

i have a multiple processes each dealing with lists that have 40000 tuples. this nearly maxes the memory available on the machine. if i do this:
while len(collection) > 0:
row = collection.pop(0)
row_count = row_count + 1
new_row = []
for value in row:
if value is not None:
in_chars = str(value)
else:
in_chars = ""
#escape any naughty characters
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
new_row = "\t".join(new_row)
rows += "\n"+new_row
if row_count % 5000 == 0:
gc.collect()
does this free more memory ?

Since the collection is shrinking at the same rate that rows is growing, your memory usage will remain stable. The gc.collect() call is not going to make much difference.
Memory management in CPython is subtle. Just because you remove references and run a collection cycle doesn't necessarily mean that the memory will be returned to the OS. See this answer for details.
To really save memory, you should structure this code around generators and iterators instead of large lists of items. I'm very surprised you say you're having connection timeouts because fetching all the rows should not take much more time than fetching a row at a time and performing the simple processing you are doing. Perhaps we should have a look at your db-fetching code?
If row-at-a-time processing is really not a possibility, then at least keep your data as an immutable deque and perform all processing on it with generators and iterators.
I'll outline these different approaches.
First of all, some common functions:
# if you don't need random-access to elements in a sequence
# a deque uses less memory and has faster appends and deletes
# from both the front and the back.
from collections import deque
from itertools import izip, repeat, islice, chain
import re
re_redshift_chars = re.compile(r'[abcdefg]')
def istrjoin(sep, seq):
"""Return a generator that acts like sep.join(seq), but lazily
The separator will be yielded separately
"""
return islice(chain.from_iterable(izip(repeat(sep), seq)), 1, None)
def escape_redshift(s):
return re_redshift_chars.sub(r'\\\g<0>', s)
def tabulate(row):
return "\t".join(escape_redshift(str(v)) if v is not None else '' for v in row)
Now the ideal is row-at-a-time processing, like this:
cursor = db.cursor()
cursor.execute("""SELECT * FROM bigtable""")
rowstrings = (tabulate(row) for row in cursor.fetchall())
lines = istrjoin("\n", rowstrings)
file_like_obj.writelines(lines)
cursor.close()
This will take the least possible amount of memory--only a row at a time.
If you really need to store the entire resultset, you can modify the code slightly:
cursor = db.cursor()
cursor.execute("SELECT * FROM bigtable")
collection = deque(cursor.fetchall())
cursor.close()
rowstrings = (tabulate(row) for row in collection)
lines = istrjoin("\n", rowstrings)
file_like_obj.writelines(lines)
Now we gather all results into collection first which remains entirely in memory for the entire program run.
However we can also duplicate your approach of deleting collection items as they are used. We can keep the same "code shape" by creating a generator that empties its source collection as it works. It would look something like this:
def drain(coll):
"""Return an iterable that deletes items from coll as it yields them.
coll must support `coll.pop(0)` or `del coll[0]`. A deque is recommended!
"""
if hasattr(coll, 'pop'):
def pop(coll):
try:
return coll.pop(0)
except IndexError:
raise StopIteration
else:
def pop(coll):
try:
item = coll[0]
except IndexError:
raise StopIteration
del coll[0]
return item
while True:
yield pop(coll)
Now you can easily substitute drain(collection) for collection when you want to free up memory as you go. After drain(collection) is exhausted, the collection object will be empty.

If your algorithm depends on pop'ing from the left side or beginning of a list, you can use deque object from collections as a faster alternative.
As a comparison:
import timeit
f1='''
q=deque()
for i in range(40000):
q.append((i,i,'tuple {}'.format(i)))
while q:
q.popleft()
'''
f2='''
l=[]
for i in range(40000):
l.append((i,i,'tuple {}'.format(i)))
while l:
l.pop(0)
'''
print 'deque took {:.2f} seconds to popleft()'.format(timeit.timeit(stmt=f1, setup='from collections import deque',number=100))
print 'list took {:.2f} seconds to pop(0)'.format(timeit.timeit(stmt=f2,number=100))
Prints:
deque took 3.46 seconds to to popleft()
list took 37.37 seconds to pop(0)
So for this particular test of popping from the beginning of the list or queue, deque is more than 10x faster.
This large advantage is only for the left side however. If you run this same test with pop() on both the speed is roughly the same. You can also reverse the list in place and pop from the right side to get the same results as popleft from the deque.
In term of 'efficiency', it will be far more efficient to process single rows from the database. If that is not an option, process your list (or deque) 'collection' in place.
Try something along these lines.
First, break out the row processing:
def process_row(row):
# I did not test this obviously, but I think I xlated your row processing faithfully
new_row = []
for value in row:
if value:
in_chars = str(value)
else
in_char=''
new_row.append("".join(["\\" + c if c in redshift_escape_chars else c for c in in_chars]))
return '\t'.join(new_row)
Now look at using a deque to allow fast pops from the left:
def cgen(collection):
# if collection is a deque:
while collection:
yield '\n'+process_row(collection.popleft())
Or if you want to stick to a list:
def cgen(collection):
collection.reverse()
while collection:
yield '\n'+process_row(collection.pop())
I think that your original approach of pop(0), process the row, and call gc every 5000 rows is probably suboptimal. The gc will be called automatically far more often than that anyway.
My final recommendation:
Use a deque. It just like a list but faster for left side push or pops;
Use popleft() so you do not need to reverse the list (if the order of collection is meaningful);
Process your collection in place as a generator;
Throw out the notion of calling gc since it is not doing anything for you.
Throw out 1-4 here if you can just call the db and get 1 row and process 1 row at a time!

Related

Extract data from dictionary as fast as possible

I have a dictionary d with around 500 main keys (name1, name2, etc.). Each value is itself a small dictionary with 5 keys called ppty1, ppty2, etc.), and the corresponding values are floats converted to strings.
I want to extract data faster than I presently do, based on a list of lists of the form ['name1', 'ppty3','ppty4'] (name1 could by any other nameX and ppty3 and ppty4 could be any other pptyX).
In my application, I have many dictionaries, but they differ only by the values of the fields ppty1, ..., ppty5. All the keys are "static". I do not care if there are some preliminary operations, I would just like the processing time of one dictionary to be, ideally, much faster than now. My poor implementation, consisting in looping over every field takes about 3 ms.
Here is the code to generate d and fields; this is just to simulate dummy data, it does not need to be improved:
import random
random.seed(314)
# build dictionary
def make_small_dict():
d = {}
for i in range(5):
key = "ppty" + str(i)
d[key] = str(random.random())
return d
d = {}
for i in range(100):
d["name" + str(i)] = make_small_dict()
# build fields
def make_row():
line = ['name' + str(random.randint(0,100))]
[line.append('ppty' + str(random.randint(0,5))) for i in range(2)]
return line
fields = [0]*300
for i in range(300):
fields[i] = [make_row() for j in range(3)]
For example, fields[0] returns
[['name420', 'ppty1', 'ppty1'],
['name206', 'ppty1', 'ppty2'],
['name21', 'ppty2', 'ppty4']]
so the first row of the output should be something like
[[d['name420']['ppty1'], d['name420']['ppty1'],
[d['name206']['ppty1'], d['name206']['ppty2']],
[d['name21']['ppty2'], d['name21']['ppty4']]]]
My solution:
start = time.time()
data = [0] * len(fields)
i = 0
for field in fields:
data2 = [0] * 3
j = 0
for row in field:
lst = [d[row[0]][key] for key in [row[1], row[2]]]
data2[j] = lst
j += 1
data[i] = data2
i += 1
print time.time() - start
My main question is, how to do improve my code? Few additional question:
Later, I need to do some operations such as column extraction, basic operation on some entries of data: would you recommend storing the extracted values directly in an np.array?
How to avoid extracting the same values multiple times (fields has some redundant rows such as ['name1', 'ppty3', 'ppty4'])?
I read that things such as i += 1 take a little bit of time, how can I avoid them?
This was tough to read, so I started by breaking bits out into functions. Then I could test to see if that worked using just a list comprehension. It's already faster, comparison over 10000 runs with timeit showed this code runs in about 64% of the original code's time.
In this case I kept everything in lists to force execution so it is directly comparable, but you could use generators or map, and that'd push the computation back to when the data is actually consumed.
def row_lookup(name, key1, key2):
return (d[name][key1], d[name][key2]) # Tuple is faster to construct than list
def field_lookup(field):
return [row_lookup(*row) for row in field]
start = time.time()
result = [field_lookup(field) for field in fields]
print(time.time() - start)
print(data == result)
# without dupes in fields
from itertools import groupby
result = [field_lookup(field) for field, _ in groupby(fields)]
Change just the result assignment line to:
result = map(field_lookup, fields)
And the runtime becomes negligible, because map is a generator, so it's not actually going to compute the data until you ask it for the result. This is not a fair comparison, but if you're not going to consume all the data, you'd save time. Change the list comprehensions in the functions to generators and you'd get the same benefit there too. Multiprocessing and asyncio didn't improve performance time in this case.
If you can change the structure you can preprocess your fields into a list of just the rows [['namex', 'pptyx', 'pptyX']..]. In this case, you can change it to just a single list comprehension, which lets you get this down to about 29% of the original runtime, ignoring the preprocessing to slim the fields.
from itertools import groupby, chain
slim_fields = [row for row, _ in groupby(chain.from_iterable(fields))]
results = [(d[name][key1], d[name][key2]) for name, key1, key2 in slim_fields]
In this case, results is just a list of tuples containing the values: [(value1, value2)..]

Adding pool results to a dict

I have a function which accepts a two inputs provided by itertools combinations, and outputs a solution. The two inputs should be stored as a tuple forming the key in the dict, while the result is the value.
I can pool this and get all of the results as a list, which I can then insert into a dictionary one-by-one, but this seems inefficient. Is there a way to get the results as each job finishes, and directly add it to the dict?
Essentially, I have the code below:
all_solutions = {}
for start, goal in itertools.combinations(graph, 2):
all_solutions[(start, goal)] = search(graph, start, goal)
I am trying to parallelize it as follows:
all_solutions = {}
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
results = pool.starmap(search, zip(itertools.repeat(graph_pool),
itertools.combinations(graph, 2)))
for i, start_goal in enumerate(itertools.combinations(graph, 2)):
start, goal = start_goal[0], start_goal[1]
all_solutions[(start, goal)] = results[i]
Which actually works, but iterates twice, once in the pool, and once to write to a dict (not to mention the clunky tuple unpacking).
This is possible, you just need to switch to using a lazy mapping function (not map or starmap, which have to finish computing all the results before you can begin using any of them):
from functools import partial
from itertools import tee
manager = multiprocessing.Manager()
graph_pool = manager.dict(graph)
pool = multiprocessing.Pool()
# Since you're processing in order and in parallel, tee might help a little
# by only generating the dict keys/search arguments once. That said,
# combinations of n choose 2 are fairly cheap; the overhead of tee's caching
# might overwhelm the cost of just generating the combinations twice
startgoals1, startgoals2 = tee(itertools.combinations(graph, 2))
# Use partial binding of search with graph_pool to be able to use imap
# without a wrapper function; using imap lets us consume results as they become
# available, so the tee-ed generators don't store too many temporaries
results = pool.imap(partial(search, graph_pool), startgoals2))
# Efficiently create the dict from the start/goal pairs and the results of the search
# This line is eager, so it won't complete until all the results are generated, but
# it will be consuming the results as they become available in parallel with
# calculating the results
all_solutions = dict(zip(startgoals1, results))

Creating Generator object from record list within a function

I'm trying to create generator object for the list of records with the data from mysql database, so I'm passing the mysql cursor object to the function as parameter.
My issue here is if the "if block" containing yield records is commented then cust_records function works perfectly fine but if I uncomment the line then the function is not working.
Not sure if this is not the way to yield the list object in Python 3
My code so far:
def cust_records(result_set) :
block_id = None
records = []
i = 0
for row in result_set :
records.append((row['id'], row, smaller_ids))
if records :
yield records
The point of generators is lazy evaluation, so storing all records in a list and yielding the list makes no sense at all. If you want to retain lazy evalution (which is IMHO preferable, specially if you have to work on arbitrary datasets that might get huge), you want to yield each record, ie:
def cust_records(result_set) :
for row in result_set :
yield (row['id'], row, smaller_ids)
# then
def example():
cursor.execute(<your_sql_query_here>)
for record in cust_records(cursor):
print(record)
else (if you really want to consume as much memory as possible) just male cust_record a plain function:
def cust_records(result_set) :
records = []
for row in result_set :
records.append((row['id'], row, smaller_ids))
return records

Is there a way to do it faster?

ladder have around 15000 elements, this code snippet performed in 5-8sec, is there any way to do it faster? I try do it without checking for duplicate and without creating accs list and time was down to 2-3sec, but I don't need duplicate in csv file.
I work in python 2.7.9
accs =[]
with codecs.open('test.csv','w', encoding="UTF-8") as out:
row = ''
for element in ladder:
if element['account']['name'] not in accs:
accs.append(element['account']['name'])
row += element['account']['name']
if 'twitch' in element['account']:
row += "," + element['account']['twitch']['name'] + ","
else:
row += ",,"
row += str(element['account']['challenges']['total']) + "\n"
out.write(row)
seen = set()
results = []
for user in ladder:
acc = user['account']
name = acc['name']
if name not in seen:
seen.add(name)
twitch_name = acc['twitch']['name'] if "twitch" in acc else ''
challenges = acc['challenges']['total']
results.append("%s,%s,%d" % (name, twitch_name, challenges))
with codecs.open('test.csv','w', encoding="UTF-8") as out:
out.write("\n".join(results))
You can’t do much about the loop, since you need to go through every element in ladder after all. However, you can improve this membership test:
if element['account']['name'] not in accs:
Since accs is a list, this will essentially loop through all items of accs and check if the name is in there. And you loop for every element in ladder, so this can easily become very inefficient.
Instead, use a set instead of a list for accs as this will give you a constant membership lookup. So you reduce your algorithm from a quadratic complexity to a linear complexity. For that, just use accs = set() and change your code to use accs.add() instead of append.
Another issue is that you are doing string concatenation. Every time you do someString + "something" you are throwing away that string object and create a new one. This can become inefficient for a high number of operations too. Instead, use a list here to collect all the elements you want to write, and then join them:
row = []
row.append(element['account']['name'])
if 'twitch' in element['account']:
row.append(element['account']['twitch']['name'])
else:
row.append('')
row.append(str(element['account']['challenges']['total']))
out.write(','.join(row))
out.write('\n')
Alternatively, since you are writing to a file anyway, you could just call out.write multiple times with each string part.
Finally, you could also look into the csv module if you are interested in writing out CSV data.

Python generator expression if-else

I am using Python to parse a large file. What I want to do is
If condition =True
append to list A
else
append to list B
I want to use generator expressions for this - to save memory. I am putting in the actual code.
def is_low_qual(read):
lowqual_bp=(bq for bq in phred_quals(read) if bq < qual_threshold)
if iter_length(lowqual_bp) > num_allowed:
return True
else:
return False
lowqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==True)
highqual=(read for read in SeqIO.parse(r_file,"fastq") if is_low_qual(read)==False)
SeqIO.write(highqual,flt_out_handle,"fastq")
SeqIO.write(lowqual,junk_out_handle,"fastq")
def iter_length(the_gen):
return sum(1 for i in the_gen)
You can use itertools.tee in conjunction with itertools.ifilter and itertools.ifilterfalse:
import itertools
def is_condition_true(x):
...
gen1, gen2 = itertools.tee(sequences)
low = itertools.ifilter(is_condition_true, gen1)
high = itertools.ifilterfalse(is_condition_true, gen2)
Using tee ensures that the function works correctly even if sequences is itself a generator.
Note, though, that tee could itself use a fair bit of memory (up to a list of size len(sequences)) if low and high are consumed at different rates (e.g. if low is exhausted before high is used).
I think you're striving to avoid iterating over your collection twice. If so, this type of approach works:
high, low = [], []
_Nones = [high.append(x) if is_condition_true() else low.append(x) for x in sequences]
This is probably less than advised because it's using a list comprehension for a side-effect. That's generally anti-pythonic.
Just to add a more general answer: If your main concern is memory, you should use one generator that loops over the whole file, and handle each item as low or high as it comes. Something like:
for r in sequences:
if condition_true(r):
handle_low(r)
else:
handle_high(r)
If you need to collect all high/low elements before using either, then you can't guard against a potential memory hit. The reason is that you can't know which elements are high/low until you read them. If you have to process low first, and it turns out all the elements are actually high, you have no choice but to store them in a list as you go, which will use memory. Doing it with one loop allows you to handle each element one at a time, but you have to balance this against other concerns (i.e., how cumbersome it is to do it this way, which will depend on exactly what you're trying to do with the data).
This is surprisingly difficult to do elegantly. Here's something that works:
from itertools import tee, ifilter, ifilterfalse
low, high = [f(condition, g) for f, g in zip((ifilter, ifilterfalse), tee(seq))]
Note that as you consume items from one resulting iterator (say low), the internal deque in tee will have to expand to contain any items that you have not yet consumed from high (including, unfortunately, those which ifilterfalse will reject). As such this might not save as much memory as you're hoping.
Here's an implementation that uses as little additional memory as possible:
def filtertee(func, iterable, codomain=(False, True)):
it = iter(iterable)
deques = dict((r, deque()) for r in codomain)
def gen(mydeque):
while True:
while not mydeque: # as long as the local deque is empty
newval = next(it) # fetch a new value,
result = func(newval) # find its image under `func`,
try:
d = deques[result] # find the appropriate deque, and
except KeyError:
raise ValueError("func returned value outside codomain")
d.append(newval) # add it.
yield mydeque.popleft()
return dict((r, gen(d)) for r, d in deques.items())
This returns a dict from the codomain of the function to a generator providing the items that take that value under func:
gen = filtertee(condition, seq)
low, high = gen[True], gen[False]
Note that it's your responsibility to ensure that condition only returns values in codomain.

Categories

Resources