I have two lists(items,sales) and for each pair of item,sale elements between two lists I have to call a function. I'm looking out for a pythonic way to avoid such redundant looping
First Loop:
# Create item_sales_list
item_sales_list = list()
for item,sales in itertools.product(items,sales):
if sales > 100:
item_sales_list.append([item,sales])
result = some_func_1(item_sales_list)
Second Loop:
# Call a function with the result returned from first function (some_func_1)
for item,sales in itertools.product(items,sales):
some_func_2(item,sales,result)
You can avoid the second call to itertools.product at least if you store the result in the list, adding the condition at the call site of some_func_1:
item_sales_list = list(itertools.product(items, sales))
result = some_func_1([el for el in item_sales_list if el[1] > 100])
for item, sales in item_sales_list:
some_func_2(item, sales, result)
It is impossible to do it with one pass unless you can pass an incomplete version of result to some_func_2.
A solution, and a frame challenge.
First, to avoid calculating itertools.product() multiple times, you can calculate it once up-front and then use it for both loops:
item_product = list(itertools.product(items, sales))
item_sales_list = [[item, sales] for item, sales in item_product if sales > 100]
Second, there's actually no time disadvantage to looping twice (you're still doing basically the same amount of work - the same operations, the same amount of times each. So it's still in the same complexity class). And in this case it's unavoidable, because you need the result of the first calculation (which requires going over the entire list) to do the second calculation.
result = some_func_1(item_sales_list)
for item, sales in item_product:
some_func_2(item, sales, result)
If you can modify some_func_2() so that it doesn't need the entire item_sales_list in order to work, then you could load it into the same for loop and do them one after another. Without knowing how some_func_2() works, it's impossible to give any further advice.
Related
I have this loop:
for s in sales:
salezip = sales[s][1]
salecount = sales[s][0]
for d in deals:
dealzip = deals[d][1]
dealname = deals[d][0]
for zips in ziplist:
if salezip == zips[0] and dealzip == zips[1]:
distance = zips[2]
print "MATCH FOUND"
if not salesdict.has_key(dealname):
salesdict[dealname] = [dealname,dealzip,salezip,salecount,distance]
else:
salesdict[dealname][3] += salecount
And it is taking FOREVER to run. The sales dictionary has 13k entries, the deals dictionary has 1000 entries, and the ziplist has 1.8M entries. It is obviously very slow when it hits the ziplist part, I have it set to print "MATCH FOUND" when it successfully find a match, and it hasn't printed in over 20 minutes. What can I do to make this move quicker?
Purpose of the code:
Loop through sales data which contains the amount of apples sold and the location of the purchase, pull the location and quantity info. Then, loop through apple dealers, find their location and their name. Then, loop through the ziplist data which shows distance between zip codes, sorted in ascending order of distance. The second it finds a match of the sales zip and dealer zip, it adds them to a dictionary with all of their information.
Having ziplist as an actual list of (zip1, zip2, distance) is insane - you want a data structure where you can directly find the desired item, without having to loop through the entire data set.
A dictionary with (zip1, zip2) as the key, and the distance as the value, would be enormously faster. Note that you'd need to insert each distance under the key (zip2, zip1) as well, to handle lookups in the opposite direction. Alternatively, you could sort [zip1, zip2] into numeric order before using it as a key (both on insert and lookup), so that it doesn't matter which order they are specified in.
The best thing you can do is to reorganize your code so that you don't have to loop so many times, and you don't have to do as many look-ups. It looks to me like you're looping over ziplist 130k times as much as you really need to. Here are a couple ideas that might help:
First, create a way to quickly look up sale and deal information by zip rather than by name:
sale_by_zip = {sales[key][1]: sales[key] for key in sales}
deal_by_zip = {deals[key][1]: deals[key] for key in deals}
Then, make the iteration through the ziplist the only outer loop:
for zips in ziplist:
salezip = zips[0]
dealzip = zips[1]
if salezip in sale_by_zip and dealzip in deal_by_zip:
distance = zips[2]
print "MATCH FOUND"
dealname = deal_by_zip[dealzip][0]
salecount = sale_by_zip[salezip][0]
if not salesdict.has_key(dealname):
salesdict[dealname] = [dealname,dealzip,salezip,salecount,distance]
else:
salesdict[dealname][3] += salecount
This should drastically reduce the amount of processing you need to do.
As others have noted, the structure of ziplist is also not the most well-suited to this problem. My suggestions assume ziplist is something you receive from an external source and cannot change the format of without making an extra pass over it. If you are building the ziplist yourself, however, consider something that would give you faster lookups.
The root of your problem is that you're processing the zip list multiple times - for every deal and then again for every sale.
One possibility is to reverse the order of your coding: start with the zips list, then the sales list, and finally the deals dictionary. If you're going to iterate through something multiple times, at least iterating through the smaller dictionary would be a lot faster.
if there aren't a lot matches, perhaps using "in" would be quicker, such as if dealzip in zips: and then process from then.
Basically, I want a fancy oneliner that doesn't read all of the files I'm looking at into memory, but still processes them all, and saves a nice sample of them.
The oneliner I would like to do is:
def foo(findex):
return [bar(line) for line in findex] # but skip every nth term
But I would like to be able to not save every nth line in that. i.e., I still want it to run (for byte position purposes), but I don't want to save the image, because I don't have enough memory for that.
So, if the output of bar(line) is 1,2,3,4,5,6,... I would like it to still run on 1,2,3,4,5,6,... but I would like the return value to be [1,3,5,7,9,...] or something of the sort.
use enumerate to get the index, and a filter using modulo to take every other line:
return [bar(line) for i,line in enumerate(findex) if i%2]
Generalize that with i%n so everytime that the index is divisible by n then i%n==0 and bar(line) isn't issued into the listcomp.
enumerate works for every iterable (file handle, generator ...), so it's way better than using range(len(findex))
Now the above is incorrect if you want to call bar on all the values (because you need the side effect generated by bar), because the filter prevents execution. So you have to do that in 2 passes, for instance using map to apply your function to all items of findex and pick only the results you're interested in (but it guarantees that all of the lines are processed) using the same modulo filter but after the execution:
l = [x for i,x in enumerate(map(bar,findex)) if i%n]
If findex is subscriptable (accepts [] operator with indices), you can try this way :
def foo(findex):
return [bar(findex[i]) for i in range (0, len(findex), 2) ]
I'm trying to get the difference in time between the last two times a person has applied for our service. My solution works, but it's ugly.
Is there a more pythonic way of accomplishing this?
for customer in previous_apps:
app_times = df.ix[df['customer_id']==customer, 'hit_datetime']
days_since_last_app = [(b-a).days for a,b in zip(app_times, app_times[1:])][-1:][0]
df.ix[df['customer_id']==customer, 'days_since_last_app'] = days_since_last_app
Having a list comprehension calculate all the differences in dates of applications then slice it with [-1:] so you have a list with only the last element then extract it by indexing with [0] is completely unnecessary.
you can just take the last application date app_times[-1] and the second last one app_times[-2] and take the difference:
days_since_last_app = (app_times[-1] - app_times[-2]).days
this will fail if there are less then 2 entries in the list so you probably want a special case for that.
(I'm guessing that line evolved into what it is by trying to resolve IndexErrors that were the result of not having previous entries.)
Start by defining a two-argument function that calculates the time difference for you, e.g. time_diff(a, b). Use it something like this:
df["last_visit"] = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
(Assuming the values in hit_datetime are sorted, which your code implies they are.)
The above "broadcasts" the last_visit values, since multiple records have the same customer_id. If you prefer you can just store the result as a Series with one row per customer:
last_visit = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
I'm not sure I precisely understand how your data is structured, but the following should provide the functionality you require:
df.sort_values(['customer_id','hit_datetime'],ascending=True,inplace=True)
df['days_since_last_app'] = df.groupby('customer_id')['hit_datetime'].transform(lambda y: y.diff().apply(lambda x: 0 if x!=x else x.days))
I am trying to create a function, new_function, that takes a number as an argument.
This function will manipulate values in a list based on what number I pass as an argument. Within this function, I will place another function, new_sum, that is responsible for manipulating values inside the list.
For example, if I pass 4 into new_function, I need new_function to run new_sum on each of the first four elements. The corresponding value will change, and I need to create four new lists.
example:
listone=[1,2,3,4,5]
def new_function(value):
for i in range(0,value):
new_list=listone[:]
variable=new_sum(i)
new_list[i]=variable
return new_list
# running new_function(4) should return four new lists
# [(new value for index zero, based on new_sum),2,3,4,5]
# [1,(new value for index one, based on new_sum),3,4,5]
# [1,2,(new value for index two, based on new_sum),4,5]
# [1,2,3,(new value for index three, based on new_sum),5]
My problem is that i keep on getting one giant list. What am I doing wrong?
Fix the indentation of return statement:
listone=[1,2,3,4,5]
def new_function(value):
for i in range(0,value):
new_list=listone[:]
variable=new_sum(i)
new_list[i]=variable
return new_list
The problem with return new_list is that once you return, the function is done.
You can make things more complicated by accumulating the results and returning them all at the end:
listone=[1,2,3,4,5]
def new_function(value):
new_lists = []
for i in range(0,value):
new_list=listone[:]
variable=new_sum(i)
new_list[i]=variable
new_lists.append(new_list)
return new_lists
However, this is exactly what generators are for: If you yield instead of return, that gives the caller one value, and then resumes when he asks for the next value. So:
listone=[1,2,3,4,5]
def new_function(value):
for i in range(0,value):
new_list=listone[:]
variable=new_sum(i)
new_list[i]=variable
yield new_list
The difference is that the first version gives the caller a list of four lists, while the second gives the caller an iterator of four lists. Often, you don't care about the difference—and, in fact, an iterator may be better for responsiveness, memory, or performance reasons.*
If you do care, it often makes more sense to just make a list out of the iterator at the point you need it. In other words, use the second version of the function, then just writes:
new_lists = list(new_function(4))
By the way, you can simplify this by not trying to mutate new_list in-place, and instead just change the values while copying. For example:
def new_function(value):
for i in range(value):
yield listone[:i] + [new_sum(i)] + listone[i+1:]
* Responsiveness is improved because you get the first result as soon as it's ready, instead of only after they're all ready. Memory use is improved because you don't need to keep all of the lists in memory at once, just one at a time. Performance may be improved because interleaving the work can result in better cache behavior and pipelining.
I know that it is not allowed to remove elements while iterating a list, but is it allowed to add elements to a python list while iterating. Here is an example:
for a in myarr:
if somecond(a):
myarr.append(newObj())
I have tried this in my code and it seems to work fine, however I don't know if it's because I am just lucky and that it will break at some point in the future?
EDIT: I prefer not to copy the list since "myarr" is huge, and therefore it would be too slow. Also I need to check the appended objects with "somecond()".
EDIT: At some point "somecond(a)" will be false, so there can not be an infinite loop.
EDIT: Someone asked about the "somecond()" function. Each object in myarr has a size, and each time "somecond(a)" is true and a new object is appended to the list, the new object will have a size smaller than a. "somecond()" has an epsilon for how small objects can be and if they are too small it will return "false"
Why don't you just do it the idiomatic C way? This ought to be bullet-proof, but it won't be fast. I'm pretty sure indexing into a list in Python walks the linked list, so this is a "Shlemiel the Painter" algorithm. But I tend not to worry about optimization until it becomes clear that a particular section of code is really a problem. First make it work; then worry about making it fast, if necessary.
If you want to iterate over all the elements:
i = 0
while i < len(some_list):
more_elements = do_something_with(some_list[i])
some_list.extend(more_elements)
i += 1
If you only want to iterate over the elements that were originally in the list:
i = 0
original_len = len(some_list)
while i < original_len:
more_elements = do_something_with(some_list[i])
some_list.extend(more_elements)
i += 1
well, according to http://docs.python.org/tutorial/controlflow.html
It is not safe to modify the sequence
being iterated over in the loop (this
can only happen for mutable sequence
types, such as lists). If you need to
modify the list you are iterating over
(for example, to duplicate selected
items) you must iterate over a copy.
You could use the islice from itertools to create an iterator over a smaller portion of the list. Then you can append entries to the list without impacting the items you're iterating over:
islice(myarr, 0, len(myarr)-1)
Even better, you don't even have to iterate over all the elements. You can increment a step size.
In short: If you'are absolutely sure all new objects fail somecond() check, then your code works fine, it just wastes some time iterating the newly added objects.
Before giving a proper answer, you have to understand why it considers a bad idea to change list/dict while iterating. When using for statement, Python tries to be clever, and returns a dynamically calculated item each time. Take list as example, python remembers a index, and each time it returns l[index] to you. If you are changing l, the result l[index] can be messy.
NOTE: Here is a stackoverflow question to demonstrate this.
The worst case for adding element while iterating is infinite loop, try(or not if you can read a bug) the following in a python REPL:
import random
l = [0]
for item in l:
l.append(random.randint(1, 1000))
print item
It will print numbers non-stop until memory is used up, or killed by system/user.
Understand the internal reason, let's discuss the solutions. Here are a few:
1. make a copy of origin list
Iterating the origin list, and modify the copied one.
result = l[:]
for item in l:
if somecond(item):
result.append(Obj())
2. control when the loop ends
Instead of handling control to python, you decides how to iterate the list:
length = len(l)
for index in range(length):
if somecond(l[index]):
l.append(Obj())
Before iterating, calculate the list length, and only loop length times.
3. store added objects in a new list
Instead of modifying the origin list, store new object in a new list and concatenate them afterward.
added = [Obj() for item in l if somecond(item)]
l.extend(added)
You can do this.
bonus_rows = []
for a in myarr:
if somecond(a):
bonus_rows.append(newObj())
myarr.extend( bonus_rows )
Access your list elements directly by i. Then you can append to your list:
for i in xrange(len(myarr)):
if somecond(a[i]):
myarr.append(newObj())
make copy of your original list, iterate over it,
see the modified code below
for a in myarr[:]:
if somecond(a):
myarr.append(newObj())
I had a similar problem today. I had a list of items that needed checking; if the objects passed the check, they were added to a result list. If they didn't pass, I changed them a bit and if they might still work (size > 0 after the change), I'd add them on to the back of the list for rechecking.
I went for a solution like
items = [...what I want to check...]
result = []
while items:
recheck_items = []
for item in items:
if check(item):
result.append(item)
else:
item = change(item) # Note that this always lowers the integer size(),
# so no danger of an infinite loop
if item.size() > 0:
recheck_items.append(item)
items = recheck_items # Let the loop restart with these, if any
My list is effectively a queue, should probably have used some sort of queue. But my lists are small (like 10 items) and this works too.
You can use an index and a while loop instead of a for loop if you want the loop to also loop over the elements that is added to the list during the loop:
i = 0
while i < len(myarr):
a = myarr[i];
i = i + 1;
if somecond(a):
myarr.append(newObj())
Expanding S.Lott's answer so that new items are processed as well:
todo = myarr
done = []
while todo:
added = []
for a in todo:
if somecond(a):
added.append(newObj())
done.extend(todo)
todo = added
The final list is in done.
Alternate solution :
reduce(lambda x,newObj : x +[newObj] if somecond else x,myarr,myarr)
Assuming you are adding at the last of this list arr, You can try this method I often use,
arr = [...The list I want to work with]
current_length = len(arr)
i = 0
while i < current_length:
current_element = arr[i]
do_something(arr[i])
# Time to insert
insert_count = 1 # How many Items you are adding add the last
arr.append(item_to_be inserted)
# IMPORTANT!!!! increase the current limit and indexer
i += 1
current_length += insert_count
This is just boilerplate and if you run this, your program will freeze because of infinite loop. DO NOT FORGET TO TERMINATE THE LOOP unless you need so.