I'm trying to get the difference in time between the last two times a person has applied for our service. My solution works, but it's ugly.
Is there a more pythonic way of accomplishing this?
for customer in previous_apps:
app_times = df.ix[df['customer_id']==customer, 'hit_datetime']
days_since_last_app = [(b-a).days for a,b in zip(app_times, app_times[1:])][-1:][0]
df.ix[df['customer_id']==customer, 'days_since_last_app'] = days_since_last_app
Having a list comprehension calculate all the differences in dates of applications then slice it with [-1:] so you have a list with only the last element then extract it by indexing with [0] is completely unnecessary.
you can just take the last application date app_times[-1] and the second last one app_times[-2] and take the difference:
days_since_last_app = (app_times[-1] - app_times[-2]).days
this will fail if there are less then 2 entries in the list so you probably want a special case for that.
(I'm guessing that line evolved into what it is by trying to resolve IndexErrors that were the result of not having previous entries.)
Start by defining a two-argument function that calculates the time difference for you, e.g. time_diff(a, b). Use it something like this:
df["last_visit"] = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
(Assuming the values in hit_datetime are sorted, which your code implies they are.)
The above "broadcasts" the last_visit values, since multiple records have the same customer_id. If you prefer you can just store the result as a Series with one row per customer:
last_visit = df.groupby("customer_id").apply(
lambda x: x.apply(time_diff(*x["hit_datetime"][-2:]))
I'm not sure I precisely understand how your data is structured, but the following should provide the functionality you require:
df.sort_values(['customer_id','hit_datetime'],ascending=True,inplace=True)
df['days_since_last_app'] = df.groupby('customer_id')['hit_datetime'].transform(lambda y: y.diff().apply(lambda x: 0 if x!=x else x.days))
Related
I have two lists(items,sales) and for each pair of item,sale elements between two lists I have to call a function. I'm looking out for a pythonic way to avoid such redundant looping
First Loop:
# Create item_sales_list
item_sales_list = list()
for item,sales in itertools.product(items,sales):
if sales > 100:
item_sales_list.append([item,sales])
result = some_func_1(item_sales_list)
Second Loop:
# Call a function with the result returned from first function (some_func_1)
for item,sales in itertools.product(items,sales):
some_func_2(item,sales,result)
You can avoid the second call to itertools.product at least if you store the result in the list, adding the condition at the call site of some_func_1:
item_sales_list = list(itertools.product(items, sales))
result = some_func_1([el for el in item_sales_list if el[1] > 100])
for item, sales in item_sales_list:
some_func_2(item, sales, result)
It is impossible to do it with one pass unless you can pass an incomplete version of result to some_func_2.
A solution, and a frame challenge.
First, to avoid calculating itertools.product() multiple times, you can calculate it once up-front and then use it for both loops:
item_product = list(itertools.product(items, sales))
item_sales_list = [[item, sales] for item, sales in item_product if sales > 100]
Second, there's actually no time disadvantage to looping twice (you're still doing basically the same amount of work - the same operations, the same amount of times each. So it's still in the same complexity class). And in this case it's unavoidable, because you need the result of the first calculation (which requires going over the entire list) to do the second calculation.
result = some_func_1(item_sales_list)
for item, sales in item_product:
some_func_2(item, sales, result)
If you can modify some_func_2() so that it doesn't need the entire item_sales_list in order to work, then you could load it into the same for loop and do them one after another. Without knowing how some_func_2() works, it's impossible to give any further advice.
Is there any way to construct a list of dataframes from a 1-dim dataframe or list? I thought apply would do the trick but it seems not to be the case. The job can be done easily by using a for loop but I wish to avoid that. More details down below.
This is the code I tried but it wouldn't work
pd.DataFrame([1,2,3,4,5]).apply(lambda x: pd.DataFrame([x]))
This is the code that would do the trick but for loop is what I wish to avoid at all cost, do run it so that you know what I actually try to achieve
list = [1,2,3,4,5]
j = []
for i in list:
i = pd.DataFrame([i])
j = j + [i]
In the project I work on, the thing I wish to do would be much more complex than just turning a element into a 1x1 dataframe but rather transforming it into a huge dataframe and eventually each of the dataframes generated would be put into a list, the only bottleneck is exactly this issue I described,
thanks in advance.
You can simplify and speed up your loop by using list comprehension which takes away the overhead of a for loop, here's a good read on it
Note: I renamed your list to lst since "list" is a reserved word in Python, don't use it as an variable
dfs = [pd.DataFrame([x]) for x in lst]
Now we can access each dataframe:
print(dfs[0])
0
0 1
print(dfs[1])
0
0 2
When doing data processing tasks I often find myself applying a series of compositions, vectorized functions, etc. to some input iterable of data to generate a final result. Ideally I would like something that will work for both lists and generators (in addition to any other iterable). I can think of a number of approaches to structuring code to accomplish this, but every way I can think of has one or more ways where it feels unclean/unidiomatic to me. I have outlined below the different methods I can think of to do this, but my question is—is there a recommended, idiomatic way to do this?
Methods I can think of, illustrated with a simple example that is generally representative of:
Write it as one large expression
result = [sum(group)
for key, group in itertools.groupby(
filter(lambda x: x <= 2, [x **2 for x in input]),
keyfunc=lambda x: x % 3)]
This is often quite difficult to read for any non-trivial sequence of steps. When reading through the code one also encounters each step in reverse order.
Save each step into a different variable name
squared = [x**2 for x in input]
filtered = filter(lambda x: x < 2, squared)
grouped = itertools.groupby(filtered, keyfunc=lambda x: x % 3)
result = [sum(group) for key, group in grouped]
This introduces a number of local variables that can often be hard to name descriptively; additionally, if the result of some or all of the intermediate steps is especially large keeping them around could be very wasteful of memory. If one wants to add a step to this process, care must be taken that all variable names get updated correctly—for example, if we wished to divide every number by two we would add the line halved = [x / 2.0 for x in filtered], but would also have to remember to change filtered to halved in the following line.
Store each step into the same variable name
tmp = [x**2 for x in input]
tmp = filter(lambda x: x < 2, tmp)
tmp = itertools.groupby(tmp, keyfunc=lambda x: x % 3)
result = [sum(group) for key, group in tmp]
I guess this seems to me as the least-bad of these options, but storing things in a generically named placeholder variable feels un-pythonic to me and makes me suspect that there is some better way out there.
Code Review often is a better place for style questions. SO is more for problem solving. But CR can be picky about the completeness of the example.
But I can a few observations:
if you wrap this calculation in a function, naming isn't such a big deal. The names don't have to be globally meaningful.
a number of your expressions are generators. Itertools tends to produce generators or gen. expressions. So memory use shouldn't be much of an issue.
def better_name(input):
squared = (x**2 for x in input) # gen expression
filtered = filter(lambda x: x < 2, squared)
grouped = itertools.groupby(filtered, lambda x: x % 3)
result = (sum(group) for key, group in grouped)
return result
list(better_name(input))
Using def functions instead of lambdas can also make the code clearer. There's a trade off. Your lambdas are simple enough that I'd probably keep them.
Your 2nd option is much more readable than the 1st. The order of the expressions guides my reading and mental evaluation. In the 1st it's hard to identify the inner-most or first evaluation. And groupby is a complex operation, so any help in compartmentalizing the action is welcome.
Following the filter docs, these are equivalent:
filtered = filter(lambda x: x < 2, squared)
filtered = (x for x in squared if x<2)
I was missing the return. The function could return a generator as I show, or an evaluated list.
groupby keyfunc is not a keyword argument, but rather positional one.
groupby is complex function. It returns a generator that produces tuples, an element of which is a generator itself. Returning this makes it more obvious.
((key, list(group)) for key, group in grouped)
So a code style that clarifies its use is desirable.
I have a huge list of tuples from which I want to extract individual columns. I have tried two methods.
Assuming the name of the list name is List and I want to extract the jth column.
First one is
column=[item[j] for item in List]
Second one is
newList=zip(*List)
column=newList[j]
However both the methods are too slow since the length of the list is about 50000 and length of each tuple is about 100. Is there a faster way to extract the columns from the list?
this is something numpy does well
A = np.array(Lst) # this step may take a while now ... maybe you should have Lst as a np.array before you get to this point
sliced = A[:,[j]] # this should be really quite fast
that said
newList=zip(*List)
column=newList[j]
takes less than a second for me with a 50kx100 tuple ... so maybe profile your code and make sure the bottleneck is actually where you think it is...
I have a dataframe where each series if filled with 0 and 1 as follows:
flagdf=pd.DataFrame({'a':[1,0,1,0,1,0,0,1], 'b':[0,0,1,0,1,0,1,0]})
Now, depending on some analysis I have done, I need to change some 0s to 1s. So the final dataframe will be:
final=pd.DataFrame({'a':[1,1,1,0,1,1,1,1], 'b':[1,1,1,0,1,1,1,1]})
The results of the analysis which shows which 0s have to be changed are stored in a second dataframe built with a multi-index:
first last
a 1 1 1
5 5 6
b 0 0 1
5 5 5
7 7 7
For each 'a' and 'b' I have the first and the last indexes of the 0s I need to change.
First question: The second index in the multi-index dataframe is equal to the series 'first'. I was initially trying to use it directly, but I found it easier to handle two series rather than an index and a series. Am I missing something?
Here is the code to do the job:
def change_one_value_one_column(flagdf,col_name,event):
flagdf[col_name].iloc[event]=1
def change_val_column(col_name, tochange, flagdf):
col_tochange=tochange.ix[col_name]
tomod=col_tochange[['first','last']].values
iter_tomod=[xrange(el[0],el[1]+1) for el in tomod]
[change_one_value_one_column(flagdf,col_name,event) for iterel in iter_tomod for event in iterel]
[change_val_colmun(col_name) for col_name in flagdf.columns]
Second question: I genuinely think that a list comprehension is always good but in cases like that, when I write a function specifically for a list comprehension, I have some doubt. Is it truly the best thing to do?
Third question: I think that the code is quite pythonic, but I am not proud of that because of the last list comprehension which is running over the series of the dataframe: using the method apply would look better to my eyes (but I'm not sure how to do it). Nontheless is there any real reason (apart from elegance) I should work to do the changes?
To answer the part about exhausting an iterator, I think you have a few pythonic choices (all of which I prefer over a list comprehension):
# the easiest, and most readable
for col_name in flagdf.columns:
change_val_column(col_name)
# consume/exhaust an iterator using built-in any (assuming each call returns None)
any(change_val_colmun(col_name) for col_name in flagdf.columns)
# use itertools' consume recipe
consume(change_val_colmun(col_name) for col_name in flagdf.columns)
See consume recipe from itertools.
However, when doing this kind of thing in numpy/pandas, you should be asking yourself "can I vertorize / use indexing here?". If you can your code will usually be both faster and more readable.
I think in this case you'll be able to remove one level of loops by doing something like:
def change_val_column(col_name, tochange, flagdf):
col_tochange = tochange.ix[col_name] # Note: you're accessing index not column here??
tomod = col_tochange[['first','last']].values
for i, j in tomod:
flag_df.loc[i:j, col_name] = 1
You may even be able to remove the for loop, but it's not obvious how / what the intention is here...
If I'm staying in python and iterating over rows, I prefer using zip/izip as a first pass.
for col, start, end in izip(to_change.index.get_level_values(0), tochange['first'], tochange['last']):
flagdf.loc[start:end, col] = 1
Simple and fast.