Pandas apply function to column taking the value of previous column - python

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.

Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names

Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0

It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Related

Python pandas How to pick up certain values by internal numbering?

I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0

Grouping by classes

I would like to see how many times a url is labelled with 1 and how many times it is labelled with 0.
My dataset is
Label URL
0 0.0 www.nytimes.com
1 0.0 newatlas.com
2 1.0 www.facebook.com
3 1.0 www.facebook.com
4 0.0 issuu.com
... ... ...
3572 0.0 www.businessinsider.com
3573 0.0 finance.yahoo.com
3574 0.0 www.cnbc.com
3575 0.0 www.ndtv.com
3576 0.0 www.baystatehealth.org
I tried df.groupby("URL")["Label"].count() but it does not return the expected output:
Label URL Freq
0 0.0 www.nytimes.com 1
0 1.0 www.nytimes.com 0
1 0.0 newatlas.com 1
1 1.0 newatlas.com 0
2 1.0 www.facebook.com 2
2 0.0 www.facebook.com 0
4 0.0 issuu.com 1
4 1.0 issuu.com 0
... ... ...
What field should I consider I the group by to get something like the above df (expected output)?
You need unique combinations of URL and Label.
df.groupby(["URL","Label"]).count()
Now you can do value_counts
df.value_counts(["URL","Label"])
Use agg:
df.groupby("URL").agg({'Label',lambda x: x.nunique()})

Find nearest neighbors

I have a large dataframe of the form:
user_id time_interval A B C D E F G H ... Z
0 12166 2.0 3.0 1.0 1.0 1.0 3.0 1.0 1.0 1.0 ... 0.0
1 12167 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
2 12168 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
3 12169 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
4 12170 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 ... 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
I would like to find, for each user_id, based on the columns A-Z as coordinates,the closest neighbors within a 'radius' distance r. The output should look like, for example, for r=0.1:
user_id neighbors
12166 [12251,12345, ...]
12167 [12168, 12169,12170, ...]
... ...
I tried for-looping throughout the user_id list but it takes ages.
I did something like this:
import scipy
neighbors = []
for i in range(len(dataframe)):
user_neighbors = [dataframe["user_id"][j] for j in range(i+1,len(dataframe)) if scipy.spatial.distance.euclidean(dataframe.values[i][2:],dataframe.values[j][2:])<0.1]
neighbors.append([dataframe["user_id"][i],user_neighbors])
and I have been waiting for hours.
Is there a pythonic way to improve this?
Here's how I've done it using apply method.
The dummy data consisting of columns A-D with an added column for neighbors:
print(df)
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 NaN
1 12167 0 1 4 3 3 NaN
2 12168 0 4 3 3 1 NaN
3 12169 0 2 2 3 2 NaN
4 12170 0 3 3 1 1 NaN
the custom function:
def func(row):
r = 2.5 # the threshold
out = df[(((df.iloc[:, 2:-1] - row[2:-1])**2).sum(axis=1)**0.5).le(r)]['user_id'].to_list()
out.remove(row['user_id'])
df.loc[row.name, ['neighbors']] = str(out)
df.apply(func, axis=1)
the output:
print(df):
user_id time_interval A B C D neighbors
0 12166 2 3 2 2 3 [12169, 12170]
1 12167 0 1 4 3 3 [12169]
2 12168 0 4 3 3 1 [12169, 12170]
3 12169 0 2 2 3 2 [12166, 12167, 12168]
4 12170 0 3 3 1 1 [12166, 12168]
Let me know if it outperforms the for-loop approach.

new Pandas Dataframe column calculated from other column values

How can I create a new column in a dataframe that consists of the MEAN of an indexed range of values in that row?
example:
1 2 3 JUNK
0 0.0 0.0 0.0 A
1 1.0 1.0 -1.0 B
2 2.0 2.0 1.0 C
the JUNK column would be ignored when trying to determine the MEAN column
expected output:
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.0
1 1.0 1.0 -1.0 B 0.33
2 2.0 2.0 1.0 C 1.66
Use drop for removing or iloc for filter out unnecessary columns:
df['MEAN'] = df.drop('JUNK', axis=1).mean(axis=1)
df['MEAN'] = df.iloc[:, :-1].mean(axis=1)
print (df)
1 2 3 JUNK MEAN
0 0.0 0.0 0.0 A 0.000000
1 1.0 1.0 -1.0 B 0.333333
2 2.0 2.0 1.0 C 1.666667

Pandas, create new columns based on existing with repeated count

It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!
Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:
Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0

Categories

Resources