Pandas multiIndex DataFrame sort - python

Just show my data
In [14]: new_df
Out[14]:
action_type 1 2 3
user_id
0000110e00f7c85f550b329dc3d76210 31.0 4.0 0.0
00004931fe12d6f678f67e375b3806e3 8.0 4.0 0.0
0000c2b8660766ed74bafd48599255f0 0.0 2.0 0.0
0000d8d4ea411b05e0392be855fe9756 19.0 0.0 3.0
ffff18540a9567b455bd5645873e56d5 1.0 0.0 0.0
ffff3c8cf716efa3ae6d3ecfedb2270b 58.0 2.0 0.0
ffffa5fe57d2ef322061513bf60362ff 0.0 2.0 0.0
ffffce218e2b4af7729a4737b8702950 1.0 0.0 0.0
ffffd17a96348904fe49216ba3c7006f 1.0 0.0 0.0
[9 rows x 3 columns]
In [15]: new_df.columns
Out[15]: Int64Index([1, 2, 3], dtype='int64', name=u'action_type')
In [16]: new_df.index
Out[16]:
Index([u'0000110e00f7c85f550b329dc3d76210',
u'00004931fe12d6f678f67e375b3806e3',
...
u'ffffa5fe57d2ef322061513bf60362ff',
u'ffffce218e2b4af7729a4737b8702950',
u'ffffd17a96348904fe49216ba3c7006f'],
dtype='object', name=u'user_id', length=9)
The output that I want is:
# sort by the action_type value 1
action_type 1 2 3
user_id
ffff3c8cf716efa3ae6d3ecfedb2270b 58.0 2.0 0.0
0000110e00f7c85f550b329dc3d76210 31.0 4.0 0.0
0000d8d4ea411b05e0392be855fe9756 19.0 0.0 3.0
00004931fe12d6f678f67e375b3806e3 8.0 4.0 0.0
ffff18540a9567b455bd5645873e56d5 1.0 0.0 0.0
ffffce218e2b4af7729a4737b8702950 1.0 0.0 0.0
ffffd17a96348904fe49216ba3c7006f 1.0 0.0 0.0
0000c2b8660766ed74bafd48599255f0 0.0 2.0 0.0
ffffa5fe57d2ef322061513bf60362ff 0.0 2.0 0.0
[9 rows x 3 columns]
# sort by the action_type value 2
action_type 1 2 3
user_id
00004931fe12d6f678f67e375b3806e3 8.0 4.0 0.0
0000110e00f7c85f550b329dc3d76210 31.0 4.0 0.0
ffff3c8cf716efa3ae6d3ecfedb2270b 58.0 2.0 0.0
0000c2b8660766ed74bafd48599255f0 0.0 2.0 0.0
ffffa5fe57d2ef322061513bf60362ff 0.0 2.0 0.0
0000d8d4ea411b05e0392be855fe9756 19.0 0.0 3.0
ffff18540a9567b455bd5645873e56d5 1.0 0.0 0.0
ffffce218e2b4af7729a4737b8702950 1.0 0.0 0.0
ffffd17a96348904fe49216ba3c7006f 1.0 0.0 0.0
[9 rows x 3 columns]
So, what I want to do is to sort the DataFrame by the action_type, that is 1, 2, 3 or the sum of any of them(action_type sum of 1+2, 1+3, 2+3, 1+2+3)
The output should sorted by the value of action_type(1, 2 or 3) of each user or the sum of action_type(for example the sum of action_type 1 and action_type 2, and any combinations, such as the sum of action_type 1 and action_type 3, the sum of action_type 2 and action_type 3, the sum of action_type 1 and action_type 2 and action_type 3) of each user.
For example:
for user id 0000110e00f7c85f550b329dc3d76210, the value of action_type 1 is 31.0, the value of action_type 2 is 4 and the value of action_type 3 is 3. The sum of action_type 1 and action_type 2 of this user is 31.0 + 4.0 = 35.0
I have tried new_df.sortlevel(), but it seems it has just sored the dataframe by the user_id, not by the action_type(1, 2, 3)
How can I do it, thank you!

UPDATE:
If you wanna sort it by columns, just try sort_values
df.sort_values(column_names)
Example:
In [173]: df
Out[173]:
1 2 3
0 6 3 8
1 0 8 0
2 3 8 0
3 5 2 7
4 1 2 1
sort descending by column 2
In [174]: df.sort_values(by=2, ascending=False)
Out[174]:
1 2 3
1 0 8 0
2 3 8 0
0 6 3 8
3 5 2 7
4 1 2 1
sort descending by sum of columns 2+3
In [177]: df.assign(sum=df.loc[:,[2,3]].sum(axis=1)).sort_values('sum', ascending=False)
Out[177]:
1 2 3 sum
0 6 3 8 11
3 5 2 7 9
1 0 8 0 8
2 3 8 0 8
4 1 2 1 3
OLD answer:
if i got you right, you can do it this way:
In [107]: df
Out[107]:
a b c
0 9 1 4
1 0 5 7
2 5 9 8
3 3 9 7
4 1 2 5
In [108]: df.assign(sum=df.sum(axis=1)).sort_values('sum', ascending=True)
Out[108]:
a b c sum
4 1 2 5 8
1 0 5 7 12
0 9 1 4 14
3 3 9 7 19
2 5 9 8 22

Related

Python pandas How to pick up certain values by internal numbering?

I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.
Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names
Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0
It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Sum of dataframe columns to another dataframe column Python gives NaN

I want to sumarize rows and columns of dataframe (pdf and wdf) and save results in another dataframe columns (to_hex).
I tried it for one dataframe and it worked. It doesn't work for another (it gives NaN). I cannot understand what is the difference.
to_hex = pd.DataFrame(0, index=np.arange(len(sasiedztwo)), columns=['ID','podroze','p_rozmyte'])
to_hex.loc[:,'ID']= wdf.index+1
to_hex.index=pdf.index
to_hex.loc[:,'podroze']= pd.DataFrame(pdf.sum(axis=0))[:]
to_hex.index=wdf.index
to_hex.loc[:,'p_rozmyte']= pd.DataFrame(wdf.sum(axis=0))[:]
This is how pdf dataframe looks like:
0 1 2 3 4 5 6 7 8
0 0 0 10 0 0 0 0 0 100
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1000
8 0 0 0 0 0 0 0 0 0
This is wdf:
0 1 2 3 4 5 6 7 8
0 2.5 5.0 35.0 0.0 27.5 55.0 25.0 50.0 102.5
1 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
2 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 25.0
3 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
4 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 250.0
6 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
7 0.0 0.0 250.0 0.0 250.0 500.0 250.0 500.0 1000.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0
And this is the result in to_hex:
ID podroze p_rozmyte
0 1 0 NaN
1 2 0 NaN
2 3 10 NaN
3 4 0 NaN
4 5 0 NaN
5 6 0 NaN
6 7 0 NaN
7 8 0 NaN
8 9 1100 NaN
SOLUTION:
One option to solve it is to modify your code as follows:
to_hex.loc[:,'ID']= wdf.index+1
# to_hex.index=pdf.index # no need
to_hex.loc[:,'podroze']= pdf.sum(axis=0) # modified; directly use the series output from SUM()
# to_hex.index=wdf.index # no need
to_hex.loc[:,'p_rozmyte']= wdf.sum(axis=0) # modified
Then you get:
ID podroze p_rozmyte
0 1 0 2.5
1 2 0 5.0
2 3 10 302.5
3 4 0 0.0
4 5 0 277.5
5 6 0 555.0
6 7 0 275.0
7 8 0 550.0
8 9 1100 3527.5
I think the reason that you get NaN for one case and correct values for the other case lies in to_hex.dtypes:
ID int64
podroze int64
p_rozmyte int64
dtype: object
And as you see to_hex dataframe has column types as int64. This is fine when you add pdf dataframe (since it has the same dtype)
pd.DataFrame(pdf.sum(axis=0))[:].dtypes
0 int64
dtype: object
but does not work when you add wdf:
pd.DataFrame(wdf.sum(axis=0))[:].dtypes
0 float64
dtype: object

Computing the difference between first and last values in a rolling window

I am using the Pandas rolling window tool on a one-column dataframe whose index is in datetime form.
I would like to compute, for each window, the difference between the first value and the last value of said window. How do I refer to the relative index when giving a lambda function? (in the brackets below)
df2 = df.rolling('3s').apply(...)
IIUC:
In [93]: df = pd.DataFrame(np.random.randint(10,size=(9, 3)))
In [94]: df
Out[94]:
0 1 2
0 7 4 5
1 9 9 3
2 1 7 6
3 0 9 2
4 2 3 7
5 6 7 1
6 1 0 1
7 8 4 7
8 0 0 9
In [95]: df.rolling(window=3).apply(lambda x: x[0]-x[-1])
Out[95]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 6.0 -3.0 -1.0
3 9.0 0.0 1.0
4 -1.0 4.0 -1.0
5 -6.0 2.0 1.0
6 1.0 3.0 6.0
7 -2.0 3.0 -6.0
8 1.0 0.0 -8.0

Categories

Resources