Compute a new column as delta to another value in pandas dataframe - python

I have this data frame:
rank cost brand city
0 1 1 a x
1 2 2 a x
2 3 3 a x
3 4 4 a x
4 5 5 a x
5 1 2 b y
6 2 4 b y
7 3 6 b y
8 4 8 b y
9 5 10 b y
I want to create a new column 'delta' which contains the cost difference compared to rank 1 for a certain brand-city combination.
Desired outcome:
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
This answer gave me some hints, but I am stuck on the fact that I cannot map a series to a multi-index.
To save on typing, here is some code:
data = {'rank': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'cost': [1, 2, 3, 4, 5, 2, 4, 6, 8, 10],
'brand': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
'city': ['x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
'delta': ['0', '1', '2', '3', '4', '0', '2', '4', '6', '8']
}

This is transform + first
df['delta']=df.cost-df.groupby(['brand','city'])['cost'].transform('first')
df
Out[291]:
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8

Use groupby with apply
data['delta'] = (data.groupby(['brand', 'city'], group_keys=False)
.apply(lambda x: x['cost'] - x[x['rank'].eq(1)]['cost'].values[0]))
data
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8

solution without using groupby. it sorts rank and uses pd.merge_ordered and assign to create delta column
In [1077]: pd.merge_ordered(data.sort_values(['brand', 'city', 'rank']), data.query('rank == 1'), how='left', on=['brand', 'city', 'rank'], fill_method='ffill').assign(delta=lambda x: x.cost_x - x.cost_y).drop('cost_y', 1)
Out[1077]:
brand city cost_x rank delta
0 a x 1 1 0
1 a x 2 2 1
2 a x 3 3 2
3 a x 4 4 3
4 a x 5 5 4
5 b y 2 1 0
6 b y 4 2 2
7 b y 6 3 4
8 b y 8 4 6
9 b y 10 5 8

Related

How to compute nested group proportions using python pandas without losing original row count

Given the following data:
df = pd.DataFrame({
'where': ['a','a','a','a','a','a'] + ['b','b','b','b','b','b'],
'what': ['x','y','z','x','y','z'] + ['x','y','z','x','y','z'],
'val' : [1,3,2,5,4,3] + [5,6,3,4,5,3]
})
Which looks as:
where what val
0 a x 1
1 a y 3
2 a z 2
3 a x 5
4 a y 4
5 a z 3
6 b x 5
7 b y 6
8 b z 3
9 b x 4
10 b y 5
11 b z 3
I would like to compute the proportion of what in where, and create a new
column that represented this.
The column will have duplicates, If I consider what = x in the above, and
add that column in then the data would be as follows
where what val what_where_prop
0 a x 1 6/18
1 a y 3
2 a z 2
3 a x 5 6/18
4 a y 4
5 a z 3
6 b x 5 9/26
7 b y 6
8 b z 3
9 b x 4 9/26
10 b y 5
11 b z 3
Here 6/18 is computed by finding the total x (6 = 1 + 5) in a over the total of val in a. The same process is taken for 9/26
The full solution will be filled similarly for y and z in the final column.
IIUC,
df['what_where_group'] = (df.groupby(['where', 'what'], as_index=False)['val']
.transform('sum')
.div(df.groupby('where')['val']
.transform('sum'),
axis=0))
df
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
1 a y 3 7 0.388889
2 a z 2 5 0.277778
3 a x 5 6 0.333333
4 a y 4 7 0.388889
5 a z 3 5 0.277778
6 b x 5 9 0.346154
7 b y 6 11 0.423077
8 b z 3 6 0.230769
9 b x 4 9 0.346154
10 b y 5 11 0.423077
11 b z 3 6 0.230769
Details:
First groupby two levels using what and where, by using index=False, I am not setting the index as the groups, and transform sum. Next, groupby only where and transform sum. Lastly, divide, using div, the first groupby by the second groupby using the direction as rows with axis=0.
Another way:
g = df.set_index(['where', 'what'])['val']
num = g.sum(level=[0,1])
denom = g.sum(level=0)
ww_group = num.div(denom, level=0).rename('what_where_group')
df.merge(ww_group, left_on=['where','what'], right_index=True)
Output:
where what val what_where_prop what_where_group
0 a x 1 6 0.333333
3 a x 5 6 0.333333
1 a y 3 7 0.388889
4 a y 4 7 0.388889
2 a z 2 5 0.277778
5 a z 3 5 0.277778
6 b x 5 9 0.346154
9 b x 4 9 0.346154
7 b y 6 11 0.423077
10 b y 5 11 0.423077
8 b z 3 6 0.230769
11 b z 3 6 0.230769
Details:
Basically the same as before just using steps. And, merge results to apply division to each line.

Concatenate columns to one column with conditions pandas

I would like to concatenate all the columns of the dataset:
df = pd.DataFrame([['0987', 4, 'j'], ['9', 4, 'y'], ['9', 6, 't'], ['4', '', 'o'], ['', 9, 'o']],
columns=['col_a', 'col_b', 'col_c'])
In [1]:
col_a col_b col_c
0 0987 4 j
1 9 4 y
2 9 6 t
3 4 u
4 9 o
Into one column with the added condition. The first is that all empty or null entries must be removed or not added to the new set. The second is that if the entry in the new column (col_new) comes from col_a or col_c it must have a label of 1. Otherwise it must be labelled 0.
So I would like the result to look like this:
col_new label
0 0987 1
1 9 1
2 9 1
3 4 1
4 4 0
5 4 0
6 6 0
7 9 0
8 j 1
9 y 1
10 t 1
11 u 1
12 o 1
Use DataFrame.melt, also for new label column use rename with lambda function and last filter rows by DataFrame.query:
df = (df.rename(columns = lambda x: 1 if x in ['col_a','col_c'] else 0)
.melt(var_name='label', value_name='col_new')
.query('col_new != ""')[['col_new','label']])
print (df)
col_new label
0 0987 1
1 9 1
2 9 1
3 4 1
5 4 0
6 4 0
7 6 0
9 9 0
10 j 1
11 y 1
12 t 1
13 o 1
14 o 1
If there are missing values:
df = pd.DataFrame([['0987', 4, 'j'], ['9', 4, 'y'], ['9', 6, 't'],
['4', np.nan, 'o'], [np.nan, 9, 'o']],
columns=['col_a', 'col_b', 'col_c'])
df = (df.rename(columns= lambda x: 1 if x in ['col_a','col_c'] else 0)
.melt(var_name='label', value_name='col_new')
.query('col_new == col_new')[['col_new','label']])
Or use DataFrame.dropna for filtering:
df = (df.rename(columns= lambda x: 1 if x in ['col_a','col_c'] else 0)
.melt(var_name='label', value_name='col_new')[['col_new','label']])
df = df.dropna(subset=['col_new'])
print (df)
col_new label
0 0987 1
1 9 1
2 9 1
3 4 1
5 4 0
6 4 0
7 6 0
9 9 0
10 j 1
11 y 1
12 t 1
13 o 1
14 o 1

Python dataframe, drop everything after a certain record

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3], \
'counter' : [1, 2, 3, 4, 1, 2, 3, 1, 2, 3], \
'status':['a', 'b', 'b' ,'c', 'a', 'a', 'a', 'a', 'a', 'b'], \
'additional_data' : [12,35,13,523,6,12,6,1,46,236]}, \
columns=['id', 'counter', 'status', 'additional_data'])
df
Out[37]:
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
2 1 3 b 13
3 1 4 c 523
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236
The id column indicates which data belongs together, the counter indicates the order of the rows, and status is a special status code. I want to drop all rows after the first occurence of a row with status='b', keeping the first row with status='b'.
Final output should look like this
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236
All help is, as always, greatly appreciated.
Use custom function with idxmax for return index of values by condition, add 1 for not lost b row:
def f(x):
m = x['status'].eq('b')
b = m.idxmax()
if m.any():
x = x.loc[:b]
else:
x
return x
a = df.groupby('id', group_keys=False).apply(f)
print (a)
id counter status additional_data
0 1 1 a 12
1 1 2 b 35
4 2 1 a 6
5 2 2 a 12
6 2 3 a 6
7 3 1 a 1
8 3 2 a 46
9 3 3 b 236

Reordering columns in CSV

Question has been posted before but the requirements were not properly conveyed. I have a csv file with more than 1000 columns:
A B C D .... X Y Z
1 0 0.5 5 .... 1 7 6
2 0 0.6 4 .... 0 7 6
3 0 0.7 3 .... 1 7 6
4 0 0.8 2 .... 0 7 6
Here X , Y and Z are the 999, 1000, 10001 column and A, B, C , D are the 1st,2nd,3rd and 4th. I need to reorder the columns in such a way that it gives me the following.
D Y Z A B C ....X
5 7 6 1 0 0.5 ....1
4 7 6 2 0 0.6 ....0
3 7 6 3 0 0.7 ....1
2 7 6 4 0 0.8 ....0
that is 4th column becomes the 1st, 1000 and 1001th column becomes 2nd and 3rd and the other columns are shifted right accordingly.
So the question is how to reorder your columns in a custom way.
For example you have the following DF and you want to reorder your columns in the following way (indices):
5, 3, rest...
DF
In [82]: df
Out[82]:
A B C D E F G
0 1 0 0.5 5 1 7 6
1 2 0 0.6 4 0 7 6
2 3 0 0.7 3 1 7 6
3 4 0 0.8 2 0 7 6
columns
In [83]: cols = df.columns.tolist()
In [84]: cols
Out[84]: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
reordered:
In [88]: cols = [cols.pop(5)] + [cols.pop(3)] + cols
In [89]: cols
Out[89]: ['F', 'D', 'A', 'B', 'C', 'E', 'G']
In [90]: df[cols]
Out[90]:
F D A B C E G
0 7 5 1 0 0.5 1 6
1 7 4 2 0 0.6 0 6
2 7 3 3 0 0.7 1 6
3 7 2 4 0 0.8 0 6
In [4]: df
Out[4]:
A B C D X Y Z
0 1 0 0.5 5 1 7 6
1 2 0 0.6 4 0 7 6
2 3 0 0.7 3 1 7 6
3 4 0 0.8 2 0 7 6
In [5]: df.reindex(columns=['D','Y','Z','A','B','C','X'])
Out[5]:
D Y Z A B C X
0 5 7 6 1 0 0.5 1
1 4 7 6 2 0 0.6 0
2 3 7 6 3 0 0.7 1

cumcout groupby --- how to list by groups

My question is related to this question
import pandas as pd
df = pd.DataFrame(
[['A', 'X', 3], ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Y', 1],
['B', 'X', 3], ['B', 'X', 1], ['B', 'X', 3], ['B', 'Y', 1],
['C', 'X', 7], ['C', 'Y', 4], ['C', 'Y', 1], ['C', 'Y', 6]],
columns=['c1', 'c2', 'v1'])
df['CNT'] = df.groupby(['c1', 'c2']).cumcount()+1
I got column 'CNT'. But I'd like to break it apart according to group 'c2' to obtain cumulative count of 'X' and 'Y' respectively.
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
Any suggestions? I am just starting to explore Pandas and appreciate your help.
I don't directly know a way to do this directly, but starting from the calculated CNT column, you can do it as follows:
Make the Xcnt and Ycnt columns:
In [13]: df['Xcnt'] = df['CNT'][df['c2']=='X']
In [14]: df['Ycnt'] = df['CNT'][df['c2']=='Y']
In [15]: df
Out[15]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 NaN
1 A X 5 2 2 NaN
2 A Y 7 1 NaN 1
3 A Y 1 2 NaN 2
4 B X 3 1 1 NaN
5 B X 1 2 2 NaN
6 B X 3 3 3 NaN
7 B Y 1 1 NaN 1
8 C X 7 1 1 NaN
9 C Y 4 1 NaN 1
10 C Y 1 2 NaN 2
11 C Y 6 3 NaN 3
Next, we want to fill the NaN's per group of c1 by forward filling:
In [23]: df['Xcnt'] = df.groupby('c1')['Xcnt'].fillna(method='ffill')
In [24]: df['Ycnt'] = df.groupby('c1')['Ycnt'].fillna(method='ffill').fillna(0)
In [25]: df
Out[25]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
For the Ycnt an extra fillna was needed to fill the convert the NaN's to 0's where the group started with NaNs (couldn't fill forward).

Categories

Resources