I am working on some machine learning task and I want change each line from "numbered objects" to "sorted by some attrs objects".
For example, I have 5 heroes in 2 teams represented by theirs stats (dN_%stat% and rN_%stat%), and what I want is to sort heroes in each team by stats numbered 3,4,0,2 so the first one is strongest and so on.
Here is my current code, but it is very slow, so I want to use native pandas objects and operations:
def sort_heroes(df):
for match_id in df.index:
for team in ['r', 'd']:
heroes = []
for n in range(1,6):
heroes.append(
[df.ix[match_id, '%s%s_%s' % (team, n, stat)]
for stat in stats])
heroes.sort(key=lambda x: (x[3], x[4], x[0], x[2]))
for n in range(1,6):
for i, stat in enumerate(stats):
df.ix[match_id, '%s%s_%s' %
(team, n, stat)] = heroes[n - 1][i]
Short example with not full but useful data representation:
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold d2_xp d2_gold
1 10 20 100 10 5000 300 0 0 15 5
2 1 1 1000 80 100 13 200 87 311 67
What I want is to sort those columns by groups with prefix (rN_ and dN_) firstly by gold then by xp
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold d2_xp d2_gold
1 5000 300 10 20 100 20 15 5 0 0
2 1000 80 100 13 1 1 200 87 311 67
You can use:
df.set_index('match_id', inplace=True)
#create MultiIndex with 3 levels
arr = df.columns.str.extract('([rd])(\d*)_(.*)', expand=True).T.values
df.columns = pd.MultiIndex.from_arrays(arr)
#reshape df, sorting
df = df.stack([0,1]).reset_index().sort_values(['match_id','level_1','gold','xp'],
ascending=[True,False,False,False])
print (df)
match_id level_1 level_2 gold xp
4 1 r 3 300.0 5000.0
2 1 r 1 20.0 10.0
3 1 r 2 10.0 100.0
1 1 d 2 5.0 15.0
0 1 d 1 0.0 0.0
8 2 r 2 80.0 1000.0
9 2 r 3 13.0 100.0
7 2 r 1 1.0 1.0
5 2 d 1 87.0 200.0
6 2 d 2 67.0 311.0
#asign new values to level 2
df.level_2 = df.groupby(['match_id','level_1']).cumcount().add(1).astype(str)
#get original shape
df = df.set_index(['match_id','level_1','level_2']).stack().unstack([1,2,3]).astype(int)
df = df.sort_index(level=[0,1,2], ascending=[False, True, False], axis=1)
#Multiindex in columns to column names
df.columns = ['{}{}_{}'.format(x[0], x[1], x[2]) for x in df.columns]
df.reset_index(inplace=True)
print (df)
match_id r1_xp r1_gold r2_xp r2_gold r3_xp r3_gold d1_xp d1_gold \
0 1 5000 300 10 20 100 10 15 5
1 2 1000 80 100 13 1 1 200 87
d2_xp d2_gold
0 0 0
1 311 67
Related
Let's take this sample dataframe and this list of ids :
df=pd.DataFrame({'Id':['A','A','A','B','C','C','D','D'], 'Weight':[50,20,30,1,2,8,3,2], 'Value':[100,100,100,10,20,20,30,30]})
Id Weight Value
0 A 50 100
1 A 20 100
2 A 30 100
3 B 1 10
4 C 2 20
5 C 8 20
6 D 3 30
7 D 2 30
L = ['A','C']
Value column has same values for each id in Id column. For the specific ids of L, I would like to apply the weights of Weight column to Value column. I am currently doing the following way but it is extremely slow with my real big dataframe :
for i in L :
df.loc[df["Id"]==i,"Value"] = (df.loc[df["Id"]==i,"Value"] * df.loc[df["Id"]==i,"Weight"] /
df[df["Id"]==i]["Weight"].sum())
How please could I do that efficiently ?
Expected output :
Id Weight Value
0 A 50 50
1 A 20 20
2 A 30 30
3 B 1 10
4 C 2 4
5 C 8 16
6 D 3 30
7 D 2 30
Idea is working only for filtered rows by Series.isin with GroupBy.transform and sum for sum per groups with same size like original DataFrame:
L = ['A','C']
m = df['Id'].isin(L)
df1 = df[m].copy()
s = df1.groupby('Id')['Weight'].transform('sum')
df.loc[m, 'Value'] = df1['Value'].mul(df1['Weight']).div(s)
print (df)
Id Weight Value
0 A 50 50.0
1 A 20 20.0
2 A 30 30.0
3 B 1 10.0
4 C 2 4.0
5 C 8 16.0
6 D 3 30.0
7 D 2 30.0
I'm working on exporting data frames to Excel using dataframe join.
However, after Join dataframe,
when calculating subtotal using groupby, the figure below is executed.
There's a "Subtotal" word in the index column.
enter image description here
Is there any way to move it into the code column and sort the indexes?
enter image description here
here codes :
def subtotal(df__, str):
container = []
for key, group in df__.groupby(['key']):
group.loc['subtotal'] = group[['quantity', 'quantity2', 'quantity3']].sum()
container.append(group)
df_subtotal = pd.concat(container)
df_subtotal.loc['GrandTotal'] = df__[['quantity', 'quantity2', 'quantity3']].sum()
print(df_subtotal)
return (df_subtotal.to_excel(writer, sheet_name=str))
Use np.where() to fill NaN in code column with value in df.index. Then assign a new index array to df.index.
import numpy as np
df['code'] = np.where(df['code'].isna(), df.index, df['code'])
df.index = np.arange(1, len(df) + 1)
print(df)
code key product quntity1 quntity2 quntity3
1 cs01767 a apple-a 10 0 10.0
2 Subtotal NaN NaN 10 0 10.0
3 cs0000 b bannana-a 50 10 40.0
4 cs0000 b bannana-b 0 0 0.0
5 cs0000 b bannana-c 0 0 0.0
6 cs0000 b bannana-d 80 20 60.0
7 cs0000 b bannana-e 0 0 0.0
8 cs01048 b bannana-f 0 0 NaN
9 cs01048 b bannana-g 0 0 0.0
10 Subtotal NaN NaN 130 30 100.0
11 cs99999 c melon-a 50 10 40.0
12 cs99999 c melon-b 20 20 0.0
13 cs01188 c melon-c 10 0 10.0
14 Subtotal NaN NaN 80 30 50.0
15 GrandTotal NaN NaN 220 60 160.0
I have a dataframe as shown in the image and I want to convert it into multiple rows without changing the order.
RESP HR SPO2 PULSE
1 46 122 0 0
2 46 122 0 0
3
4
One possible solution is use reshape, only necessary modulo of length of columns is 0 (so is possible convert all data to 4 columns DataFrame):
df1 = pd.Dataframe(df.values.reshape(-1, 4), columns=['RESP','HR','SPO2','PULSE'])
df1['RESP1'] = df['RESP'].shift(-1)
General data solution:
a = '46 122 0 0 46 122 0 0 45 122 0 0 45 122 0'.split()
df = pd.DataFrame([a]).astype(int)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 46 122 0 0 46 122 0 0 45 122 0 0 45 122 0
#flatten values
a = df.values.ravel()
#number of new columns
N = 4
#array filled by NaNs for possible add NaNs to end of last row
arr = np.full(((len(a) - 1)//N + 1)*N, np.nan)
#fill array by flatten values
arr[:len(a)] = a
#reshape to new DataFrame (last value is NaN)
df1 = pd.DataFrame(arr.reshape((-1, N)), columns=['RESP','HR','SPO2','PULSE'])
#new column with shifting first col
df1['RESP1'] = df1['RESP'].shift(-1)
print(df1)
RESP HR SPO2 PULSE RESP1
0 46.0 122.0 0.0 0.0 46.0
1 46.0 122.0 0.0 0.0 45.0
2 45.0 122.0 0.0 0.0 45.0
3 45.0 122.0 0.0 NaN NaN
Here's another way with groupby:
df = pd.DataFrame(np.random.arange(12), columns=list('abcd'*3))
new_df = pd.concat((x.stack().reset_index(drop=True)
.rename(k) for k,x in df.groupby(df.columns, axis=1)),
axis=1)
new_df = (new_df.assign(a1=lambda x: x['a'].shift(-1))
.rename(columns={'a1':'a'})
)
Output:
a b c d a
0 0 1 2 3 4.0
1 4 5 6 7 8.0
2 8 9 10 11 NaN
I have a table that looks something like this:
Column 1 | Column 2 | Column 3
1 a 100
1 r 100
1 h 200
1 j 200
2 a 50
2 q 50
2 k 40
3 a 10
3 q 150
3 k 150
Imagine I am trying to get the top values of each groupby('Column 1')
Normally I would just .head(n) but in this case I am also trying to get only top rows with the same Column 3 value like:
Column 1 | Column 2 | Column 3
1 a 100
1 r 100
2 a 50
2 q 50
3 a 10
Assuming the table is already in the order I want it
Any advice would be highly appreciated
I think you need first need groupby with first and then merge:
print df.groupby('Column 1')['Column 3'].first().reset_index()
Column 1 Column 3
0 1 100
1 2 50
2 3 10
print pd.merge(df,
df.groupby('Column 1')['Column 3'].first().reset_index(),
on=['Column 1','Column 3'])
Column 1 Column 2 Column 3
0 1 a 100
1 1 r 100
2 2 a 50
3 2 q 50
4 3 a 10
Timings:
df = pd.concat([df]*1000).reset_index(drop=True)
%timeit pd.merge(df, df.groupby('Column 1')['Column 3'].first().reset_index(), on=['Column 1','Column 3'])
100 loops, best of 3: 3.58 ms per loop
%timeit df[(df.assign(diff=df.groupby('Column 1')['Column 3'].diff().fillna(0)).groupby('Column 1')['diff'].cumsum() == 0)]
100 loops, best of 3: 5.06 ms per loop
My solution (without merging):
In [83]: idx = (df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0))
....: .groupby('Column1')['diff'].cumsum() == 0
....: )
In [84]: df[idx]
Out[84]:
Column1 Column2 Column3
0 1 a 100
1 1 r 100
4 2 a 50
5 2 q 50
7 3 a 10
Explanation:
In [85]: df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0))
Out[85]:
Column1 Column2 Column3 diff
0 1 a 100 0.0
1 1 r 100 0.0
2 1 h 200 100.0
3 1 j 200 0.0
4 2 a 50 0.0
5 2 q 50 0.0
6 2 k 40 -10.0
7 3 a 10 0.0
8 3 q 150 140.0
9 3 k 150 0.0
In [86]: df.assign(diff=df.groupby('Column1')['Column3'].diff().fillna(0)).groupby('Column1')['diff'].cumsum()
Out[86]:
0 0.0
1 0.0
2 100.0
3 100.0
4 0.0
5 0.0
6 -10.0
7 0.0
8 140.0
9 140.0
Name: diff, dtype: float64
I have customer records with id, timestamp and status.
ID, TS, STATUS
1 10 GOOD
1 20 GOOD
1 25 BAD
1 30 BAD
1 50 BAD
1 600 GOOD
2 40 GOOD
.. ...
I am trying to calculate how much time is spent in consecutive BAD statuses (lets imagine order above is correct) per customer. So for customer id=1, 30-25,50-30,600-50 in total 575 seconds was spent in BAD status.
What is the method of doing this in Pandas? If I calculate .diff() on TS, that would give me differences, but how can I tie that 1) to the customer 2) certain status "blocks" for that customer?
Sample data:
df = pandas.DataFrame({'ID':[1,1,1,1,1,1,2],
'TS':[10,20,25,30,50,600,40],
'Status':['G','G','B','B','B','G','G']
},
columns=['ID','TS','Status'])
Thanks,
In [1]: df = DataFrame({'ID':[1,1,1,1,1,2,2],'TS':[10,20,25,30,50,10,40],'Stat
us':['G','G','B','B','B','B','B']}, columns=['ID','TS','Status'])
In [2]: f = lambda x: x.diff().sum()
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['TS'].transform(f)
In [4]: df
Out[4]:
ID TS Status diff
0 1 10 G NaN
1 1 20 G NaN
2 1 25 B 25
3 1 30 B 25
4 1 50 B 25
5 2 10 B 30
6 2 40 B 30
Explanation:
Subset the dataframe to only those records with the desired Status. Groupby the ID and apply the lambda function diff().sum() to each group. Use transform instead of apply because transform returns an indexed series which you can use to assign to a new column 'diff'.
EDIT: New response to account for expanded question scope.
In [1]: df
Out[1]:
ID TS Status
0 1 10 G
1 1 20 G
2 1 25 B
3 1 30 B
4 1 50 B
5 1 600 G
6 2 40 G
In [2]: df['shift'] = -df['TS'].diff(-1)
In [3]: df['diff'] = df[df.Status=='B'].groupby('ID')['shift'].transform('sum')
In [4]: df
Out[4]:
ID TS Status shift diff
0 1 10 G 10 NaN
1 1 20 G 5 NaN
2 1 25 B 5 575
3 1 30 B 20 575
4 1 50 B 550 575
5 1 600 G -560 NaN
6 2 40 G NaN NaN
Here's a solution to separately aggregate each contiguous block of bad status (part 2 of your question?).
In [5]: df = pandas.DataFrame({'ID':[1,1,1,1,1,1,1,1,2,2,2],
'TS':[10,20,25,30,50,600,650,670,40,50,60],
'Status':['G','G','B','B','B','G','B','B','G','B','B']
},
columns=['ID','TS','Status'])
In [6]: grp = df.groupby('ID')
In [7]: def status_change(df):
...: return (df.Status.shift(1) != df.Status).astype(int)
...:
In [8]: df['BlockId'] = grp.apply(lambda df: status_change(df).cumsum())
In [9]: df['Duration'] = grp.TS.diff().shift(-1)
In [10]: df
Out[10]:
ID TS Status BlockId Duration
0 1 10 G 1 10
1 1 20 G 1 5
2 1 25 B 2 5
3 1 30 B 2 20
4 1 50 B 2 550
5 1 600 G 3 50
6 1 650 B 4 20
7 1 670 B 4 NaN
8 2 40 G 1 10
9 2 50 B 2 10
10 2 60 B 2 NaN
In [11]: df[df.Status == 'B'].groupby(['ID', 'BlockId']).Duration.sum()
Out[11]:
ID BlockId
1 2 575
4 20
2 2 10
Name: Duration