I have been working with multi-level DataFrames pretty recently, and I have found that they can significantly reduce computation time for large data sets. For example, consider the simple data frame:
df = pd.DataFrame([
[1, 111, 0], [2, 222, 0], [1, 111, 0],
[2, 222, 1], [1, 111, 1], [2, 222, 2]
], columns=["ID", "A", "B"], index=[1, 1, 2, 2, 3, 3]
)
df.head(6)
ID A B
1 1 111 0
1 2 222 0
2 1 111 0
2 2 222 1
3 1 111 1
3 2 222 2
which can be pivoted by ID to create a multi-level data frame:
pivot_df = df.pivot(columns="ID")
pivot_df.head()
A B
ID 1 2 1 2
1 111 222 0 0
2 111 222 0 1
3 111 222 1 2
The great thing about having my data in this format is that I can perform "vector" operations across all IDs simply by referencing the level 0 columns:
pivot_df["A"] * (1 + pivot_df["B"])**2
ID 1 2
1 111 222
2 111 888
3 444 999
These operations are really helpful for me! In real life, my computations are much more complex and need to be performed for > 1000 IDs. A common DataFrame size that I work with contains 10 columns (at level 0) with 1000 IDs (at level 1) with 350 rows.
I am interested in figuring out to do two things: update values for a particular field in this pivoted DataFrame; create a new column for this DataFrame. Something like
pivot_df["A"] = pivot_df["A"] * (1 + pivot_df["B"])**2
or
pivot_df["C"] = pivot_df["A"] * (1 + pivot_df["B"])**2
I do not get any errors when I perform either of these, but the DataFrame remains unchanged. I have also tried using .loc and .iloc, but I am having no success.
I think that the problem is maintaining the multi-level structure of the computed DataFrames, but I am pretty new to using multi-level DataFrames and not sure how to solve this problem efficiently. I have a clumsy workaround that is not efficient (create a dictionary of computed DataFrames and then merge them all together...
df_dict = OrderedDict()
df_dict["A"] = pivot_df["A"]
df_dict["B"] = pivot_df["B"]
df_dict["C"] = pivot_df["A"] * (1 + pivot_df["B"])**2
dfs = [val.T.set_index(np.repeat(key, val.shape[1]), append=True).T for key, val in df_dict.iteritems()]
final_df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), dfs)
final_df.columns = final_df.columns.swaplevel(0, 1)
or similarly,
df_dict = OrderedDict()
df_dict["A"] = pivot_df["A"] * (1 + pivot_df["B"])**2
df_dict["B"] = pivot_df["B"]
dfs = [val.T.set_index(np.repeat(key, val.shape[1]), append=True).T for key, val in df_dict.iteritems()]
final_df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), dfs)
final_df.columns = final_df.columns.swaplevel(0, 1)
This is not necessarily clunky (I was kind of proud of the workaround), but this is certainly not efficient or computationally optimized. Does anyone have any recommendations?
Option 1
Don't pivot first!
You stated that it was convenient to pivot because you could perform vector calculations in the new pivoted form. This is a mis-representation because you could have easily performed those calculations prior to the pivot.
df['C'] = df["A"] * (1 + df["B"]) ** 2
df.pivot(columns='ID')
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Or in a piped one-liner if you prefer
df.assign(C=df.A * (1 + df.B) ** 2).pivot(columns='ID')
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Option 2
pd.concat
But to answer your question...
pdf = df.pivot(columns='ID')
pd.concat([
pdf.A, pdf.B, pdf.A * (1 + pdf.B) ** 2
], axis=1, keys=['A', 'B', 'C'])
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Option 3
more pd.concat
Add another level to columns prior to concat
pdf = df.pivot(columns='ID')
c = pdf.A * (1 + pdf.B) ** 2
c.columns = [['C'] * len(c.columns), c.columns]
pd.concat([pdf, c], axis=1)
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Related
df = pd.DataFrame({'ID': ['A','A','A','A','A'],
'target': ['B','B','B','B','C'],
'length':[208,315,1987,3775,200],
'start':[139403,140668,141726,143705,108],
'end':[139609,140982,143711,147467,208]})
ID target length start end
0 A B 208 139403 139609
1 A B 315 140668 140982
2 A B 1987 141726 143711
3 A B 3775 143705 147467
4 A C 200 108 208
If I perform the operation:
(df.assign(length=
df['start'].lt(df['end'].shift())
.mul(df['start']-df['end'].shift(fill_value=0))
.add(df['length'])))
I get the correct result but how do I apply this logic to every group in a groupby?
for (a, b) in df.groupby(['start','end']):
(df.assign(length=
df['sstart'].lt(df['send'].shift())
.mul(df['sstart']-df['send'].shift(fill_value=0))
.add(df['length'])))
Leaves the dataframe unchanged?
Group the df on required columns(ID and target) and shift the end column then apply your formula as usual:
s = df.groupby(['ID', 'target'])['end'].shift()
df['length'] = df['start'].lt(s) * df['start'].sub(s.fillna(0)) + df['length']
ID target length start end
0 A B 208.0 139403 139609
1 A B 315.0 140668 140982
2 A B 1987.0 141726 143711
3 A B 3769.0 143705 147467
4 A C 200.0 108 208
I need to select the row within pandas group by based on condition.
Condition1 # For a given group R1,R2,W, if TYPE(A) amount2 is equal to TYPE(B) row, we need to bring the complete TYPE(A) row as output.
Condition2 # For a given group R1,R2,W, if TYPE(A) row amount2 is not equal to TYPE(B) row amount2 , we need sum up the amount1 & amount2 of both TYPE (A) & (B) rows & we need to bring the remaining columns from the TYPE(A) row as output.
Input dataframe
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 1 B 111 222 D 2.5
2 123 12 2 A 222 222 A 1.5
3 123 12 2 B 333 333 D 2.5
4 123 12 3 A 444 444 D 2.5
5 123 12 3 B 333 333 E 3.5
Expected output
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5
First is necessary get all groups with amount1 equal by amount2 by reshape with DataFrame.set_index and DataFrame.unstack, compare selected columns by DataFrame.xs with DataFrame.eq and for test if all columns match is used DataFrame.all, last use DataFrame.merge for same length like original:
df1 = df.set_index(['R1','R2','W','TYPE'])['amount2'].unstack()
m = df1['A'].eq(df1['B']).rename('m')
m = df.join(m, on=['R1','R2','W'])['m']
Then for match rows (here first group) filter by boolean indexing only A rows chained by & for bitwise AND:
df2 = df[m & df['TYPE'].eq('A')]
print (df2)
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
Then filter all another groups by inverted mask with ~ and aggregate by GroupBy.agg all columns with GroupBy.first and amount columns by sum:
cols = df.columns.difference(['R1','R2','W','amount1','amount2'])
d1 = dict.fromkeys(['amount1','amount2'], 'sum')
d2 = dict.fromkeys(cols, 'first')
df3 = df[~m].groupby(['R1','R2','W'], as_index=False).agg({**d1, **d2}).assign(TYPE='A')
print (df3)
R1 R2 W amount1 amount2 Exchange Status TYPE
0 123 12 2 555 555 1.5 A A
1 123 12 3 777 777 2.5 D A
Last join together by concat and if necessary sorting by DataFrame.sort_values:
df4 = pd.concat([df2, df3], ignore_index=True, sort=False).sort_values(['R1','R2','W'])
print (df4)
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5
Another solution:
#get the rows for A for each grouping
#assumption is TYPE is already sorted with A always ahead of B
core = ['R1','R2','W']
A = df.groupby(core).first()
#get rows for B for each grouping
B = df.groupby(core).last()
#first condition
cond1 = (A.amount1.eq(B.amount1)) & (A.amount2.eq(B.amount2))
#extract outcome from A to get the first part
part1 = A.loc[cond1]
#second condition
cond2 = A.amount2.ne(B.amount2)
#add the 'amount1' and 'amount 2' columns based on the second condition
part2 = B.loc[cond2].filter(['amount1','amount2']) +
A.loc[cond2].filter(['amount1','amount2'])
#merge with A to get the remaining columns
part2 = part2.join(A[['TYPE','Status','Exchange']])
#merge part1 and 2 to get final result
pd.concat([part1,part2]).reset_index()
R1 R2 W TYPE amount1 amount2 Status Exchange
0 123 12 1 A 111 222 C 1.5
1 123 12 2 A 555 555 A 1.5
2 123 12 3 A 777 777 D 2.5
I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...
I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0
Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9
Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()
I have a dataframe with several columns [A, B, C, ..., Z]. I want to delete all rows from the dataframe which have the property that their values in columns [B, C, ..., Z] are equal to 0 (integer zero).
Example df:
A B C ... Z
0 3 0 0 ... 0
1 1 0 0 ... 0
2 2 1 2 ... 3 <-- keep only this as it has values other than zero
I tried to do this like so:
df = df[(df.columns[1:] != 0).all()]
I can't get it to work. I am not too experienced with conditions in indexers. I wanted to avoid a solution that chains a zero test for every column. I am sure that there is a more elegant solution to this.
Thanks!
EDIT:
The solution worked for an artificially created dataframe, but when I used it on my df that I got from reading a csv, it failed. The file looks like this:
A;B;C;D;E;F;G;H;I;J;K;L;M;N;O;P;Q;R;S;T;U;V;W;X;Y;Z
0;25310;169;81;0;0;0;12291181;31442;246;0;0;0;0;0;0;0;0;0;251;31696;0;0;329;0;0
1;6252727;20480;82;0;0;0;31088;85;245;0;0;0;0;0;0;0;0;0;20567;331;0;0;329;0;0
2;6032184;10961;82;0;0;0;31024;84;245;0;0;0;0;0;0;0;0;0;11046;330;0;0;329;0;0
3;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
4;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
5;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
6;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
7;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
8;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
9;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
10;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0;0
I read it using the following commands:
import pandas as pd
# retrieve csv file as dataframe
df = pd.read_csv('PATH/TO/FILE'),
decimal=',',
sep=';')
df[list(df)] = df[list(df)].astype('int')
print(df)
df = df[(df.iloc[:, 1:] != 0).all(axis=1)]
print(df)
The first print statement shows that the frame is read correctly, but the second print gives me an empty dataframe. How can this be?
Use iloc for select all columns without first:
df = df[(df.iloc[:, 1:] != 0).all(axis=1)]
print (df)
A B C Z
2 2 1 2 3
EDIT:
df = df[(df.iloc[:, 1:] != 0).any(axis=1)]
print (df)
A B C D E F G H I J ... Q R S T \
0 0 25310 169 81 0 0 0 12291181 31442 246 ... 0 0 0 251
1 1 6252727 20480 82 0 0 0 31088 85 245 ... 0 0 0 20567
2 2 6032184 10961 82 0 0 0 31024 84 245 ... 0 0 0 11046
U V W X Y Z
0 31696 0 0 329 0 0
1 331 0 0 329 0 0
2 330 0 0 329 0 0
[3 rows x 26 columns]
I have a Dataset with two columns, I would like to do some operations on a particular column and get a new dataframe altogether. Consider this as my dataset:
A B
1 01
1 56
1 89
1 108
2 23
2 36
2 89
3 13
4 45
I would like to perform two operations on the column B and create a dataframe with these 2 columns. 1st Column would be the Highest number for 1 ie - 108 subtracted by its least - 1 (108 - 1), for 2 - (89 - 23) and if its a single instance it should directly be 0. 2nd Column would be a specific number, assume it to be 125 subtracted by the very first instance of value in A ie ( 125 - 1), (125 - 23), (125 - 13)... We should get something like this:
A C D
1 107 124
2 66 102
3 0 112
4 0 80
I was thinking of using .loc to find the specific position of the value and then subtract it, How should I do this?
Use agg by first and custom function with lambda, then rename columns and substract 125 with D :
df = df.groupby('A')['B'].agg([lambda x: x.max() - x.min(), 'first']) \
.rename(columns={'first':'D','<lambda>':'C'}) \
.assign(D= lambda x: 125 - x['D']) \
.reset_index()
print (df)
A C D
0 1 107 124
1 2 66 102
2 3 0 112
3 4 0 80
rename is necessary, because deprecate groupby agg with a dictionary when renaming.
Another solution:
df = df.groupby('A')['B'].agg(['min','max', 'first']) \
.rename(columns={'first':'D','min':'C'}) \
.assign(D=lambda x: 125 - x['D'], C=lambda x: x['max'] - x['C']) \
.drop('max', axis=1) \
.reset_index()
print (df)
A C D
0 1 107 124
1 2 66 102
2 3 0 112
3 4 0 80
u = df.groupby('A').agg(['max', 'min', 'first'])
u.columns = 'max', 'min', 'first'
u['C'] = u['max'] - u['min']
u['D'] = 125 - u['first']
del u['min']
del u['max']
del u['first']
u.reset_index()
# A C D
#0 1 107 124
#1 2 66 102
#2 3 0 112
#3 4 0 80
You could
In [1494]: df.groupby('A', as_index=False).B.agg(
{'C': lambda x: x.max() - x.min(), 'D': lambda x: 125-x.iloc[0]})
Out[1494]:
A C D
0 1 107 124
1 2 66 102
2 3 0 112
3 4 0 80