Groupby and assign operation result to each group - python

df = pd.DataFrame({'ID': ['A','A','A','A','A'],
'target': ['B','B','B','B','C'],
'length':[208,315,1987,3775,200],
'start':[139403,140668,141726,143705,108],
'end':[139609,140982,143711,147467,208]})
ID target length start end
0 A B 208 139403 139609
1 A B 315 140668 140982
2 A B 1987 141726 143711
3 A B 3775 143705 147467
4 A C 200 108 208
If I perform the operation:
(df.assign(length=
df['start'].lt(df['end'].shift())
.mul(df['start']-df['end'].shift(fill_value=0))
.add(df['length'])))
I get the correct result but how do I apply this logic to every group in a groupby?
for (a, b) in df.groupby(['start','end']):
(df.assign(length=
df['sstart'].lt(df['send'].shift())
.mul(df['sstart']-df['send'].shift(fill_value=0))
.add(df['length'])))
Leaves the dataframe unchanged?

Group the df on required columns(ID and target) and shift the end column then apply your formula as usual:
s = df.groupby(['ID', 'target'])['end'].shift()
df['length'] = df['start'].lt(s) * df['start'].sub(s.fillna(0)) + df['length']
ID target length start end
0 A B 208.0 139403 139609
1 A B 315.0 140668 140982
2 A B 1987.0 141726 143711
3 A B 3769.0 143705 147467
4 A C 200.0 108 208

Related

Pandas: select rows by random groups while keeping all of the group's variables

My dataframe looks like this:
id std number
A 1 1
A 0 12
B 123.45 34
B 1 56
B 12 78
C 134 90
C 1234 100
C 12345 111
I'd like to select random rows of Id while retaining all of the information in the other rows, such that dataframe would look like this:
id std number
A 1 1
A 0 12
C 134 90
C 1234 100
C 12345 111
I tried it with
size = 1000
replace = True
fn = lambda obj: obj.loc[np.random.choice(obj.index, size, replace),:]
df2 = df1.groupby('Id', as_index=False).apply(fn)
and
df2 = df1.sample(n=1000).groupby('id')
but obviously that didn't work. Any help would be appreciated.
You need create random ids first and then compare original column id by Series.isin in boolean indexing:
#number of groups
N = 2
df2 = df1[df1['id'].isin(df1['id'].drop_duplicates().sample(N))]
print (df2)
id std number
0 A 1.0 1
1 A 0.0 12
5 C 134.0 90
6 C 1234.0 100
7 C 12345.0 111
Or:
N = 2
df2 = df1[df1['id'].isin(np.random.choice(df1['id'].unique(), N))]

Pandas reshape a multicolumn dataframe long to wide with conditional check

I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...
I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0
Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9
Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()

Is there no syntax suger for dynamic creating columns with multiindexed pandas dataframe?

First, I show the pandas dataframe to elucidate my problem.
import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)
this python code creates dataframe(df1) like this:
#input dataframe
lv1 A B
lv2 c d c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
I want to create columns 'c*d' on lv2 by using df1's data. like this:
#output dataframe after calculation
lv1 A B
lv2 c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
For this problem,I wrote some code like this:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)
Although this code almost solved my problem, but I really want to write without 'for' statement like this:
df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]
With this statement,I got Key error that says 'c*d' is missing.
Is there no syntax sugar for this calculation? Or can I achieve better performance by other code?
A bit improved your solution:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Another solution with stack and unstack:
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
.assign(c_d = lambda x: x.sum(1))
.unstack()
.swaplevel(0,1,1)
.reindex(columns=mux)
print (df1)
A B
c d c_d c d c_d
0 1 2 3 3 4 7
1 5 6 11 7 8 15
2 9 10 19 11 12 23
df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
A B
c*d c*d
0 2 12
1 30 56
2 90 132
mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Explanation of jezrael's answer using stack which is may be the most idiomatic way in pandas.
output = (df1
# "Stack" data, by moving the top level ('lv1') of the
# column MultiIndex into row index,
# now the rows are a MultiIndex and the columns
# are a regular Index.
.stack(0)
# Since we only have 2 columns now, 'lv2' ('c' & 'd')
# we can multiply them together along the row axis.
# The assign method takes key=value pairs mapping new column
# names to the function used to calculate them. Here we're
# wrapping them in a dictionary and unpacking them using **
.assign(**{'c*d': lambda x: x.product(axis=1)})
# Undos the stack operation, moving 'lv1', back to the
# column index, but now as the bottom level of the column index
.unstack()
# This sets the order of the column index MultiIndex levels.
# Since they are named we can use the names, you can also use
# their integer positions instead. Here axis=1 references
# the column index
.swaplevel('lv1', 'lv2', axis=1)
# Sort the values in both levels of the column MultiIndex.
# This will order them as c, c*d, d which is not what you
# specified above, however having a sorted MultiIndex is required
# for indexing via .loc[:, (...)] to work properly
.sort_index(axis=1)
)

Subtracting values from a column on a specific condition and getting a new DataFrame

I have a Dataset with two columns, I would like to do some operations on a particular column and get a new dataframe altogether. Consider this as my dataset:
A B
1 01
1 56
1 89
1 108
2 23
2 36
2 89
3 13
4 45
I would like to perform two operations on the column B and create a dataframe with these 2 columns. 1st Column would be the Highest number for 1 ie - 108 subtracted by its least - 1 (108 - 1), for 2 - (89 - 23) and if its a single instance it should directly be 0. 2nd Column would be a specific number, assume it to be 125 subtracted by the very first instance of value in A ie ( 125 - 1), (125 - 23), (125 - 13)... We should get something like this:
A C D
1 107 124
2 66 102
3 0 112
4 0 80
I was thinking of using .loc to find the specific position of the value and then subtract it, How should I do this?
Use agg by first and custom function with lambda, then rename columns and substract 125 with D :
df = df.groupby('A')['B'].agg([lambda x: x.max() - x.min(), 'first']) \
.rename(columns={'first':'D','<lambda>':'C'}) \
.assign(D= lambda x: 125 - x['D']) \
.reset_index()
print (df)
A C D
0 1 107 124
1 2 66 102
2 3 0 112
3 4 0 80
rename is necessary, because deprecate groupby agg with a dictionary when renaming.
Another solution:
df = df.groupby('A')['B'].agg(['min','max', 'first']) \
.rename(columns={'first':'D','min':'C'}) \
.assign(D=lambda x: 125 - x['D'], C=lambda x: x['max'] - x['C']) \
.drop('max', axis=1) \
.reset_index()
print (df)
A C D
0 1 107 124
1 2 66 102
2 3 0 112
3 4 0 80
u = df.groupby('A').agg(['max', 'min', 'first'])
u.columns = 'max', 'min', 'first'
u['C'] = u['max'] - u['min']
u['D'] = 125 - u['first']
del u['min']
del u['max']
del u['first']
u.reset_index()
# A C D
#0 1 107 124
#1 2 66 102
#2 3 0 112
#3 4 0 80
You could
In [1494]: df.groupby('A', as_index=False).B.agg(
{'C': lambda x: x.max() - x.min(), 'D': lambda x: 125-x.iloc[0]})
Out[1494]:
A C D
0 1 107 124
1 2 66 102
2 3 0 112
3 4 0 80

Set values in a multi-level pandas data frame python

I have been working with multi-level DataFrames pretty recently, and I have found that they can significantly reduce computation time for large data sets. For example, consider the simple data frame:
df = pd.DataFrame([
[1, 111, 0], [2, 222, 0], [1, 111, 0],
[2, 222, 1], [1, 111, 1], [2, 222, 2]
], columns=["ID", "A", "B"], index=[1, 1, 2, 2, 3, 3]
)
df.head(6)
ID A B
1 1 111 0
1 2 222 0
2 1 111 0
2 2 222 1
3 1 111 1
3 2 222 2
which can be pivoted by ID to create a multi-level data frame:
pivot_df = df.pivot(columns="ID")
pivot_df.head()
A B
ID 1 2 1 2
1 111 222 0 0
2 111 222 0 1
3 111 222 1 2
The great thing about having my data in this format is that I can perform "vector" operations across all IDs simply by referencing the level 0 columns:
pivot_df["A"] * (1 + pivot_df["B"])**2
ID 1 2
1 111 222
2 111 888
3 444 999
These operations are really helpful for me! In real life, my computations are much more complex and need to be performed for > 1000 IDs. A common DataFrame size that I work with contains 10 columns (at level 0) with 1000 IDs (at level 1) with 350 rows.
I am interested in figuring out to do two things: update values for a particular field in this pivoted DataFrame; create a new column for this DataFrame. Something like
pivot_df["A"] = pivot_df["A"] * (1 + pivot_df["B"])**2
or
pivot_df["C"] = pivot_df["A"] * (1 + pivot_df["B"])**2
I do not get any errors when I perform either of these, but the DataFrame remains unchanged. I have also tried using .loc and .iloc, but I am having no success.
I think that the problem is maintaining the multi-level structure of the computed DataFrames, but I am pretty new to using multi-level DataFrames and not sure how to solve this problem efficiently. I have a clumsy workaround that is not efficient (create a dictionary of computed DataFrames and then merge them all together...
df_dict = OrderedDict()
df_dict["A"] = pivot_df["A"]
df_dict["B"] = pivot_df["B"]
df_dict["C"] = pivot_df["A"] * (1 + pivot_df["B"])**2
dfs = [val.T.set_index(np.repeat(key, val.shape[1]), append=True).T for key, val in df_dict.iteritems()]
final_df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), dfs)
final_df.columns = final_df.columns.swaplevel(0, 1)
or similarly,
df_dict = OrderedDict()
df_dict["A"] = pivot_df["A"] * (1 + pivot_df["B"])**2
df_dict["B"] = pivot_df["B"]
dfs = [val.T.set_index(np.repeat(key, val.shape[1]), append=True).T for key, val in df_dict.iteritems()]
final_df = reduce(lambda x, y: pd.merge(x, y, left_index=True, right_index=True), dfs)
final_df.columns = final_df.columns.swaplevel(0, 1)
This is not necessarily clunky (I was kind of proud of the workaround), but this is certainly not efficient or computationally optimized. Does anyone have any recommendations?
Option 1
Don't pivot first!
You stated that it was convenient to pivot because you could perform vector calculations in the new pivoted form. This is a mis-representation because you could have easily performed those calculations prior to the pivot.
df['C'] = df["A"] * (1 + df["B"]) ** 2
df.pivot(columns='ID')
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Or in a piped one-liner if you prefer
df.assign(C=df.A * (1 + df.B) ** 2).pivot(columns='ID')
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Option 2
pd.concat
But to answer your question...
pdf = df.pivot(columns='ID')
pd.concat([
pdf.A, pdf.B, pdf.A * (1 + pdf.B) ** 2
], axis=1, keys=['A', 'B', 'C'])
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998
Option 3
more pd.concat
Add another level to columns prior to concat
pdf = df.pivot(columns='ID')
c = pdf.A * (1 + pdf.B) ** 2
c.columns = [['C'] * len(c.columns), c.columns]
pd.concat([pdf, c], axis=1)
A B C
ID 1 2 1 2 1 2
1 111 222 0 0 111 222
2 111 222 0 1 111 888
3 111 222 1 2 444 1998

Categories

Resources