I have a data frame df1 like this:
A B C ...
mean 10 100 1
std 11 110 2
median 12 120 3
I want to make another df with separate col for each df1 col. header-row name pair:
A-mean A-std A-median B-mean B-std B-median C-mean C-std C-median ...
10 11 12 100 110 120 1 2 3
Basically I have used the pandas.DataFrame.describe function and now I would like to transpose it this way.
You can unstack your DataFrame into a Series, flatten the Index, turn it back into a DataFrame and transpose the result.
out = (
df.unstack()
.pipe(lambda s:
s.set_axis(s.index.map('-'.join))
)
.to_frame().T
)
print(out)
A-mean A-std A-median B-mean B-std B-median C-mean C-std C-median
0 10 11 12 100 110 120 1 2 3
Related
I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0
I have a pandas data frame as follows:
id group type action cost
101 A 1 10
101 A 1 repair 3
102 B 1 5
102 B 1 repair 7
102 B 1 grease 2
102 B 1 inflate 1
103 A 2 12
104 B 2 9
I need to reshape it from long to wide, but depending on the value of the action column, as follows:
id group type action_std action_extra
101 A 1 10 3
102 B 1 5 10
103 A 2 12 0
104 B 2 9 0
In other words, for the rows with empty action field the cost value should be put under the action_std column, while for the rows with non-empty action field the cost value should be summarized under the action_extra column.
I've attempted with several combinations of groupby / agg / pivot but I cannot find any fully working solution...
I would suggest you simply split the cost column into a cost, and a cost_extra column. Something like the following:
import numpy as np
result = df.assign(
cost_extra=lambda df: np.where(
df['action'].notnull(), df['cost'], np.nan
)
).assign(
cost=lambda df: np.where(
df['action'].isnull(), df['cost'], np.nan
)
).groupby(
["id", "group", "type"]
)["cost", "cost_extra"].agg(
"sum"
)
result looks like:
cost cost_extra
id group type
101 A 1 10.0 3.0
102 B 1 5.0 10.0
103 A 2 12.0 0.0
104 B 2 9.0 0.0
Check groupby with unstack
df.cost.groupby([df.id,df.group,df.type,df.action.eq('')]).sum().unstack(fill_value=0)
action False True
id group type
101 A 1 3 10
102 B 1 10 5
103 A 2 0 12
104 B 2 0 9
Thanks for your hints, this is the solution that I finally like the most (also for its simplicity):
df["action_std"] = df["cost"].where(df["action"] == "")
df["action_extra"] = df["cost"].where(df["action"] != "")
df = df.groupby(["id", "group", "type"])["action_std", "action_extra"].sum().reset_index()
I am trying to find some elegant ways of rearranging a pandas dataframe.
My initial dataframe looks like this:
PS PSS 10PS 10PSS 5PS 5PSS
1 6 263 5 23 2 101
2 5 49 2 30 1 30
desired arrangement would be:
1-PS 1-PSS 1-10PS 1-10PSS 1-5PS 1-5PSS 2-PS 2-PSS 2-10PS 2-10PSS 2-5PS 2-5PSS
A 6 263 5 23 2 101 5 49 2 30 1 30
Where A is a new index and I would like the rows to be merged with the columns.
You need stack here , with column join
s=df.stack().to_frame('A')
s.index=s.index.map('{0[0]}-{0[1]}'.format)
s.T
Out[42]:
1-PS 1-PSS 1-10PS 1-10PSS 1-5PS 1-5PSS 2-PS 2-PSS 2-10PS 2-10PSS \
A 6 263 5 23 2 101 5 49 2 30
2-5PS 2-5PSS
A 1 30
Hopefully these lines can help you out:
# Put a pandas Series from each line in a generator
series = (pd.Series(i, index=['{}-{}'.format(ind,x) for x in df.columns])
for ind, i in zip(df.index,df.values))
# Concatenate and convert to frame + transpose
df = pd.concat(series).to_frame('A').T
Full example:
import pandas as pd
data = '''\
index PS PSS 10PS 10PSS 5PS 5PSS
1 6 263 5 23 2 101
2 5 49 2 30 1 30'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+').set_index('index')
# Put a pandas Series from each line in a generator
series = (pd.Series(i, index=['{}-{}'.format(ind,x) for x in df.columns])
for ind, i in zip(df.index,df.values))
# Concatenate and convert to frame + transpose
df = pd.concat(series).to_frame('A').T
First, I show the pandas dataframe to elucidate my problem.
import pandas as pd
mi = pd.MultiIndex.from_product([["A","B"],["c","d"]], names=['lv1', 'lv2'])
df1 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]],columns=mi)
this python code creates dataframe(df1) like this:
#input dataframe
lv1 A B
lv2 c d c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
I want to create columns 'c*d' on lv2 by using df1's data. like this:
#output dataframe after calculation
lv1 A B
lv2 c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
For this problem,I wrote some code like this:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
df1.sort_index(1,inplace=True)
Although this code almost solved my problem, but I really want to write without 'for' statement like this:
df1.loc[:,(slice(None),"c*d")]=df1.loc[:,(slice(None),"c")]*df1.loc[:,(slice(None),"d")]
With this statement,I got Key error that says 'c*d' is missing.
Is there no syntax sugar for this calculation? Or can I achieve better performance by other code?
A bit improved your solution:
for l1 in mi.levels[0]:
df1.loc[:, (l1, "c*d")] = df1.loc[:,(l1,"c")]*df1.loc[:,(l1,"d")]
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c*d']])
df1 = df1.reindex(columns=mux)
print (df1)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Another solution with stack and unstack:
mux = pd.MultiIndex.from_product([df1.columns.levels[0], ['c','d','c_d']])
df1 = df1.stack(0)
.assign(c_d = lambda x: x.sum(1))
.unstack()
.swaplevel(0,1,1)
.reindex(columns=mux)
print (df1)
A B
c d c_d c d c_d
0 1 2 3 3 4 7
1 5 6 11 7 8 15
2 9 10 19 11 12 23
df2 = df1.xs("c", axis=1, level=1).mul(df1.xs("d", axis=1, level=1))
df2.columns = pd.MultiIndex.from_product([df2.columns, ['c*d']])
print (df2)
A B
c*d c*d
0 2 12
1 30 56
2 90 132
mux = pd.MultiIndex.from_product([df2.columns.levels[0], ['c','d','c*d']])
df = df1.join(df2).reindex(columns=mux)
print (df)
A B
c d c*d c d c*d
0 1 2 2 3 4 12
1 5 6 30 7 8 56
2 9 10 90 11 12 132
Explanation of jezrael's answer using stack which is may be the most idiomatic way in pandas.
output = (df1
# "Stack" data, by moving the top level ('lv1') of the
# column MultiIndex into row index,
# now the rows are a MultiIndex and the columns
# are a regular Index.
.stack(0)
# Since we only have 2 columns now, 'lv2' ('c' & 'd')
# we can multiply them together along the row axis.
# The assign method takes key=value pairs mapping new column
# names to the function used to calculate them. Here we're
# wrapping them in a dictionary and unpacking them using **
.assign(**{'c*d': lambda x: x.product(axis=1)})
# Undos the stack operation, moving 'lv1', back to the
# column index, but now as the bottom level of the column index
.unstack()
# This sets the order of the column index MultiIndex levels.
# Since they are named we can use the names, you can also use
# their integer positions instead. Here axis=1 references
# the column index
.swaplevel('lv1', 'lv2', axis=1)
# Sort the values in both levels of the column MultiIndex.
# This will order them as c, c*d, d which is not what you
# specified above, however having a sorted MultiIndex is required
# for indexing via .loc[:, (...)] to work properly
.sort_index(axis=1)
)
i have a table in pandas dataframe df
Leafid pidx pidy value
100 1 3 10
100 2 6 12
200 5 7 48
300 7 1 11
i have another dataframe df2 which has
pid price
1 10
2 20
3 30
4 40
5 50
6 60
7 70
i want to merge df and df2 such that i have two more column's price_pidx and price_pidy
and then also do division of price_pidy/price_pidx
for example:
Leafid pidx pidy value price_pidx price_pidy price_pidy/price_pidx
`100 1 3 10 10 30 3`
my final df should have columns
pidx pidy value price_pidx/price_pidy
i don't want to use .map() in this.
is there any way to do it using pd.merge?
i know how to bring one column price_pidx but how to bring both?
for eg.
pd.merge(df,df2['pid','price'],how = left, left_on = 'pidx' right_on = 'pid')
but how to bring both price_pidx and price_pidy
Without map it is complicated, because need reshape by melt, then merge and last unstack:
df = pd.melt(df, id_vars='value', value_name='pid', var_name='g')
df2 = pd.merge(df,df2[['pid','price']], how='left', on = 'pid')
df2 = df2.set_index(['value','g']).unstack()
df2.columns = ['_'.join(col) for col in df2.columns]
df2['col'] = df2.price_pidy / df2.price_pidx
df2 = df2.rename(columns={'pid_pidx':'pidx','pid_pidy':'pidy'})
print (df2)
pidx pidy price_pidx price_pidy col
value
10 1 3 10 30 3.000000
11 7 1 70 10 0.142857
12 2 6 20 60 3.000000
48 5 7 50 70 1.400000