Lets Suppose I have
ID A1 B1 A2 B2
1 3 4 5 6
2 7 8 9 10
I want to use pandas stack and wants to achieve something like this
ID A B
1 3 4
1 5 6
2 7 8
2 9 10
but what I got is
ID A B
1 3 4
2 7 8
1 5 6
2 9 10
this is what i am using
df.stack().reset_index().
Is it possible to achieve something like this using Stack? append() method in pandas does this, but if possible I want to achieve using pandas stack() Any idea ?
You can use pd.wide_to_long:
pd.wide_to_long(df, ['A','B'], 'ID', 'value', sep='', suffix='.+')\
.reset_index()\
.sort_values('ID')\
.drop('value', axis=1)
Output:
ID A B
0 1 3 4
2 1 5 6
1 2 7 8
3 2 9 10
Create a new columns object by splitting up the existing column names. This takes for granted that we have single character letters followed by a single digit.
d = df.set_index('ID')
d.columns = d.columns.map(tuple)
d.stack().reset_index('ID')
ID A B
1 1 3 4
2 1 5 6
1 2 7 8
2 2 9 10
One-line
df.set_index('ID').rename(columns=tuple).stack().reset_index('ID')
More generalized
d = df.set_index('ID')
s = d.columns.str
d.columns = [
s.extract('^(\D+)', expand=False),
s.extract('(\d+)$', expand=False)
]
d.stack().reset_index('ID')
A more interested way
s.groupby(s.columns.str[0],axis=1).agg(lambda x : x.values.tolist()).stack().apply(pd.Series).unstack(0).T.reset_index(level=0,drop=True)
Out[90]:
A B
ID
1 3 4
2 7 8
1 5 6
2 9 10
Related
I have a pandas DataFrame that looks like this:
a b c
8 3 3
4 3 3
5 3 3
1 9 4
7 3 1
1 3 3
6 3 3
9 7 7
1 7 7
I want to get a DataFrame like this:
a b c
17 3 3
1 9 4
7 3 1
7 3 3
10 7 7
Essentially, I want to add together the values in column a when the values in columns b and c are the same, but I want to do that in sections. groupby wouldn't work here because it would put the DataFrame out of order. I have an iterative solution, but it is messy and not very Pythonic. Is there a way to do this using the functions of the DataFrame?
Let us do shift with cumsum create the subgroup by key
s = df[['b','c']].ne(df[['b','c']].shift()).all(1).cumsum()
out = df.groupby([s,df.b,df.c]).agg({'a':'sum','b':'first','c':'first'}).reset_index(drop=True)
a b c
0 17 3 3
1 1 9 4
2 7 3 1
3 7 3 3
4 10 7 7
Try this:
df.groupby(['b', 'c', df[['b', 'c']].diff().any(axis=1).cumsum()], as_index=False)['a'].sum()
Output:
b c a
0 3 1 7
1 3 3 17
2 3 3 7
3 7 7 10
4 9 4 1
I need to calculate a column based on other row. Basically I want my new_column to be the sum of "base_column" for all row with same id.
I currently do the following (but is not really efficient) what is the most efficient way to achieve that ?
def calculate(x):
filtered_df = df[["id"] == dataset.at[x.name, "id"]] # in fact my filter is more complex basically same id and date in the last 4 weeks
df.at[x.name, "new_column"] = filtered_df["base_column"].sum()
df.apply(calculate)
You can do a below
df['new_column']= df.groupby('id')['base_column'].transform('sum')
input
id base_column
0 1 2
1 1 4
2 2 5
3 3 6
4 5 7
5 7 4
6 7 5
7 7 3
output
id base_column new_column
0 1 2 6
1 1 4 6
2 2 5 5
3 3 6 6
4 5 7 7
5 7 4 12
6 7 5 12
7 7 3 12
Another way to do this is to use groupby and merge
import pandas as pd
df = pd.DataFrame({'id':[1,1,2],'base_column':[2,4,5]})
# compute sum by id
sum_base =df.groupby("id").agg({"base_column": 'sum'}).reset_index().rename(columns={'base_column':'new_column'})
# join the result to df
df = pd.merge(df,sum_base,how='left',on='id')
# id base_column new_column
#0 1 2 6
#1 1 4 6
#2 2 5 5
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
I want to groupby DataFrame and get the nlargest data of column 'C'.
while the return is series, not DataFrame.
dftest = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).sort_index()
the result get a series.
2 1
4 2
5 2
6 3
7 3
8 4
9 4
Name: C, dtype: int64
I hope the result is DataFrame, like
A B C
2 3 A 1
4 5 A 2
5 6 B 2
6 7 A 3
7 8 B 3
8 9 B 4
9 10 B 4
******update**************
sorry, the column 'A' in fact does not series integers, the dftest might be more like
dftest = pd.DataFrame({'A':['Feb','Flow','Air','Flow','Feb','Beta','Cat','Feb','Beta','Air'],
'B':['A','B','A','B','A','B','A','B','B','B'],
'C':[0,0,1,1,2,2,3,3,4,4]})
and the result should be
A B C
2 Air A 1
4 Feb A 2
5 Beta B 2
6 Cat A 3
7 Feb B 3
8 Beta B 4
9 Air B 4
It may be a bit clumsy, but it does what you asked:
dfn= dftest.groupby('B').apply(lambda
grp:grp['C'].nlargest(int(grp['C'].count()*0.8))).reset_index().rename(columns=
{'level_1':'A'})
dfn.A = dfn.A+1
dfn=dfn[['A','B','C']].sort_values(by='A')
Thanks to my friends, the follow code works for me.
dfn=dftest.groupby('B',group_keys=False)\
.apply(lambda grp:grp.nlargest(n=int(grp['C'].count()*0.8),columns='C').sort_index())
the dfn is
In [8]:dfn
Out[8]:
A B C
2 3 A 1
4 5 A 2
6 7 A 3
5 6 B 2
7 8 B 3
8 9 B 4
9 10 B 4
my previous code is deal with series, the later one is deal with DataFrame.
df = pd.DataFrame({'a':[1,2,3,4],'b':[5,6,7,8],'c':[9,10,11,12]})
How can I insert a new row of zeros at index 0 in one single line?
I tried pd.concat([pd.DataFrame([[0,0,0]]),df) but it did not work.
The desired output:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
You can concat the temp df with the original df but you need to pass the same column names so that it aligns in the concatenated df, additionally to get the index as you desire call reset_index with drop=True param.
In [87]:
pd.concat([pd.DataFrame([[0,0,0]], columns=df.columns),df]).reset_index(drop=True)
Out[87]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
alternatively to EdChum's solution you can do this:
In [163]: pd.DataFrame([[0,0,0]], columns=df.columns).append(df, ignore_index=True)
Out[163]:
a b c
0 0 0 0
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
An answer more specific to the dataframe being prepended to
pd.concat([df.iloc[[0], :] * 0, df]).reset_index(drop=True)