Cannot re-add column to pandas multi-index dataframe after deletion - python

It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1

There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8

Related

Join two columns of sequentially values

I have dataframe, where 'A' 1 - client, B - admin
I need to merge messages in row with 1 sequentially and merge lines 2 - admin response sequentially across the dataframe.
df1 = pd.DataFrame({'A' : ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'j', 'de', 'be'],
'B' : [1, 1, 2, 1, 1, 1, 2, 2, 1, 2]})
df1
A B
A B
0 a 1
1 b 1
2 c 2
3 d 1
4 e 1
5 f 1
6 h 2
7 j 2
8 de 1
9 be 2
I need to get in the end this dataframe:
df2 = pd.DataFrame({'A' : ['a, b', 'd, e, f', 'de'],
'B' : ['c', 'h, j', 'be' ]})
Out:
A B
0 a,b c
1 d,e,f h,j
2 de be
I do not know how to do this
Create groups by consecutive values in B - trick compare shifted values with cumulative sum and aggregate first and join. Create helper column for posible pivoting in next step by DataFrame.pivot:
Solution working if exist pairs 1,2 in sequentially order with duplicates.
df = (df1.groupby(df1['B'].ne(df1['B'].shift()).cumsum())
.agg(B = ('B','first'), A= ('A', ','.join))
.assign(C = lambda x: x['B'].eq(1).cumsum()))
print (df)
B A C
B
1 1 a,b 1
2 2 c 1
3 1 d,e,f 2
4 2 h,j 2
5 1 de 3
6 2 be 3
df = (df.pivot('C','B','A')
.rename(columns={1:'A',2:'B'})
.reset_index(drop=True).rename_axis(None, axis=1))
print (df)
A B
0 a,b c
1 d,e,f h,j
2 de be

How to embed a Series into specific rows in a Dataframe?

For instance, now I have a data frame df initially:
df = pd.DataFrame()
df['A'] = pd.Series([1, 1, 2, 2, 1, 3, 3])
df['B'] = pd.Series(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df
# A B
1 a
1 b
2 c
2 d
1 e
3 f
3 g
Now I'd like to replace the rows which column A equals to 1 with a list [0, 1, 2]. So, here is my expectation after embedding:
df
# A B
1 0
1 1
2 c
2 d
1 2
3 f
3 g
How to achieve this goal?
df.loc[df['A']==1, 'B'] = [0, 1, 2]
print(df)
Prints:
A B
0 1 0
1 1 1
2 2 c
3 2 d
4 1 2
5 3 f
6 3 g

Latest values based on time column

I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a

How to get equivalent of pandas melt using groupby + stack?

Recently I am learning groupby and stack and encountered one method of pandas called melt. I would like to know how to achieve the same result given by melt using groupby and stack.
Here is the MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
df1 = pd.melt(df, id_vars='A',value_vars=['B','C'],var_name='variable',value_name='value')
print(df1)
A variable value
0 1 B 1
1 1 B 1
2 1 B 2
3 2 B 2
4 2 B 1
5 1 C 10
6 1 C 20
7 1 C 30
8 2 C 40
9 2 C 50
How to get the same result using groupby and stack?
My attempt
df.groupby('A')[['B','C']].count().stack(0).reset_index()
I am not quite correct. And looking for the suggestions.
I guess you do not need groupby, just stack + sort_values:
result = df[['A', 'B', 'C']].set_index('A').stack().reset_index().sort_values(by='level_1')
result.columns = ['A', 'variable', 'value']
Output
A variable value
0 1 B 1
2 1 B 1
4 1 B 2
6 2 B 2
8 2 B 1
1 1 C 10
3 1 C 20
5 1 C 30
7 2 C 40
9 2 C 50

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Categories

Resources