I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Related
I have dataframe, where 'A' 1 - client, B - admin
I need to merge messages in row with 1 sequentially and merge lines 2 - admin response sequentially across the dataframe.
df1 = pd.DataFrame({'A' : ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'j', 'de', 'be'],
'B' : [1, 1, 2, 1, 1, 1, 2, 2, 1, 2]})
df1
A B
A B
0 a 1
1 b 1
2 c 2
3 d 1
4 e 1
5 f 1
6 h 2
7 j 2
8 de 1
9 be 2
I need to get in the end this dataframe:
df2 = pd.DataFrame({'A' : ['a, b', 'd, e, f', 'de'],
'B' : ['c', 'h, j', 'be' ]})
Out:
A B
0 a,b c
1 d,e,f h,j
2 de be
I do not know how to do this
Create groups by consecutive values in B - trick compare shifted values with cumulative sum and aggregate first and join. Create helper column for posible pivoting in next step by DataFrame.pivot:
Solution working if exist pairs 1,2 in sequentially order with duplicates.
df = (df1.groupby(df1['B'].ne(df1['B'].shift()).cumsum())
.agg(B = ('B','first'), A= ('A', ','.join))
.assign(C = lambda x: x['B'].eq(1).cumsum()))
print (df)
B A C
B
1 1 a,b 1
2 2 c 1
3 1 d,e,f 2
4 2 h,j 2
5 1 de 3
6 2 be 3
df = (df.pivot('C','B','A')
.rename(columns={1:'A',2:'B'})
.reset_index(drop=True).rename_axis(None, axis=1))
print (df)
A B
0 a,b c
1 d,e,f h,j
2 de be
I have dataframe:
d_test = {
'c1' : ['a', 'b', np.nan, 'c'],
'c2' : ['d', np.nan, 'e', 'f'],
'test': [1,2,3,4],
}
df_test = pd.DataFrame(d_test)
And I want to concatenate columns c1 and c2 in one and have following resulted dataframe:
a 1
b 2
c 4
d 1
e 3
f 4
I tired to use
pd.concat([df_test.c1 , df_test.c2], axis = 0)
to generate such a column but have no idea how to keep 'test' column as well during concationation.
use melt
df_test.melt('test').dropna()[['value', 'test']]
result:
value test
0 a 1
1 b 2
3 c 4
4 d 1
6 e 3
7 f 4
I have a CSV based dataframe
name value
A 5
B 5
C 5
D 1
E 2
F 1
and a values count dictionary like this:
{
5: 2,
1: 1
}
How to split original dataframe into two:
name value
A 5
B 5
D 1
name value
C 5
E 2
F 1
So how to split a dataframe heaving a list of column values and counts in pandas?
This worked for me:
def target_indices(df, value_count):
indices = []
for index, row in df.iterrows():
for key in value_count:
if key == row['value'] and value_count[key] > 0:
indices.append(index)
value_count[key] -= 1
return(indices)
df = pd.DataFrame({'name': ['A', 'B', 'C', 'D', 'E', 'F'], 'value': [5, 5, 5, 1, 2, 1]})
value_count = {5: 2, 1: 1}
indices = target_indices(df, value_count)
df1 = df.iloc[indices]
print(df1)
df2 = df.drop(indices)
print(df2)
Output:
name value
0 A 5
1 B 5
3 D 1
name value
2 C 5
4 E 2
5 F 1
I have a dataframe with different categories and want to exclude all the values which are above a given percentile for each category.
d = {'cat': ['A', 'B', 'A', 'A', 'C', 'C', 'B', 'A', 'B', 'C'],
'val': [1, 2, 4, 2, 1, 0, 9, 8, 7, 7]}
df = pd.DataFrame(data=d)
cat val
0 A 1
1 B 2
2 A 4
3 A 2
4 C 1
5 C 0
6 B 9
7 A 8
8 B 7
9 C 7
So for example, excluding the 0.95 percentile should result in:
cat val
0 A 1
1 B 2
2 A 4
3 A 2
4 C 1
5 C 0
8 B 7
because we have:
>>> df[df['cat']=='A'].quantile(0.95).item()
7.399999999999999
>>> df[df['cat']=='B'].quantile(0.95).item()
8.8
>>> df[df['cat']=='C'].quantile(0.95).item()
6.399999999999999
In reality there are many categories and I need a neat way to do it.
You can use the quantile function in combination with groupby:
df.groupby('cat')['val'].apply(lambda x: x[x < x.quantile(0.95)]).reset_index().drop(columns='level_1')
I came up with the following solution:
idx = [False] * df.shape[0]
for cat in df['cat'].unique():
idx |= ((df['cat']==cat) & (df['val'].between(0, df[df['cat']==cat ].quantile(0.95).item())))
df[idx]
but it would be nice to see other solutions (hopefully better ones).
It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1
There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8