I have dataframe, where 'A' 1 - client, B - admin
I need to merge messages in row with 1 sequentially and merge lines 2 - admin response sequentially across the dataframe.
df1 = pd.DataFrame({'A' : ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'j', 'de', 'be'],
'B' : [1, 1, 2, 1, 1, 1, 2, 2, 1, 2]})
df1
A B
A B
0 a 1
1 b 1
2 c 2
3 d 1
4 e 1
5 f 1
6 h 2
7 j 2
8 de 1
9 be 2
I need to get in the end this dataframe:
df2 = pd.DataFrame({'A' : ['a, b', 'd, e, f', 'de'],
'B' : ['c', 'h, j', 'be' ]})
Out:
A B
0 a,b c
1 d,e,f h,j
2 de be
I do not know how to do this
Create groups by consecutive values in B - trick compare shifted values with cumulative sum and aggregate first and join. Create helper column for posible pivoting in next step by DataFrame.pivot:
Solution working if exist pairs 1,2 in sequentially order with duplicates.
df = (df1.groupby(df1['B'].ne(df1['B'].shift()).cumsum())
.agg(B = ('B','first'), A= ('A', ','.join))
.assign(C = lambda x: x['B'].eq(1).cumsum()))
print (df)
B A C
B
1 1 a,b 1
2 2 c 1
3 1 d,e,f 2
4 2 h,j 2
5 1 de 3
6 2 be 3
df = (df.pivot('C','B','A')
.rename(columns={1:'A',2:'B'})
.reset_index(drop=True).rename_axis(None, axis=1))
print (df)
A B
0 a,b c
1 d,e,f h,j
2 de be
Related
I have dataframe:
d_test = {
'c1' : ['a', 'b', np.nan, 'c'],
'c2' : ['d', np.nan, 'e', 'f'],
'test': [1,2,3,4],
}
df_test = pd.DataFrame(d_test)
And I want to concatenate columns c1 and c2 in one and have following resulted dataframe:
a 1
b 2
c 4
d 1
e 3
f 4
I tired to use
pd.concat([df_test.c1 , df_test.c2], axis = 0)
to generate such a column but have no idea how to keep 'test' column as well during concationation.
use melt
df_test.melt('test').dropna()[['value', 'test']]
result:
value test
0 a 1
1 b 2
3 c 4
4 d 1
6 e 3
7 f 4
For instance, now I have a data frame df initially:
df = pd.DataFrame()
df['A'] = pd.Series([1, 1, 2, 2, 1, 3, 3])
df['B'] = pd.Series(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
df
# A B
1 a
1 b
2 c
2 d
1 e
3 f
3 g
Now I'd like to replace the rows which column A equals to 1 with a list [0, 1, 2]. So, here is my expectation after embedding:
df
# A B
1 0
1 1
2 c
2 d
1 2
3 f
3 g
How to achieve this goal?
df.loc[df['A']==1, 'B'] = [0, 1, 2]
print(df)
Prints:
A B
0 1 0
1 1 1
2 2 c
3 2 d
4 1 2
5 3 f
6 3 g
I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
If I have a df like so:
dfdict = {'1': ['a', 'a', 'a', 'b'], '2': ['a', 'b', 'c', 'a'], '3': ['b', 'a', 'd', 'c']}
df1 = pd.DataFrame(dfdict)
1 2 3
0 a a b
1 a b a
2 a c d
3 b a c
I want to save only the rows where col 1 matches 2 OR 1 matches 3. In this case, rows 0 and 1 would be saved:
1 2 3
0 a a b
1 a b a
I tried:
df2 = df1.loc[df1['1'] == df1['2'] & df1['1'] == df1['3']]
but I get error TypeError: unsupported operand type(s) for &: 'str' and 'bool'.
I would also like to get the other lines where col 1 does NOT match 2 OR 3, i.e. rows 2 and 3, in a separate df.
Option 1
eq, fixing your code,
df1[df1['1'].eq(df1['2']) | df1['1'].eq(df1['3'])]
1 2 3
0 a a b
1 a b a
Option 2
np.vectorize
f = np.vectorize(lambda x, y, z: x in (y, z))
df[f(df1['1'], df1['2'], df1['3'])]
1 2 3
0 a a b
1 a b a
It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1
There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8