To try, I have:
test = pd.DataFrame([[1,'A', 'B', 'A B r'], [0,'A', 'B', 'A A A'], [2,'B', 'C', 'B a c'], [1,'A', 'B', 's A B'], [1,'A', 'B', 'A'], [0,'B', 'C', 'x']])
replace = [['x', 'y', 'z'], ['r', 's', 't'], ['a', 'b', 'c']]
I would like to replace parts of values in the last column with 0 only if they exist in the replace list at position corresponding to the number in the first column for that row.
For example, looking at the first three rows:
So, since 'r' is in replace[1], that cell becomes A B 0.
'A' is not in replace[0], so it stays as A A A,
'a' and 'c' are both in replace[2], so it becomes B 0 0,
etc.
I tried something like
test[3] = test[3].apply(lambda x: ' '.join([n if n not in replace[test[0]] else 0 for n in test.split()]))
but it's not changing anything.
IIUC, use zip and a list comprehension to accomplish this.
I've simplified and created a custom replace_ function, but feel free to use regex to perform the replacement if needed.
def replace_(st, reps):
for old,new in reps:
st = st.replace(old,new)
return st
df['new'] = [replace_(b, zip(replace[a], ['0']*3)) for a,b in zip(df[0], df[3])]
Outputs
0 1 2 3 new
0 1 A B A B r A B 0
1 0 A B A A A A A A
2 2 B C B a c B 0 0
3 1 A B s A B 0 A B
4 1 A B A A
5 0 B C x 0
Use list comprehension with lookup in sets:
test[3] = [' '.join('0' if i in set(replace[a]) else i for i in b.split())
for a,b in zip(test[0], test[3])]
print (test)
0 1 2 3
0 1 A B A B 0
1 0 A B A A A
2 2 B C B 0 0
3 1 A B 0 A B
4 1 A B A
5 0 B C 0
Or convert to sets before for improve performance:
r = [set(x) for x in replace]
test[3]=[' '.join('0' if i in r[a] else i for i in b.split()) for a,b in zip(test[0], test[3])]
Finally I know what you need
s=pd.Series(replace).reindex(test[0])
[ "".join([dict.fromkeys(y,'0').get(c, c) for c in x]) for x,y in zip(test[3],s)]
['A B 0', 'A A A', 'B 0 0', '0 A B', 'A', '0']
Related
I have dataframe, where 'A' 1 - client, B - admin
I need to merge messages in row with 1 sequentially and merge lines 2 - admin response sequentially across the dataframe.
df1 = pd.DataFrame({'A' : ['a', 'b', 'c', 'd', 'e', 'f', 'h', 'j', 'de', 'be'],
'B' : [1, 1, 2, 1, 1, 1, 2, 2, 1, 2]})
df1
A B
A B
0 a 1
1 b 1
2 c 2
3 d 1
4 e 1
5 f 1
6 h 2
7 j 2
8 de 1
9 be 2
I need to get in the end this dataframe:
df2 = pd.DataFrame({'A' : ['a, b', 'd, e, f', 'de'],
'B' : ['c', 'h, j', 'be' ]})
Out:
A B
0 a,b c
1 d,e,f h,j
2 de be
I do not know how to do this
Create groups by consecutive values in B - trick compare shifted values with cumulative sum and aggregate first and join. Create helper column for posible pivoting in next step by DataFrame.pivot:
Solution working if exist pairs 1,2 in sequentially order with duplicates.
df = (df1.groupby(df1['B'].ne(df1['B'].shift()).cumsum())
.agg(B = ('B','first'), A= ('A', ','.join))
.assign(C = lambda x: x['B'].eq(1).cumsum()))
print (df)
B A C
B
1 1 a,b 1
2 2 c 1
3 1 d,e,f 2
4 2 h,j 2
5 1 de 3
6 2 be 3
df = (df.pivot('C','B','A')
.rename(columns={1:'A',2:'B'})
.reset_index(drop=True).rename_axis(None, axis=1))
print (df)
A B
0 a,b c
1 d,e,f h,j
2 de be
How can I find the same values in the columns regardless of their position?
df = pd.DataFrame({'one':['A','B', 'C', 'D', 'E', np.nan, 'H'],
'two':['B', 'E', 'C', np.nan, np.nan, 'H', 'L']})
The result I want to get:
three
0 B
1 E
2 C
3 H
The exact logic is unclear, you can try:
out = pd.DataFrame({'three': sorted(set(df['one'].dropna())
&set(df['two'].dropna()))})
output:
three
0 B
1 C
2 E
3 H
Or maybe you want to keep the items of col two?
out = (df.loc[df['two'].isin(df['one'].dropna()), 'two']
.to_frame(name='three')
)
output:
three
0 B
1 E
2 C
5 H
Try this:
df = pd.DataFrame(set(df['one']).intersection(df['two']), columns=['Three']).dropna()
print(df)
Output:
Three
1 C
2 H
3 E
4 B
I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,2,2,2,2,3,3,3,3],
'time': [1,2,3,4,1,2,3,4,1,2,3,4],
'cat': ['a', 'a', 'b', 'c',
'a', 'b', 'b', 'b',
'c', 'b', 'c', 'b']
})
I want to calculate, how many times the cat changes from one time to the next, by id
So:
for id == 1, cat changes from a to a 1 time, from a to b 1 time and from b to c 1 times
for id == 2, cat changes from a to b 1 time, and from b to b 2 times
for id == 3, cat changes from c to b 2 times, and from b to c 1 time
Any ideas how I could compute that ?
Ideally the output should look something like:
pd.DataFrame({'id': [1,2,3],
'a to a': [1,0,0],
'a to b': [1,1,0],
'a to c': [0,0,0],
'b to a': [0,0,0],
'b to b': [0,2,0],
'b to c': [1,0,1],
'c to a': [0,0,0],
'c to b': [0,0,2],
'c to c': [0,0,0]
})
Similar to #Anky we will use shift within a group to create a label for the current and next value. Then we just need a crosstab. Since the .str.cat will keep NaN from the shift, and crosstab ignores them we can ensure we only count within group transitions.
import pandas as pd
s = foo['cat'].str.cat(' to ' + foo.groupby('id')['cat'].shift(-1))
pd.crosstab(foo['id'], s)
cat a to a a to b b to b b to c c to b
id
1 1 1 0 1 0
2 0 1 2 0 0
3 0 0 0 1 2
Probably this might be another way:
g = foo.groupby(["id"])
s = (g['cat'].shift().where(g['time'].diff().ge(1)).fillna('')).add(foo['cat'])
t = foo[['id']].assign(k=s)
out = t[t['k'].str.len()>1].groupby("id")['k'].value_counts().unstack(fill_value=0)
print(out.rename_axis(None,axis=1))
aa ab bb bc cb
id
1 1 1 0 1 0
2 0 1 2 0 0
3 0 0 0 1 2
You can use collections.Counter and itertools.pairwise:
from collections import Counter
from itertools import pairwise
foo.groupby('id')['cat'].apply(lambda x: Counter(pairwise(x))).unstack(level=0).fillna(0)
output:
id 1 2 3
(a, a) 1.0 0.0 0.0
(a, b) 1.0 1.0 0.0
(b, c) 1.0 0.0 1.0
(b, b) 0.0 2.0 0.0
(c, b) 0.0 0.0 2.0
NB. pairwise requires python ≥ 3.10, for versions below use the recipe from the documentation.
We can use pandas.Series.shift to shift the values, then pandas.DataFrame.groupby to group the distinct values, and count.
# Have the next category be stored in next_cat
foo['next_cat'] = foo['cat'].shift(-1)
# If the use condition is to ensure that each ID's next category is
# of the same ID, we use this instead:
foo['next_cat'] = foo.groupby(['id'])['cat'].shift(-1)
# Only groupby for values that are not NaN in next_cat (ignore the last row)
# Then group by category, next category, and the ID, and count these values.
foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()
This outputs:
time
cat next_cat id
a a 1 1
b 1 1
2 1
b b 2 2
c 1 1
2 1
3 1
c a 1 1
b 3 2
We can then drop the index and pivot in order to achieve your ideal shape using pandas.DataFrame.pivot_table:
# This time around we're storing the data into foo.
foo = foo.loc[foo['next_cat'] == foo['next_cat']].groupby(['cat', 'next_cat', 'id']).count()
# Reset the index so we can pivot using these columns.
foo = foo.reset_index()
foo.pivot_table(columns=foo['cat'] + " to " + foo['next_cat'], values=['time'], index=['id']).fillna(0).astype(int)
This outputs:
time
a to a a to b b to b b to c c to a c to b
id
1 1 1 0 1 1 0
2 0 1 2 1 0 0
3 0 0 0 1 0 2
If I have a dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['A', 'B', 'C', 'D'],
... ['E', 'B', 'C']
... ])
>>> df
0 1 2 3
0 A B C D
1 E B C None
>>>
I shoudl transform the dataframe to two columns format:
x, y
-----
A, B
B, C
C, D
E, B
B, C
For each row, from left to right, take two neighbor values and make a pair of it.
It is kind of from-to if you consider each row as a path.
How to do the transformation?
We can do explode with zip
s=pd.DataFrame(df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode().tolist())
Out[336]:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
Update
s=df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode()
s=pd.DataFrame(s.tolist(),index=s.index)
s
Out[340]:
0 1
0 A B
0 B C
0 C D
1 E B
1 B C
Pre-preparing the data could help too:
import pandas as pd
inp = [['A', 'B', 'C', 'D'],
['E', 'B', 'C']]
# Convert beforehand
inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp]
inp2 = inp2[0] + inp2[1]
df = pd.DataFrame(inp2)
print(df)
Output:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
My column in dataframe contains indices of values in list. Like:
id | idx
A | 0
B | 0
C | 2
D | 1
list = ['a', 'b', 'c', 'd']
I want to replace each value in idx column greater than 0 by value in list of corresponding index, so that:
id | idx
A | 0
B | 0
C | c # list[2]
D | b # list[1]
I tried to do this with loop, but it does nothing...Although if I move ['idx'] it will replace all values on this row
for index in df.idx.values:
if index >=1:
df[df.idx==index]['idx'] = list[index]
Dont use list like variable name, because builtin (python code word).
Then use Series.map with enumerate in Series.mask:
L = ['a', 'b', 'c', 'd']
df['idx'] = df['idx'].mask(df['idx'] >=1, df['idx'].map(dict(enumerate(L))))
print (df)
id idx
0 A 0
1 B 0
2 C c
3 D b
Similar idea is processing only matched rows by mask:
L = ['a', 'b', 'c', 'd']
m = df['idx'] >=1
df.loc[m,'idx'] = df.loc[m,'idx'].map(dict(enumerate(L)))
print (df)
id idx
0 A 0
1 B 0
2 C c
3 D b
Create a dictionary for items where the index is greater than 0, then use the mapping with replace to get your output :
mapping = dict((key,val) for key,val in enumerate(l) if key > 0)
print(mapping)
{1: 'b', 2: 'c', 3: 'd'}
df.replace(mapping)
id idx
0 A 0
1 B 0
2 C c
3 D b
Note : I changed the list variable name to l