Replace values in a column that come after a specific value - python

I would like to replace values in a column, but only to the values seen after an specific value
for example, I have the following dataset:
In [108]: df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]], columns=['ID','time,'A', 'B', 'C'])
In [109]: df
Out[109]:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 4 8 7
4 16 1 9 3 1
5 17 3 1 4 8
and I want to change for column "A" all the values that come after 5 for a 1, for column "B" all the values that come after 1 for 6, for column "C" change all the values after 7 for a 5. so it will look like this:
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5
I know that I could use where to get sort of a similar effect, but if I put a condition like df["A"] = np.where(x!=5,1,x), but obviously this will change the values before 5 as well. I can't think of anything else at the moment.
Thanks for the help.

Use DataFrame.mask with shifted valeus by DataFrame.shift, compared by dictioanry and for next Trues is used DataFrame.cummax:
df=pd.DataFrame([[12,13,14,15,16,17],[4,10,5,6,1,3],
[1, 3,5,4,9,1],[2, 4, 1,8,3,4], [4, 2, 6,7,1,8]],
index=['ID','time','A', 'B', 'C']).T
after = {'A':5, 'B':1, 'C': 7}
new = {'A':1, 'B':6, 'C': 5}
cols = list(after.keys())
s = pd.Series(new)
df[cols] = df[cols].mask(df[cols].shift().eq(after).cummax(), s, axis=1)
print (df)
ID time A B C
0 12 4 1 2 4
1 13 10 3 4 2
2 14 5 5 1 6
3 15 6 1 6 7
4 16 1 1 6 5
5 17 3 1 6 5

Related

How to get the cumulative count based on two columns

Let's say we have the following dataframe. If we wanted to find the count of consecutive 1's, you could use the below.
col
0 0
1 1
2 1
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 1
11 1
12 1
13 0
14 1
15 1
df['col'].groupby(df['col'].diff().ne(0).cumsum()).cumsum()
But the problem I see is when you need to use groupby with and id field. If we added an id field to the dataframe (below), it makes it more complicated. We can no longer use the solution above.
id col
0 B 0
1 B 1
2 B 1
3 B 1
4 A 0
5 A 0
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
When presented with this issue, ive seen the case for making a helper series to use in the groupby like this:
s = df['col'].eq(0).groupby(df['id']).cumsum()
df['col'].groupby([df['id'],s]).cumsum()
Which works, but the problem is that the first group contains the first row, which does not fit the criteria. This usually isn't a problem, but it is if we wanted to find the count. Replacing cumsum() at the end of the last groupby() with .transform('count') would actually give us 6 instead of 5 for the count of consecutive 1's in the first B group.
The only solution I can come up with for this problem is the following code:
df['col'].groupby([df['id'],df.groupby('id')['col'].transform(lambda x: x.diff().ne(0).astype(int).cumsum())]).transform('count')
Expected output:
0 1
1 5
2 5
3 5
4 2
5 2
6 5
7 5
8 1
9 2
10 2
11 2
12 2
13 1
14 2
15 2
This works, but uses transform() twice, which I heard isn't the fastest. It is the only solution I can think of that uses diff().ne(0) to get the "real" groups.
Index 1,2,3,6 and 7 are all id B, with the same value in the 'col' column, so the count would not be reset, so they would all be apart of the same group.
Can this be done without using multiple .transform()?
The following code uses only 1 .transform(), and relies upon ordering the index, to get the correct counts.
The original index is kept, so the final result can be reindexed back to the original order.
Use cum_counts['cum_counts'] to get the exact desired output, without the other column.
import pandas as pd
# test data as shown in OP
df = pd.DataFrame({'id': ['B', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'A', 'A', 'A', 'A', 'A'], 'col': [0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1]})
# reset the index, then set the index and sort
df = df.reset_index().set_index(['index', 'id']).sort_index(level=1)
col
index id
4 A 0
5 A 0
11 A 1
12 A 1
13 A 0
14 A 1
15 A 1
0 B 0
1 B 1
2 B 1
3 B 1
6 B 1
7 B 1
8 B 0
9 B 1
10 B 1
# get the cumulative sum
g = df.col.ne(df.col.shift()).cumsum()
# use g to groupby and use only 1 transform to get the counts
cum_counts = df['col'].groupby(g).transform('count').reset_index(level=1, name='cum_counts').sort_index()
id cum_counts
index
0 B 1
1 B 5
2 B 5
3 B 5
4 A 2
5 A 2
6 B 5
7 B 5
8 B 1
9 B 2
10 B 2
11 A 2
12 A 2
13 A 1
14 A 2
15 A 2
After looking at #TrentonMcKinney solution, I came up with:
df = df.sort_values(['id'])
grp =(df[['id','col']] != df[['id','col']].shift()).any(axis=1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df.sort_index()
Output:
id col count
0 B 0 1
1 B 1 5
2 B 1 5
3 B 1 5
4 A 0 2
5 A 0 2
6 B 1 5
7 B 1 5
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2
IIUC, do you want?
grp = (df[['id', 'col']] != df[['id', 'col']].shift()).any(axis = 1).cumsum()
df['count'] = df.groupby(grp)['id'].transform('count')
df
Output:
id col count
0 B 0 1
1 B 1 3
2 B 1 3
3 B 1 3
4 A 0 2
5 A 0 2
6 B 1 2
7 B 1 2
8 B 0 1
9 B 1 2
10 B 1 2
11 A 1 2
12 A 1 2
13 A 0 1
14 A 1 2
15 A 1 2

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

How to set value to a cell filtered by rows in python DataFrame?

import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],[10,11,12]],columns=['A','B','C'])
df[df['B']%2 ==0]['C'] = 5
I am expecting this code to change the value of columns C to 5, wherever B is even. But it is not working.
It returns the table as follow
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I am expecting it to return
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
If need change value of column in DataFrame is necessary DataFrame.loc with condition and column name:
df.loc[df['B']%2 ==0, 'C'] = 5
print (df)
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12
Your solution is nice example of chained indexing - docs.
You could just change the order to:
df['C'][df['B']%2 == 0] = 5
And it also works
Using numpy where
df['C'] = np.where(df['B']%2 == 0, 5, df['C'])
Output
A B C
0 1 2 5
1 4 5 6
2 7 8 5
3 10 11 12

Pandas - Giving all rows (particularly) duplicate rows a unique identifier

Let's say I have a DF with 5 columns and I want to make a unique 'key' for each row.
a b c d e
1 1 2 3 4 5
2 1 2 3 4 6
3 1 2 3 4 7
4 1 2 2 5 6
5 2 3 4 5 6
6 2 3 4 5 6
7 3 4 5 6 7
I'd like to create a 'key' column as follows:
a b c d e key
1 1 2 3 4 5 12345
2 1 2 3 4 6 12346
3 1 2 3 4 7 12347
4 1 2 2 5 6 12256
5 2 3 4 5 6 23456
6 2 3 4 5 6 23456
7 3 4 5 6 7 34567
Now the problem with this of course is that row 5 & 6 are duplicates.
I'd like to be able to create unique keys like so:
a b c d e key
1 1 2 3 4 5 12345_1
2 1 2 3 4 6 12346_1
3 1 2 3 4 7 12347_1
4 1 2 2 5 6 12256_1
5 2 3 4 5 6 23456_1
6 2 3 4 5 6 23456_2
7 3 4 5 6 7 34567_1
Not sure how to do this or if this is the best method - appreciate any help.
Thanks
Edit: Columns will be mostly strings, not numeric.
On way is to hash to tuple of each row:
In [11]: df.apply(lambda x: hash(tuple(x)), axis=1)
Out[11]:
1 -2898633648302616629
2 -2898619338595901633
3 -2898621714079554433
4 -9151203046966584651
5 1657626630271466437
6 1657626630271466437
7 3771657657075408722
dtype: int64
In [12]: df['key'] = df.apply(lambda x: hash(tuple(x)), axis=1)
In [13]: df['key'].astype(str) + '_' + (df.groupby('key').cumcount() + 1).astype(str)
Out[13]:
1 -2898633648302616629_1
2 -2898619338595901633_1
3 -2898621714079554433_1
4 -9151203046966584651_1
5 1657626630271466437_1
6 1657626630271466437_2
7 3771657657075408722_1
dtype: object
Note: Generally you don't need to be doing this (it's unclear why you'd want to!).
try this.,
df['key']=df.apply(lambda x:'-'.join(x.values.tolist()),axis=1)
m=~df['key'].duplicated()
s= (df.groupby(m.cumsum()).cumcount()+1).astype(str)
df['key']=df['key']+'_'+s
print (df)
O/P:
a b c d e key
0 1 2 3 4 5 1-2-3-4-5_0
1 1 2 3 4 6 1-2-3-4-6_0
2 1 2 3 4 7 1-2-3-4-7_0
3 1 2 2 5 6 1-2-2-5-6_0
4 2 3 4 5 6 2-3-4-5-6_0
5 2 3 4 5 6 2-3-4-5-6_1
6 3 4 5 6 7 3-4-5-6-7_0
7 1 2 3 4 5 1-2-3-4-5_1
Another much simpler way:
df['key']=df['key']+'_'+(df.groupby('key').cumcount()).astype(str)
Explanation:
first create your unique id using join.
create a sequence s using duplicate and perform cumsum, restart when new value found.
finally concat key and your sequence s.
Maybe you can do something link the following
import uuid
df['uuid'] = [uuid.uuid4() for __ in range(df.index.size)]
Another approach would be to use np.random.choice(range(10000,99999), len(df), replace=False) to generate unique random numbers without replacement for each row in your df:
df = pd.DataFrame(columns = ['a', 'b', 'c', 'd', 'e'],
data = [[1, 2, 3, 4, 5],[1, 2, 3, 4, 6],[1, 2, 3, 4, 7],[1, 2, 2, 5, 6],[2, 3, 4, 5, 6],[2, 3, 4, 5, 6],[3, 4, 5, 6, 7]])
df['key'] = np.random.choice(range(10000,99999), len(df), replace=False)
df
a b c d e key
0 1 2 3 4 5 10560
1 1 2 3 4 6 79547
2 1 2 3 4 7 24762
3 1 2 2 5 6 95221
4 2 3 4 5 6 79460
5 2 3 4 5 6 62820
6 3 4 5 6 7 82964

Multiple pandas columns

If a have pandas dataframe with 4 columns like this:
A B C D
0 2 4 1 9
1 3 2 9 7
2 1 6 9 2
3 8 6 5 4
is it possible to apply df.cumsum() in some way to get the results in a new column next to existing column like this:
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22
You can create new columns using assign:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
and order the columns with sort_index:
result.sort_index(axis=1)
# A AA B BB C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
Note that depending on the column names, sorting may not produce the desired order. In that case, using reindex is a more robust way of ensuring you obtain the desired column order:
result = df.assign(**{col*2:df[col].cumsum() for col in df})
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
Here is an example which demonstrates the difference:
import pandas as pd
df = pd.DataFrame({'A': [2, 3, 1, 8], 'A A': [4, 2, 6, 6], 'C': [1, 9, 9, 5], 'D': [9, 7, 2, 4]})
result = df.assign(**{col*2:df[col].cumsum() for col in df})
print(result.sort_index(axis=1))
# A A A A AA A AA C CC D DD
# 0 2 4 4 2 1 1 9 9
# 1 3 2 6 5 9 10 7 16
# 2 1 6 12 6 9 19 2 18
# 3 8 6 18 14 5 24 4 22
result = result.reindex(columns=[item for col in df for item in (col, col*2)])
print(result)
# A AA A A A AA A C CC D DD
# 0 2 2 4 4 1 1 9 9
# 1 3 5 2 6 9 10 7 16
# 2 1 6 6 12 9 19 2 18
# 3 8 14 6 18 5 24 4 22
#unutbu's way certainly works but using insert reads better to me. Plus you don't need to worry about sorting/reindexing!
for i, col_name in enumerate(df):
df.insert(i * 2 + 1, col_name * 2, df[col_name].cumsum())
df
returns
A AA B BB C CC D DD
0 2 2 4 4 1 1 9 9
1 3 5 2 6 9 10 7 16
2 1 6 6 12 9 19 2 18
3 8 14 6 18 5 24 4 22

Categories

Resources