How to use dictionary on np.where clause in pandas - python

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
id time col_id col_a col_b col_c
0 1 1 ffp 1 -1 10
1 1 2 ffp 2 -2 20
2 1 3 ffp 3 -3 30
3 2 1 hie 4 -4 40
4 2 2 hie 5 -5 50
5 2 3 ttt 6 -6 60
I would like to create a new col in foo, which will take the value of either col_a or col_b or col_c, depending on the value of col_id.
I am doing the following:
foo['col'] = np.where(foo.col_id == "ffp", foo.col_a,
np.where(foo.col_id == "hie",foo.col_b, foo.col_c))
which gives
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
Since I have a lot of columns, I was wondering if there is a cleaner way to do that, with using a dictionary for example:
dict_cols_matching = {"ffp" : "col_a", "hie": "col_b", "ttt": "col_c"}
Any ideas ?

You can map the values of the dictionary on col_id, then perform indexing lookup:
import numpy as np
idx, cols = pd.factorize(foo['col_id'].map(dict_cols_matching))
foo['col'] = foo.reindex(cols, axis=1).to_numpy()[np.arange(len(foo)), idx]
Output:
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60

With np.select function to arrange condition list to choice list:
foo['col'] = np.select([foo.col_id.eq("ffp"), foo.col_id.eq("hie"), foo.col_id.eq("ttt")],
[foo.col_a, foo.col_b, foo.col_c])
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60

You can use lambda function to select the column based on your id, but the method depends on the order of the columns, adjust the parameter 3 if you change the order.
import pandas as pd
import numpy as np
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
idSet = np.unique(foo['col_id'].to_numpy()).tolist()
foo['col'] = foo.apply(lambda x: x[idSet.index(x.col_id)+3], axis=1)
display(foo)
Output
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60

You might use a reset_index in combination with a rowwise apply:
foo[["col_id"]].reset_index().apply(lambda u: foo.loc[u["index"],dict_cols_matching[u["col_id"]]], axis=1)

Related

Python pandas how to add an accumulation counter to a neighboring column?

I have a csv file with something like this column
Comulative
-1
-3
-4
1
2
5
-1
-4
-8
1
3
5
10
I would like to add an internal column counting the number of sign shifts
To get something like this
Comulative
Score
-1
1
-3
2
-4
3
1
1
2
2
5
3
-1
1
-4
2
-8
3
1
1
3
2
5
3
10
4
In my original csv file, the Comulative column usually does not change the sign from about 100 to 500 lines here , for clarity , it changes so often !
Can you tell me how to do it better ?
Get the sign with numpy.sign, then use a custom groupby with cumcount:
# get sign
s = np.sign(df['Comulative'])
# group by consecutive signs
group = s.ne(s.shift()).cumsum()
# enumerate
df['Score'] = s.groupby(group).cumcount().add(1)
NB. if you want to consider 0 as part of the positive numbers, use s = df['Comulative'].gt(0).
output:
Comulative Score
0 -1 1
1 -3 2
2 -4 3
3 1 1
4 2 2
5 5 3
6 -1 1
7 -4 2
8 -8 3
9 1 1
10 3 2
11 5 3
12 10 4
mozway's usage of np.sign is far prettier, but for a more drawn out answer - you could do something like this.
# Marks true every time the previous value is positive,
# and the current is negative or visa versa.
groups = (df.Comulative.lt(0) & df.Comulative.shift().gt(0)
| df.Comulative.gt(0) & df.Comulative.shift().lt(0)).cumsum()
df['Score'] = df.groupby(groups).cumcount().add(1)
Output:
Comulative Score
0 -1 1
1 -3 2
2 -4 3
3 1 1
4 2 2
5 5 3
6 -1 1
7 -4 2
8 -8 3
9 1 1
10 3 2
11 5 3
12 10 4

Create a new column that counts backwards from a specific point

I would like to look at an outcome in the time prior to a change in product and after a change in product. Here is an example df:
import pandas as pd
ids = [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2]
date = ["11/4/2020", "12/5/2020", "01/5/2021", "02/5/2020", "03/5/2020", "04/5/2020", "05/5/2020", "06/5/2020", "07/5/2020", "08/5/2020", "09/5/2020",
"01/3/2019", "02/3/2019", "03/3/2019", "04/3/2019", "05/3/2019", "06/3/2019", "07/3/2019", "08/3/2019", "09/3/2019", "10/3/2019"]
months = [0,1,2,3,4,0,1,2,3,4,5,0,1,2,3,4,0,1,2,3,4]
df = pd.DataFrame({'ids': ids,
'date': date,
'months': months
})
df
ids date months
0 1 11/4/2020 0
1 1 12/5/2020 1
2 1 01/5/2021 2
3 1 02/5/2020 3
4 1 03/5/2020 4
5 1 04/5/2020 0
6 1 05/5/2020 1
7 1 06/5/2020 2
8 1 07/5/2020 3
9 1 08/5/2020 4
10 1 09/5/2020 5
11 2 01/3/2019 0
12 2 02/3/2019 1
13 2 03/3/2019 2
14 2 04/3/2019 3
15 2 05/3/2019 4
16 2 06/3/2019 0
17 2 07/3/2019 1
18 2 08/3/2019 2
19 2 09/3/2019 3
20 2 10/3/2019 4
This is what I would like the end result to be:
ids date months new_col
0 1 11/4/2020 0 -5
1 1 12/5/2020 1 -4
2 1 01/5/2021 2 -3
3 1 02/5/2020 3 -2
4 1 03/5/2020 4 -1
5 1 04/5/2020 0 0
6 1 05/5/2020 1 1
7 1 06/5/2020 2 2
8 1 07/5/2020 3 3
9 1 08/5/2020 4 4
10 1 09/5/2020 5 5
11 2 01/3/2019 0 -5
12 2 02/3/2019 1 -4
13 2 03/3/2019 2 -3
14 2 04/3/2019 3 -2
15 2 05/3/2019 4 -1
16 2 06/3/2019 0 0
17 2 07/3/2019 1 1
18 2 08/3/2019 2 2
19 2 09/3/2019 3 3
20 2 10/3/2019 4 4
In other words I would like to add a column that finds the second instance of months = 0 for a specific ID and counts backwards from that so I can look at outcomes before that point (all the negative numbers) vs the outcomes after that point (all the positive numbers).
Is there a simple way to do this in pandas?
Thanks in advance
Assume there are 2 and only 2 instances of 0 per group so I don't care about ids because:
(id1, first 0) -> negative counter,
(id1, second 0) -> positive counter,
(id2, first 0) -> negative counter,
(id2, second 0) -> positive count and so on.
Create virtual groups to know if you have to create negative or positive counter:
odd group: negative counter
even group: positive counter
df['new_col'] = (
df.assign(new_col=df['months'].eq(0).cumsum())
.groupby('new_col')['new_col']
.apply(lambda x: range(-len(x), 0, 1) if x.name % 2 else range(len(x)))
.explode().values
)
Output:
>>> df
ids date months new_col
0 1 11/4/2020 0 -5
1 1 12/5/2020 1 -4
2 1 01/5/2021 2 -3
3 1 02/5/2020 3 -2
4 1 03/5/2020 4 -1
5 1 04/5/2020 0 0
6 1 05/5/2020 1 1
7 1 06/5/2020 2 2
8 1 07/5/2020 3 3
9 1 08/5/2020 4 4
10 1 09/5/2020 5 5
11 2 01/3/2019 0 -5
12 2 02/3/2019 1 -4
13 2 03/3/2019 2 -3
14 2 04/3/2019 3 -2
15 2 05/3/2019 4 -1
16 2 06/3/2019 0 0
17 2 07/3/2019 1 1
18 2 08/3/2019 2 2
19 2 09/3/2019 3 3
20 2 10/3/2019 4 4

How to get balance value in a new column pandas

here is my data frame
data={'first':[5,4,3,2,3], 'second':[1,2,3,4,5]}
df= pd.DataFrame(data)
first second
5 1
4 2
3 3
2 4
3 5
and I want to do like this in third column like 0-5+1= -4, -4-4+2= -6, -6-3+3= -6, and so on. And I am sorry for I am not so good in English.
first second third
5 1 -4 #0-5(first)+1(second)= balance -4
4 2 -6 #-4(balance)-4(first)+2(second)= balance -6
3 3 -6
2 4 -4
3 5 -2
You can subtract second from first and take the cumsum (cumulated sum):
df['third'] = (df['second']-df['first']).cumsum()
output:
first second third
0 5 1 -4
1 4 2 -6
2 3 3 -6
3 2 4 -4
4 3 5 -2

Subtraction of elements column-wise, in pandas

I have the following dataframe:
frame=pd.DataFrame({"col1":[1,5,9,4,7,3],"col2":[5,8,7,9,3,4],"col3":[3,4,2,7,9,1],
"col4":[2,4,7,4,9,0],"col5":[3,4,5,2,1,1],"col6":[8,7,5,4,1,2]})
it results in the following output:
col1 col2 col3 col4 col5 col6
0 1 5 3 2 3 8
1 5 8 4 4 4 7
2 9 7 2 7 5 5
3 4 9 7 4 2 4
4 7 3 9 9 1 1
5 3 4 1 0 1 2
I want to create a new dataframe that differences col1 and col2, col3 and col4 and col5 and col6
Expected output is like that:
col1-col2 col3-col4 col5-col6
0 -4 1 -5
1 -3 0 -3
2 2 -5 0
3 -5 3 -2
4 4 0 0
5 -1 1 -1
Thanks in advance
dfr = pd.DataFrame({'col1-col2': frame.col1 - frame.col2,
'col3-col4': frame.col3 - frame.col4,
'col5-col6': frame.col5 - frame.col6})
If many columns use general solution - select pair and unpair columns, convert to numpy array and create new DataFrame by contructor:
#pandas 0.24+
arr = frame.iloc[:, ::2].to_numpy() - frame.iloc[:, 1::2].to_numpy()
#pandas below
#arr = frame.iloc[:, ::2].values - frame.iloc[:, 1::2].values
c = [f'{a}-{b}' for a, b in zip(frame.columns[::2], frame.columns[1::2])]
df = pd.DataFrame(arr, columns=c)
print (df)
col1-col2 col3-col4 col5-col6
0 -4 1 -5
1 -3 0 -3
2 2 -5 0
3 -5 3 -2
4 4 0 0
5 -1 1 -1
If performance is important, convert to numpy array first, store to variable and then indexing:
#pandas 0.24+
arr = frame.to_numpy()
#pandas below
#arr = frame.values
c = [f'{a}-{b}' for a, b in zip(frame.columns[::2], frame.columns[1::2])]
df = pd.DataFrame(arr[:, ::2] - arr[:, 1::2], columns=c)
df = pd.DataFrame(frame.apply(lambda x: [x['col1']-x['col2'],x['col3']-x['col4'],x['col5']-x['col6']],axis=1).tolist())
df.rename({0:'col1-col2',1:'col3-col4',2:'col4-col5'},axis=1)
col1-col2 col3-col4 col4-col5
0 -4 1 -5
1 -3 0 -3
2 2 -5 0
3 -5 3 -2
4 4 0 0
5 -1 1 -1

Pandas max value in column and subtract

I have a pandas dataframe like:
df = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3],
'B':[3,2,20,1,6,2,3,22,1]})
I would like to find the 'max' value in column 'B' then subtract this max value from all the values in column 'B' and create a new column 'C' with the new result. Max is 22 for bottom df.
A B C
2 1 3 -19
1 1 2 -20
0 1 20 -2
3 2 1 -21
5 2 6 -16
4 2 2 -20
8 3 3 -19
7 3 22 0
6 3 1 -21
You can assign your new column with the result of subtracting column 'B' with max of column 'B':
In [25]:
df['C'] = df['B'] - df['B'].max()
df
Out[25]:
A B C
0 1 3 -19
1 1 2 -20
2 1 20 -2
3 2 1 -21
4 2 6 -16
5 2 2 -20
6 3 3 -19
7 3 22 0
8 3 1 -21
Use sub for substracting max value of column B:
df['C'] = df['B'].sub(df['B'].max())
print (df)
A B C
0 1 3 -19
1 1 2 -20
2 1 20 -2
3 2 1 -21
4 2 6 -16
5 2 2 -20
6 3 3 -19
7 3 22 0
8 3 1 -21
Another solution with assign:
df = df.assign(C=df['B'].sub(df['B'].max()))
print (df)
A B C
0 1 3 -19
1 1 2 -20
2 1 20 -2
3 2 1 -21
4 2 6 -16
5 2 2 -20
6 3 3 -19
7 3 22 0
8 3 1 -21

Categories

Resources