I have a pandas dataframe like:
df = pd.DataFrame({'A':[1,1,1,2,2,2,3,3,3],
'B':[3,2,20,1,6,2,3,22,1]})
I would like to find the 'max' value in column 'B' then subtract this max value from all the values in column 'B' and create a new column 'C' with the new result. Max is 22 for bottom df.
A B C
2 1 3 -19
1 1 2 -20
0 1 20 -2
3 2 1 -21
5 2 6 -16
4 2 2 -20
8 3 3 -19
7 3 22 0
6 3 1 -21
You can assign your new column with the result of subtracting column 'B' with max of column 'B':
In [25]:
df['C'] = df['B'] - df['B'].max()
df
Out[25]:
A B C
0 1 3 -19
1 1 2 -20
2 1 20 -2
3 2 1 -21
4 2 6 -16
5 2 2 -20
6 3 3 -19
7 3 22 0
8 3 1 -21
Use sub for substracting max value of column B:
df['C'] = df['B'].sub(df['B'].max())
print (df)
A B C
0 1 3 -19
1 1 2 -20
2 1 20 -2
3 2 1 -21
4 2 6 -16
5 2 2 -20
6 3 3 -19
7 3 22 0
8 3 1 -21
Another solution with assign:
df = df.assign(C=df['B'].sub(df['B'].max()))
print (df)
A B C
0 1 3 -19
1 1 2 -20
2 1 20 -2
3 2 1 -21
4 2 6 -16
5 2 2 -20
6 3 3 -19
7 3 22 0
8 3 1 -21
Related
I would like to look at an outcome in the time prior to a change in product and after a change in product. Here is an example df:
import pandas as pd
ids = [1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2]
date = ["11/4/2020", "12/5/2020", "01/5/2021", "02/5/2020", "03/5/2020", "04/5/2020", "05/5/2020", "06/5/2020", "07/5/2020", "08/5/2020", "09/5/2020",
"01/3/2019", "02/3/2019", "03/3/2019", "04/3/2019", "05/3/2019", "06/3/2019", "07/3/2019", "08/3/2019", "09/3/2019", "10/3/2019"]
months = [0,1,2,3,4,0,1,2,3,4,5,0,1,2,3,4,0,1,2,3,4]
df = pd.DataFrame({'ids': ids,
'date': date,
'months': months
})
df
ids date months
0 1 11/4/2020 0
1 1 12/5/2020 1
2 1 01/5/2021 2
3 1 02/5/2020 3
4 1 03/5/2020 4
5 1 04/5/2020 0
6 1 05/5/2020 1
7 1 06/5/2020 2
8 1 07/5/2020 3
9 1 08/5/2020 4
10 1 09/5/2020 5
11 2 01/3/2019 0
12 2 02/3/2019 1
13 2 03/3/2019 2
14 2 04/3/2019 3
15 2 05/3/2019 4
16 2 06/3/2019 0
17 2 07/3/2019 1
18 2 08/3/2019 2
19 2 09/3/2019 3
20 2 10/3/2019 4
This is what I would like the end result to be:
ids date months new_col
0 1 11/4/2020 0 -5
1 1 12/5/2020 1 -4
2 1 01/5/2021 2 -3
3 1 02/5/2020 3 -2
4 1 03/5/2020 4 -1
5 1 04/5/2020 0 0
6 1 05/5/2020 1 1
7 1 06/5/2020 2 2
8 1 07/5/2020 3 3
9 1 08/5/2020 4 4
10 1 09/5/2020 5 5
11 2 01/3/2019 0 -5
12 2 02/3/2019 1 -4
13 2 03/3/2019 2 -3
14 2 04/3/2019 3 -2
15 2 05/3/2019 4 -1
16 2 06/3/2019 0 0
17 2 07/3/2019 1 1
18 2 08/3/2019 2 2
19 2 09/3/2019 3 3
20 2 10/3/2019 4 4
In other words I would like to add a column that finds the second instance of months = 0 for a specific ID and counts backwards from that so I can look at outcomes before that point (all the negative numbers) vs the outcomes after that point (all the positive numbers).
Is there a simple way to do this in pandas?
Thanks in advance
Assume there are 2 and only 2 instances of 0 per group so I don't care about ids because:
(id1, first 0) -> negative counter,
(id1, second 0) -> positive counter,
(id2, first 0) -> negative counter,
(id2, second 0) -> positive count and so on.
Create virtual groups to know if you have to create negative or positive counter:
odd group: negative counter
even group: positive counter
df['new_col'] = (
df.assign(new_col=df['months'].eq(0).cumsum())
.groupby('new_col')['new_col']
.apply(lambda x: range(-len(x), 0, 1) if x.name % 2 else range(len(x)))
.explode().values
)
Output:
>>> df
ids date months new_col
0 1 11/4/2020 0 -5
1 1 12/5/2020 1 -4
2 1 01/5/2021 2 -3
3 1 02/5/2020 3 -2
4 1 03/5/2020 4 -1
5 1 04/5/2020 0 0
6 1 05/5/2020 1 1
7 1 06/5/2020 2 2
8 1 07/5/2020 3 3
9 1 08/5/2020 4 4
10 1 09/5/2020 5 5
11 2 01/3/2019 0 -5
12 2 02/3/2019 1 -4
13 2 03/3/2019 2 -3
14 2 04/3/2019 3 -2
15 2 05/3/2019 4 -1
16 2 06/3/2019 0 0
17 2 07/3/2019 1 1
18 2 08/3/2019 2 2
19 2 09/3/2019 3 3
20 2 10/3/2019 4 4
I have a dataframe with repeat values in column A. I want to drop duplicates, keeping the row which has its value > 0 in column B
So this:
A B
1 20
1 10
1 -3
2 30
2 -9
2 40
3 10
Should turn into this:
A B
1 20
1 10
2 30
2 40
3 10
Any suggestions on how this can be achieved? I shall be grateful!
In sample data are not duplciates, so use only:
df = df[df['B'].gt(0)]
print (df)
A B
0 1 20
1 1 10
3 2 30
5 2 40
6 3 10
If there are duplicates:
print (df)
A B
0 1 20
1 1 10
2 1 10
3 1 10
4 1 -3
5 2 30
6 2 -9
7 2 40
8 3 10
df = df[df['B'].gt(0) & ~df.duplicated()]
print (df)
A B
0 1 20
1 1 10
5 2 30
7 2 40
8 3 10
I have a dataframe-
data={'a':[1,2,3,6],'b':[5,6,7,6],'c':[45,77,88,99]}
df=pd.DataFrame(data)
Now I want to add a column at a two rows down in the dataframe.
The updated dataframe should look like-
l=[4,5] #column to add
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
I did this-
df.loc[:2,'f'] = pd.Series(l)
Idea is add Series by index with length by list:
df['d'] = pd.Series(l, index=df.index[-len(l):])
print (df)
a b c d
0 1 5 45 NaN
1 2 6 77 NaN
2 3 7 88 4.0
3 6 6 99 5.0
Last for 0 values add Series.reindex by original index
df['d'] = pd.Series(l, index=df.index[-len(l):]).reindex(df.index, fill_value=0)
print (df)
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
Another idea is repeat 0 values by difference of lengths and add l:
df['d'] = [0] * (len(df) - len(l)) + l
print (df)
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
You can add a col with 0s and set the index:
>>> df
a b c
0 1 5 45
1 2 6 77
2 3 7 88
3 6 6 99
>>> df['d'] = 0
>>> df.iloc[-2:, df.columns.get_loc('d')] = [4,5]
>>> df
a b c d
0 1 5 45 0
1 2 6 77 0
2 3 7 88 4
3 6 6 99 5
I want to obtain the second highest value of a certain section for each row from a dataframe. How do I do this?
I have tried the following code but it doesn't work:
df.iloc[:, 5:-3].nlargest(2)(axis=1, level=2)
Is there any other way to obtain this?
Using apply with axis=1 you can find the second largest value for each row. by finding the first 2 largest and then getting the last of them
df.iloc[:, 5:-3].apply(lambda row: row.nlargest(2).values[-1],axis=1)
Example
The code below find the second largest value in each row of df.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame({'Col{}'.format(i):np.random.randint(0,100,5) for i in range(5)})
In [4]: df
Out[4]:
Col0 Col1 Col2 Col3 Col4
0 82 32 14 62 90
1 62 32 74 62 72
2 31 79 22 17 3
3 42 54 66 93 50
4 13 88 6 46 69
In [5]: df.apply(lambda row: row.nlargest(2).values[-1],axis=1)
Out[5]:
0 82
1 72
2 31
3 66
4 69
dtype: int64
I think you need sorting per rows and then select:
a = np.sort(df.iloc[:, 5:-3], axis=1)[:, -2]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 8 8 3 7 7 0 4 2 5 2
1 2 2 1 0 8 4 0 9 6 2
2 4 1 5 3 4 4 3 7 1 1
3 7 7 0 2 9 9 3 2 5 8
4 1 0 7 6 2 0 8 2 5 1
5 8 1 5 4 2 8 3 5 0 9
6 3 6 3 4 7 6 3 9 0 4
7 4 5 7 6 6 2 4 2 7 1
8 6 6 0 7 2 3 5 4 2 4
9 3 7 9 0 0 5 9 6 6 5
print (df.iloc[:, 5:-3])
5 6
0 0 4
1 4 0
2 4 3
3 9 3
4 0 8
5 8 3
6 6 3
7 2 4
8 3 5
9 5 9
a = np.sort(df.iloc[:, 5:-3], axis=1)[:, -2]
print (a)
[0 0 3 3 0 3 3 2 3 5]
If need both values:
a = df.iloc[:, 5:-3].values
b = pd.DataFrame(a[np.arange(len(a))[:, None], np.argsort(a, axis=1)])
print (b)
0 1
0 0 4
1 0 4
2 3 4
3 3 9
4 0 8
5 3 8
6 3 6
7 2 4
8 3 5
9 5 9
You need to sort your dataframe with numpy.sort() and then get the second value.
import numpy as np
second = np.sort(df.iloc[:, 5:-3], axis=1)[:, 1]
I have a dataframe looked like below.
T$QOOR
3
14
12
-6
-19
9
I want to move the positive and negative one into new columns.
sls_item['SALES'] = sls_item['T$QOOR'].apply(lambda x: x if x >= 0 else 0)
sls_item['RETURN'] = sls_item['T$QOOR'].apply(lambda x: x*-1 if x < 0 else 0)
The result will be as below.
T$QOOR SALES RETURN
3 3 0
14 14 0
12 12 0
-6 0 -6
-19 0 -19
9 9 0
Any better and cleaner way to do so other than using apply?
Solution with clip_lower and
clip_upper, also mul for multiple by -1 is added:
sls_item['SALES'] = sls_item['T$QOOR'].clip_lower(0)
sls_item['RETURN'] = sls_item['T$QOOR'].clip_upper(0).mul(-1)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
Use where or numpy.where:
sls_item['SALES'] = sls_item['T$QOOR'].where(lambda x: x >= 0, 0)
sls_item['RETURN'] = sls_item['T$QOOR'].where(lambda x: x < 0, 0) * -1
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
mask = sls_item['T$QOOR'] >=0
sls_item['SALES'] = np.where(mask, sls_item['T$QOOR'], 0)
sls_item['RETURN'] = np.where(~mask, sls_item['T$QOOR'] * -1, 0)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
assgin + where
df.assign(po=df.where(df['T$QOOR']>0,0),ne=df.where(df['T$QOOR']<0,0))
Out[1355]:
T$QOOR ne po
0 3 0 3
1 14 0 14
2 12 0 12
3 -6 -6 0
4 -19 -19 0
5 9 0 9