Multiple IF condition based on previous and the current row - Pandas - python

I want to create a new column G which should display 1 if the previous row of D column contains a chrg or float and the current rows contains Dischrg. The first row should contain 0.
I want the pandas equivalent of the following excel code:
=IF(AND(D2="Chrg",D3="Dischrg"),G2+1,IF(AND(D2="Float",D3="Dischrg"),G2+1,0))
D G
Chrg 0
Dischrg 1
Float 0
Dischrg 0
Float 0
Dischrg 1

Use Series.shift with Series.isin:
m1 = df['D'].shift().isin(['Chrg', 'Float']) # Chrg or Float previous row
m2 = df['D'].eq('Dischrg') # Current row Dischrg
df['G'] = (m1&m2).astype(int)
D G
0 Chrg 0
1 Dischrg 1
2 Float 0
3 Dischrg 1
4 Float 0
5 Dischrg 1

You can shift the column so you have the previous value in the same row:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([{"a": 1, "b": 1}, {"a": 2, "b": 2}, {"a": 3, "b": 3}])
In [3]: df
Out[3]:
a b
0 1 1
1 2 2
2 3 3
In [4]: df["prev"] = df.a.shift(1)
In [5]: df
Out[5]:
a b prev
0 1 1 NaN
1 2 2 1.0
2 3 3 2.0

Related

How to conditionally add one hot vector to a Pandas DataFrame

I have the following Pandas DataFrame in Python:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3], [3, 2, 1], [2, 1, 1]]),
columns=['a', 'b', 'c'])
df
It looks as the following when you output it:
a b c
0 1 2 3
1 3 2 1
2 2 1 1
I need to add 3 new columns, as column "d", column "e", and column "f".
Values in each new column will be determined based on the values of column "b" and column "c".
In a given row:
If the value of column "b" is bigger than the value of column "c", columns [d, e, f] will have the values [1, 0, 0].
If the value of column "b" is equal to the value of column "c", columns [d, e, f] will have the values [0, 1, 0].
If the value of column "b" is smaller than the value of column "c", columns [d, e, f] will have the values [0, 0, 1].
After this operation, the DataFrame needs to look as the following:
a b c d e f
0 1 2 3 0 0 1 # Since b smaller than c
1 3 2 1 1 0 0 # Since b bigger than c
2 2 1 1 0 1 0 # Since b = c
My original DataFrame is much bigger than the one in this example.
Is there a good way of doing this in Python without looping through the DataFrame?
You can use np.where to create condition vector and use str.get_dummies to create dummies
df['vec'] = np.where(df.b>df.c, 'd', np.where(df.b == df.c, 'e', 'f'))
df = df.assign(**df['vec'].str.get_dummies()).drop('vec',1)
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
Let us try np.sign with get_dummies, -1 is c<b, 0 is c=b, 1 is c>b
df=df.join(np.sign(df.eval('c-b')).map({-1:'d',0:'e',1:'f'}).astype(str).str.get_dummies())
df
Out[29]:
a b c d e f
0 1 2 3 0 0 1
1 3 2 1 1 0 0
2 2 1 1 0 1 0
You simply harness the Boolean conditions you've already specified.
df["d"] = np.where(df.b > df.c, 1, 0)
df["e"] = np.where(df.b == df.c, 1, 0)
df["f"] = np.where(df.b < df.c, 1, 0)

How to get maximum column condensed in one row with Pandas?

For the following dataframe
df = pd.DataFrame({"a": [1, 0, 11], "b": [7, 0, 0], "c": [0,10,0], "d": [1,0,0],
"e": [0,0,0], "name":["b","c","a"]})
print(df)
a b c d e name
0 1 7 0 1 0 b
1 0 0 10 0 0 c
2 11 0 0 0 0 a
I would like to get one row back that comprises the maximum values of each column plus the name of that column.
E.g. in this case:
a b c d e name
11 7 10 1 0 a
How can this be performed?
First get maximum values to one row DataFrame by max to_frame and transposse by T and then get name of maximum value per DataFrame with idxmax:
a = df.max().to_frame().T
a.loc[0, 'name'] = df.set_index('name').max(axis=1).idxmax()
print (a)
a b c d e name
0 11 7 10 1 0 a
Detail:
print (df.set_index('name').max(axis=1))
name
b 7
c 10
a 11
dtype: int64
print (df.set_index('name').max(axis=1).idxmax())
a
Use df.max() and create a Dataframeand Transpose as:
pd.DataFrame(df.max()).T
a b c d e name
0 11 7 10 1 0 c

How to delete a rows pandas df

I am trying to remove a row in a pandas df plus the following row. For the df below I want to remove the row when the value in Code is equal to X. But I also want to remove the subsequent row as well.
import pandas as pd
d = ({
'Code' : ['A','A','B','C','X','A','B','A'],
'Int' : [0,1,1,2,3,3,4,5],
})
df = pd.DataFrame(d)
If I use this code it removes the desired row. But I can't use the same for value A as there are other rows that contain A, which are required.
df = df[df.Code != 'X']
So my intended output is:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
4 B 4
5 A 5
I need something like df = df[df.Code != 'X'] +1
Using shift
df.loc[(df.Code!='X')&(df.Code.shift()!='X'),]
Out[99]:
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5
You need to find the index of the element you want to delete, and then you can simply delete at that index twice:
>>> i = df[df.Code == 'X'].index
>>> df.drop(df.index[[i]], inplace=True)
>>> df.drop(df.index[[i]], inplace=True, errors='ignore')
>>> df
Code Int
0 A 0
1 A 1
2 B 1
3 C 2
6 B 4
7 A 5

how do I insert a column at a specific column index in pandas?

Can I insert a column at a specific column index in pandas?
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
This will put column n as the last column of df, but isn't there a way to tell df to put n at the beginning?
see docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.insert.html
using loc = 0 will insert at the beginning
df.insert(loc, column, value)
df = pd.DataFrame({'B': [1, 2, 3], 'C': [4, 5, 6]})
df
Out:
B C
0 1 4
1 2 5
2 3 6
idx = 0
new_col = [7, 8, 9] # can be a list, a Series, an array or a scalar
df.insert(loc=idx, column='A', value=new_col)
df
Out:
A B C
0 7 1 4
1 8 2 5
2 9 3 6
If you want a single value for all rows:
df.insert(0,'name_of_column','')
df['name_of_column'] = value
Edit:
You can also:
df.insert(0,'name_of_column',value)
df.insert(loc, column_name, value)
This will work if there is no other column with the same name. If a column, with your provided name already exists in the dataframe, it will raise a ValueError.
You can pass an optional parameter allow_duplicates with True value to create a new column with already existing column name.
Here is an example:
>>> df = pd.DataFrame({'b': [1, 2], 'c': [3,4]})
>>> df
b c
0 1 3
1 2 4
>>> df.insert(0, 'a', -1)
>>> df
a b c
0 -1 1 3
1 -1 2 4
>>> df.insert(0, 'a', -2)
Traceback (most recent call last):
File "", line 1, in
File "C:\Python39\lib\site-packages\pandas\core\frame.py", line 3760, in insert
self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
File "C:\Python39\lib\site-packages\pandas\core\internals\managers.py", line 1191, in insert
raise ValueError(f"cannot insert {item}, already exists")
ValueError: cannot insert a, already exists
>>> df.insert(0, 'a', -2, allow_duplicates = True)
>>> df
a a b c
0 -2 -1 1 3
1 -2 -1 2 4
You could try to extract columns as list, massage this as you want, and reindex your dataframe:
>>> cols = df.columns.tolist()
>>> cols = [cols[-1]]+cols[:-1] # or whatever change you need
>>> df.reindex(columns=cols)
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
EDIT: this can be done in one line ; however, this looks a bit ugly. Maybe some cleaner proposal may come...
>>> df.reindex(columns=['n']+df.columns[:-1].tolist())
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import pandas as pd
df = pd.DataFrame({'l':['a','b','c','d'], 'v':[1,2,1,2]})
df['n'] = 0
df
l v n
0 a 1 0
1 b 2 0
2 c 1 0
3 d 2 0
# here you can add the below code and it should work.
df = df[list('nlv')]
df
n l v
0 0 a 1
1 0 b 2
2 0 c 1
3 0 d 2
However, if you have words in your columns names instead of letters. It should include two brackets around your column names.
import pandas as pd
df = pd.DataFrame({'Upper':['a','b','c','d'], 'Lower':[1,2,1,2]})
df['Net'] = 0
df['Mid'] = 2
df['Zsore'] = 2
df
Upper Lower Net Mid Zsore
0 a 1 0 2 2
1 b 2 0 2 2
2 c 1 0 2 2
3 d 2 0 2 2
# here you can add below line and it should work
df = df[list(('Mid','Upper', 'Lower', 'Net','Zsore'))]
df
Mid Upper Lower Net Zsore
0 2 a 1 0 2
1 2 b 2 0 2
2 2 c 1 0 2
3 2 d 2 0 2
A general 4-line routine
You can have the following 4-line routine whenever you want to create a new column and insert into a specific location loc.
df['new_column'] = ... #new column's definition
col = df.columns.tolist()
col.insert(loc, col.pop()) #loc is the column's index you want to insert into
df = df[col]
In your example, it is simple:
df['n'] = 0
col = df.columns.tolist()
col.insert(0, col.pop())
df = df[col]

Assign to selection in pandas

I have a pandas dataframe and I want to create a new column, that is computed differently for different groups of rows. Here is a quick example:
import pandas as pd
data = {'foo': list('aaade'), 'bar': range(5)}
df = pd.DataFrame(data)
The dataframe looks like this:
bar foo
0 0 a
1 1 a
2 2 a
3 3 d
4 4 e
Now I am adding a new column and try to assign some values to selected rows:
df['xyz'] = 0
df.loc[(df['foo'] == 'a'), 'xyz'] = df.loc[(df['foo'] == 'a')].apply(lambda x: x['bar'] * 2, axis=1)
The dataframe has not changed. What I would expect is the dataframe to look like this:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
In my real-world problem, the 'xyz' column is also computated for the other rows, but using a different function. In fact, I am also using different columns for the computation. So my questions:
Why does the assignment in the above example not work?
Is it neccessary to do df.loc[(df['foo'] == 'a') twice (as I am doing it now)?
You're changing a copy of df (a boolean mask of the DataFrame is a copy, see docs).
Another way to achieve the desired result is as follows:
In [11]: df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
Out[11]:
0 0
1 2
2 4
3 0
4 0
dtype: int64
In [12]: df['xyz'] = df.apply(lambda row: (row['bar']*2 if row['foo'] == 'a' else row['xyz']), axis=1)
In [13]: df
Out[13]:
bar foo xyz
0 0 a 0
1 1 a 2
2 2 a 4
3 3 d 0
4 4 e 0
Perhaps a neater way is just to:
In [21]: 2 * (df1.bar) * (df1.foo == 'a')
Out[21]:
0 0
1 2
2 4
3 0
4 0
dtype: int64

Categories

Resources