Assign new DataFrame column base on list condition - python

I'd like to assign the new column to my DataFrame base on condition - if row.id is one of the bad_cat value.
bad_cat = [71,84]
df = pd.DataFrame({'name' : ['a','b','c','d','e'], 'id' : [1,2,71,5,84]})
df['type'] = df[df.id in bad_cat]
Output:
name id type
a 1 False
b 2 False
c 71 True
d 5 False
e 84 True
It seems my code doesn't work - could you explain how to do it.

The most intuitive answer would be one provided by Quang Hoang using .isin method. This will create a mask resulting in a series of bool statements:
df['type'] = df['id'].isin(bad_cat)
The other approach could be to use index - this can be faster solution under some circumstances. After setting index to column that will be assessed against values provided in the list, you can use .loc for slicing and setting type to True for vlaues that match those on the list.
df.set_index('id', inplace=True)
df['type'] = False
df['type'].loc[bad_cat] = True
for both solutions output will be:
name type
id
1 a False
2 b False
71 c True
5 d False
84 e True
Note: that values in the column that serves as an index does not have to be unique.

Related

iterate on rows of dataframe with conditional flag value in python

I'm doing a cross-check between 2 dataframes to assign a value to a flag. If a specific key is present in both dataframes with a different value, the flag will be set to "change" for that row. If the value is the same, the flag will be set to "no change". However if a specific key is present more than once in only one of the 2 dataframes, then the value of the flag will be "add". Let me give an example to make it clearer:
df 1:
key
value
key value present in df 2
abcd
1
False
wxyz
5
True
df 2:
key
value
key value present in df 1
abcd
2
False
wxyz
5
True
Then the result will be for dataframe 1:
df 1:
key
value
key value present in df 2
xcheck_flag
abcd
1
False
change
wxyz
5
True
no change
To get this result I use the following logic:
def changeType(df1):
def condition_check(row):
if (row['key value present in df 2'] == False):
return 'change'
else:
return 'no change'
df1['xcheck_flag']= df1.apply(condition_check, axis=1)
Now this is rather straightforward, right? Well I have a complication which I haven't been able to solve, yet.
Imagine the following use case:
df 1:
key
value
key value present in df 2
abcd
1
False
wxyz
5
True
abcd
3
False
df 2:
key
value
key value present in df 1
abcd
2
False
wxyz
5
True
In this case, the key abcd appears twice in df 1 and only once in df 2. If this happen, I need to apply the following logic when doing the cross-dataframe check: the first time I will match the key with dataframe 2, then set the value of the flag to change like in previous case; the second time we match the value, then set the flag to "additional change". It doesn't matter which row from df 1 gets assigned the value "change" or "additional". The only condition is that when you have such a case, only one key-value gets assigned with "change" and then all the others that might happen get assigned with "additional"
This give us:
df 1:
key
value
key value present in df 2
xcheck_flag
abcd
1
False
change
wxyz
5
True
no change
abcd
3
True
additional change
I've been trying to adapt my initial function to include this behaviour but without success.
If you have any hint, it would be greatly welcomed!
I would probably do something like this:
import pandas as pd
df1 = pd.DataFrame({'key': ['abcd', 'wxyz', 'abcd'], 'value': [1, 5, 3]})
df2 = pd.DataFrame({'key': ['abcd', 'wxyz'], 'value': [2, 5]})
df1['key_duplicated'] = df1.duplicated('key', keep='first')
df3 = df1.join(df2.set_index(['key']), rsuffix='_2', on=['key'])
which gives you a dataframe which I think contains all the columns you need to calculate the flags you're interested in:
key value key_duplicated value_2
0 abcd 1 False 2
1 wxyz 5 False 5
2 abcd 3 True 2
note if the key is not present in df2 value_2 will be NaN.
One solution could be using a dictionary to store the number of occurrences of each key:
def check(key, value, df2):
flag = ''
if seen[key] > 0:
flag = 'additional change'
else:
if value == df2[df2['key']==key]['value'].tolist()[0]:
flag = 'no change'
else:
flag = 'change'
seen[key] += 1
return flag
seen = {k: 0 for k in df1['key'].tolist()}
df1['flag'] = df1.apply(lambda row: check(row['key'], row['value'], df2), axis=1)

Pandas - iloc - comparing value to the cell below

For the following table:
Using Pandas - I would like achieve the desired_output column, that is TRUE when the value below the current cell i different - otherwise FALSE.
I have tried the following code - but error occurs.
df['desired_output']=df.two.apply(lambda x: True if df.iloc[int(x),1]==df.iloc[int(x+1),1] else False)
df['desired_output'] = df['city'].shift().bfill() != df['city']
Compare by Series.ne with Series.shifted values and first missing value is replaced by original value:
df = pd.DataFrame({'city':list('mmmssb')})
df['out'] = df['city'].ne(df['city'].shift(fill_value=df['city'].iat[0]))
print (df)
city out
0 m False
1 m False
2 m False
3 s True
4 s False
5 b True
For oldier pandas versions if no missing values in column city is used replace first missing value by Series.fillna:
df['out'] = df['city'].ne(df['city'].shift().fillna(df['city']))

Compare columns in a dictionary of dataframes

I have a dictionary of dataframes (Di_1). Each dataframe has the same number of columns, column names, number of rows and row indexes. I also have a list of the names of the dataframes (dfs). I would like to compare the contents of one of the columns (A) in each dataframe with those of the last dataframe in the list to see whether they are the same. For example:
df_A = pd.DataFrame({'A': [1,0,1,0]})
df_B = pd.DataFrame({'A': [1,1,0,0]})
Di_1 = {'X': df_A, 'Y': df_B}
dfs = ['X','Y']
I tried:
for df in dfs:
Di_1[str(df)]['True'] = Di_1[str(df)]['A'] .equals(Di_1[str(dfs[-1])]['A'])
I got:
[0,0,0,0]
I would like to get:
[1,0,0,1]
My attempt is checking whether the whole column is the same but I would instead please like to get it to go through each dataframe row by row.
I think you make things too complicated here. You can
series_last = Di_1[dfs[-1]]['A']
for df in map(Di_1.get, dfs):
df['True'] = df['A'] == series_last
This will produce as result:
>>> df_A
A True
0 1 True
1 0 False
2 1 False
3 0 True
>>> df_B
A True
0 1 True
1 1 True
2 0 True
3 0 True
So each df_i has an extra column named 'True' (perhaps you better use a different name), that checks if for a specific row, the value is the same as the one in the series_last.
In case the dfs contains something else than strings, we can first convert these to strings:
series_last = Di_1[str(dfs[-1])]['A']
for df in map(Di_1.get, map(str, dfs)):
df['True'] = df['A'] == series_last
Create a list:
l=[Di_1[i] for i in dfs]
Then using isin() you can compare the first and last df
l[0].isin(l[-1]).astype(int)
A
0 1
1 0
2 0
3 1

Pandas: Trying to edit data in a row for a list of dataframes

I have a list of 3 DataFrames x, where each DataFrame has 3 columns. It looks like
1 2 T/F
4 7 False
4 11 True
4 20 False
4 25 True
4 40 False
What I want to do is set the value of each row in column 'T/F' to False for each DataFrame in list x
I attempted to do this with the following code
rang = list(range(len(x))) # rang=[0,1,2]
for i in rang:
x[i].iloc[:len(x), 'T/F'] = False
The code compiled, but it didn't appear to work.
Much simpler. Just iterate over the actual dataframes and update the columns with:
for df in [df1, df2]:
df['T/F'] = False
Als note that DataFrame.iloc is a integer-location based indexing. If you want to index using the column names use .loc.

lookup from multiple columns pandas

i have 2 dataframes df1 & df2 as given below:
df1:
a
T11552
T11559
T11566
T11567
T11569
T11594
T11604
T11625
df2:
a b
T11552 T11555
T11560 T11559
T11566 T11562
T11568 T11565
T11569 T11560
T11590 T11594
T11604 T11610
T11621 T11625
T11633 T11631
T11635 T11634
T13149 T13140
I want to have a new dataframe df3 where i want to search the value of df1 in df2. if the value is present in df2, i want to add new column in df1 returning True/False as shown below.
df3:
a v
T11552 TRUE
T11559 TRUE
T11566 TRUE
T11567 FALSE
T11569 TRUE
T11594 TRUE
T11604 TRUE
T11625 TRUE
T11633 TRUE
T11634 TRUE
Use assign for new DataFrame with isin and converting all values to flatten array by ravel, for improve performance is possible check only unique values and also check by in1d:
df3 = df1.assign(v = lambda x: x['a'].isin(np.unique(df2.values.ravel())))
#alternative solution
#df3 = df1.assign(v = lambda x: np.in1d(x['a'], np.unique(df2[['a','b']].values.ravel())))
#if need specify columns in df2 for check
df3 = df1.assign(v = lambda x: x['a'].isin(np.unique(df2[['a','b']].values.ravel())))
print (df3)
a v
0 T11552 True
1 T11559 True
2 T11566 True
3 T11567 False
4 T11569 True
5 T11594 True
6 T11604 True
7 T11625 True
Try this:
df3 = df1[['a']].copy()
df3['v'] = df3['a'].isin(set(df2.values.ravel()))
The above code will:
Create a new dataframe using column 'a' from df1.
Create a Boolean column 'v' testing the existence of each value of column 'a' versus values in df2 via set and numpy.ravel.

Categories

Resources