I'm doing a cross-check between 2 dataframes to assign a value to a flag. If a specific key is present in both dataframes with a different value, the flag will be set to "change" for that row. If the value is the same, the flag will be set to "no change". However if a specific key is present more than once in only one of the 2 dataframes, then the value of the flag will be "add". Let me give an example to make it clearer:
df 1:
key
value
key value present in df 2
abcd
1
False
wxyz
5
True
df 2:
key
value
key value present in df 1
abcd
2
False
wxyz
5
True
Then the result will be for dataframe 1:
df 1:
key
value
key value present in df 2
xcheck_flag
abcd
1
False
change
wxyz
5
True
no change
To get this result I use the following logic:
def changeType(df1):
def condition_check(row):
if (row['key value present in df 2'] == False):
return 'change'
else:
return 'no change'
df1['xcheck_flag']= df1.apply(condition_check, axis=1)
Now this is rather straightforward, right? Well I have a complication which I haven't been able to solve, yet.
Imagine the following use case:
df 1:
key
value
key value present in df 2
abcd
1
False
wxyz
5
True
abcd
3
False
df 2:
key
value
key value present in df 1
abcd
2
False
wxyz
5
True
In this case, the key abcd appears twice in df 1 and only once in df 2. If this happen, I need to apply the following logic when doing the cross-dataframe check: the first time I will match the key with dataframe 2, then set the value of the flag to change like in previous case; the second time we match the value, then set the flag to "additional change". It doesn't matter which row from df 1 gets assigned the value "change" or "additional". The only condition is that when you have such a case, only one key-value gets assigned with "change" and then all the others that might happen get assigned with "additional"
This give us:
df 1:
key
value
key value present in df 2
xcheck_flag
abcd
1
False
change
wxyz
5
True
no change
abcd
3
True
additional change
I've been trying to adapt my initial function to include this behaviour but without success.
If you have any hint, it would be greatly welcomed!
I would probably do something like this:
import pandas as pd
df1 = pd.DataFrame({'key': ['abcd', 'wxyz', 'abcd'], 'value': [1, 5, 3]})
df2 = pd.DataFrame({'key': ['abcd', 'wxyz'], 'value': [2, 5]})
df1['key_duplicated'] = df1.duplicated('key', keep='first')
df3 = df1.join(df2.set_index(['key']), rsuffix='_2', on=['key'])
which gives you a dataframe which I think contains all the columns you need to calculate the flags you're interested in:
key value key_duplicated value_2
0 abcd 1 False 2
1 wxyz 5 False 5
2 abcd 3 True 2
note if the key is not present in df2 value_2 will be NaN.
One solution could be using a dictionary to store the number of occurrences of each key:
def check(key, value, df2):
flag = ''
if seen[key] > 0:
flag = 'additional change'
else:
if value == df2[df2['key']==key]['value'].tolist()[0]:
flag = 'no change'
else:
flag = 'change'
seen[key] += 1
return flag
seen = {k: 0 for k in df1['key'].tolist()}
df1['flag'] = df1.apply(lambda row: check(row['key'], row['value'], df2), axis=1)
Related
I'd like to assign the new column to my DataFrame base on condition - if row.id is one of the bad_cat value.
bad_cat = [71,84]
df = pd.DataFrame({'name' : ['a','b','c','d','e'], 'id' : [1,2,71,5,84]})
df['type'] = df[df.id in bad_cat]
Output:
name id type
a 1 False
b 2 False
c 71 True
d 5 False
e 84 True
It seems my code doesn't work - could you explain how to do it.
The most intuitive answer would be one provided by Quang Hoang using .isin method. This will create a mask resulting in a series of bool statements:
df['type'] = df['id'].isin(bad_cat)
The other approach could be to use index - this can be faster solution under some circumstances. After setting index to column that will be assessed against values provided in the list, you can use .loc for slicing and setting type to True for vlaues that match those on the list.
df.set_index('id', inplace=True)
df['type'] = False
df['type'].loc[bad_cat] = True
for both solutions output will be:
name type
id
1 a False
2 b False
71 c True
5 d False
84 e True
Note: that values in the column that serves as an index does not have to be unique.
I have a dictionary of dataframes (Di_1). Each dataframe has the same number of columns, column names, number of rows and row indexes. I also have a list of the names of the dataframes (dfs). I would like to compare the contents of one of the columns (A) in each dataframe with those of the last dataframe in the list to see whether they are the same. For example:
df_A = pd.DataFrame({'A': [1,0,1,0]})
df_B = pd.DataFrame({'A': [1,1,0,0]})
Di_1 = {'X': df_A, 'Y': df_B}
dfs = ['X','Y']
I tried:
for df in dfs:
Di_1[str(df)]['True'] = Di_1[str(df)]['A'] .equals(Di_1[str(dfs[-1])]['A'])
I got:
[0,0,0,0]
I would like to get:
[1,0,0,1]
My attempt is checking whether the whole column is the same but I would instead please like to get it to go through each dataframe row by row.
I think you make things too complicated here. You can
series_last = Di_1[dfs[-1]]['A']
for df in map(Di_1.get, dfs):
df['True'] = df['A'] == series_last
This will produce as result:
>>> df_A
A True
0 1 True
1 0 False
2 1 False
3 0 True
>>> df_B
A True
0 1 True
1 1 True
2 0 True
3 0 True
So each df_i has an extra column named 'True' (perhaps you better use a different name), that checks if for a specific row, the value is the same as the one in the series_last.
In case the dfs contains something else than strings, we can first convert these to strings:
series_last = Di_1[str(dfs[-1])]['A']
for df in map(Di_1.get, map(str, dfs)):
df['True'] = df['A'] == series_last
Create a list:
l=[Di_1[i] for i in dfs]
Then using isin() you can compare the first and last df
l[0].isin(l[-1]).astype(int)
A
0 1
1 0
2 0
3 1
This question already has answers here:
Pandas: compare list objects in Series
(5 answers)
Closed 3 years ago.
So I have a function that sets a value in a column of a dataframe based on whether or not some string in the dataframe contains values from a list.
I then want to get a count of how many rows in the dataframe have that value, but I am getting an error.
If certain conditions are met, the 'tag' column is being set equal to a list, ['date','must',glucose']. Not all of the rows meet the condition for this to happen. I want to find the number of rows where this IS being met,by analyzing the dataframe.
I have tried this:
df = data[data['tag'] == ['date','must','glucose']]
print(df)
...but that yields:
ValueError: Lengths must match to compare
I also tried this but that yields the same error:
df = data.tag == ['date','must','glucose']
If I was just comparing values, that would work, but having a list in the cell instead of a value is blowing it up. Like if the value was just 'four' and I was doing this, it wouldn't give me an error:
df = data[data.tag=='four']
Is there a way to accomplish this? Thank you!
You can use apply function for it.
df = df[df['tag'].apply(lambda x : x == ['date','must','glucose'])]
you can also convert it into tuple and compare
source: Pandas: compare list objects in Series
EDITING ANSWER
You need to use isin() to accomplish that. Consider:
>>> data = pd.DataFrame({'sample col1': [1,2,3,4,5], 'sample col2': ['a','b','c','d','e'], 'tag': ['some text', 'some value','date','must','glucose']})
>>> data
sample col1 sample col2 tag
0 1 a some text
1 2 b some value
2 3 c date
3 4 d must
4 5 e glucose
>>> df = data[~data['tag'].isin(['date','must','glucose'])]
>>> df
sample col1 sample col2 tag
0 1 a some text
1 2 b some value
On your case:
>>> df.reset_index(inplace = True, drop =True)
>>> df['map'] = 'True'
>>> df
sample col1 sample col2 tag map
0 1 a some text True
1 2 b some value True
>>> map_dict = dict(zip(df['tag'], df['map']))
>>> data['In your list?'] = data['tag'].map(map_dict).fillna(value = 'False')
>>> data
sample col1 sample col2 tag Not in your list?
0 1 a some text True
1 2 b some value True
2 3 c date False
3 4 d must False
4 5 e glucose False
Hope this helps :D
I have a two data frame df1 (35k record) and df2(100k records). In df1['col1'] and df2['col3'] i have unique id's. I want to match df1['col1'] with df2['col3']. If they match, I want to update df1 with one more column say df1['Match'] with value true and if not match, update with False value. I want to map this TRUE and False value against Matching and non-matching record only.
I am using .isin()function, I am getting the correct match and not match count but not able to map them correctly.
Match = df1['col1'].isin(df2['col3'])
df1['match'] = Match
I have also used merge function using by passing the parameter how=rightbut did not get the results.
You can simply do as follows:
df1['Match'] = df1['col1'].isin(df2['col3'])
For instance:
import pandas as pd
data1 = [1,2,3,4,5]
data2 = [2,3,5]
df1 = pd.DataFrame(data1, columns=['a'])
df2 = pd.DataFrame(data2,columns=['c'])
print (df1)
print (df2)
df1['Match'] = df1['a'].isin(df2['c']) # if matches it returns True else False
print (df1)
Output:
a
0 1
1 2
2 3
3 4
4 5
c
0 2
1 3
2 5
a Match
0 1 False
1 2 True
2 3 True
3 4 False
4 5 True
Use df.loc indexing:
df1['Match'] = False
df1.loc[df1['col1'].isin(df2['col3']), 'Match'] = True
I'm trying to add one column in my dataframe (DF) according to another column value and whether that value is in my DF or not.
Example:
>>> d = { 'one' : pd.Series(['aa', 'bb', 'cc', 'aa-01', 'bb-02', 'dd']) }
>>> df = pd.DataFrame(d)
>>> df
one
0 aa
1 bb
2 cc
3 aa-01
4 bb-02
5 dd
I would like to add the following column if I can find another element with the current element appended -01 or -02.
Example: in this dataframe only the elements 'aa' and 'bb' have the elements with the appended value, which are 'aa-01', and 'bb-02', thus only 'aa' and 'bb' will have the value True in the new column
Expected result:
>>> expected_df
one two
0 aa True
1 bb True
2 cc False
3 aa-01 False
4 bb-02 False
5 dd False
I believe I have to use isin() with apply(), but I can't figure out a way to modify the row and use isin at the same time within the function passed as argument to apply.
Use str.endswith to check for strings ending with the given chars and create a boolean mask. Followed by removing the last three chars after the mask generation fed to the isin method.
mask = df['one'].str.endswith(('-01','-02'))
df['two'] = df['one'].isin(df[mask].squeeze().str[:-3])
df