i want to match two dataframe columns in python - python

I have a two data frame df1 (35k record) and df2(100k records). In df1['col1'] and df2['col3'] i have unique id's. I want to match df1['col1'] with df2['col3']. If they match, I want to update df1 with one more column say df1['Match'] with value true and if not match, update with False value. I want to map this TRUE and False value against Matching and non-matching record only.
I am using .isin()function, I am getting the correct match and not match count but not able to map them correctly.
Match = df1['col1'].isin(df2['col3'])
df1['match'] = Match
I have also used merge function using by passing the parameter how=rightbut did not get the results.

You can simply do as follows:
df1['Match'] = df1['col1'].isin(df2['col3'])
For instance:
import pandas as pd
data1 = [1,2,3,4,5]
data2 = [2,3,5]
df1 = pd.DataFrame(data1, columns=['a'])
df2 = pd.DataFrame(data2,columns=['c'])
print (df1)
print (df2)
df1['Match'] = df1['a'].isin(df2['c']) # if matches it returns True else False
print (df1)
Output:
a
0 1
1 2
2 3
3 4
4 5
c
0 2
1 3
2 5
a Match
0 1 False
1 2 True
2 3 True
3 4 False
4 5 True

Use df.loc indexing:
df1['Match'] = False
df1.loc[df1['col1'].isin(df2['col3']), 'Match'] = True

Related

Get a count of occurrence of string in each row and column of pandas dataframe

import pandas as pd
# list of paragraphs from judicial opinions
# rows are opinions
# columns are paragraphs from the opinion
opinion1 = ['sentenced to life','sentenced to death. The sentence ...','', 'sentencing Appellant for a term of life imprisonment']
opinion2 = ['Justice Smith','This concerns a sentencing hearing.', 'The third sentence read ...', 'Defendant rested.']
opinion3 = ['sentence sentencing sentenced','New matters ...', 'The clear weight of the evidence', 'A death sentence']
data = [opinion1, opinion2, opinion3]
df = pd.DataFrame(data, columns = ['p1','p2','p3','p4'])
# This works for one column. I have 300+ in the real data set.
df['p2'].str.contains('sentenc')
How do I determine whether 'sentenc' is in columns 'p1' through 'p4'?
Desired output would be something like:
True True False True
False True True False
True False False True
How do I retrieve a count of the number of times that 'sentenc' appears in each cell?
Desired output would be a count for each cell of the number of times 'sentenc' appears:
1 2 0 1
0 1 1 0
3 0 0 1
Thank you!
Use pd.Series.str.count:
counts = df.apply(lambda col: col.str.count('sentenc'))
Output:
>>> counts
p1 p2 p3 p4
0 1 2 0 1
1 0 1 1 0
2 3 0 0 1
To get it in boolean form, use .str.contains, or call .astype(bool) with the code above:
bools = df.apply(lambda col: col.str.contains('sentenc'))
or
bools = df.apply(lambda col: col.str.count('sentenc')).astype(bool)
Both will work just fine.

Pandas check if a value exists using multiple conditions within group and count value if true

I've created a boolean column based on criteria I've identified. I'd like to take it a step farther by counting the True values per group.
I have
group = df.groupby('id')
df.loc[:,'Match'] = (group['flag'].transform(lambda x: x.eq(0).any()))&(group['flag'].transform(lambda x: x.eq(1).any()))
Which gives me True and False values. How can I then count the # of True values that are populated per id?
Sample data:
id flag Match Count Match
123 0 True 3
123 1 True 3
123 1 True 3
567 0 False 0
567 0 False 0
The Match column is created above, then I'd like to create the Count Match column.
Is it:
df['Count Match'] = df['Match'].astype(int).groupby(df['id']).transform('sum')

Compare columns in a dictionary of dataframes

I have a dictionary of dataframes (Di_1). Each dataframe has the same number of columns, column names, number of rows and row indexes. I also have a list of the names of the dataframes (dfs). I would like to compare the contents of one of the columns (A) in each dataframe with those of the last dataframe in the list to see whether they are the same. For example:
df_A = pd.DataFrame({'A': [1,0,1,0]})
df_B = pd.DataFrame({'A': [1,1,0,0]})
Di_1 = {'X': df_A, 'Y': df_B}
dfs = ['X','Y']
I tried:
for df in dfs:
Di_1[str(df)]['True'] = Di_1[str(df)]['A'] .equals(Di_1[str(dfs[-1])]['A'])
I got:
[0,0,0,0]
I would like to get:
[1,0,0,1]
My attempt is checking whether the whole column is the same but I would instead please like to get it to go through each dataframe row by row.
I think you make things too complicated here. You can
series_last = Di_1[dfs[-1]]['A']
for df in map(Di_1.get, dfs):
df['True'] = df['A'] == series_last
This will produce as result:
>>> df_A
A True
0 1 True
1 0 False
2 1 False
3 0 True
>>> df_B
A True
0 1 True
1 1 True
2 0 True
3 0 True
So each df_i has an extra column named 'True' (perhaps you better use a different name), that checks if for a specific row, the value is the same as the one in the series_last.
In case the dfs contains something else than strings, we can first convert these to strings:
series_last = Di_1[str(dfs[-1])]['A']
for df in map(Di_1.get, map(str, dfs)):
df['True'] = df['A'] == series_last
Create a list:
l=[Di_1[i] for i in dfs]
Then using isin() you can compare the first and last df
l[0].isin(l[-1]).astype(int)
A
0 1
1 0
2 0
3 1

Pandas empty dataframe resulting from an isin function that keeps objects with an ID if the ID is present in a dataFrame of just IDs

I've got 2 data frames, one with 1 column currentWorkspaceGuid (workspacesDF) and another with 4 columns currentWorkspaceGuid, modelGuid, memoryUsage, lastModified (extrasDF) and I'm trying to get isin to result in a dataFrame that shows the values from the second dataframe only if the workspaceGuid exists in the workspacesDF . It's giving me an empty dataframe when I use the following code:
import pandas as pd
extrasDF = pd.read_csv("~/downloads/Extras.csv")
workspacesDF = pd.read_csv("~/downloads/workspaces.csv")
not_in_workspaces = extrasDF[(extrasDF.currentWorkspaceGuid.isin(workspacesDF))]
print(not_in_workspaces)
I tried adding in print statements to verify the column matches when it should and doesn't when it shouldn't but it's still returning nothing.
Once I can get this to work correctly, my end goal is to return a list of the items that don't exist in the workspacesDF which I think I can do just by adding ~ to the front of the isin statement which is why I'm not doing a join or merge.
EDIT:
Adding example data from both files for clarification:
from workspaces.csv:
currentWorkspaceGuid
8a81b09c56cdf89c0157345759d75644
8a81948240d60b1901417a266a536462
402882f738cf7433013b612dc5f60bbd
8a8194884c860a53014ca1f6596d54e9
8a8194884a34d3ff014a4f31bea3705a
from Extras.csv:
currentWorkspaceGuid,modelGuid,memoryUsage,lastModified
8a81b09c56cdf89c0157345759d75644,635D5FAAC46D4856AAFD21AC6386DDCA,1191785,"2018-08-08 17:57:45"
8a81948240d60b1901417a266a536462,4076B1A8B1E34D549FFFE9F5FFE4538A,5400000,"2016-09-13 18:32:50"
402882f738cf7433013b612dc5f60bbd,4CA3CDC12CD349ABA8658365480073CA,550000,"2017-11-23 16:26:10"
8a8194884c860a53014ca1f6596d54e9,15E3E6B6087A4CA6838616A418E9657A,830000,"2018-05-22 17:35:50"
8a8194884a34d3ff014a4f31bea3705a,C47D186A479140BFAB24AF8D24E8B2BA,816686,"2018-07-31 09:39:16"
I think need compare columns (Series):
mask = extrasDF['currentWorkspaceGuid'].isin(workspacesDF['currentWorkspaceGuid'])
in_workspaces = extrasDF[mask]
print (in_workspaces)
currentWorkspaceGuid modelGuid \
0 8a81b09c56cdf89c0157345759d75644 635D5FAAC46D4856AAFD21AC6386DDCA
1 8a81948240d60b1901417a266a536462 4076B1A8B1E34D549FFFE9F5FFE4538A
2 402882f738cf7433013b612dc5f60bbd 4CA3CDC12CD349ABA8658365480073CA
3 8a8194884c860a53014ca1f6596d54e9 15E3E6B6087A4CA6838616A418E9657A
4 8a8194884a34d3ff014a4f31bea3705a C47D186A479140BFAB24AF8D24E8B2BA
memoryUsage lastModified
0 1191785 2018-08-08 17:57:45
1 5400000 2016-09-13 18:32:50
2 550000 2017-11-23 16:26:10
3 830000 2018-05-22 17:35:50
4 816686 2018-07-31 09:39:16
For filter non matched values add ~ for invert boolean mask:
not_in_workspaces = extrasDF[~mask]
print (not_in_workspaces)
Empty DataFrame
Columns: [currentWorkspaceGuid, modelGuid, memoryUsage, lastModified]
Index: []
Details:
print (mask)
0 True
1 True
2 True
3 True
4 True
Name: currentWorkspaceGuid, dtype: bool
print (~mask)
0 False
1 False
2 False
3 False
4 False
Name: currentWorkspaceGuid, dtype: bool

Python Pandas mark all but one specific duplicate row

I have a Pandas dataframe that has already been reduced to duplicates only and sorted.
Duplicates are identified by column "HASH" and then sorted by "HASH" and "SIZE"
df_out['is_duplicated'] = df.duplicated(['HASH'], keep=False) #keep=False: mark all duplicates as true
df_out = df_out.ix[(df_out['is_duplicated'] == True)] #Keep only duplicate records
df_out = df_out.sort_values(['HASH', 'SIZE'], ascending=[True, False]) #Sort by "HASH", then by "SIZE"
Result:
HASH SIZE is_duplicated
1 5 TRUE
1 3 TRUE
1 2 TRUE
9 7 TRUE
9 5 TRUE
I would like to add 2 more columns.
First column would identify rows of data with the same "HASH" by an ID.
First set of rows with the same "HASH" would be 1, next set would be 2, etc...
Second column would mark the a single row in each group that has the largest "SIZE"
HASH SIZE ID KEEP
1 5 1 TRUE
1 3 1 FALSE
1 2 1 FALSE
9 7 2 TRUE
9 5 2 FALSE
Perhaps use dicts and list comprehension:
import pandas as pd
df = pd.DataFrame([[1,1,1,9,9],[5,3,2,7,5]]).T
df.columns = ['HASH','SIZE']
hash_dict = dict(zip(df.HASH.unique(),range(1,df.HASH.nunique()+1)))
df['ID'] = [hash_dict[k] for k in df.HASH]
max_dict = dict(df.groupby('HASH')['SIZE'].max())
df['KEEP'] = [True if b==max_dict[a] else False for a,b in zip(df.HASH,df.SIZE)]

Categories

Resources