Suppose I have two pandas DataFrame namely df1, df2
df1 = {name : [tom, jerry, jennifer, hafiz, kitty]}
df2 = {name : [tom, jerry, alex, hafiz, samdin, unnar]}
From this two datasets, I want to generate
good_boy = [tom, jerry] # present in both the datasets
bad_boy = [jenifer, hafiz, kitty] # present in df1 but not in df2
new_boy = [alex, samdin, unnar] # in df2 but not in df1
Actual dataset is very large with millions of rows, I tried doing iterative check, but it is damn slow. Is there any tric (parallel processing) already there in Pandas.
Please help me to solve this problem, my concent is time. Thank you
As said by #QuangHoang in comments, the key here is merge. The indicator=True option asks for an additional _merge column indicating whether the row is present in one of the dataframes (and which one) or both:
df1 = pd.DataFrame({'name' : ['tom', 'jerry', 'jennifer', 'hafiz', 'kitty']})
df2 = pd.DataFrame({'name' : ['tom', 'jerry', 'alex', 'hafiz', 'samdin', 'unnar']})
tmp = pd.merge(df1, df2, how='outer', on='name', indicator=True)
good_boy = tmp.loc[tmp['_merge']=='both', 'name'].to_list()
bad_boy = tmp.loc[tmp['_merge']=='left_only', 'name'].to_list()
new_boy = tmp.loc[tmp['_merge']=='right_only', 'name'].to_list()
you can use DataFrame.join
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
good_boy = df1.join(df2, on = 'name', how = 'inner')[['name_left']].rename(columns = {'name_left' : 'name'})
bad_boy = df1[~df1['name'].isin(df2['name'].tolist())]
new_boy = df2[~df2['name'].isin(df1['name'].tolist())]
Related
I have two dataframes and in the column 'column_to_compare value from df1' I have "sub-categorical" data which I want to match to the "names" to its "categories".
I listed the matching categories below in the matching_cat variable. I don't know if it is the right way to associate my variable to a category.
I want to create a new dataframe which compares these two dataframes with a common id and on the columns categories and names.
import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)
df1
data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }
df2 = pd.DataFrame(data2)
df2
matching_cat = {'bird': ['falcon', 'toucan', 'eagle'],
'dog': ['golden retriever', 'doberman'],
'insect':['spider','mosquito'] }
So, here are my two dataframes:
And I want to be able to "map" the values with the categories and output this:
Ok using your example, here is what I came up with:
import pandas as pd
data1 = {'key_col': ['12563','12564','12565','12566'],
'categories': ['bird', 'dog', 'insect','insect'],'column3': ['some','other','data','there']
}
df1 = pd.DataFrame(data1)
data2 = { 'key_col': ['12563','12564','12565','12566','12567'],
'names': ['falcon', 'golden retriever', 'doberman','spider','spider'],
'column_randomn': ['some','data','here','here','here'] }
df2 = pd.DataFrame(data2)
matching_category = {'bird': ['falcon', 'toucan', 'eagle'],
'dog': ['golden retriever', 'doberman'],
'insect': ['spider','mosquito'] }
# Function to compare rows against matching_category
def test(row):
try:
if row['names'] in matching_category[row['categories']]:
val = True
else:
val = False
except:
val=False
return val
# Merge the two dataframes based on 'key_col'
df3 = df1.merge(df2, how='outer', on='key_col')
# Call the test function on each row in the new dataframe (df3)
df3['new_col'] = df3.apply(test, axis=1)
# Drop unwanted columns
df3 = df3.drop(['column3', 'column_randomn'], axis=1)
# Create new dataframes based on whether output matches or not
output_matches = df3[df3['new_col']==True]
output_mismatches = df3[df3['new_col']==False]
# Display the dataframes
print('OUTPUT MATCHES:')
print('================================================')
print(output_matches)
print("")
print('OUTPUT MIS-MATCHES:')
print('================================================')
print(output_mismatches)
OUTPUT:
OUTPUT MATCHES:
================================================
key_col categories names new_col
0 12563 bird falcon True
1 12564 dog golden retriever True
3 12566 insect spider True
OUTPUT MIS-MATCHES:
================================================
key_col categories names new_col
2 12565 insect doberman False
4 12567 NaN spider False
I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.
You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
I'm having a python project:
df_testR with columns={'Name', 'City','Licence', 'Amount'}
df_testF with columns={'Name', 'City','Licence', 'Amount'}
I want to compare both df's. Result should be a df, wehere I see the Name, City and Licence and the Amount. Normally, df_testR and df_testF should be exact same.
In case it is not the same, I want to see the difference in Amount_R vs Amount_F.
I referred to: Diff between two dataframes in pandas
But I receive a table with TRUE and FALSE only:
Name
City
Licence
Amount
True
True
True
False
But I'd like to get a table that lists ONLY the lines where differences occur, and that shows the differences between the data in the way such as:
Name
City
Licence
Amount_R
Amount_F
Paul
NY
YES
200
500.
Here, both tables contain PAUL, NY and Licence = Yes, but Table R contains 200 as Amount and table F contains 500 as amount. I want to receive a table from my analysis that captures only the lines where such differences occur.
Could someone help?
import copy
import pandas as pd
data1 = {'Name': ['A', 'B', 'C'], 'City': ['SF', 'LA', 'NY'], 'Licence': ['YES', 'NO', 'NO'], 'Amount': [100, 200, 300]}
data2 = copy.deepcopy(data1)
data2.update({'Amount': [500, 200, 300]})
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df2.drop(1, inplace=True)
First find the missing rows and print them:
matching = df1.isin(df2)
meta_data_columns = ['Name', 'City', 'Licence']
metadata_match = matching[meta_data_columns]
metadata_match['check'] = metadata_match.apply(all, 1, raw=True)
missing_rows = list(metadata_match.index[~metadata_match['check']])
if missing_rows:
print('Some rows are missing from df2:')
print(df1.iloc[missing_rows, :])
Then drop these rows and merge:
df3 = pd.merge(df2, df1.drop(missing_rows), on=meta_data_columns)
Now remove the rows that have the same amount:
df_different_amounts = df3.loc[df3['Amount_x'] != df3['Amount_y'], :]
I assumed the DFs are sorted.
If you're dealing with very large DFs it might be better to first filter the DFs to make the merge faster.
I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1
I am writing because I am having an issue with a for loop which fills a dataframe when it is empty. Unfortunately, the posts Filling empty python dataframe using loops, Appending to an empty data frame in Pandas?, Creating an empty Pandas DataFrame, then filling it? did not help me to solve it.
My attempt aims, first, at finding the empty dataframes in the list "listDataframe" and then, wants to fill them with some chosen columns. I believe my code is clearer than my explanation. What I can't do is to save the new dataframe using its original name. Here my attempt:
for k,j in zip(listOwner,listDataframe):
for y in j:
if y.empty:
data = pd.DataFrame({"Event Date": list_test_2, "Site Group Name" : k, "Impressions" : 0})
y = pd.concat([data,y])
#y = y.append(data)
where "listOwner", "listDataframe" and "list_test_2" are, respectively, given by:
listOwner = ['OWNER ONE', 'OWNER TWO', 'OWNER THREE', 'OWNER FOUR']
listDataframe = [df_a,df_b,df_c,df_d]
with
df_a = [df_ap_1, df_di_1, df_er_diret_1, df_er_s_1]
df_b = [df_ap_2, df_di_2, df_er_diret_2, df_er_s_2]
df_c = [df_ap_3, df_di_3, df_er_diret_3, df_er_s_3]
df_d = [df_ap_4, df_di_4, df_er_diret_4, df_er_s_4]
and
list_test_2 = []
for i in range(1,8):
f = (datetime.today() - timedelta(days=i)).date()
list_test_2.append(datetime.combine(f, datetime.min.time()))
The empty dataframe were df_ap_1 and df_ap_3. After running the above lines (using both concat and append) if I call these two dataframes they are still empty. Any idea why that happens and how to overcome this issue?
UPDATE
In order to avoid both append and concat, I tried to use the coming attempt (again with no success).
for k,j in zip(listOwner,listDataframe):
for y in j:
if y.empty:
y = pd.DataFrame({"Event Date": list_test_2, "Site Group Name" : k, "Impressions" : 0})
The two desired result should be:
where the first dataframe should be called df_ap_1 while the second one df_ap_3.
Thanks in advance.
Drigo
Here's a way to do it:
import pandas as pd
columns = ['Event Date', 'Site Group Name', 'Impressions']
df_ap_1 = pd.DataFrame(columns=columns) #empty dataframe
df_di_1 = pd.DataFrame(columns=columns) #empty dataframe
df_ap_2 = pd.DataFrame({'Event Date':[1], 'Site Group Name':[2], 'Impressions': [3]}) #non-empty dataframe
df_di_2 = pd.DataFrame(columns=columns) #empty dataframe
df_a = [df_ap_1, df_di_1]
df_b = [df_ap_2, df_di_2]
listDataframe = [df_a,df_b]
list_test_2 = 'foo'
listOwner = ['OWNER ONE', 'OWNER TWO']
def appendOwner(df, owner, list_test_2):
#appends a row to a dataframe for each row in listOwner
new_row = {'Event Date': list_test_2,
'Site Group Name': owner,
'Impressions': 0,
}
df.loc[len(df)] = new_row
for owner, dfList in zip(listOwner, listDataframe):
for df in dfList:
if df.empty:
appendOwner(df, owner, list_test_2)
print(listDataframe)
You can use the appendOwner function to append the rows from listOwner to an empty dataframe.