I have two df A & B, I want to iterate through df B's certain columns and check values of all its rows and see if values exist in one of the columns in A, and use fill null values with A's other columns' values.
df A:
country region product
USA NY apple
USA NY orange
UK LON banana
UK LON chocolate
CANADA TOR syrup
CANADA TOR fish
df B:
country ID product1 product2 product3 product4 region
USA 123 other stuff other stuff apple NA NA
USA 456 orange other stuff other stuff NA NA
UK 234 banana other stuff other stuff NA NA
UK 766 other stuff other stuff chocolate NA NA
CANADA 877 other stuff other stuff syrup NA NA
CANADA 109 NA fish NA other stuff NA
so I want to iterate through dfB and for example see if dfA.product (apple) is in columns of dfB.product1-product4 if true such as the first row of dfB indicates, then I want to add the region value from dfA.region into dfB's region which now is currently NA.
here is the code I have, I am not sure if it is right:
import pandas as pd
from tqdm import tqdm
def fill_null_value(dfA, dfB):
for i, row in tqdm(dfA.iterrows()):
for index, row in tqdm(dfB.iterrows()):
if dfB['product1'][index] == dfA['product'][i]:
dfB['region'] = dfA['region '][i]
elif dfB['product2'][index] == dfA['product'[i]:
dfB['region'] = dfA['region'][i]
elif dfB['product3'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
elif dfB['product4'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
else:
dfB['region '] = "not found"
print('outputing data')
return dfB.to_excel('test.xlsx')
If i where you I would create some join and then concat them and drop duplicates
df_1 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product1'], how='right')
df_2 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product2'], how='right')
df_3 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product3'], how='right')
df_4 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product4'], how='right')
df = pd.concat([df_1, df_2, df_3, df_4]).drop_duplicates()
The main issue here seems to be finding a single column for products in your second data set that you can do your join on. It's not clear how exactly you are deciding what values in the various product columns in df_b are meant to be used as keys to lookup vs. the ones that are ignored.
Assuming, though, that your df_a contains an exhaustive list of product values and each of those values only ever occurs in a row once you could do something like this (simplifying your example):
import pandas as pd
df_a = pd.DataFrame({'Region':['USA', 'Canada'], 'Product': ['apple', 'banana']})
df_b = pd.DataFrame({'product1': ['apple', 'xyz'], 'product2': ['xyz', 'banana']})
product_cols = ['product1', 'product2']
df_b['Product'] = df_b[product_cols].apply(lambda x: x[x.isin(df_a.Product)][0], axis=1)
df_b = df_b.merge(df_a, on='Product')
The big thing here is generating a column that you can join on for your lookup
Related
i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )
I want to add another header under two columns of a table.
I tried this way but I got InvalidSyntax
table = {
'Color':color,
'Continent':continent,
{'Country':country,
'Nation':season,}
'Name':name,
'Surname':surname,
}
You could use multiindex as described here. Given the structure you show in your dictionary this should do:
df.columns = pd.MultiIndex.from_tuples([('Color'),('Continent'),('Country','Nation'),('Country','Name'),('Country','Surname')])
For example:
df = pd.DataFrame({'Color':['Red','Green'],
'Continent':['Asia','Africa'],
'Nation':['China','Egypt'],
'Name':['X','Y'],
'Surname':['A','B']})
df.columns = pd.MultiIndex.from_tuples([(['Color']),(['Continent']),('Country','Nation'),('Country','Name'),('Country','Surname')])
print(df)
Outputs:
Color Continent Country
NaN NaN Nation Name Surname
0 Red Asia China X A
1 Green Africa Egypt Y B
I have 2 dataframes:
df1 = pd.read_excel(path1)
df2 = pd.read_excel(path2)
df1c:
date_of_financing
2012-10-01 n/a
2014-06-01 NCB, CSC, Health
2014-02-01 National Cancer Institute
2013-09-01 n/a
2012-09-01 Maryland Venture Fund
...
2021-06-01 Karista and White Fund
2021-07-01 Zepp Health
names = ['3E Bioventures', '3SBio', '3V SourceOne Capital',...]
df1['date_of_financing'] = pd.to_datetime(df1['date_of_financing'])
df1 = df1.set_index('date_of_financing')
df1c = df1.fillna('n/a')
names = df2['investor_name'].unique().tolist()
The investor names in df2 where put in a list. I want to iterate the names in df2['investor_name'] list in a column of df1, df1c['lead_investors'], a create a new column df1c['investor_continent'], where for every match I got from df1c and df2, I write in the new column 'asia', in this way:
for x in names:
match_asia = df1c['lead_investors'].str.contains(x, na=False).any()
if match_asia > 0:
df1c['investor_continent'] = 'asia'
the loop returns the exact boolean result, but df1c['investor_continent'] = 'asia' is obviously wrong because it prints 'asia' in every row.
What's the exact way to print 'asia' when there is a match, and 'other' if there is not match?
I partially solved it with:
conditions = list(map(df1c['lead_investors'].str.contains, names))
df1c['continents'] = np.select(conditions, names, 'unknown')
Is there a way to use a combination of a column names and a values dictionary to filter a pandas dataframe?
Example dataframe:
df = pd.DataFrame({
"name": ["Ann", "Jana", "Yi", "Robin", "Amal", "Nori"],
"city": ["Chicago", "Prague", "Shanghai", "Manchester", "Chicago", "Osaka"],
"age": [28, 33, 34, 38, 31, 37],
"score": [79.0, 81.0, 80.0, 68.0, 61.0, 84.0],
})
column_dict = {0:"city", 1:"score"}
value_dict = {0:"Chicago", 1:61}
The goal would be to use the matching keys column and value dictionaries to filter the dataframe.
In this example, the city would be filtered to Chicago and the score would be filtered to 61, with the filtered dataframe being:
name city age score
4 Amal Chicago 31 61.0
keep_rows = pd.Series(True, index=df.index)
for k, col in column_dict.items():
value = value_dict[k]
keep_rows &= df[col] == value
>>> df[keep_rows]
name city age score
4 Amal Chicago 31 61.0
It's a bit funny to use two different dicts to store keys and values. You're better off with something like this:
filter_dict = {"city":"Chicago", "score":61}
df_filt = df
for k,v in filter_dict.items():
df_filt = df_filt[df_filt[k] == v]
output:
name city age score
4 Amal Chicago 31 61.0
Use merge:
# create filter DataFrame from column_dict and value_dict
df_filter = pd.DataFrame({value: [value_dict[key]] for key, value in column_dict.items()})
# use merge with df_filter
res = df.merge(df_filter, on=['city', 'score'])
print(res)
Output
name city age score
0 Amal Chicago 31 61.0
I've got a dataframe df in Pandas that looks like this:
stores product discount
Westminster 102141 T
Westminster 102142 F
City of London 102141 T
City of London 102142 F
City of London 102143 T
And I'd like to end up with a dataset that looks like this:
stores product_1 discount_1 product_2 discount_2 product_3 discount_3
Westminster 102141 T 102143 F
City of London 102141 T 102143 F 102143 T
How do I do this in pandas?
I think this is some kind of pivot on the stores column, but with multiple . Or perhaps it's an "unmelt" rather than a "pivot"?
I tried:
df.pivot("stores", ["product", "discount"], ["product", "discount"])
But I get TypeError: MultiIndex.name must be a hashable type.
Use DataFrame.unstack for reshape, only necessary create counter by GroupBy.cumcount, last change ordering of second level and flatten MultiIndex in columns by map:
df = (df.set_index(['stores', df.groupby('stores').cumcount().add(1)])
.unstack()
.sort_index(axis=1, level=1))
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
stores discount_1 product_1 discount_2 product_2 discount_3 \
0 City of London T 102141.0 F 102142.0 T
1 Westminster T 102141.0 F 102142.0 NaN
product_3
0 102143.0
1 NaN