LEFT ON Case When in Pandas - python

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think

Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA

I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

Related

Add header under two columns of a table

I want to add another header under two columns of a table.
I tried this way but I got InvalidSyntax
table = {
'Color':color,
'Continent':continent,
{'Country':country,
'Nation':season,}
'Name':name,
'Surname':surname,
}
You could use multiindex as described here. Given the structure you show in your dictionary this should do:
df.columns = pd.MultiIndex.from_tuples([('Color'),('Continent'),('Country','Nation'),('Country','Name'),('Country','Surname')])
For example:
df = pd.DataFrame({'Color':['Red','Green'],
'Continent':['Asia','Africa'],
'Nation':['China','Egypt'],
'Name':['X','Y'],
'Surname':['A','B']})
df.columns = pd.MultiIndex.from_tuples([(['Color']),(['Continent']),('Country','Nation'),('Country','Name'),('Country','Surname')])
print(df)
Outputs:
Color Continent Country
NaN NaN Nation Name Surname
0 Red Asia China X A
1 Green Africa Egypt Y B

Pandas conditional match loop between two dataframes

I have 2 dataframes:
df1 = pd.read_excel(path1)
df2 = pd.read_excel(path2)
df1c:
date_of_financing
2012-10-01 n/a
2014-06-01 NCB, CSC, Health
2014-02-01 National Cancer Institute
2013-09-01 n/a
2012-09-01 Maryland Venture Fund
...
2021-06-01 Karista and White Fund
2021-07-01 Zepp Health
names = ['3E Bioventures', '3SBio', '3V SourceOne Capital',...]
df1['date_of_financing'] = pd.to_datetime(df1['date_of_financing'])
df1 = df1.set_index('date_of_financing')
df1c = df1.fillna('n/a')
names = df2['investor_name'].unique().tolist()
The investor names in df2 where put in a list. I want to iterate the names in df2['investor_name'] list in a column of df1, df1c['lead_investors'], a create a new column df1c['investor_continent'], where for every match I got from df1c and df2, I write in the new column 'asia', in this way:
for x in names:
match_asia = df1c['lead_investors'].str.contains(x, na=False).any()
if match_asia > 0:
df1c['investor_continent'] = 'asia'
the loop returns the exact boolean result, but df1c['investor_continent'] = 'asia' is obviously wrong because it prints 'asia' in every row.
What's the exact way to print 'asia' when there is a match, and 'other' if there is not match?
I partially solved it with:
conditions = list(map(df1c['lead_investors'].str.contains, names))
df1c['continents'] = np.select(conditions, names, 'unknown')

Groupby and fill data into a text template in Python

Given a small dataset as follow:
id city price area
0 1 a 12 6
1 2 a 3 7
2 3 a 3 8
3 4 b 2 9
4 5 b 5 6
I would like to groupby city and fill the data into a text template as follows:
For 【】 city, it has 【】district, 【】district and 【】district, the price is respectively 【】dollars, 【】dollars and【】dollar, the area sold is respectively 【】㎡,【】㎡ and 【】㎡.
Code:
df.groupby('city')['district'].apply(list)
Out[14]:
city
bj [hd, cy, tz]
sh [hp, pd]
Name: district, dtype: object
df.groupby('city')['price'].apply(list)
Out[15]:
city
bj [12, 3, 3]
sh [2, 5]
Name: price, dtype: object
df.groupby('city')['area'].apply(list)
Out[16]:
city
bj [6, 7, 8]
sh [9, 6]
Name: area, dtype: object
For example, the result will be like this:
For bj city, it has hd district, cy district and tz district, the price is respectively 12 dollars, 3 dollars and 3 dollar, the area sold is respectively 6 ㎡,7 ㎡ and 8 ㎡.
Is it possible I could get an approximate result (not necessary be exact same) as above with Python? Many thanks for Python or Pandas masters' kind help at advance.
df = pd.DataFrame(
dict(id = [1,2,3,4,5,],
city = list("a" * 3) + list("b" * 2),
district = ["hd", "cy", "tz", "hp", "pd",],
price = [12,3,3,2,5,],
area = [6,7,8,9,6,],)
)
# We can set a few initial variables to help the process out.
target = ["city",]
ignore = ["id",]
# This will produce -> ['district', 'price', 'area']
groupers = [i for i in df.columns if not i in tuple(target + ignore)]
# Iterate through each unique city value.
for c in df["city"].unique():
# Start our message.
msg = f"For city '{c}'," # I tweaked the formatting here.
# Subset of data based on target variable (in this case, 'city')
# Use `.drop_duplicates()` to retain unique rows.
dft = df.loc[df["city"] == c, groupers].drop_duplicates()
# --- OR, the following to use the `target` variable value. --- #
# dft = df.loc[df[target[0]] == c, groupers].drop_duplicates()
# Iterate a transposed index
for idx in dft.T.index:
# Make a quick value variable for clarity
vals = dft.T.loc[idx].values
# Do some ad hoc formatting for each specific field, if required
# Add your desired message start to the respective variable.
# `field` will be what is output to the message string.
if idx == "price":
msg_start = "the price is respectively "
field = "dollars"
elif idx == "area":
msg_start = "the area sold is respectively "
field = "m\u00b2"
else:
msg_start = " it has\n"
field = idx
# Add the message start section
msg += msg_start
# Use .join() with conditions to determine if the item is the last one in the list.
msg += ", ".join([f"{i} {field}" if i != vals[-1] else f"and {i} {field}" for i in vals])
# Add a newline for separation betweeen each set of items.
msg += "\n"
print(msg)
Output:
For city 'a', it has
hd district, cy district, and tz district
the price is respectively 12 dollars, and 3 dollars, and 3 dollars
the area sold is respectively 6 m², 7 m², and 8 m²
For city 'b', it has
hp district, and pd district
the price is respectively 2 dollars, and 5 dollars
the area sold is respectively 9 m², and 6 m²

How to add categorical variable to dataframe?

Im working in a World Happiness Report project that includes datasets from 2015 to 2019. I concatenated them into a final dataframe to get average of parameters (economy, health, etc.) for every country across that time span. But what I forgot to add was the respective region that the country is in (ej: England - Western Europe). How could I add the 'Region' column to my final dataframe and to be sure that that region matches with its respective country?
Not sure if this is what you are looking for.
You may want to do something like this:
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
Or you can also use merge statement. The assumption is that for each country, you have a region that it can map to.
df = pd.merge(df,region_df,how='left',on = ['Country'])
Make sure you have indexed both on Country before you merge to get the optimized response.
data setup
import pandas as pd
c = ['Country','Happiness Score','Other_fields']
d = [['Denmark', 7.5460,1.25],
['Norway',7.5410,1.50],
['Finland',7.5378,1.85]]
region_cols = ['Country','Region']
region_data = [['Denmark','R1'],['Norway','R2'],['Finland','R3']]
df = pd.DataFrame(data = d, columns = c)
region_df = pd.DataFrame(data = region_data, columns = region_cols)
Based on the lookup DataFrame, you can do a map to check for Country and assign Region to df.
df['Region'] = df['Country'].map(region_df.set_index('Country')['Region'])
print (df)
Your result will be as follows:
Base DataFrame:
Country Happiness Score Other_fields
0 Denmark 7.5460 1.25
1 Norway 7.5410 1.50
2 Finland 7.5378 1.85
Lookup DataFrame:
Country Region
0 Denmark R1
1 Norway R2
2 Finland R3
Updated DataFrame:
Country Happiness Score Other_fields Region
0 Denmark 7.5460 1.25 R1
1 Norway 7.5410 1.50 R2
2 Finland 7.5378 1.85 R3

Pandas: How to find whether address in one dataframe is from city and state in another dataframe?

I have a dataframe of addresses as below:
main_df =
address
0 3, my_street, Mumbai, Maharashtra
1 Bangalore Karnataka 45th Avenue
2 TelanganaHyderabad some_street, some apartment
And I have a dataframe with city and state as below (note few states have cities with same names too:
city_state_df =
city state
0 Mumbai Maharashtra
1 Ahmednagar Maharashtra
2 Ahmednagar Bihar
3 Bangalore Karnataka
4 Hyderabad Telangana
I want to have a mapping of city and state next to each address. I am able to do so with iterrows() with nested for loops. However, both take more than an hour each for mere 15k records. What is the optimum way of achieving this considering addresses are randomly written and multiple states have same city name?
My code below:
main_df = pd.DataFrame({'address': ['3, my_street, Mumbai, Maharashtra', 'Bangalore Karnataka 45th Avenue', 'TelanganaHyderabad some_street, some apartment']})
city_state_df = pd.DataFrame({'city': ['Mumbai', 'Ahmednagar', 'Ahmednagar', 'Bangalore', 'Hyderabad'],
'state': ['Maharashtra', 'Maharashtra', 'Bihar', 'Karnataka', 'Telangana']})
df['city'] = np.nan
df['state'] = np.nan
for i, df_row in df.iterrows():
for j, city_row in city_state_df.iterrows():
if city_row['city'] in df_row['address']:
city_filtered = city[city['city'] == city_row['city']]
for k, fil_row in city_filtered.iterrows():
if fil_row['state'] in df_row['address']:
df_row['city'] = fil_row['city']
df_row['state'] = fil_row['state']
break
break
Hello maybe something like this:
main_df = main_df.reindex(columns=[*main_df.columns.tolist(), 'state', 'city'],fill_value=None)
for i, row in city_state_df.iterrows():
main_df.loc[(main_df.address.str.contains(row.city)) & \
(main_df.address.str.contains(row.state)), \
['city', 'state']] = [row.city, row.state]

Categories

Resources