Python looping two different dataFrames to create a new column - python

I want to add a new column to a dataframe by referencing another dataframe.
I want to run an if statement using startswith method to match df1['BSI'] column to df2['initial'] to assign the corresponding df2['marker'], and give df1 a new column that consists of markers, which I will use for cartopy marker style.
I am having trouble looping df2 inside a df1 loop. I basically can't figure out how to call df1 item onto df2 loop to compare to df2 items.
df1 looks like this:
BSI Shelter_Number Location Latitude Longitude
0 AA-010 1085 SUSSEX (N SIDE) & RIDEAU FALLS 45.439571 -75.695694
1 AA-030 3690 SUSSEX (E SIDE) & ALEXANDER NS 45.442795 -75.692322
2 AA-180 279 CRICHTON (E SIDE) & BEECHWOOD FS 45.439556 -75.676849
3 AA-200 2018 BEECHWOOD (S SIDE) & CHARLEVOIX NS 45.441154 -75.673622
4 AA-220 3301 BEECHWOOD (S SIDE) & MAISONNEUVE NS 45.442188 -75.671356
df2 looks like this:
initial marker
0 AA bo
1 AB bv
2 AC b^
3 AD b<
4 AE b>
desired output is:
BSI, Shelter_Number, Location, Latitude, Longitude, marker
0
AA-010 1085 SUSSEX (N SIDE) & RIDEAU FALLS 45.439571 -75.695694 bo
1
AA-030 3690 SUSSEX (E SIDE) & ALEXANDER NS 45.442795 -75.692322 bo
2
AA-180 279 CRICHTON (E SIDE) & BEECHWOOD FS 45.439556 -75.676849 bo
3
AA-200 2018 BEECHWOOD (S SIDE) & CHARLEVOIX NS 45.441154 -75.673622 bo
4
AA-220 3301 BEECHWOOD (S SIDE) & MAISONNEUVE NS 45.442188 -75.671356 bo

Use map. Infact there are many similar answers using map but the only difference here is that you are using only a part of BSI in df1 for matching
df1['marker'] = df1['BSI'].str.extract('(.*)-', expand = False).map(df2.set_index('initial').marker)
BSI Shelter_Number Location Latitude Longitude marker
0 AA-010 1085 SUSSEX (N SIDE) & RIDEAU FALLS 45.439571 -75.695694 bo
1 AA-030 3690 SUSSEX (E SIDE) & ALEXANDER NS 45.442795 -75.692322 bo
2 AA-180 279 RICHTON (E SIDE) & BEECHWOOD FS 45.439556 -75.676849 bo
3 AA-200 2018 BEECHWOOD (S SIDE) & CHARLEVOIX NS 45.441154 -75.673622 bo
4 AA-220 3301 BEECHWOOD (S SIDE) & MAISONNEUVE NS 45.442188 -75.671356 bo

You can create a dictionary from your df2 and then map df1 to create the new column. If all of your entries in BSI are the same format as provided, then it's simple to just select the first 2 letters. If if it needs to be more complicated, like all things before the first hyphen, then you can use regex.
Here's some test data
import pandas as pd
df1 = pd.DataFrame({'BSI': ['AA-010', 'AA-030', 'AA-180', 'AA-200', 'AA-220'],
'Latitude': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'initial': ['AA', 'AB', 'AC', 'AD', 'AE'],
'marker': ['bo', 'bv', 'b^', 'b<', 'b>']})
Here's the mapping
dct = pd.Series(df2.marker.values, index=df2.initial).to_dict()
df1['marker'] = df1['BSI'].str[0:2].map(dct)
BSI Latitude marker
0 AA-010 1 bo
1 AA-030 2 bo
2 AA-180 3 bo
3 AA-200 4 bo
4 AA-220 5 bo

Related

LEFT ON Case When in Pandas

i wanted to ask that if in SQL I can do like JOIN ON CASE WHEN, is there a way to do this in Pandas?
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = {"North": ["AT"], "West":["TX","LA"]}
So what i have is 2 dummy dict and i have already converted it to become dataframe, first is the name of the cities with the case,and I'm trying to figure out which region the cities belongs to.
Region|City
North|AT
West|TX
West|LA
None|NY
None|CH
So what i thought in SQL was using left on case when, and if the result is null when join with North region then join with West region.
But if there are 15 or 30 region in some country, it'd be problems i think
Use:
#get City without duplicates
df1 = pd.DataFrame(disease)[['City']].drop_duplicates()
#create DataFrame from region dictionary
region = {"North": ["AT"], "West":["TX","LA"]}
df2 = pd.DataFrame([(k, x) for k, v in region.items() for x in v],
columns=['Region','City'])
#append not matched cities to df2
out = pd.concat([df2, df1[~df1['City'].isin(df2['City'])]])
print (out)
Region City
0 North AT
1 West TX
2 West LA
0 NaN CH
1 NaN NY
If order is not important:
out = df2.merge(df1, how = 'right')
print (out)
Region City
0 NaN CH
1 NaN NY
2 West TX
3 North AT
4 West LA
I'm sorry, I'm not exactly sure what's your expected result, can you express more? if your expected result is just getting the city's region there is no need for conditional joining? for ex: you can transform the city-region table into per city per region per row and direct join with the main df
disease = [
{"City":"CH","Case_Recorded":5300,"Recovered":2839,"Deaths":2461},
{"City":"NY","Case_Recorded":1311,"Recovered":521,"Deaths":790},
{"City":"TX","Case_Recorded":1991,"Recovered":1413,"Deaths":578},
{"City":"AT","Case_Recorded":3381,"Recovered":3112,"Deaths":269},
{"City":"TX","Case_Recorded":3991,"Recovered":2810,"Deaths":1311},
{"City":"LA","Case_Recorded":2647,"Recovered":2344,"Deaths":303},
{"City":"LA","Case_Recorded":4410,"Recovered":3344,"Deaths":1066}
]
region = [
{'City':'AT','Region':"North"},
{'City':'TX','Region':"West"},
{'City':'LA','Region':"West"}
]
df = pd.DataFrame(disease)
df_reg = pd.DataFrame(region)
df.merge( df_reg , on = 'City' , how = 'left' )

Unable to get the groupby column of same numeric column

Below is the dataframe
df = pd.DataFrame({'Cust_Pincode':[487551,487551,639207,452001,484661,484661],
'REGIONAL_GROUPING':['WEST I','WEST II','TN II','WEST I','WEST I','WEST II'],
'C_LATITUDE':[22.89831,23.74881,10.72208,22.69875,23.88280,23.88280],
'C_LONGITUDE':[78.75441,79.48472,77.94168,75.88575,80.98250,80.98250],
'Region_dist_lim':[33.577743,33.577743,36.812093,33.577743,33.577743,33.577743]})
Cust_Pincode REGIONAL_GROUPING C_LATITUDE C_LONGITUDE Region_dist_lim
0 487551 WEST I 22.89831 78.75441 33.577743
1 487551 WEST II 23.74881 79.48472 33.577743
2 639207 TN II 10.72208 77.94168 36.812093
3 452001 WEST I 22.69875 75.88575 33.577743
4 484661 WEST I 23.88280 80.98250 33.577743
5 484661 WEST II 23.88280 80.98250 33.577743
I'm trying to write a code which will return unique Cust_Pincode has different REGIONAL_GROUPING. groupby on cust_pincode, regional_grouping and return the dataframe where cust_pincode has multiple regional grouping value. Below is the expected output dataframe
Cust_Pincode REGIONAL_GROUPING
WEST I
0 487551
WEST II
WEST I
1 484661
WEST II
The code which i've written is below
df.groupby(['Cust_Pincode','REGIONAL_GROUPING']).filter(lambda x: len(x) > 1)
The above code is not giving any output
You can try this solution
df = df.groupby(['Cust_Pincode']).filter(lambda x: len(x) > 1)
print(df.groupby(['Cust_Pincode', 'REGIONAL_GROUPING']).first())
Why use filter()?
You can just use first() like this:
df.groupby(['Cust_Pincode','REGIONAL_GROUPING']).first()

Calculation in a dataframe column

I have a dataframe similar to:
City %SC Team
0 London 50.5 A
1 London 40.1 B
2 London 9.4 C
3 Birmingham 31.3 B
4 Birmingham 27.1 A
5 Birmingham 23.7 D
6 Birmingham 17.9 C
7 York 40.1 A
8 York 38.8 C
9 York 21.1 B
.
.
.
I want to separate the cities in Clear win, Marginal win, Extremely Marginal win based on the difference of the top 2 teams.
I have tried the following code:
df = pd.read_csv('file.csv')
Clear, Marginal, EMarginal = [],[],[]
for i in file['%SC']:
if i[0] - i[1] >= 10:
Clear.append('City','Team')
elif i[0] - i[1] < 10 and i[0] - i[1] >=2 :
Marginal.append('City','Team')
else:
EMarginal.append('City','Team')
Expected output:
Clear = [London , A]
Marginal = [Birmingham , B]
EMarginal = [York , A]
My approach doesn't seem right, can anyone suggest a way I could achieved the desired result? Many thanks
If I understand correctly, you want to divide the cities into groups according to the first two teams.
def classify(city):
largest = city.nlargest(2, '%SC')
diff = largest['%SC'].iloc[0] - largest['%SC'].iloc[1]
if diff >= 10:
return 'Clear'
elif diff < 10 and diff >=2 :
return 'Marginal'
return 'EMarginal'
groups = df.groupby("City").apply(classify)
# groups is the following table:
# City
# Birmingham Marginal
# London Clear
# York EMarginal
# dtype: object
If you insist to have a them as a list, you can call
groups.groupby(groups).apply(lambda g: list(g.index)).to_dict()
# Output:
# {'Clear': ['London'], 'EMarginal': ['York'], 'Marginal': ['Birmingham']}
If you still insist to include the winning team in each city, you can call
groups.name = "Margin"
df.join(groups, on="City")\
.groupby("Margin")\
.apply(
lambda g: list(
g.nlargest(1, "%SC")
.apply(lambda row: (row["City"], row["Team"]), axis=1)
)
).to_dict()
# Output
# {'Clear': [('London', 'A')], 'EMarginal': [('York', 'A')], 'Marginal': [('Birmingham', 'B')]}

Fuzzy String match and merge database - Dataframe

I have two dataframes (with strings) that I am trying to compare to each other. One has a list of areas, the other has a list of areas with long,lat info as well. I am struggling to write a code to perform the following:
1) Check if the string in df1 matches (or a partially matches) area names in df2, then it will merge & carry over the long lat columns.
2) if df1 does not match with df2, then the new column will have NaN or zero.
Code:
import pandas as pd
df1 = pd.read_csv('Dubai Communities1.csv')
df1.head()
CNAME_E1
0 Abu Hail
1 Al Asbaq
2 Al Aweer First
3 Al Aweer Second
4 Al Bada
df2 = pd.read_csv('Dubai Communities2.csv')
df2.head()
COMM_NUM CNAME_E2 Latitude Longitude
0 315 UMM HURAIR 55.3237 25.2364
1 917 AL MARMOOM 55.4518 24.9756
2 624 WARSAN 55.4034 25.1424
3 123 AL MUTEENA 55.3228 25.2739
4 813 AL ROWAIYAH 55.3981 25.1053
The output after search and join will look like this:
CName_E1 CName_E3 Latitude Longitude
0 Area1 Area1 22 7.25
1 Area2 Area2 38 71.83
2 Area3 NaN NaN NaN
3 Area4 Area4 35 8.05

Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes

I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes.
Notice how each cell is separated by a , ... is there any really efficient method to split all of these cells and create 2 dataframes, one for the labels and one for the ( value ) in parenthesis?
NumPy has all those ufuncs, is there a way I can use them on string dtypes since they can be converted to np.array with DF.as_matrix()? I'm trying to steer clear of for loops, I could iterate through all the indices and populate an empty array but that's pretty barbaric.
I'm using Beaker Notebook btw, it's really cool (HIGHLY RECOMMENDED)
#Set URL Destination
url = "http://www.reef.org/print/db/stats"
#Process raw table
DF_raw = pd.pandas.read_html(url)[0]
#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()
#Process table
DF_freqSpecies = pd.DataFrame(
DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
columns = DF_raw.iloc[0,:]
)
DF_freqSpecies
#Split these into 2 separate DataFrames
Here's my naive way of doing such:
import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)
dims = DF_freqSpecies.shape
for i in range(dims[0]):
for j in range(dims[1]):
#Parse current dataframe
species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
freq = float(freq[:-1])
#Populate split DataFrames
DF_species.iloc[i,j] = species
DF_freq.iloc[i,j] = freq
I want these 2 dataframes as my output:
(1) Species;
and (2) Frequencies
you can do it this way:
DF1:
In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)
In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone
0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major
0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin
0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish
and DF2
In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)
In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
0 85 54.6 92
1 84.8 53.2 85.8
2 81 50.8 85.7
3 79.9 50.2 85.7
4 74.8 49.7 82.9
0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
0 85.7 79 67.4
1 82.5 77.3 46.6
2 75.2 73.9 26.2
3 68.9 73.3 25.2
4 67.9 72.8 23.7
0 South Atlantic States Central Indo-Pacific
0 79.7 80.1
1 78.5 75.6
2 78.5 73.5
3 72.7 71.4
4 65.7 70.2
RegEx debugging and explanation:
we basically want to remove everything, except number in parentheses:
(\d+\.*\d*) - group(1) - it's our number
\((\d+\.*\d*)\) - our number in parentheses
.*\((\d+\.*\d*)\).* - the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell
it will be replaced with the group(1) - our number

Categories

Resources