How to find college wise rankings and global rankings using pandas? - python

I have the following data frame:
Agent_Name college_name score college_local_ranking global_ranking
Anna Harvard 60 1 4
Mathew oxford 99 1 1
Angel IIT 65 3 6
I'm able to find the global ranking using the rank function.
df['global_ranking'] = df['score'].rank(ascending=False)
Please help me in finding local ranking based on the score of their college.
I tried this but I'm getting error.
df['college_local_ranking'] = df['score'].groupby(by = ['college_name']).rank()

Your command :
df['college_local_ranking'] = df['score'].groupby(by = ['college_name']).rank()
will fail because you are subsetting the dataframe with df[score], and then applying groupby on college_name which won't be present in this subset.
The correct command would be:
df['college_local_ranking'] = df.groupby('college_name')['score'].rank()

Here is what could work for you
df.college_local_ranking=df.groupby("college_name")["score"].rank(ascending=False)
Hope that helps

Related

Issue when importing excel file with thousands of lines- ValueError: Passed header=1 but only 1 lines in file

I am trying to import an excel file with headers on the second row
selected columns=['Student name','Age','Faculty']
data = pd.read_excel(path_in + 'Results\\' + 'Survey_data.xlsx', header = 1,usecols = selected_columns).rename(columns={'Student Name':'First name'}).drop_duplicates()
Currently, the excel looks something like this:
Student name Surname Faculty Major Scholarship Age L1 TFM Date Failed subjects
Ana Ruiz Economics Finance N 20 N 0
Linda Peterson Mathematics Mathematics Y 22 N 2021-12-04 0
Gregory Olsen Engineering Industrial Engineering N 21 N 0
Ana Watson Business Marketing N 22 N 0
I have tried including the last column in the selected_columns list but it returns the same error. Would greatly appreciate if someone can let me know why python is not reading all the lines.
Thanks in advance.

pandas operations inside a for-loop

Here is a sample of my data
threats =
binomial_name
continent threat_type
Africa Agriculture & Aquaculture 143
Biological Resource Use 102
Climate Change 3
Commercial Development 36
Energy Production & Mining 30
... ... ...
South America Human Intrusions 1
Invasive Species 3
Natural System Modifications 1
Transportation Corridor 2
Unknown 38
I want to use a for loop and obtain and append together the top 5 values of each continent into a data frame.
Here is my code -
continents = threats.continent.unique()
for i in continents:
continen = (threats
.query('continent == i')
.groupby(['continent','threat_type'])
.sort_values(by=('binomial_name'), ascending=False).
.head())
top5 = appended_data.append(continen)
I am however getting the error - KeyError: 'i'
Where am I going wrong?
So, the canonical way to do this:
df.groupby('continent', as_index=False).apply(
lambda grp: grp.nlargest(5, 'binomial_value'))
If you want to do this in a loop, replace this part:
for i in continents:
continen = threats[threats['continent'] == i].nlargest(2, 'binomial_name')
appended_data.append(continen)

simply selecting a column in a specific row

Ok, my frustration has hit epic proportions. I am new to Pandas and trying to use it on an excel db i have, however, i cannot seem to figure out what should be a VERY simple action.
I have a dataframe as such:
ID UID NAME STATE
1 123 Bob NY
1 123 Bob PA
2 124 Jim NY
2 124 Jim PA
3 125 Sue NY
all i need is to be able to locate and print the ID of a record by the unique combination of UID and STATE.
The closest I can come up with is this:
temp_db = fd_db.loc[(fd_db['UID'] == "1") & (fd_db['STATE'] == "NY")]
but this still grabs all UID and not ONLY the one with the STATE
Then, when i try to print the result
temp_db.ID.values
prints this:
['1', '1']
I need just the data and not the structure.
My end result needs to be just to print to the screen : 1
Any help is much appreciated.
I think it's because your UID condition is wrong : the UID column an Integer and you give a String.
For example when I run this :
df.loc[(df['UID'] == "123") & (df['STATE'] == 'NY')]
The output is :
Empty DataFrame
Columns: [ID, UID, NAME, STATE]
Index: []
but when I consider UID as an Integer :
df.loc[(df['UID'] == 123) & (df['STATE'] == 'NY')]
It output :
ID UID NAME STATE
0 1 123 Bob NY
I hope that will help you !
fd_db.loc[(fd_db['UID'] == 123) & (fd_db['STATE'] == 'NY')]['ID'].iloc[0]

Use contains to merge data frame

I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100

pandas: column formatting issues causing merge problems

I have the following two databases:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/rgdp_catcode.merge'
df=pd.read_csv(url, index_col=0)
df.head(1)
naics catcode GeoName Description ComponentName year GDP state
0 22 E1600',\t'E1620',\t'A4000',\t'E5000',\t'E3000'... Alabama Utilities Real GDP by state 2004 5205 AL
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge'
df1=pd.read_csv(url, index_col=0)
df1.head(1)
state year unemployment log_diff_unemployment id.thomas party type date bills id.fec years_exp session name disposition catcode
0 AK 2006 6.6 -0.044452 1440 Republican sen 2006-05-01 s2686-109 S2AK00010 39 109 National Cable & Telecommunications Association support C4500
Regarding df, I had to manually input the catcode values. I think that is why the formatting is off. What I would like is to simply have the values without the \t prefix. I want to merge the dfs on catcode, state, year. I made a test earlier wherein a df1.catcode with only one value per cell was matched with the values in another df.catcode that had more than one value per cell and it worked.
So technically, all I need to do is lose the \t before each consecutive value in df.catcode, but additionally, if anyone has ever done a merge of this sort before, any 'caveats' learned through experience would be appreciated. My merge code looks like this:
mplmerge=pd.merge(df1,df, on=(['catcode', 'state', 'year']), how='left' )
I think this can be done with the regex method, I'm looking at the documentation now.
Cleaning catcode column in df is rather straightforward:
catcode_fixed = df.catcode.str.findall('[A-Z][0-9]{4}')
This will produce a series with a list of catcodes in every row:
catcode_fixed.head(3)
Out[195]:
0 [E1600, E1620, A4000, E5000, E3000, E1000]
1 [X3000, X3200, L1400, H6000, X5000]
2 [X3000, X3200, L1400, H6000, X5000]
Name: catcode, dtype: object
If I understand correctly what you want, then you need to "ungroup" these lists. Here is the trick, in short:
catcode_fixed = catcode_fixed = catcode_fixed.apply(pd.Series).stack()
catcode_fixed.index = catcode_fixed.index.droplevel(-1)
So, we've got (note the index values):
catcode_fixed.head(12)
Out[206]:
0 E1600
0 E1620
0 A4000
0 E5000
0 E3000
0 E1000
1 X3000
1 X3200
1 L1400
1 H6000
1 X5000
2 X3000
dtype: object
Now, dropping the old catcode and joining in the new one:
df.drop('catcode',axis = 1, inplace = True)
catcode_fixed.name = 'catcode'
df = df.join(catcode_fixed)
By the way, you may also need to use df1.reset_index() when merging the data frames.

Categories

Resources