Fill a dataframe from another one based on two conditions - python

I am a little stuck on a small project I am working on and I would appreciate your help.
I have two data frames.
The first one is larger and it is the one I want to use for my final analyses.
It contains ISIN for bonds based on industry, region and has ratings from S&P and Moody’s.
ISIN
Industry
Region
SP
MD
The second data has Industry, rating(S&P and Moody’s) and region as well as an estimated rating based on financial information like investments, spending on R&D etc.
Industry
Region
SP
MD
Internal Estimate
I would like to extract in a new column in the first database the internal rating based on the Industry, Region and Rating labeled “Internal Estimate”.
A merge wouldn’t work because in an industry you can have several S&P and Moody’s ratings or even sometimes those are missing.
That is why I have written a code with the following conditions:
For i in range (1: i):
if Bond_Rating[‘MD’]='' and Bond_Ratings[‘SP’]='':
Bond_Rating[Internal Estimate] = ''
elif Bond_Rating['MD']='' and Bond_Rating[‘SP’]!='':
Bond_Rating['INTERNAL ESTIMATE']= Bond_Rating.lookup[concat('BicId','RegionName',’SP’),INTERNAL ESTIMATE.Table[‘InternalEstimate’]]
elif Bond_Rating['MD']!='' and Bond_Rating[‘SP’]='':
Bond_Rating['INTERNAL ESTIMATE']= Bond_Rating.lookup[concat('BicId','RegionName','MD'), INTERNAL ESTIMATE.Table[‘InternalEstimate’]]
elif Bond_Rating['MD']!='' and Bond_Rating[‘SP’] !='':
Bond_Rating['INTERNAL ESTIMATE']= Bond_Rating.lookup[concat('BicId','RegionName','MD',’SP’), INTERNAL ESTIMATE.Table[‘InternalEstimate’]]
However, I am unsure why my code doesn’t work. I keep getting errors.
I would appreciate your assistance.

Related

How could I classify data group based in different parameters?

Today, I faced an issue during a tentative to classify a large amount of data (600K rows) based on some different behaviors.
For example:
I've listed 6 possible scenarios for data classification. The classification should consider the status and the niche market, and according to the related scenario, the market indication should be replaced with some information.
For the first scenario, we have this. When the status is equal to 'Stand By' and we have a previously Niche Market changing we have a Market Indication that it is a Market Transition.
For the Second Scenario, during a 'Stand By' status, if we have a status changing and a Niche Market changing, all those movements indicate a Market Transition.
The 3rd scenario is similar to the first, but the status change.
In the 4th scenario, the niche market does not change, but the status change, so here we consider a strategic operation movement.
The 5th scenario is similar to the 2nd, but the niche market does not change.
The 6th scenario is similar to the 5th, but the status is invested.
My main objective is to classify a large amount of data identify all scenario's particularities and create a column identifying what the scenario is.
I've tried to map some conditions like that:
for i in range(len(dfcopy)):
if(dfcopy.at[i,'Status'] == 'Stand By'):
if(dfcopy.at[i-1,'Niche Market'] != dfcopy.at[i,'Niche Market'] ):
while(dfcopy.iat[i,9] == 'Stand by'):
dfcopy['Market Type'] = 'Market Transition'
i = i+1
elif(dfcopy.at[i,'Niche Market'] != dfcopy.at[i+1,'Niche Market']):
while(dfcopy.at[i,'Status'] == 'Stand By'):
dfcopy['Market Type'] = ' Market Transition'
i+1
else:
dfcopy['Market Type'] = 'Strategic Operation'
But it doesn't work.
Does anyone have a idea of how could I map those behaviors?
Thanks a lot!!

How to add a new column to CSV according to certain equality of other columns?

Here is my question. I am working with a python CSV data set using pandas. I am comparing crime rates in NYC neighborhoods and Airbnb rents in that neighborhood using 2 different data sets. What I want to do is checking if the neighborhood names are same then adding the crime rate column next to the price column of air BnB df. However, the indexes are not same such that there are 500 crimes for the upper east side houses while there is only 1 crime number for upper east side. So how can I combine this info? HELP much needed as I have a report due by tonight thnx
So far I have done:
i only implemented csv files as df for both and then thought about creating a dctionary with crime rates data for neighbourhoods and rates and if I find equality on aribnb locations and dictionary locations i want to add crime rate values from dictionary to an empty list. And after doing this I believe list will be in ordered way with Air bnb locations so that I can add this list as a new column to Air bnb csv. Sorry my code is not proper so ı cant post it here. Also I am stuck at adding the proper value of dict to empty list by finding same locations in 2 csvs.
datasets:
http://app.coredata.nyc/?mlb=false&ntii=crime_all_rt&ntr=Community%20District&mz=14&vtl=https%3A%2F%2Fthefurmancenter.carto.com%2Fu%2Fnyufc%2Fapi%2Fv2%2Fviz%2F98d1f16e-95fd-4e52-a2b1-b7abaf634828%2Fviz.json&mln=true&mlp=true&mlat=40.718&ptsb=&nty=2018&mb=roadmap&pf=%7B%22subsidies%22%3Atrue%7D&md=table&mlv=false&mlng=-73.996&btl=Borough&atp=neighborhoods
https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data

I want to create new dataframe columns looping over rows of a specific column

I am trying to create a Gale-Shapley algorithm in Python, that delivers stable matches of doctors and hospitals. To do so, I gave every doctor and every hospital a random preference represented by a number.
Dataframe consisting of preferences
Afterwards I created a function that rates every hospital for one specific doctor (represented by ID) followed by a ranking of this rating creating two new columns. In rating the match, I took the absolute value of the difference between the preferences, where a lower absolute value is a better match. This is the formula for the first doctor:
doctors_sorted_by_preference['Rating of Hospital by Doctor 1']=abs(doctors_sorted_by_preference['Preference Doctor'].iloc[0]-doctors_sorted_by_preference['Preference Hospital'])
doctors_sorted_by_preference['Rank of Hospital by Doctor 1']=doctors_sorted_by_preference["Rating of Hospital by Doctor 1"].rank()
which leads to the following table:
Dataframe consisting of preferences and rating + ranking of doctor
Hence, doctor 1 prefers the first hospital over all other hospitals as represented by the ranking.
Now I want to repeat this function for every different doctor by creating a loop (creating two new columns for every doctor and adding them to my dataframe), but I don't know how to do this. I could type out the same function for all the 10 different doctors, but if I increase the dataset to include 1000 doctors and hospitals this would become impossible, there must be a better way...
This would be the same function for doctor 2:
doctors_sorted_by_preference['Rating of Hospital by Doctor 2']=abs(doctors_sorted_by_preference['Preference Doctor'].iloc[1]-doctors_sorted_by_preference['Preference Hospital'])
doctors_sorted_by_preference['Rank of Hospital by Doctor 2']=doctors_sorted_by_preference["Rating of Hospital by Doctor 2"].rank()
Thank you in advance!
You can also append the values into list and then write it to dataframe. Appending into lists would be faster if you have a large dataset.
I named by dataframe as df for sake of viewing :
for i in range(len(df['Preference Doctor'])):
list1= []
for j in df['Preference Hospital']:
list1.append(abs(df['Preference Doctor'].iloc[i]-j))
df['Rating of Hospital by Doctor_' +str(i+1)] = pd.DataFrame(list1)
df['Rank of Hospital by Doctor_' +str(i+1)] = df['Rating of Hospital by Doctor_'
+str(i+1)].rank()

how to use apply function/for loop on dataframe column content in python

for context, I'm looking at a dataset of data scientist job titles and job descriptions and I'm trying to identify how much is each degree level is cited in those job description.
I was able to get the code to work on one particular job description, but now I need to do a "for loop" or equivalent to iterate through the 'description column' and count cumulatively the amount of times each level of education was cited.
sentence = set(data_scientist_filtered.description.iloc[30].split())
degree_level = {'level_1':{'bachelors','bachelor','ba'},
'level_2':{'masters','ms','m.s',"master's",'master of science'},
'level_3':{'phd','p.h.d'}}
results = {}
for key, words in degree_level.items():
results[key] = len(words.intersection(sentence))
results
Sample string would be something like this:
data_scientist_filtered.description.iloc[30]=
'the team: the data science team is a newly formed applied research team within s&p global ratings that will be responsible for building and executing a bold vision around using machine learning, natural language processing, data science, knowledge engineering, and human computer interfaces for augmenting various business processes.\n\nthe impact: this role will have a significant impact on the success of our data science projects ranging from choosing which projects should be undertaken, to delivering highest quality solution, ultimately enabling our business processes and products with ai and data science solutions.\n\nwhat’s in it for you: this is a high visibility team with an opportunity to make a very meaningful impact on the future direction of the company. you will work with senior leaders in the organization to help define, build, and transform our business. you will work closely with other senior scientists to create state of the art augmented intelligence, data science and machine learning solutions.\n\nresponsibilities: as a data scientist you will be responsible for building ai and data science models. you will need to rapidly prototype various algorithmic implementations and test their efficacy using appropriate experimental design and hypothesis validation.\n\nbasic qualifications: bs in computer science, computational linguistics, artificial intelligence, statistics, or related field with 5+ years of relevant industry experience.\n\npreferred qualifications:\nms in computer science, statistics, computational linguistics, artificial intelligence or related field with 3+ years of relevant industry experience.\nexperience with financial data sets, or s&p’s credit ratings process is highly preferred.
Sample dataframe:
position company description location
data scientist Xpert Staffing this job is for.. Atlanta, GA
data scientist Cotiviti great opportunity of.. Atlanta, GA
I'd suggest using the isin() method here, then getting the sum.
data = [['John',"ba"],['Harry',"ms"],['Bill',"phd"],['Mary', 'bachelors']]
df = pd.DataFrame(data,columns=['name','description'])
degree_level = {
'level_1':{'bachelors','bachelor','ba'},
'level_2':{'masters','ms','m.s',"master's",'master of science'},
'level_3':{'phd','p.h.d'}
}
results = {}
for level, values in degree_level:
results[level] = data_scientist_filtered['description'].isin(values).sum()
print(results)
#{"level_1": 2, "level_2": 1, "level_3": 1}
Edit
The for loop can be replaced by a comprehension, just FYI.
def num_of_degrees(degrees):
return data_scientist_filtered['description'].isin(values).sum()
results = {level: num_of_degrees(values) for level, values in degree_level}
Edit 2
With you showing what the df looks like, now I see what the issue is.
You need to filter() the df then get the count().
#just cleaning some unnessecary values from degrees_level
degree_level = {
'level_1':{'bachelor',' ba '},
'level_2':{'masters',' ms ',' m.s ',"master's"},
'level_3':{'phd','p.h.d'}}
results = {}
for level, values in degree_level:
results[level] = df.query(' or '.join((f"column_name.str.contains({value})" for value in values)), case=False, engine='python').count()
Something like that should work
The simple way to do this breakup of text is by using n gram compare of text column by column.
Create a list of position, company, location for possible values to be found.
Later compare the list column by column and save it in a data frame which can be combined lastly.
text1 = "Growing company located in the Atlanta, GA area is currently looking to add a Data Scientist to their team. The Data Scientist will analyze business level data to produce actionable insights utilizing analytics tools"
text2 = "Data scientist data analyst"
bigrams1 = ngrams(text1.lower().split(), n) # For description
bigrams2 = ngrams(text2.lower().split(), n) # For position dictionary
def compare(bigrams1, bigrams2):
common=[]
for grams in bigrams2:
if grams in bigrams1:
common.append(grams)
return common
compare(bigrams1, bigrams2)
Output as
compare(trigrams1,trigrams2)
Out[140]: [('data', 'scientist')]

filtering data from Pandas dataframes

Background: I am trying to use data from a csv file to make asks questions and make conclusions base on data. The data is a log of patient visits from a clinic in Brazil, including additional patient data, and whether the patient was a no show or not. I have chosen to examine correlations between the patient's age and the no show data.
Problem: Given visit number, patient ID, age, and no show data, how do I compile an array of ages that correlate with the each unique patient ID (so that I can evaluate the mean age of total unique patients visiting the clinic).
My code:
# data set of no shows at a clinic in Brazil
noshow_data = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
noshow_df = pd.DataFrame(noshow_data)
Here is the beginning of the code, with the head of the whole dataframe of the csv given
# Next I construct a dataframe with only the data I'm interested in:
ptid = noshow_df['PatientId']
ages = noshow_df['Age']
noshow = noshow_df['No-show']
ptid_ages_noshow = pd.DataFrame({'PatientId' : pt_id, 'Ages' : ages,
'No_show' : noshow})
ptid_ages_noshow
Here I have sorted the data to show the multiple visits of a unique patient
# Now, I know how to determine the total number of unique patients:
# total number of unique patients
num_unique_pts = noshow_df.PatientId.unique()
len(num_unique_pts)
If I want to find the mean age of all the patients during the course of all visits I would use:
# mean age of all vists
ages = noshow_data['Age']
ages.mean()
So my question is this, how could I find the mean age of all the unique patients?
You can simply use the groupby function available in pandas with restriction to the concerned columns :
ptid_ages_noshow[['PatientId','Ages']].groupby('PatientId').mean()
So you only want to keep one appointment per patient for the calculation? This is how to do it:
noshow_df.drop_duplicates('PatientId')['Age'].mean()
Keep in mind that the age of people changes over time. You need to decide how you want to handle this.

Categories

Resources