Complex partial string matching in pandas

Complex partial string matching in pandas - python

Given a dataframe with the following structure and values json_path -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0]
Christian Family
Abraham Family
data.attributes.total.children.[0].children.[0]
Christian Family
In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.
Any suggestions or resources on how I can achieve this?

As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows
df[df['json_path'].str.count(r'\.children').eq(4)]
A more robust approach would be to check for the consecutive occurrence of 4 children
df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]
json_path Reporting Group Entity/Grouping
2 data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
3 data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

Related

Python Pandas replace NaN with data from another row

I have two dataframes. Dataframe A contains course information, including the ISBN number for required textbooks:
Course Abbreviation
Course Number
Section Number
Course Name
Course Instructor
Course Seats
ISBN No
ACCT
205
101
Intro Financial Accounting
30
9780357617977
ACCT
205
102
Intro Financial Accounting
Grant
30
9780357617977
ACCT
205
901
Intro Financial Accounting
Grant
35
9780357617977
Dataframe B contains book purchasing info and also includes the ISBN number:
Title
ISBN
Binding
Edition
US_List
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
9.78148E+12
Paper
17.99 USD
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
9.78148E+12
eBook
ADOBE AUDITION CC: CLASSROOM IN A BOOK: THE OFFICIAL TRAINING WORKBOOK FROM ADOBE.
9.78014E+12
Paper
2ND ED.
59.99 USD
I am able to merge the two dataframes so that the course info is available along with the book purchasing info. However, Dataframe B contains many different listings for the same book. I would like to bring the course info over to matching titles where the ISBN isn't the same. So in the example below, even though the ISBNs are different, the course info would appear for both versions of the title:
Course Abbreviation
Course Number
Section Number
Course Name
Course Instructor
Course Seats
ISBN No
Title
CTEC
107
825.0
Skills for IT Success
Lott
20.0
9781476764665
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
What would be the best way to do this? The rows that need course info filled in are not always in the same location in relation to the rows that do have course info, so I don't think ffill or bfill will work.

Sorting by ISBN No will push the nulls to the bottom, then you can groupby title and ffill the data.
df.sort_values(by='ISBN No').groupby('Title').ffill()

How to create a search algorithm for sales otimization in Python?

I have a dataset with distances between 4 cities, each city has a sales store from the same company and I have the number of sales from the last month of each sales store in another dataset.
What i want to know is the best possible route between the cities to make more profit (each product is sold for 5) knowing that i only produce in the first city and then i have a truck with a maximum truckload of 5000 loading the other 3 cities.
I can´t find anything similar to my problem, the closest i could find were search algorithms, can someone tell me what approach to take?
Sorry if my question is a bit confusing.

How to iterate through a dataframe based on two conditions?

I have a sample of companies with financial figures which I would like to compare. My data looks like this:
Cusip9 Issuer IPO Year Total Assets Long-Term Debt Sales SIC-Code
1 783755101 Ryerson Tull Inc 1996 9322000.0 2632000.0 633000.0 3661
2 826170102 Siebel Sys Inc 1996 995010.0 0.0 50250.0 2456
3 894363100 Travis Boats & Motors Inc 1996 313500.0 43340.0 23830.0 3661
4 159186105 Channell Commercial Corp 1996 426580.0 3380.0 111100.0 7483
5 742580103 Printware Inc 1996 145750.0 0.0 23830.0 8473
For every company I want to calculate a "similarity Score". This score should indicate the comparability with other companies. Therefore I want to compare them in different financial figures. The comparability should be expressed as the euclidean distance, the square root of the sum of the squared differences between the financial figures, to the "closest company". So I need to calculate the distance to every company, that fits these conditions, but only need the closest score. Assets of Company 1 minus Assets of Company 2 plus Debt Company 1 minus Debt Comapny 2....
√((x_1-y_1 )^2+(x_2-y_2 )^2)
This should only be computed for companies with the same SIC-Code and the IPO Year of the comparable companies should be smaller then for the company for which the "Similarity score" is computed. I only want to compare these companies with already listed companies.
Hopefully, my point get's clear. Has someone any idea where I can start? I am just starting with programming and completely lost with this.
Thanks in advance.

I would first create different dataframes according to the SIC-code, so every new dataframe only contains companies with the same SIC-code. Then for every of those dataframes, just double loop over the companies and compute the scores, and store them in a matrix. (So you'll end up with a symmetrical matrix of scores.)

try this , Here I have taken Compare the company with IPO Year Equal to or Smaller then since You didn't give any company record with smaller IPO year) You can change it to only Smaller than (<) in statement Group=df[...]
def closestCompany(companyRecord):
Group = df[(df['SIC-Code']==companyRecord['SIC-Code']) & (df['IPO Year'] <= companyRecord['IPO Year']) & (df['Issuer'] != companyRecord['Issuer'])]
return (((Group['Total Assets']-companyRecord['Total Assets'])**2 + (Group['Long-Term Debt'] - companyRecord['Long-Term Debt'])**2)**0.5).min()
df['Closest Company Similarity Score']=df.apply(closestCompany, axis=1)
df

I want to create new dataframe columns looping over rows of a specific column

I am trying to create a Gale-Shapley algorithm in Python, that delivers stable matches of doctors and hospitals. To do so, I gave every doctor and every hospital a random preference represented by a number.
Dataframe consisting of preferences
Afterwards I created a function that rates every hospital for one specific doctor (represented by ID) followed by a ranking of this rating creating two new columns. In rating the match, I took the absolute value of the difference between the preferences, where a lower absolute value is a better match. This is the formula for the first doctor:
doctors_sorted_by_preference['Rating of Hospital by Doctor 1']=abs(doctors_sorted_by_preference['Preference Doctor'].iloc[0]-doctors_sorted_by_preference['Preference Hospital'])
doctors_sorted_by_preference['Rank of Hospital by Doctor 1']=doctors_sorted_by_preference["Rating of Hospital by Doctor 1"].rank()
which leads to the following table:
Dataframe consisting of preferences and rating + ranking of doctor
Hence, doctor 1 prefers the first hospital over all other hospitals as represented by the ranking.
Now I want to repeat this function for every different doctor by creating a loop (creating two new columns for every doctor and adding them to my dataframe), but I don't know how to do this. I could type out the same function for all the 10 different doctors, but if I increase the dataset to include 1000 doctors and hospitals this would become impossible, there must be a better way...
This would be the same function for doctor 2:
doctors_sorted_by_preference['Rating of Hospital by Doctor 2']=abs(doctors_sorted_by_preference['Preference Doctor'].iloc[1]-doctors_sorted_by_preference['Preference Hospital'])
doctors_sorted_by_preference['Rank of Hospital by Doctor 2']=doctors_sorted_by_preference["Rating of Hospital by Doctor 2"].rank()
Thank you in advance!

You can also append the values into list and then write it to dataframe. Appending into lists would be faster if you have a large dataset.
I named by dataframe as df for sake of viewing :
for i in range(len(df['Preference Doctor'])):
list1= []
for j in df['Preference Hospital']:
list1.append(abs(df['Preference Doctor'].iloc[i]-j))
df['Rating of Hospital by Doctor_' +str(i+1)] = pd.DataFrame(list1)
df['Rank of Hospital by Doctor_' +str(i+1)] = df['Rating of Hospital by Doctor_'
+str(i+1)].rank()

Count occurrence of elements in column of lists (with a twist)

I've got a column of lists called "author_background" which I would like to analyze. The actual column consists of 8.000 rows. My aim is to get an overview on how many different elements there are in total (in all lists of the column) and count in how many lists each element occurs in.
How my column looks like:
df.author_background
0 [Professor for Business Administration, Harvard Business School]
1 [Professor for Industrial Engineering, University of Oakland]
2 [Harvard Business School]
3 [CEO, SpaceX]
desired output
0 Harvard Business School 2
1 Professor for Business Administration 1
2 Professor for Industrial Engineering 1
3 CEO 1
4 University of Oakland 1
5 SpaceX 1
I would like to know how often "Professor of Business Administration", "Professor for Industrial Engineering", "Harvard Business School", etc. occurs in the column. There are way more titles I don't know about.
Basically, I would like to use pd.value_counts for the column. However, its not possible because its a list.
Is there another way to count the occurrence of each element?
If thats more helpful: I also got a list which contains all elements of the lists (not nested).

Turn it all into a single series by list flattening:
pd.Series([bg for bgs in df.author_background for bg in bgs])
Now you can call value_counts() to get your result.

You can try so:
el = pd.Series([item for sublist in df.author_background for item in sublist])
df = el.groupby(el).size().rename_axis('author_background').reset_index(name='counter')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.