Python Pandas replace NaN with data from another row - python

I have two dataframes. Dataframe A contains course information, including the ISBN number for required textbooks:
Course Abbreviation
Course Number
Section Number
Course Name
Course Instructor
Course Seats
ISBN No
ACCT
205
101
Intro Financial Accounting
30
 9780357617977
ACCT
205
102
Intro Financial Accounting
Grant
30
 9780357617977
ACCT
205
901
Intro Financial Accounting
Grant
35
 9780357617977
Dataframe B contains book purchasing info and also includes the ISBN number:
Title
ISBN
Binding
Edition
US_List
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
9.78148E+12
Paper
17.99 USD
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
9.78148E+12
eBook
ADOBE AUDITION CC: CLASSROOM IN A BOOK: THE OFFICIAL TRAINING WORKBOOK FROM ADOBE.
9.78014E+12
Paper
2ND ED.
59.99 USD
I am able to merge the two dataframes so that the course info is available along with the book purchasing info. However, Dataframe B contains many different listings for the same book. I would like to bring the course info over to matching titles where the ISBN isn't the same. So in the example below, even though the ISBNs are different, the course info would appear for both versions of the title:
Course Abbreviation
Course Number
Section Number
Course Name
Course Instructor
Course Seats
ISBN No
Title
CTEC
107
825.0
Skills for IT Success
Lott
20.0
9781476764665
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
What would be the best way to do this? The rows that need course info filled in are not always in the same location in relation to the rows that do have course info, so I don't think ffill or bfill will work.

Sorting by ISBN No will push the nulls to the bottom, then you can groupby title and ffill the data.
df.sort_values(by='ISBN No').groupby('Title').ffill()

Related

Complex partial string matching in pandas

Given a dataframe with the following structure and values json_path -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0]
Christian Family
Abraham Family
data.attributes.total.children.[0].children.[0]
Christian Family
In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -
json_path
Reporting Group
Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]
Christian Family
Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]
Christian Family
Investment Grade Fixed Income
I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.
Any suggestions or resources on how I can achieve this?
As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows
df[df['json_path'].str.count(r'\.children').eq(4)]
A more robust approach would be to check for the consecutive occurrence of 4 children
df[df['json_path'].str.contains(r'(\.children\.\[\d+\]){4}')]
json_path Reporting Group Entity/Grouping
2 data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
3 data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income

Python: Return Average on unique values based on multiple columns

I have a data set that is looks like this, and there are instances in which there are multiple duplicates (ex.Gone Girl is repeated twice). I am not proficient at Python so I don't have any code at the moment and have tried looking all over stackoverflow.
But my objective is:
Remove all the duplicates and count how many times the author has written a book (ex. JK Rowling has written 2 different books and the other have all written 1)
Based on the unique books, what is the average rating of the author (ex. JK Rowlings would be (4.9+4.7)/2 = 4.55
Appreciate any help
Book Name
Author
User Rating
The Casual Vacancy
JK Rowling
4.9
Cabin Fever (Diary of a Wimpy Kid, Book 6)
Jeff Kinney
3.9
Harry Potter
JK Rowling
4.7
Cabin Fever (Diary of a Wimpy Kid, Book 6)
Jeff Kinney
4.4
Gone Girl
Gillian Flynn
4.0
Gone Girl
Gillian Flynn
4.0
The Girl on the Train)
Paula Hawkins
4.1
I'm not sure how you'd like the output format, but here is a way to drop duplicate books by the same author, and return the average score of the author after duplicate removal using the pandas library:
import pandas as pd
df = pd.read_csv('mydata.txt', sep='\t') # use this if it is a tab delimited text file
df = pd.read_csv('mydata.csv') # use this if it is a comma separated value file
subset = df.drop_duplicates(subset=['Book Name', 'Author']).groupby('Author').agg({"User Rating": "mean"})
print(subset)
outputs:
User Rating
Author
Gillian Flynn 4.0
Feff Kinney 3.9
JK Rowling 4.8
Paula Hawkins 4.1
Explanations:
First, I am creating a pandas dataframe using the pandas library. If the data is in tab delimited text format, use the first line df = pd.read_csv('mydata.txt', sep='\t') to read in the data. If the data is in comma separated value format, use the second line df = pd.read_csv('mydata.csv') to read in the data. This creates the dataframe.
Then, df.drop_duplicates drops duplicate entries in a dataframe. If you select a subset it will look for duplicates of the subset passed. So in this case, I passed a list of two columns where I wanted to drop duplicates ['Book Name', 'Author']. When you pass multiple columns, both of them have to be identical for it to be counted as a duplicate.
Then, I groupby the 'Author' column which will perform an agg or aggregation, to get the mean of the User Rating column, for each 'Author'.
I would use df.groupby(...).mean() 2 times, to be consistent if, for an author, there are multiple books of which some have multiple ratings. But, the specifications may differ.
Calculate the mean note for each 'Author', 'Book Name' couple
If for a couple the User Ratings are the same, that waste some ressources with no arm.
Calculate the mean note by author
the code is :
df.groupby(['Author', 'Book Name']]).mean().groupby(['Author']).mean()
with value :
User Rating
Author
Gillian Flynn 4.00
JK Rowling 4.80
Jeff Kinney 4.15
Paula Hawkins 4.10

Panel Data Research & Development Capitalisation

I am working with a panel data containing many companies' research and development expenses throughout the years.
What I would like to do is to capitalise these expenses as if they were assets. For those who are not familiar with financial terminology, I am trying to accumulate the values of each year's R&D expenses with the following ones by decaying its value (or "depreciating" it) every time period by the corresponding depreciation rate.
The dataframe looks something like this:
fyear tic rd_tot rd_dep
0 1979 AMFD 1.345 0.200
1 1980 AMFD 0.789 0.200
.. .. .. .. ..
211339 2017 ACA 3.567 0.340
211340 2018 ACA 2.990 0.340
211341 2018 CTRM 0.054 0.234
Where fyear is the fiscal year, tic is the company specific letter code, rd_tot is the total R&D expenditure for the year and rd_dep is the applicable depreciation rate.
So far I was able to come up with this:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum()for i in range(1, len(df)+1)]
However the problem is that the code just runs through the entire column without taking in consideration that the R&D expense needs to be capitalised in a company (or tic) specific way. I also tried by using .groupby(['tic]) but it did not work.
Therefore, I am trying to look for help to solve this problem, so that I can get each years R&D expenses capitalisation on a COMPANY SPECIFIC way.
Thank you very much for your help!
This solution breaks the initial dataframe into separate ones (one for each 'tic' group), and applies the r&d capital calculation formula on each df.
Finally, we re-construct the dataframe using pd.concat.
tic_dfs = [tic_group for _, tic_group in df.groupby('tic')]
for df in tic_dfs:
df['r&d_capital'] = [(df['rd_tot'].iloc[:i] * (1 - df['rd_dep'].iloc[:i]*np.arange(i)[::-1])).sum() for i in range(1,len(df)+1)]
result=pd.concat([df for df in tic_dfs]).sort_index()
Note: "_" is the mask for the group name e.g. "ACA", "AMFD" etc, while tic_group is the actual data body.

Python Example for KNN or K-Means Clustering

I am looking at some sample data such as this:
Data:
ID Name ParValue Coupon Maturity Issuer Moodys S&P_Fitch Grade Risk
37833100 Apple_Inc. 1049 95 2030 Apple_Inc. Aaa AAA Investment Highest_Quality
02079K107 Alphabet_Inc. 1055 99 2030 Alphabet_Inc. Aa AA Investment High_Quality
11659109 Alaska_Air_Group 996 98 2030 Alaska_Air_Group A A Investment Strong
931142103 Walmart_Stores,_Inc.  1195 99 2030 Walmart_Stores,_Inc.  Baa BBB Investment Medium_Grade
495734523 Corp._Takeover 1108 97 2021 Corp._Takeover Ba,_B BB,_B Junk Speculative
193467211 Toys_R_Us 1109 105 2021 Toys_R_Us Caa/Ca/C CCC/CC/C Junk Highly_Speculative
576300972 Enron 1062 102 2021 Enron C D Junk In_Default
983457823 Economic_Consultants_Inc. Economic_Consultants_Inc. Baa BBB Investment Medium_Grade
894652378 Forecast_Backtesters_Corp. Forecast_Backtesters_Corp. Aaa AAA Investment Highest_Quality
Image:
So, if WalMart has Baa, BBB, Investment, and Medium_Grade (for Moodys, S&P_Fitch, Grade, and Risk) and Economic_Consultants_Inc. has these same attributes, I can know that Economic_Consultants_Inc. has 1195, 99, and 2030 (for ParValue, Coupon, Maturity), even though these data points are missing.
This is probably a KNN problem, but I'm thinking K-Means could be useful too. Basically, I'm trying to figure out how to update missing data points (ParValue, Coupon, & Maturity), like the ones colored pink in the image above, based on similar attributes. Then, I want to group similar items together (K-Means problem). Has someone here come across a good online example of how to do this? I looked online today and found some examples using randomly generated numbers, but my data sets will NOT have randomly generated numbers. I would appreciate any insight into how to solve this problem.
What you seem to be missing is pandas.
I suggest you go through the 10 min tutorial to get started.
The approach should be
Load the data into a dataframe using pandas,
Use the apply method to fill the missing values, based on the conditions you stated above.
This answer is similar to what you might have to do.
also you can use, missing value imputation using impyute package.

Predicting user ratings

I have a prediction problem that I am working on and need some help on how to approach it. I have a CSV with two columns, user_id and ratings, where a user is giving a rating on something in the ratings column. A user can repeat in the user_id column with different unique ratings. For example:
user_id rating
1 5
4 6
1 6
7 6
2 7
4 7
Now the prediction data set has users who have already given previous rating similar to the one above:
user_id rating
11 6
12 10
13 8
13 9
14 4
14 5
Goal is to predict what these specific users will rate the next time. Secondly, lets say if we add a user '15' with no rating history, how can we predict the first two ratings that user will provide, in order.
I'm not sure how to train a model, with just user_id and ratings, which also happens to be the target column. Any help is appreciated!
First and the foremost thing that you have to mention that on what a user is giving rating i.e., the category for example in movie rating system you can provide that for a particular movie A which is an action movie the user gives rating 1 which means that the user hates action and for a movie B which is comedy type the user gives rating 9 which means that the user is a comedy lover so the next time a similar category came you can predict the rating of the user very easily and you can do so by including many movie category like thriller,romance,drama etc and can even take many accounting features like movie length, leading actor, director, language etc etc as all these govern a user rating very broadly.
But if you not provide on which basis the user is giving rating then it is very hard and of no use for example I am a user and I give ratings like 1,5,2,6,8,1,9,3,4,10 can you predict my next rating the answer is no because it just like a random generator between 0-10 but in the movie case where my past ratings clearly show that I love comedy and hate action then when a new comedy movie came you can easily predict the rating for that movie for me.
But still if your problem is this only then you can use various statistical methods like either take the mean and then approximate to nearest integer or either take the mode.
But I can suggest is that plot the rating for a user and visualise it, if it is following some pattern like for a user the rating first increases then goes to peak then decreases then go to minimum and then increases and follow like this(believe me this is going to be very impractical due to your constraints) and on the basis of that predict the rating.
But the best out of all these is be make a statistical model like give high weight to the last rating and lesser weight to second last rating and then even lesser and then take a mean, eg->
predict_rating = w1*(last_rating) + w2*(second_last_rating) + w3*(third_last_rating) ....
and then take mean
This will give you very good results and indeed it is machine learning and this particular algorithm in which you find the best suited weights is multivariate linear regression
and this for sure is the best model for the given constraints

Categories

Resources