Python Example for KNN or K-Means Clustering

Python Example for KNN or K-Means Clustering - python

I am looking at some sample data such as this:
Data:
ID Name ParValue Coupon Maturity Issuer Moodys S&P_Fitch Grade Risk
37833100 Apple_Inc. 1049 95 2030 Apple_Inc. Aaa AAA Investment Highest_Quality
02079K107 Alphabet_Inc. 1055 99 2030 Alphabet_Inc. Aa AA Investment High_Quality
11659109 Alaska_Air_Group 996 98 2030 Alaska_Air_Group A A Investment Strong
931142103 Walmart_Stores,_Inc.  1195 99 2030 Walmart_Stores,_Inc.  Baa BBB Investment Medium_Grade
495734523 Corp._Takeover 1108 97 2021 Corp._Takeover Ba,_B BB,_B Junk Speculative
193467211 Toys_R_Us 1109 105 2021 Toys_R_Us Caa/Ca/C CCC/CC/C Junk Highly_Speculative
576300972 Enron 1062 102 2021 Enron C D Junk In_Default
983457823 Economic_Consultants_Inc. Economic_Consultants_Inc. Baa BBB Investment Medium_Grade
894652378 Forecast_Backtesters_Corp. Forecast_Backtesters_Corp. Aaa AAA Investment Highest_Quality
Image:
So, if WalMart has Baa, BBB, Investment, and Medium_Grade (for Moodys, S&P_Fitch, Grade, and Risk) and Economic_Consultants_Inc. has these same attributes, I can know that Economic_Consultants_Inc. has 1195, 99, and 2030 (for ParValue, Coupon, Maturity), even though these data points are missing.
This is probably a KNN problem, but I'm thinking K-Means could be useful too. Basically, I'm trying to figure out how to update missing data points (ParValue, Coupon, & Maturity), like the ones colored pink in the image above, based on similar attributes. Then, I want to group similar items together (K-Means problem). Has someone here come across a good online example of how to do this? I looked online today and found some examples using randomly generated numbers, but my data sets will NOT have randomly generated numbers. I would appreciate any insight into how to solve this problem.

What you seem to be missing is pandas.
I suggest you go through the 10 min tutorial to get started.
The approach should be
Load the data into a dataframe using pandas,
Use the apply method to fill the missing values, based on the conditions you stated above.
This answer is similar to what you might have to do.

also you can use, missing value imputation using impyute package.

Related

How to create a Dataframe from rows with conditions from another existing Dataframe using pandas?

So I have this problem, because of the size of the dataframe that I am working on, clearly, I cannot upload it, but it has the following structure:
country
coastline
EU
highest
1
Norway
yes
yes
1500
2
Turkey
yes
no
20100
...
...
...
...
41
Bolivia
no
no
999
42
Japan
yes
no
89
I have to solve several exercises with Pandas, among them is, for example, showing the country with the "highest" maximum, minimum and the average but only of the countries that do belong to the EU, I already solved the maximum and the minimum, but for the middle I thought about creating a new dataframe, one that is created from only the rows that contain a "yes" in the EU column, I've tried a few things, but they haven't worked.
I thought this is the best way to solve it, but if anyone has another idea, I'm looking forward to reading it.
By the way, these are the examples that I said that I was able to solve:
print('Minimum outside the EU')
paises[(paises.EU == "no")].sort_values(by=['highest'], ascending=[True]).head(1)
Which gives me this:
country
coastline
EU
highest
3
Belarus
no
no
345
As a last condition, this must be solved using pandas, since it is basically the chapter that we are working on in classes.

If you want to create a new dataframe that is based off of a filter on your first, you can do this:
new_df = df[df['EU'] == 'yes'].copy()
This will look at the 'EU' column in the original dataframe df, and only return the rows where it is 'yes'. I think it is good practice to add the .copy() since we can sometimes get strange side-affects if we then make changes to new_df (probably wouldn't here).

Python Pandas replace NaN with data from another row

I have two dataframes. Dataframe A contains course information, including the ISBN number for required textbooks:
Course Abbreviation
Course Number
Section Number
Course Name
Course Instructor
Course Seats
ISBN No
ACCT
205
101
Intro Financial Accounting
30
9780357617977
ACCT
205
102
Intro Financial Accounting
Grant
30
9780357617977
ACCT
205
901
Intro Financial Accounting
Grant
35
9780357617977
Dataframe B contains book purchasing info and also includes the ISBN number:
Title
ISBN
Binding
Edition
US_List
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
9.78148E+12
Paper
17.99 USD
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
9.78148E+12
eBook
ADOBE AUDITION CC: CLASSROOM IN A BOOK: THE OFFICIAL TRAINING WORKBOOK FROM ADOBE.
9.78014E+12
Paper
2ND ED.
59.99 USD
I am able to merge the two dataframes so that the course info is available along with the book purchasing info. However, Dataframe B contains many different listings for the same book. I would like to bring the course info over to matching titles where the ISBN isn't the same. So in the example below, even though the ISBNs are different, the course info would appear for both versions of the title:
Course Abbreviation
Course Number
Section Number
Course Name
Course Instructor
Course Seats
ISBN No
Title
CTEC
107
825.0
Skills for IT Success
Lott
20.0
9781476764665
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
NaN
NaN
NaN
NaN
NaN
NaN
NaN
7 HABITS OF HIGHLY EFFECTIVE TEENS: THE ULTIMATE TEENAGE SUCCESS GUIDE.
What would be the best way to do this? The rows that need course info filled in are not always in the same location in relation to the rows that do have course info, so I don't think ffill or bfill will work.

Sorting by ISBN No will push the nulls to the bottom, then you can groupby title and ffill the data.
df.sort_values(by='ISBN No').groupby('Title').ffill()

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

Deep learning training the dataset which has gap

I have a dataset of a sensor (station) for several years with this structure:
station Direction year month day dayOfweek hour volume
1009 3 2015 1 1 5 0 37
1009 3 2015 1 1 5 1 20
1009 3 2015 1 1 5 2 24
... . .. .. .. .. .. ..
there is plenty of gap(missed value) in this data. For example there might be a month or several days missed. I fill the missed volumes with 0. I want to predict volume based on previous data. I used LSTM and the mean absolute percent error (MAPE) is quite high around 20 and I need to reduce it.
The main problem that I have is even for traning I have a gap. Is there any other techniqe in deep learning for these kind of data?

There are multiple ways to handle missing values as listed here (https://machinelearningmastery.com/handle-missing-data-python/).
If i have enough data I will just ommit rows with missing data. If i do not have enough data and/or have to predict on cases where data is missing I normally try those two approaches and choose the one with the higher accuracy.
The same as you. I choose a distinct value which is not included in the dataset, like 0 in your case and fill in that value. The other approach is to use the mean or median of the training set. I use the same value (calculated on training set) in my validation set/test set. The median is better than the mean, if the mean does not make sense in the current context. (2014.5 as year for example).

Finding/Plotting average of series with non-consistent x

I have a weird problem concerning a weird dataset.
Basically I have 25 replicates of a model, they all have fire sizes and those fire sizes are summed cumulatively.
So a short summary of data would be by example :
fires_size rep cumsum
0 1 rep_9 1
1 1 rep_9 2
2 1 rep_9 3
....
50 59 rep_9 4000
51 75 rep_9 4075
....
150 1 rep_20 1
151 1 rep_20 2
152 1 rep_20 3
....
200 12 rep_20 3500
201 70 rep_20 3570
So when I plot this pandas dataframe, with fire sizes as x and cumulative area burnt as y : I get something like this (the blue lines, as I have two different datasets).
Image there as I can't upload picture
So now a cool thing would be to be able to create an average replicate that could be drawn on top of my other reps to show well the average distribution, even better would be to be able to calculate the standard deviation and use fill.between to show even better the variability.
My problem is that since my fire sizes (x axes) are not consistent(and multiple as y is cumulative), I have no idea how I could do that. I tried a few trendline, tried lowess and things like that but it never gives me a really good result.
So is there an easy way to do that? I am lacking some basic statistic knowledges here! And can't really find any answer that I can relate too as I don't even know how to describe my dataset.
Thank you so much!
I will post the link to the full data in a comment as I can't post more than one link

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.