Python : How to use Multinomial Logistic Regression using SKlearn - python

I have a test dataset and train dataset as below. I have provided a sample data with min records, but my data has than 1000's of records. Here E is my target variable which I need to predict using an algorithm. It has only four categories like 1,2,3,4. It can take only any of these values.
Training Dataset:
A B C D E
1 20 30 1 1
2 22 12 33 2
3 45 65 77 3
12 43 55 65 4
11 25 30 1 1
22 23 19 31 2
31 41 11 70 3
1 48 23 60 4
Test Dataset:
A B C D E
11 21 12 11
1 2 3 4
5 6 7 8
99 87 65 34
11 21 24 12
Since E has only 4 categories, I thought of predicting this using Multinomial Logistic Regression (1 vs Rest Logic). I am trying to implement it using python.
I know the logic that we need to set these targets in a variable and use an algorithm to predict any of these values:
output = [1,2,3,4]
But I am stuck at a point on how to use it using python (sklearn) to loop through these values and what algorithm should I use to predict the output values? Any help would be greatly appreciated

You could try
LogisticRegression(multi_class='multinomial',solver ='newton-cg').fit(X_train,y_train)

LogisticRegression can handle multiple classes out-of-the-box.
X = df[['A', 'B', 'C', 'D']]
y = df['E']
lr = LogisticRegression()
lr.fit(X, y)
preds = lr.predict(X) # will output array with integer values.

Related

KNN From Scratch in Python: How Do I create and Go to the Next Test Instance?

I have the following KNN function that does produce erroneous predictions. I believe it is because the code only is using the the single test instance created. How do I adjust this code as per the comments in the function to pick the next test instance and repeat the process until the loop terminates?
def knn_predict(X_train, y_train, X_test, k=5):
y_pred=[]
for i in range(0, len(X_test)):
#Grab a test instance from X_test
test_instance = np.array([X_test.iloc[i]])
#find distances between the test instance and all training instances
d = metrics.euclidean_distances(X_train, test_instance)
#stack the distances with the y_train to get a matrix
stacked = np.stack((d.flatten(), y_train.Time.values), axis=1)
#sort the matrix by the distance column pick k number of y_train values where k
#is much less than the length of the training set
y_train_nearest_k = stacked[np.argsort(stacked[:,-1])][0:k,0]
#make a predicted value for the test instance and append it to y_pred
y_pred.append(np.mean(y_train_nearest_k))
#pick the next instance repeat the process until the loop terminates
return y_pred
Calling the function as is has these results:
y_delivery_test_pred = knn_predict(X_delivery_train, y_delivery_train, X_delivery_test, k=5)
y_delivery_test_pred[0:5]
[6.603648852515093,
19.02562968007764,
34.00249960949702,
24.003332407921455,
24.330669863436253]
The correct results (implemented using sklearn KNeighborsRegressor) should be more like below:
array([[5.14],
[6.5 ],
[6.32],
[6.2 ],
[9.16]])
Data Sample:
X_train:
Miles Deliveries
0 100 4
1 50 3
2 100 4
3 100 2
4 50 2
5 80 2
6 75 3
7 65 4
8 90 3
9 90 2
10 50 5
y_train:
Time
0 9.3
1 4.8
2 8.9
3 6.5
4 4.2
5 6.2
6 7.4
7 6.0
8 7.6
9 6.1
10 7.0
X_test:
Miles Deliveries
0 50 3
1 65 2
2 80 1
3 70 1
4 70 5
5 95 6
6 50 6
7 90 3
8 60 3
9 80 1
10 95 6
Thanks!

Sample dataframe by value in column and keep all rows

I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.
For example, in the dataframe below I want to randomly sample some fraction of the values in b, but keep all corresponding rows in a and c.
d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )
Desired example output from a 16% sample:
Out[66]:
a b c
0 1 0 0
1 26 0 50
2 51 0 0
3 76 0 50
4 4 12 6
5 29 12 56
6 54 12 6
7 79 12 56
8 18 68 34
9 43 68 84
10 68 68 34
11 93 68 84
12 19 72 36
13 44 72 86
14 69 72 36
15 94 72 86
I've tried sampling the series and merging back to the main data, like this:
In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))
This creates the desired output, but it seems inefficient. My real dataset has millions of values in b and hundreds of millions of rows. I know I could also use some version of ``isin```, but that also is slow.
Is there a more efficient way to do this?
I really doubt that isin is slow:
uniques = df.b.unique()
# this maybe the bottle neck
samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )
# sampling here
df[df.b.isin(samples)]
You can profile the steps above. In case samples=... is slow, you can try:
idx = np.random.rand(len(uniques))
samples = uniques[idx<0.16]
Those took about 100 ms on my system on 10 million rows.
Note: d.b.sample(int(.16 * d.b.nunique())) does not sample 0.16 of the unique values in b.

Predicted values all form a row instead of a column

I have a dataframe like this called test based on feature selection:
Spin Seek Power
0 92 50 99
1 88 20 90
2 56 100 90
3 87 20 100
4 67 30 45
Original data frame called hdd_new was like this:
serial_number Spin Seek Power
0 W3015JSX 92 50 99
1 ZA10Q2F7 88 20 90
2 9VYC10JY 56 100 90
3 S301LJ5G 87 20 100
4 Z305D4X6 67 30 45
After building my model, I decided to test it on a new data that comes in .csv file.
df_test = hdd_new['serial_number']
y_pred = model.predict(test)
df_test['failure'] = y_pred
df_test[['serial_number','failure']].to_csv('predictions.csv', index=False)
df_test = pd.DataFrame(df_test)
df_test
Output:
serial_number
0 W3015JSX
1 ZA10Q2F7
2 9VYC10JY
3 S301LJ5G
4 Z305D4X6
failure [0,1,0,0,1]
What I want to achieve:
serial_number failure
0 W3015JSX 0
1 ZA10Q2F7 1
2 9VYC10JY 0
3 S301LJ5G 0
4 Z305D4X6 1
I don't know what I am doing wrong. Please help.
Just looking at what you've shared and without knowing the details of your model, you could perhaps re-organise your code like this?
df_test = pd.DataFrame()
df_test['serial_number'] = hdd_new['serial_number']
y_pred = model.predict(test)
df_test['failure'] = y_pred
df_test[['serial_number','failure']].to_csv('predictions.csv', index=False)
Note: Unless df_test contains other columns which are not included in this scenario, the last line can simply read:
df_test.to_csv('predictions.csv', index=False)

Finding row with closest numerical proximity within Pandas DataFrame

I have a Pandas DataFrame with the following hypothetical data:
ID Time X-coord Y-coord
0 1 5 68 5
1 2 8 72 78
2 3 1 15 23
3 4 4 81 59
4 5 9 78 99
5 6 12 55 12
6 7 5 85 14
7 8 7 58 17
8 9 13 91 47
9 10 10 29 87
For each row (or ID), I want to find the ID with the closest proximity in time and space (X & Y) within this dataframe. Bonus: Time should have priority over XY.
Ideally, in the end I would like to have a new column called "Closest_ID" containing the most proximal ID within the dataframe.
I'm having trouble coming up with a function for this.
I would really appreciate any help or hint that points me in the right direction!
Thanks a lot!
Let's denote df as our dataframe. Then you can do something like:
from sklearn.metrics import pairwise_distances
space_vals = df[['X-coord', 'Y-coord']]
time_vals =df['Time']
space_distance = pairwise_distance(space_vals)
time_distance = pairwise_distance(time_vals)
space_distance[space_distance == 0] = 1e9 # arbitrary large number
time_distance[time_distance == 0] = 1e9 # again
closest_space_id = np.argmin(space_distance, axis=0)
closest_time_id = np.argmin(time_distance, axis=0)
Then, you can store the last 2 results in 2 columns, or somehow decide which one is closer.
Note: this code hasn't been checked, and it might have a few bugs...

Reshaping Data in R vs python

I'm trying to restructure a data-frame in R for k-means. Presently the data is structured like this:
Subject Posture s1 s2 s3....sn
1 45 45 43 42 ...
2 90 35 45 42 ..
3 0 3 56 98
4 45 ....
and so on. I'd like to collapse all the sn variables into a single column and create an additional variable with the s-number:
Subject Posture sn dv
1 45 1 45
2 90 2 35
3 0 3 31
4 45 4 45
Is this possible within R, or am I better off reshaping the csv directly in python?
Any help is greatly appreciated.
Here's the typical approach in base R (though using "reshape2" is probably the more typical practice).
Assuming we're starting with "mydf", defined as:
mydf <- data.frame(Subject = 1:3, Posture = c(45, 90, 0),
s1 = c(45, 35, 3), s2 = c(43, 45, 56), s3 = c(42, 42, 98))
You can reshape with:
reshape(mydf, direction = "long", idvar=c("Subject", "Posture"),
varying = 3:ncol(mydf), sep = "", timevar="sn")
# Subject Posture sn s
# 1.45.1 1 45 1 45
# 2.90.1 2 90 1 35
# 3.0.1 3 0 1 3
# 1.45.2 1 45 2 43
# 2.90.2 2 90 2 45
# 3.0.2 3 0 2 56
# 1.45.3 1 45 3 42
# 2.90.3 2 90 3 42
# 3.0.3 3 0 3 98
require(reshape2)
melt(df, id.vars="Posture")
Where df is the data.frame you presented. Next time please use dput() to provide actual data.
I think this will work for you.
EDIT:
Make sure to install the reshape2 package first of course.

Categories

Resources