Different shapes between new data and training dataset

Different shapes between new data and training dataset - python

I have a dataframe and looks something like the one below.
Spent Products bought Target Variable
0 2300 Car/Mortgage/Leisure 0
1 1500 Car/Education 0
2 150 Groceries 1
3 700 Groceries/Education 1
4 900 Mortgage 1
5 180 Education/Sports 1
6 1800 Car/Mortgage/Others 0
7 900 Sports/Groceries 1
8 1000 Self-Enrichment/Car 1
9 140 Car/Groceries 1
I used pd.get_dummies to one hot encode all the "products bought" column. Now I have a shape of (5000,150).
I train/test/split my data and thereafter, applied PCA. I fit_transform the train set, and applied only transform on the test set. Following that I used a decision tree classifier to predict which got me a 90% accuracy.
Now here comes the problem. I have new set of data. I know my model was trained on a shape of (,150) and this **new data only has a shape of (150, 28) after** applying encoding with pd.get_dummies.
I know merging the new data with the old dataset is not a solution. I'm kind of stuck, and I'm not sure how to go about solving this. Anyone has any input? Thanks
Edit: I tried reindexing the new dataset but it did not work. There are more unique variables in the "products bought" column my training set and less so in my new dataset.
The new dataframe looks more like something like the one below.
Spent Products bought Target Variable
0 230 Leisure 1
1 150 Others 1
2 100 Groceries 1
3 700 Education 1
4 900 Mortgage 0
5 180 Education/Sports 1
6 1800 Car/Mortgage 0
7 400 Groceries 1
8 4000 Car 1
9 140 Car/Groceries 1

Related

How to do Multi label classification or Multi class classification of the below problem? Pandas Python

My original data looks like this.
id season home_team away_team home_goals away_goals result winner
0 0 2006-07 Shu Liv 1 1 D NaN
1 1 2006-07 Ars Avl 1 1 D NaN
2 2 2006-07 Eve Wat 2 1 H Eve
3 3 2006-07 New Wig 2 1 H New
4 4 2006-07 Por Bla 3 0 H Por
The purpose is to build a model that predicts
i.e.
Home Team Win 55%
Draw 13%
Away Team Win 32%
I Selected these 3 columns and label encoded them
home_team, away_team, winner
Then I created these new classes/lables.
df.loc[df["winner"]==df["home_team"],"home_team_win"]=1
df.loc[df["winner"]!=df["home_team"],"home_team_win"]=0
df.loc[df["result"]=='D',"draw"]=1
df.loc[df["result"]!='D',"draw"]=0
df.loc[df["winner"]==df["away_team"],"away_team_win"]=1
df.loc[df["winner"]!=df["away_team"],"away_team_win"]=0
Now the encoded data is looking like this,
home_team away_team home_team_win away_team_win draw
0 28 19 0 0 1
1 1 2 0 0 1
2 14 34 1 0 0
3 23 37 1 0 0
4 25 4 1 0 0
Initially, I used the code below for a single label 'home_team_win' and it worked fine, but it doesn't support multi classes/labels.
X = prediction_df.drop(['home_team_win'] ,axis=1)
y = prediction_df['home_team_win']
logReg=LogisticRegression(solver='lbfgs')
rfe = RFE(logReg, 20)
rfe = rfe.fit(X, y.values.ravel())
How to do Multi label classification or Multi class classification of this problem?

The target binary variables home_team_win, away_team_win, and draw are mutually exclusive. It does not seem to be a good idea to use multi-label methods in this problem, since, in general, they are designed to exploit dependencies among labels, which is nonexistent in this dataset.
I suggest modelling it as a multi-class problem in its most common form, where there is a single column with three classes: 0,1, and 2 (representing home_team_loss, draw, away_team_win).
Many implementations of classifiers in scikit-learn can work directly in this manner. Logistic Regression is one of them:
from sklearn.linear_model import LogisticRegression
logReg=LogisticRegression(solver='lbfgs', multi_class='ovr')
logReg.fit(X,Y)
logReg.predict_proba(X)
This code will output the desired probabilities for each class of each row of X.
In particular, this code trains one Logistic Regression for each class separately (this is what the multi_class='ovr' parameter do).
Take a look at https://scikit-learn.org/stable/supervised_learning.html for other classifiers that directly work in this multi-class dataset form that I suggested.

Prediction On Seasonal Data

I am looking to predict values based on seasonal data. Bonuses are paid quarterly/ annually/ monthly and amount usually goes up after couple of time periods. Data is given below. I have converted the Bonus event as the numerical value (Yes = 1, No = 0). I have tried using Excel's forecast functions but it was not useful.
Is there a package with the help of which I can predict Next Month & Amount of Bonus? where the recent data points have higher weightage than the older ones.
My dataset has about 10 years worth of data and about 10,000 personnel. So, it is not possible to predict both Month and Amount manually. I am trying to predict the next Bonus Month and Amount.
Date
Bonus
Amount
Jan-15
0
000
Feb-15
0
000
Mar-15
1
100
Apr-15
0
000
May-15
0
000
Jun-15
1
100
Jul-15
0
000
Aug-15
0
000
Sep-15
1
145
Oct-15
0
000
Nov-15
0
000
Dec-15
1
145
Jan-16
0
000
Feb-16
0
000
Mar-16
1
145
Apr-16
0
000
May-16
1
150
Jun-16
0
000
Jul-16
0
000
Aug-16
1
150
Sep-16
0
000
Oct-16
0
000
Nov-16
1
150
Dec-16
0
000
Thanks for the help.

Have a look at Pandas package (https://pypi.org/project/pandas/). The problem you have is fundamental time series analysis as far as I can tell and there are many guides online on how to implement it (just search for ARIMA pandas).

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.

There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Minimize number of shops while reaching all customers

In this particular issue, I have an imaginary city divided into squares - basically a MxN grid of squares covering the city. M and N can be relatively big, so I have cases with more than 40,000 square cells overall.
I have a number of customers Z distributed in this grid, some cells will contain many customers while others will be empty. I would like to find a way to place the minimum number of shops (only one per cell) to be able to serve all customers, with the restriction that all customers must be “in reach” of one shop and all customers need to be included.
As an additional couple of twist, I have these constraints/issues:
There is a maximum distance that a customer can travel - if the shop is in a cell too far away then the customer cannot be associated with that shop. Edit: it’s not really a distance, it’s a measure of how easy it is for a customer to reach a shop, so I can’t use circles...
While respecting the condition (1) above, there may well be multiple shops in reaching distance of the same customer. In this case, the closest shop should win.
At the moment I’m trying to ignore the issue of costs - many customers means bigger shops and larger costs - but maybe at some point I’ll think about that too. The problem is, I have no idea of the name of the problem I’m looking at nor about possible algorithmic solutions for it: can this be solved as a Linear Programming problem?
I normally code in Python, so any suggestions on a possible algorithmic approach and/or some code/libraries to solve it would be very much appreciated.
Thank you in advance.
Edit: as a follow up, I kind of found out I could solve this problem as a MINLP “uncapacitated facility problem”, but all the information I have found are way too complex: I don’t care to know which customer is served by which shop, I only care to know if and where a shop is built. I have a secondary way - as post processing - to associate a customer to the most appropriate shop.
All the codes I found set up this monstrous linear system associating a constraint per customer per shop (as “explained” here: https://en.m.wikipedia.org/wiki/Facility_location_problem#Uncapacitated_facility_location), so in a situation like mine I could easily end up with a linear system with millions of rows and columns, which with integer/binary variables will take about the age of the universe to solve.
There must be an easier way to handle this...

I think this can be formulated as a set covering problem link.
You say:
in a situation like mine I could easily end up with a linear system
with millions of rows and columns, which with integer/binary variables
will take about the age of the universe to solve
So let's see if that is even remotely true.
Step 1: generate some data
I generated a grid of 200 x 200, yielding 40,000 cells. I place at random M=500 customers. This looks like:
---- 22 PARAMETER cloc customer locations
x y
cust1 35 75
cust2 169 84
cust3 111 18
cust4 61 163
cust5 59 102
...
cust497 196 148
cust498 115 136
cust499 63 101
cust500 92 87
Step 2: calculate reach of each customer
The next step is to determine for each customer c the allowed locations (i,j) within reach. I created a large sparse boolean matrix reach(c,i,j) for this. I used the rule: if the manhattan distance is
|i-cloc(c,'x')|+|j-cloc(c,'y')| <= 10
then the store at (i,j) can service customer c. My data looks like:
(zeros are not stored). This data structure has 106k elements.
Step 3: Form MIP model
We form a simple MIP model:
The inequality constraint says: we need at least one store that is within reach of each customer. This is a very simple model to formulate and to implement.
Step 4: Solve
This is a large but easy MIP. It has 40,000 binary variables. It solves very fast. On my laptop it took less than 1 second with a commercial solver (3 seconds with open-source solver CBC).
The solution looks like:
---- 47 VARIABLE numStores.L = 113 number of stores
---- 47 VARIABLE placeStore.L store locations
j1 j6 j7 j8 j9 j15 j16 j17 j18
i4 1
i18 1
i40 1
i70 1
i79 1
i80 1
i107 1
i118 1
i136 1
i157 1
i167 1
i193 1
+ j21 j23 j26 j28 j29 j31 j32 j36 j38
i10 1
i28 1
i54 1
i72 1
i96 1
i113 1
i147 1
i158 1
i179 1
i184 1
i198 1
+ j39 j44 j45 j46 j49 j50 j56 j58 j59
i5 1
i18 1
i39 1
i62 1
i85 1
i102 1
i104 1
i133 1
i166 1
i195 1
+ j62 j66 j67 j68 j69 j73 j74 j76 j80
i11 1
i16 1
i36 1
i61 1
i76 1
i105 1
i112 1
i117 1
i128 1
i146 1
i190 1
+ j82 j84 j85 j88 j90 j92 j95 j96 j97
i17 1
i26 1
i35 1
i48 1
i68 1
i79 1
i97 1
i136 1
i156 1
i170 1
i183 1
i191 1
+ j98 j102 j107 j111 j112 j114 j115 j116 j118
i4 1
i22 1
i36 1
i56 1
i63 1
i68 1
i88 1
i100 1
i101 1
i111 1
i129 1
i140 1
+ j119 j121 j126 j127 j132 j133 j134 j136 j139
i11 1
i30 1
i53 1
i72 1
i111 1
i129 1
i144 1
i159 1
i183 1
i191 1
+ j140 j147 j149 j150 j152 j153 j154 j156 j158
i14 1
i35 1
i48 1
i83 1
i98 1
i117 1
i158 1
i174 1
i194 1
+ j161 j162 j163 j164 j166 j170 j172 j174 j175
i5 1
i32 1
i42 1
i61 1
i69 1
i103 1
i143 1
i145 1
i158 1
i192 1
i198 1
+ j176 j178 j179 j180 j182 j183 j184 j188 j191
i6 1
i13 1
i23 1
i47 1
i61 1
i81 1
i93 1
i103 1
i125 1
i182 1
i193 1
+ j192 j193 j196
i73 1
i120 1
i138 1
i167 1
I think we have debunked your statement that a MIP model is not a feasible approach to this problem.
Note that the age of the universe is 13.7 billion years or 4.3e17 seconds. So we have achieved a speed-up of about 1e17. This is a record for me.
Note that this model does not find the optimal locations for the stores, but only a configuration that minimizes the number of stores needed to service all customers. It is optimal in that sense. But the solution will not minimize the distances between customers and stores.

Create subcolumns in pandas dataframe python

I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?

The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.