Summarising features with multiple values in Python for Machine Learning model

Summarising features with multiple values in Python for Machine Learning model - python

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.

There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Related

Python: How to allocate a 'misc' total between other categories

I'm building a report in Python to automate a lot of manual transformation we do in Excel at the moment. I'm able to extract the data and pivot it, to get something like this
Date
Category 1
Category 2
Category 3
Misc
01/01/21
40
30
30
10
02/01/21
30
20
50
20
Is it possible to divide the misc total for each date in to the other categories by ratio? So I would end up with the below
Date
Category 1
Category 2
Category 3
01/01/21
44
33
33
02/01/21
36
24
60
The only way I can think of is to split the misc values off to their own table, work out the ratios of the other categories, and then add misc * ratio to each category value, but I just wondered if there was a function I could use to condense the working on this?
Thanks

I think your solution hits the nail on the head. However it can be quite dense already:
>>> cat = df.filter(regex='Category')
>>> df.update(cat + cat.mul(df['Misc'] / cat.sum(axis=1), axis=0))
>>> df.drop(columns=['Misc'])
Date Category 1 Category 2 Category 3
0 01/01/21 44.0 33.0 33.0
1 02/01/21 36.0 24.0 60.0
cat.mul(df['Misc'] / cat.sum(axis=1), axis=0) gets you the reallocated misc values per row, since you multiply each value by misc and divide it by the row total. .mul() allows to do the the multiplication while specifying along which axis, the rest is about having the right columns.

Minimize number of shops while reaching all customers

In this particular issue, I have an imaginary city divided into squares - basically a MxN grid of squares covering the city. M and N can be relatively big, so I have cases with more than 40,000 square cells overall.
I have a number of customers Z distributed in this grid, some cells will contain many customers while others will be empty. I would like to find a way to place the minimum number of shops (only one per cell) to be able to serve all customers, with the restriction that all customers must be “in reach” of one shop and all customers need to be included.
As an additional couple of twist, I have these constraints/issues:
There is a maximum distance that a customer can travel - if the shop is in a cell too far away then the customer cannot be associated with that shop. Edit: it’s not really a distance, it’s a measure of how easy it is for a customer to reach a shop, so I can’t use circles...
While respecting the condition (1) above, there may well be multiple shops in reaching distance of the same customer. In this case, the closest shop should win.
At the moment I’m trying to ignore the issue of costs - many customers means bigger shops and larger costs - but maybe at some point I’ll think about that too. The problem is, I have no idea of the name of the problem I’m looking at nor about possible algorithmic solutions for it: can this be solved as a Linear Programming problem?
I normally code in Python, so any suggestions on a possible algorithmic approach and/or some code/libraries to solve it would be very much appreciated.
Thank you in advance.
Edit: as a follow up, I kind of found out I could solve this problem as a MINLP “uncapacitated facility problem”, but all the information I have found are way too complex: I don’t care to know which customer is served by which shop, I only care to know if and where a shop is built. I have a secondary way - as post processing - to associate a customer to the most appropriate shop.
All the codes I found set up this monstrous linear system associating a constraint per customer per shop (as “explained” here: https://en.m.wikipedia.org/wiki/Facility_location_problem#Uncapacitated_facility_location), so in a situation like mine I could easily end up with a linear system with millions of rows and columns, which with integer/binary variables will take about the age of the universe to solve.
There must be an easier way to handle this...

I think this can be formulated as a set covering problem link.
You say:
in a situation like mine I could easily end up with a linear system
with millions of rows and columns, which with integer/binary variables
will take about the age of the universe to solve
So let's see if that is even remotely true.
Step 1: generate some data
I generated a grid of 200 x 200, yielding 40,000 cells. I place at random M=500 customers. This looks like:
---- 22 PARAMETER cloc customer locations
x y
cust1 35 75
cust2 169 84
cust3 111 18
cust4 61 163
cust5 59 102
...
cust497 196 148
cust498 115 136
cust499 63 101
cust500 92 87
Step 2: calculate reach of each customer
The next step is to determine for each customer c the allowed locations (i,j) within reach. I created a large sparse boolean matrix reach(c,i,j) for this. I used the rule: if the manhattan distance is
|i-cloc(c,'x')|+|j-cloc(c,'y')| <= 10
then the store at (i,j) can service customer c. My data looks like:
(zeros are not stored). This data structure has 106k elements.
Step 3: Form MIP model
We form a simple MIP model:
The inequality constraint says: we need at least one store that is within reach of each customer. This is a very simple model to formulate and to implement.
Step 4: Solve
This is a large but easy MIP. It has 40,000 binary variables. It solves very fast. On my laptop it took less than 1 second with a commercial solver (3 seconds with open-source solver CBC).
The solution looks like:
---- 47 VARIABLE numStores.L = 113 number of stores
---- 47 VARIABLE placeStore.L store locations
j1 j6 j7 j8 j9 j15 j16 j17 j18
i4 1
i18 1
i40 1
i70 1
i79 1
i80 1
i107 1
i118 1
i136 1
i157 1
i167 1
i193 1
+ j21 j23 j26 j28 j29 j31 j32 j36 j38
i10 1
i28 1
i54 1
i72 1
i96 1
i113 1
i147 1
i158 1
i179 1
i184 1
i198 1
+ j39 j44 j45 j46 j49 j50 j56 j58 j59
i5 1
i18 1
i39 1
i62 1
i85 1
i102 1
i104 1
i133 1
i166 1
i195 1
+ j62 j66 j67 j68 j69 j73 j74 j76 j80
i11 1
i16 1
i36 1
i61 1
i76 1
i105 1
i112 1
i117 1
i128 1
i146 1
i190 1
+ j82 j84 j85 j88 j90 j92 j95 j96 j97
i17 1
i26 1
i35 1
i48 1
i68 1
i79 1
i97 1
i136 1
i156 1
i170 1
i183 1
i191 1
+ j98 j102 j107 j111 j112 j114 j115 j116 j118
i4 1
i22 1
i36 1
i56 1
i63 1
i68 1
i88 1
i100 1
i101 1
i111 1
i129 1
i140 1
+ j119 j121 j126 j127 j132 j133 j134 j136 j139
i11 1
i30 1
i53 1
i72 1
i111 1
i129 1
i144 1
i159 1
i183 1
i191 1
+ j140 j147 j149 j150 j152 j153 j154 j156 j158
i14 1
i35 1
i48 1
i83 1
i98 1
i117 1
i158 1
i174 1
i194 1
+ j161 j162 j163 j164 j166 j170 j172 j174 j175
i5 1
i32 1
i42 1
i61 1
i69 1
i103 1
i143 1
i145 1
i158 1
i192 1
i198 1
+ j176 j178 j179 j180 j182 j183 j184 j188 j191
i6 1
i13 1
i23 1
i47 1
i61 1
i81 1
i93 1
i103 1
i125 1
i182 1
i193 1
+ j192 j193 j196
i73 1
i120 1
i138 1
i167 1
I think we have debunked your statement that a MIP model is not a feasible approach to this problem.
Note that the age of the universe is 13.7 billion years or 4.3e17 seconds. So we have achieved a speed-up of about 1e17. This is a record for me.
Note that this model does not find the optimal locations for the stores, but only a configuration that minimizes the number of stores needed to service all customers. It is optimal in that sense. But the solution will not minimize the distances between customers and stores.

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd import numpy as np  df = pd.DataFrame({"ID": [23,43,36,87,90,98], "y_2010": [22631,27456,61200,87621,58951,24587], "y_2011": [21954,29654,np.nan,86542,57486,25478], "y_2012": [22314,28159,np.nan,87542,2000,np.nan], "y_2013": [22032,28654,31895,88456,0,24896,], "y_2014": [21843,2000,1600,86961,0,25461]})  print(df)  a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
  # calculate mean mean = a.mean(1) # calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean) # AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way

I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

Create subcolumns in pandas dataframe python

I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?

The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!

Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.