I have a pandas dataframe with a column of continous variables. I need to convert them into 3 bins, such that first bin encompases values <20 percentile, second between 20 and 80th percentile and last is >80th percentile.
I am trying to achieve it by first getting the bin boundaries for such percentiles and then using pandas cut function. The issue is that I get an odd results, of getting only the middle bin encoded. Please see below:
test = [x for x in range(0,100)]
a = pd.DataFrame(test)
np.percentile(a, [20, 80])
Out[52]: array([ 19.8, 79.2])
pd.cut(a[0], np.percentile(a[0], [20, 80]))
...
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 (19.8, 79.2]
21 (19.8, 79.2]
22 (19.8, 79.2]
...
78 (19.8, 79.2]
79 (19.8, 79.2]
80 NaN
Why is that so? I though pandas cut requires you to supply boundaries of bins you want to get. Supplying 2 boundaries I supposed to get 3 bins, but seems like it doesn't work this way.
If you need 3 bins , then you need 4 break..
test = [x for x in range(0,100)]
a = pd.DataFrame(test)
np.percentile(a, [0,20, 80,100])
Out[527]: array([ 0. , 19.8, 79.2, 99. ])
pd.cut(a[0], np.percentile(a[0], [0,20, 80,100]))
Also, in pandas we have qcut , which means you do not need get the bin from numpy
pd.qcut(a[0],[0,0.2,0.8,1])
Related
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
I'm trying to implement a paper where PIMA Indians Diabetes dataset is used. This is the dataset after imputing missing values:
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age Outcome
0 1 148.0 72.000000 35.00000 155.548223 33.600000 0.627 50 1
1 1 85.0 66.000000 29.00000 155.548223 26.600000 0.351 31 0
2 1 183.0 64.000000 29.15342 155.548223 23.300000 0.672 32 1
3 1 89.0 66.000000 23.00000 94.000000 28.100000 0.167 21 0
4 0 137.0 40.000000 35.00000 168.000000 43.100000 2.288 33 1
5 1 116.0 74.000000 29.15342 155.548223 25.600000 0.201 30 0
The description:
df.describe()
Preg Glucose BP SkinThickness Insulin BMI Pedigree Age
count768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean0.855469 121.686763 72.405184 29.153420 155.548223 32.457464 0.471876 33.240885
std 0.351857 30.435949 12.096346 8.790942 85.021108 6.875151 0.331329 11.760232
min 0.000000 44.000000 24.000000 7.000000 14.000000 18.200000 0.078000 21.000000
25% 1.000000 99.750000 64.000000 25.000000 121.500000 27.500000 0.243750 24.000000
50% 1.000000 117.000000 72.202592 29.153420 155.548223 32.400000 0.372500 29.000000
75% 1.000000 140.250000 80.000000 32.000000 155.548223 36.600000 0.626250 41.000000
max 1.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000
The description of normalization from the paper is as follows:
As part of our data preprocessing, the original data values are scaled so as to fall within a small specified range of [0,1] values by performing normalization of the dataset. This will improve speed and reduce runtime complexity. Using the Z-Score we normalize our value set V to obtain a new set of normalized values V’ with the equation below:
V'=V-Y/Z
where V’= New normalized value, V=previous value, Y=mean and Z=standard deviation
z=scipy.stats.zscore(df)
But when I try to run the code above, I'm getting negative values and values greater than one i.e., not in the range [0,1].
There are several points to note here.
Firstly, z-score normalisation will not result in features in the range [0, 1] unless the input data has very specific characteristics.
Secondly, as others have noted, two of the most common ways of normalising data are standardisation and min-max scaling.
Set up data
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')
# For the purposes of this exercise, we'll just use the alphabet as column names
df.columns = list(string.ascii_lowercase)[:len(df.columns)]
$ print(df.head())
a b c d e f g h i
0 1 85 66 29 0 26.6 0.351 31 0
1 8 183 64 0 0 23.3 0.672 32 1
2 1 89 66 23 94 28.1 0.167 21 0
3 0 137 40 35 168 43.1 2.288 33 1
4 5 116 74 0 0 25.6 0.201 30 0
Standardisation
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {standardised.min().min():4.3f} Max: {standardised.max().max():4.3f}")
Min: -4.055 Max: 845.307
As you can see, the values are far from being in [0, 1]. Note the range of the resulting data from z-score normalisation will vary depending on the distribution of the input data.
Min-max scaling
min_max = (df - df.values.min()) / (df.values.max() - df.values.min())
# print the minimum and maximum values in the entire dataset with a little formatting
$ print(f"Min: {min_max.min().min():4.3f} Max: {min_max.max().max():4.3f}")
Min: 0.000 Max: 1.000
Here we do indeed get values in [0, 1].
Discussion
These and a number of other scalers exist in the sklearn preprocessing module. I recommend reading the sklearn documentation and using these instead of doing it manually, for various reasons:
There are fewer chances of making a mistake as you have to do less typing.
sklearn will be at least as computationally efficient and often more so.
You should use the same scaling parameters from training on the test data to avoid leakage of test data information. (In most real world uses, this is unlikely to be significant but it is good practice.) By using sklearn you don't need to store the min/max/mean/SD etc. from scaling training data to reuse subsequently on test data. Instead, you can just use scaler.fit_transform(X_train) and scaler.transform(X_test).
If you want to reverse the scaling later on, you can use scaler.inverse_transform(data).
I'm sure there are other reasons, but these are the main ones that come to mind.
Your standardization formula hasn't the aim of putting values in the range [0, 1].
If you want to normalize data to make it in such a range, you can use the following formula :
z = (actual_value - min_value_in_database)/(max_value_in_database - min_value_in_database)
And sir, you're not oblige to do it manually, just use sklearn library, you'll find different standardization and normalization methods in the preprocessing section.
Assuming your original dataframe is df and it has no invalid float values, this should work
df2 = (df - df.values.min()) / (df.values.max()-df.values.min())
I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0
I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?
What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20
I am trying to do spline interpolation between two arrays in Python. My data set looks like this:
| 5 15
-------+--------------------
1 32.68 29.16
2 32.73 27.20
3 32.78 28.24
4 32.83 27.27
5 32.88 25.27
6 32.93 31.35
7 32.98 27.39
8 33.03 26.42
9 33.08 27.46
10 33.13 30.50
11 33.18 27.53
12 33.23 29.57
13 33.23 27.99
14 33.23 28.64
15 33.23 26.68
16 33.23 29.72
And I am trying to do a spline interpolation between the two points and produce the values for 10, something that will eventually look like this (but spline interpolated):
| 10
-----+--------
1 30.92
2 29.965
3 30.51
4 30.05
5 29.075
6 32.14
7 30.185
8 29.725
9 30.27
10 31.815
11 30.355
12 31.4
13 30.61
14 30.935
15 29.955
16 31.475
I have been looking at examples of using scipy.interpolate.InterpolatedUnivariateSpline, but it seems to take only one array for x and one for y, and I can't figure out how to make it interpolate these two arrays.
Can someone please help point me in the right direction?
With the amount of data you have, only two points for each x value, piecewise linear interpolation is the most practical tool. Taking your two arrays to be v5 and v15 (values along y=5 line and y=15 line), and the x-values to be 1,2, ..., 16, we can create a piecewise linear interpolant like this:
from scipy.interpolate import interp2d
f = interp2d(np.arange(1, 17), [5, 15], np.stack((v5, v15)), kind='linear')
This can be evaluated in the usual way: for example, f(np.arange(1, 17), 10) returns precisely the numbers you expected.
[ 30.92 , 29.965, 30.51 , 30.05 , 29.075, 32.14 , 30.185,
29.725, 30.27 , 31.815, 30.355, 31.4 , 30.61 , 30.935,
29.955, 31.475]
interp2d can also create a cubic bivariate spline, but not from this data: your grid is too small in the y-direction.