Automatic detection of column unit and unit conversion - python

I need to automatically detect the units of the columns and convert to correct the unit and rename the column.
The standard units should be:
'Inch'(In) for 'Height' column
'Degree Celcius'(°C) for 'Temperature' column.
Sample tables shown below where units need to be converted. The table could have mixture of units, some need to be converted and some that don't. Any idea?
Height (mm) Temperature(°F)
16 27
12 30
17 32
20 23
14 43
Height (mm) Temperature(°C)
14 31
13 42
19 50
22 28
18 36
1 mm = 0.0393701 inch
T(°C) = (T(°F) - 32) × 5/9

You can filter the columns names with .filter method, then apply your conversion and finally rename:
import pandas as pd
df = pd.DataFrame({'Height (mm)':[16,12,17,20,14],'Temperature(°F)':[27,30,32,23,43]})
print('before conversion',df)
for x in df.filter(like='°F',axis=1).keys():
df[x] = (df[x] - 32) * 5/9
df.rename(columns={x:x.replace('°F','°C')}, inplace=True)
for x in df.filter(like='(mm)',axis=1).keys():
df[x] = df[x] * 0.0393701
df.rename(columns={x:x.replace('(mm)','(In)')}, inplace=True)
print('After conversion',df)
output looks like:
before conversion
Height (mm) Temperature(°F)
0 16 27
1 12 30
2 17 32
3 20 23
4 14 43
After conversion
Height (In) Temperature(°C)
0 0.629922 -2.777778
1 0.472441 -1.111111
2 0.669292 0.000000
3 0.787402 -5.000000
4 0.551181 6.111111

Related

Sort rows of curve shaped data in python

I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve

Given a SciPy discrete random variable distribution, how do I round a number to the closest value in that distribution? [duplicate]

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

How can I multiply a numpy array with pandas series?

I have a numpy series of size (50,0)
array([1.01255569e+00, 1.04166667e+00, 1.07158165e+00, 1.10229277e+00,
1.13430127e+00, 1.16387337e+00, 1.20365912e+00, 1.24007937e+00,
1.27877238e+00, 1.31856540e+00, 1.35281385e+00, 1.40291807e+00,
1.45180023e+00, 1.49700599e+00, 1.55183116e+00, 1.60051216e+00,
1.66002656e+00, 1.73370319e+00, 1.80115274e+00, 1.87687688e+00,
1.95312500e+00, 2.04750205e+00, 2.14961307e+00, 2.23613596e+00,
2.34082397e+00, 2.48015873e+00, 2.61780105e+00, 2.75027503e+00,
2.91715286e+00, 3.07881773e+00, 3.31564987e+00, 3.57142857e+00,
3.81679389e+00, 4.17362270e+00, 4.51263538e+00, 4.95049505e+00,
5.59284116e+00, 6.17283951e+00, 7.02247191e+00, 8.03858521e+00,
9.72762646e+00, 1.17370892e+01, 1.47928994e+01, 2.10084034e+01,
3.12500000e+01, 4.90196078e+01, 9.25925926e+01, 2.08333333e+02,
5.00000000e+02, 1.25000000e+03])
And I have a pandas dataframe of length 50 as well.
x
0 9.999740e-01
1 9.981870e-01
2 9.804506e-01
3 9.187764e-01
4 8.031568e-01
5 6.544660e-01
6 5.032716e-01
7 3.707446e-01
8 2.650768e-01
9 1.857835e-01
10 1.285488e-01
11 8.824506e-02
12 6.030141e-02
13 4.111080e-02
14 2.800453e-02
15 1.907999e-02
16 1.301045e-02
17 8.882996e-03
18 6.074386e-03
19 4.161024e-03
20 2.855636e-03
21 1.963543e-03
22 1.352791e-03
23 9.338596e-04
24 6.459459e-04
25 4.476854e-04
26 3.108912e-04
27 2.163201e-04
28 1.508106e-04
29 1.053430e-04
30 7.372442e-05
31 5.169401e-05
32 3.631486e-05
33 2.555852e-05
34 1.802129e-05
35 1.272995e-05
36 9.008454e-06
37 6.386289e-06
38 4.535381e-06
39 3.226546e-06
40 2.299394e-06
41 1.641469e-06
42 1.173785e-06
43 8.407618e-07
44 6.032249e-07
45 4.335110e-07
46 3.120531e-07
47 2.249870e-07
48 1.624726e-07
49 1.175140e-07
And I want to multiply every numpy cells with pandas cell.
Example:
1.01255569e+00*9.999740e-01
1.04166667e+00*9.981870e-01
Desired output
numpy array of same size.
You can just use the .values property of the 'x' series in your Pandas dataframe:
df['x'].values * arr
where df is your dataframe and arr is your array.
The above expression will return the result as a Numpy array. If you want a Pandas DataFrame instead, you can omit the use of .values:
df['x'] * arr
Or np.multiply, multiply n with p['x'].values:
print(np.multiply(n,p['x'].values))
Or pd.Series.multiply:
print(np.array(p['x'].multiply(n)))
Or pd.Series.mul:
print(np.array(p['x'].mul(n)))

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

Pandas: Creating another column with row column multiplication

Priority Expected Actual
High 47 30
Medium 22 14
Required 16 5
I'm trying to create two other columns 'Expected_values' which will have the values like for the row High 47*5, for the row Medium 22*3,for the row Required 16*10 and 'Actual_values' for the row High 30*5, for the row Medium 14*3,for the row Required 5*10
like this
Priority Expected Actual Expected_values Actual_values
Required 16 5 160 50
High 47 30 235 150
Medium 22 14 66 42
Any simple way to do that in pandas or numpy?
try:
a = np.array([5, 3, 10])
df['Expected_values'] = df.Expected * a
df['Actual_values'] = df.Actual * a
print df
Priority Expected Actual Expected_values Actual_values
0 High 47 30 235 150
1 Medium 22 14 66 42
2 Required 16 5 160 50

Categories

Resources