Interpolate between two columns of a Dataframe - python

I am attempting to interpolate a value based on a number's position in a different column. Take this column for instance:
Coupon Price
9.5 109.04
9.375 108.79
9.25 108.54
9.125 108.29
9 108.04
8.875 107.79
8.75 107.54
8.625 107.29
8.5 107.04
8.375 106.79
8.25 106.54
Lets say I have a number like 107. I want to be able to find 107's relative distance from both 107.04 and 106.79 to interpolate the value that has the same relative distance between 8.5 and 8.375, the coupon values at the same index. Is this possible? I can solve this in excel using the FORECAST method, but want to know if it can be done in Python.

Welcome to Stack Overflow.
We need to make a custom function for this, unless there's a standard library function I'm unaware, which is entirely possible. I'm going to make a function that allows you to enter a bond by price and it will get inserted into the dataframe with the appropriate coupon.
Assuming we are starting with a sorted dataframe.
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.375 106.79
10 8.250 106.54
I've inserted comments into the function.
def add_bond(Price, df):
# Add row
df.loc[df.shape[0]] = [np.NaN, Price]
df = df.sort_values('Price', ascending=False).reset_index(drop=True)
# Get index
idx = df[df['Price'] == Price].head(1).index.tolist()[0]
# Get the distance from Prices from previous row to next row
span = abs(df.iloc[idx-1, 1] - df.iloc[idx +1, 1]).round(4)
# Get the distance and direction from Price from previous row to new value
terp = (df.iloc[idx, 1] - df.iloc[idx-1, 1]).round(4)
# Find the percentage movement from previous in percentage.
moved = terp / span
# Finally calculate the move from the previous for Coupon.
df.iloc[idx, 0] = df.iloc[idx-1,0] + (abs(df.iloc[idx-1,0] - df.iloc[idx+1, 0]) * (moved))
return df
A function to calculate the Coupon of a new bond using Price in the DataFrame.
# Add 107
df = add_bond(107, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.875 107.79
6 8.750 107.54
7 8.625 107.29
8 8.500 107.04
9 8.480 107.00
10 8.375 106.79
11 8.250 106.54
Add one more.
# Add 107.9
df = add_bond(107.9, df)
print(df)
Coupon Price
0 9.500 109.04
1 9.375 108.79
2 9.250 108.54
3 9.125 108.29
4 9.000 108.04
5 8.930 107.90
6 8.875 107.79
7 8.750 107.54
8 8.625 107.29
9 8.500 107.04
10 8.480 107.00
11 8.375 106.79
12 8.250 106.54
If this answer meets your needs, please remember to select correct answer. Thanks.

Probably there's a function that does the work for you somewhere but my advice is to program it yourself, it's not difficult at all and it's a nice programming excercise. Just find the slope in that segment and use the equation a straight line:
(y-y0) = ((y1-y0)/(x1-x0))*(x-x0) -> y = ((y1-y0)/(x1-x0))*(x-x0) + y0
Where:
x -> Your given value (107)
x1 & x0 -> The values right above and below (107.04 & 106.79)
y1 & y0 -> The corresponding values to x1 & x0 (8.5 & 8.375)
y -> Your target value.
Just basic high-school maths ;-)

Related

Pandas select a specific column from each row [duplicate]

This question already has answers here:
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
(4 answers)
Closed 11 months ago.
first of all: thank you for all the questions and answers. So far, I always found a solution to my problems here. However, with the following problem I'm stuck:
I have a dataframe as this:
Jan_x Feb_x Mar_x Apr_x ... driest driest_rr DMAI Station_id
0 -433 -398 -18 508 ... Mar_x 2684 37.189000 2
1 -95 -102 164 631 ... Mar_x 2732 30.568445 10
2 59 272 691 1165 ... Jan_x 1970 40.237462 12
3 30 239 696 1108 ... Feb_x 3548 43.941148 13
4 -1128 -1193 -985 -667 ... Feb_x 12715 334.828246 15
(995 rows in total)
The first 12 columns are monthly mean temperature values (in 0.01 degrees), the last column ('Station_id') is an identifier for climate stations. From another dataframe containing precipitation data I got the driest month ('driest') and it's precipitation amount ('driest_rr'; in 0.01 mm). Finally, 'DMAI' is an annual aridity index already calculated in the step before.
Now I want to compute another Aridity Index (for meteorologists/climate scientists: the Pinna Combinative Index) that includes both annual mean temperature and precipitation (already included in 'DMAI') and mean temperature and precipitation of the driest month. The equation is:
DMAI = P/(T+10)
PCI = 0.5 (DMAI+(12Pd/Td+10))
with P,T annual mean temperature and precipitation
and Pd,Td mean temperature and precipitation of the driest month
(in mm and °C respectively)
I already have:
df['PCI'] = 0.5 * (df.loc[:,'DMAI'] +(12*(df.loc[:,'driest_rr']/100)))/(df.loc[:,'Mar_x']+10))
which works. However, the driest month is not always March, I need the one specified in the column 'driest'.
df['PCI'] = 0.5 * (df.loc[:,'DMAI'] +(12*(df.loc[:,'driest_rr']/100)))/(df.loc[:,df_dmai.loc[:,'driest']]+10))
does not work however.
Is there a way to solve this?
I found a few similar question, like this one here:
How can I select a specific column from each row in a Pandas DataFrame?
However, the answers that I found use either the deprecated df.lookup() or a numpy workaround, so they don't help me in this case.
pandas has a lot of numpy behind it, and so the workaround from the pandas docs is very easy to plug right back into your DataFrame:
In [27]: df = pd.DataFrame({'select': ['a', 'b', 'c', 'b', 'c', 'a'], 'a': range(6), 'b': range(6, 12), 'c': range(12, 18)})
In [28]: idx, cols = pd.factorize(df['select'])
In [29]: df['chosen'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
In [30]: df
Out[30]:
select a b c chosen
0 a 0 6 12 0
1 b 1 7 13 7
2 c 2 8 14 14
3 b 3 9 15 9
4 c 4 10 16 16
5 a 5 11 17 5
You can use loc method or iloc you can find these methods by adding a . after your dataframe name then click tab

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Average between points based on time

I'm trying to use Python to get time taken, as well as average speed between an object traveling between points.
The data looks somewhat like this,
location initialtime id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9km
2 2020-09-18T12:10:14.485952Z car_uno 83 8km
3 2020-09-18T11:59:14.484781Z car_duo 70 9km
7 2020-09-18T12:00:14.484653Z car_trio 85 8km
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5km
The function I'm using currently is essentially like this,
Speeds.index = pd.to_datetime(Speeds.index)
..etc
Now if I were doing this usually, I would just take the unique values of the id's,
for x in speeds.id.unique():
Speeds[speeds.id=="x"]...
But this method really isn't working.
What is the best approach for simply seeing if there are multiple id points over time, then taking the average of the speeds by that time given? Otherwise just returning the speed itself if there are not multiple values.
Is there a simpler pandas filter I could use?
Expected output is simply,
area - id - initial time - journey time - average speed.
the point is to get the average time and journey time for a vehicle going past two points
To get the average speed and journey times you can use groupby() and pass in the columns that determine one complete journey, like id or area.
import pandas as pd
from io import StringIO
data = StringIO("""
area initialtime id speed
1 2020-09-18T12:03:14.485952Z car_uno 72
2 2020-09-18T12:10:14.485952Z car_uno 83
3 2020-09-18T11:59:14.484781Z car_duo 70
7 2020-09-18T12:00:14.484653Z car_trio 85
8 2020-09-18T12:12:14.484653Z car_trio 70
""")
df = pd.read_csv(data, delim_whitespace=True)
df["initialtime"] = pd.to_datetime(df["initialtime"])
# change to ["id", "area"] if need more granular aggregation
group_cols = ["id"]
time = df.groupby(group_cols)["initialtime"].agg([max, min]).eval('max-min').reset_index(name="journey_time")
speed = df.groupby(group_cols)["speed"].mean().reset_index(name="average_speed")
pd.merge(time, speed, on=group_cols)
id journey_time average_speed
0 car_duo 00:00:00 70.0
1 car_trio 00:12:00 77.5
2 car_uno 00:07:00 77.5
I tryed to use a very intuitive solution. I'm assuming the data has already been loaded to df.
df['initialtime'] = pd.to_datetime(df['initialtime'])
result = []
for car in df['id'].unique():
_df = df[df['id'] == car].sort_values('initialtime', ascending=True)
# Where the car is leaving "from" and where it's heading "to"
_df['From'] = _df['location']
_df['To'] = _df['location'].shift(-1, fill_value=_df['location'].iloc[0])
# Auxiliary columns
_df['end_time'] = _df['initialtime'].shift(-1, fill_value=_df['initialtime'].iloc[0])
_df['end_speed'] = _df['speed'].shift(-1, fill_value=_df['speed'].iloc[0])
# Desired columns
_df['journey_time'] = _df['end_time'] - _df['initialtime']
_df['avg_speed'] = (_df['speed'] + _df['end_speed']) / 2
_df = _df[_df['journey_time'] >= pd.Timedelta(0)]
_df.drop(['location', 'distance', 'speed', 'end_time', 'end_speed'],
axis=1, inplace=True)
result.append(_df)
final_df = pd.concat(result).reset_index(drop=True)
The final DataFrame is as follows:
initialtime id From To journey_time avg_speed
0 2020-09-18 12:03:14.485952+00:00 car_uno 1 2 0 days 00:07:00 77.5
1 2020-09-18 11:59:14.484781+00:00 car_duo 3 3 0 days 00:00:00 70.0
2 2020-09-18 12:00:14.484653+00:00 car_trio 7 8 0 days 00:12:00 77.5
Here is another approach. My results are different that other posts, so I may have misunderstood the requirements. In brief, I calculated each average speed as total distance divided by total time (for each car).
from io import StringIO
import pandas as pd
# speed in km / hour; distance in km
data = '''location initial-time id speed distance
1 2020-09-18T12:03:14.485952Z car_uno 72 9
2 2020-09-18T12:10:14.485952Z car_uno 83 8
3 2020-09-18T11:59:14.484781Z car_duo 70 9
7 2020-09-18T12:00:14.484653Z car_trio 85 8
8 2020-09-18T12:12:14.484653Z car_trio 70 7.5
'''
Now create data frame and perform calculations
# create data frame
df = pd.read_csv(StringIO(data), delim_whitespace=True)
df['elapsed-time'] = df['distance'] / df['speed'] # in hours
# utility function
def hours_to_hms(elapsed):
''' Convert `elapsed` (in hours) to hh:mm:ss (round to nearest sec)'''
h, m = divmod(elapsed, 1)
m *= 60
_, s = divmod(m, 1)
s *= 60
hms = '{:02d}:{:02d}:{:02d}'.format(int(h), int(m), int(round(s, 0)))
return hms
# perform calculations
start_time = df.groupby('id')['initial-time'].min()
journey_hrs = df.groupby('id')['elapsed-time'].sum().rename('elapsed-hrs')
hms = journey_hrs.apply(lambda x: hours_to_hms(x)).rename('hh:mm:ss')
ave_speed = ((df.groupby('id')['distance'].sum()
/ df.groupby('id')['elapsed-time'].sum())
.rename('ave speed (km/hr)')
.round(2))
# assemble results
result = pd.concat([start_time, journey_hrs, hms, ave_speed], axis=1)
print(result)
initial-time elapsed-hrs hh:mm:ss \
id
car_duo 2020-09-18T11:59:14.484781Z 0.128571 00:07:43
car_trio 2020-09-18T12:00:14.484653Z 0.201261 00:12:05
car_uno 2020-09-18T12:03:14.485952Z 0.221386 00:13:17
ave speed (km/hr)
id
car_duo 70.00
car_trio 77.01
car_uno 76.79
You should provide a better dataset (ie with identical time points) so that we understand better the inputs, and an exemple of expected output so that we understand the computation of the average speed.
Thus I'm just guessing that you may be looking for df.groupby('initialtime')['speed'].mean() if df is a dataframe containing your input data.

How to compare value in Pandas DataFrame against a value in the previous row AND the previous column?

I have a dataframe consisting of two columns filled with float values. I need to calculate all the values of 'h' minus all the values of 'c', at the index previous to the current 'h' value.
So for instance, for 'h' in row 1, I need to calculate 1.17322 - 1.17285 (the value of 'c' in the previous row)
I have tried several different methods to accomplish this, including the use of: .iloc, .shift(), .groupby(), and .diff(), but I cannot get exactly what I'm looking for.
If anybody could help, it would be greatly appreciated
c h
0 1.17285 1.17310
1 1.17287 1.17322
2 1.17298 1.17340
3 1.17346 1.17348
4 1.17478 1.17511
5 1.17595 1.17700
6 1.17508 1.17633
7 1.17474 1.17545
8 1.17463 1.17546
9 1.17224 1.17468
10 1.17437 1.17456
11 1.17552 1.17641
12 1.17750 1.17784
13 1.17694 1.17770
Try this using shift, for as an example:
df['c_shift'] = df['c'].shift()
df['diff'] = df['h'] - df['c_shift']
print(df)
Output:
c h c_shift diff
0 1.17285 1.17310 NaN NaN
1 1.17287 1.17322 1.17285 0.00037
2 1.17298 1.17340 1.17287 0.00053
3 1.17346 1.17348 1.17298 0.00050
4 1.17478 1.17511 1.17346 0.00165
5 1.17595 1.17700 1.17478 0.00222
6 1.17508 1.17633 1.17595 0.00038
7 1.17474 1.17545 1.17508 0.00037
8 1.17463 1.17546 1.17474 0.00072
9 1.17224 1.17468 1.17463 0.00005
10 1.17437 1.17456 1.17224 0.00232
11 1.17552 1.17641 1.17437 0.00204
12 1.17750 1.17784 1.17552 0.00232
13 1.17694 1.17770 1.17750 0.00020
Of course, you can do this in one step:
df['diff'] = df['h'] - df['c'].shift()

Modifying values in a dataframe

I am trying to iterate through the rows of a dataframe and modify some values as I iterate. The dataframe looks like this:
Time WindSpeed SkyCover Temp DewPt RH Press Precip
3 21:53 11 Light Snow -1.7 -6.1 72% 1003.1 0
4 20:53 N 11 Mostly Cloudy -2.2 -6.1 75% 1002.8 0
5 19:53 Calm Mostly Cloudy -2.8 -6.7 75% 1002.7 0
6 18:53 Calm Overcast -1.7 -6.7 69% 1002.4 0
7 17:53 N 5 Overcast -1.7 -7.2 66% 1002.6 0
8 16:53 NE 8 Overcast -1.1 -7.2 64% 1002.5 0
…
I have written the following loop to go through the dataframe and alter the windspeed column. This column is a vector when windspeed is greater than 1 KPH and a text value 'Calm' when below that threshold. I am wanting this loop to look at the column values row by row and if it is calm, put '1' in its place but if it is greater than one, remove the direction and keep only the scalar value.
for i in df.index:
if df.at[i, 2] == 'Calm':
df.at[i, 2] = 1
else:
df.at[i, 2] = re.findall('[0-9]+', df.at[i, 2])[0]
As you can see in the above dataframe, this loop has worked on the first row of data but does not continue past that. I am not receiving any error messages as to why it is stopping after the first row.
Use apply:
df.WindSpeed = df.WindSpeed.apply(lambda x: 1 if x == 'Calm' else re.findall(r'[0-9]+',x)[0])
Adding another way of doing it:
import numpy as np
df['WindSpeed'] = np.where(df['WindSpeed'] == 'Calm', '1', df['WindSpeed'].str.extract('(\d+)'))
df['WindSpeed']=df['WindSpeed'].apply(modify)
def modify(x):
if x=='Calm' :
y=1;
else:
y=re.findall('[0-9]+',x)
return y

Categories

Resources