linear interpolation between values stored in separate dataframes - python

I have numbers stored in 2 data frames (real ones are much bigger) which look like
df1
A B C T Z
13/03/2017 1.321674 3.1790 3.774602 30.898 13.22
06/02/2017 1.306358 3.1387 3.712554 30.847 13.36
09/01/2017 1.361103 3.2280 3.738500 32.062 13.75
05/12/2016 1.339258 3.4560 3.548593 31.978 13.81
07/11/2016 1.295137 3.2323 3.188161 31.463 13.43
df2
A B C T Z
13/03/2017 1.320829 3.1530 3.7418 30.933 13.1450
06/02/2017 1.305483 3.1160 3.6839 30.870 13.2985
09/01/2017 1.359989 3.1969 3.7129 32.098 13.6700
05/12/2016 1.338151 3.4215 3.5231 32.035 13.7243
07/11/2016 1.293996 3.2020 3.1681 31.480 13.3587
and a list where I have stored all daily dates from 13/03/2017 to 7/11/2016.
I would like to create a dataframe with the following features:
the list of daily dates is the indexrow
I would like to create columns (in this case from A to Z) and for each row/ day compute the linear interpolation value between the value in df1 and the corresponding value in df2 shifted by -1. For example, in the row '12/03/2017' for column A I want to compute [(34/35)*1.321674]+[(1/35)*1.305483] = 1.3212114. Where 35 is the number of days between 13/03/2017 and 06/02/2017, 1.321674 is the value in df1 corresponding to column A for the day 13/03/2017 and 1.305483 is the value in df2 corresponding to column A for the day 06/02/2017. For 11/03/2017 for column A I want to compute [(33/35)*1.321674]+[(2/35)*1.305483] = 1.3207488. Thus keeping fixed the values 1.321674 and 1.305483 for the time interval until 6/2/2017 where it should show 1.305483.
Finally, the linear interpolation should shift interpolating values when the corresponding row shows a date which is the next time interval. For example, once I reach 05/02/2017, the linear interpolation should be between 1.306358 (df1, column A) and 1.359989 (df2, column B), that is shift one position down.
For clarity, date format is 'dd/mm/yyyy'
I would greatly appreciate any piece of advice or suggestion, I am aware it's a lot of work so any hint is valued!
Please let me know if I you need more clarification.
Thanks!

Related

Conditional average of values in a row, depending on data qualifiers

I hope you're all doing well.
So I've been working with Excel my whole life and I'm now switching to Python & Pandas. The Learning curve is proving to be quite steep for me, so please bare with me.
Day after day it's getting better. I've already managed to aggregate values, input/ouput from csv/excel, drop "na" values and much more. However, I've stumbeled upon a wall to high for me to climb right now...
I created an extract of the dataframe I'm working with. You can download it here, so you can understand what I'll be writing about: https://filetransfer.io/data-package/pWE9L29S#link
df_example
t_stamp,1_wind,2_wind,3_wind,4_wind,5_wind,6_wind,7_wind,1_wind_Q,2_wind_Q,3_wind_Q,4_wind_Q,5_wind_Q,6_wind_Q,7_wind_Q
2021-06-06 18:20:00,12.14397093693768,12.14570426940918,10.97993184016605,11.16468568605988,9.961717914791588,10.34653735907099,11.6856901451427,True,False,True,True,True,True,True
2021-05-10 19:00:00,8.045154709031468,8.572511270557484,8.499070711427668,7.949358210396142,8.252115912454919,7.116505042782365,8.815732567915179,True,True,True,True,True,True,True
2021-05-27 22:20:00,8.38946901817802,6.713454777683985,7.269814675171176,7.141862659613969,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-05 18:20:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-06-06 12:20:00,11.95525872119988,12.14570426940918,12.26086164116684,12.89527716859738,11.77172234144684,12.12409015586662,12.52180822809299,True,False,True,True,True,True,True
2021-06-04 03:30:00,14.72553364088618,12.72900662616056,10.59386275508178,10.96070182287055,12.38239256540934,12.07846616943932,10.58384464064597,True,True,True,True,False,True,True
2021-05-05 13:00:00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,False,False,False,False,False,False,False
2021-05-24 18:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.60747853230812,17.18577813727543,17.70745523935796,False,False,False,False,True,True,True
2021-05-07 19:00:00,13.94341927008482,10.95456999345216,13.36533234604886,0.0,3.782910539990379,10.86996953698871,13.45072022532649,True,True,True,False,False,True,True
2021-05-13 00:40:00,10.70940582779898,10.22222264510213,9.043496015164536,9.03805802580422,11.53775481234347,10.09538681656049,10.19345618536208,True,True,True,True,True,True,True
2021-05-27 19:40:00,10.8317678500958,7.929683248532885,8.264301219025942,8.184133252794958,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-09 12:00:00,10.55571650269678,7.635778078425459,10.43683108425784,7.847532146733346,8.100127641989639,7.770247510198059,8.040702032061867,True,True,True,True,True,True,True
2021-05-19 19:00:00,2.322496225799398,2.193219010982461,2.301622604435732,2.204278609893358,2.285408405883714,1.813280858368885,1.667207419773053,True,True,True,True,True,True,True
2021-05-30 12:30:00,5.776450801637788,8.488826231951345,10.98525552709715,7.03016556196849,12.38239256540934,14.23146015260278,11.26704854500004,True,True,True,True,False,False,False
2021-05-24 14:10:00,17.12270521348523,16.22721748967324,14.15318916689965,19.35395873243158,17.93466266883504,17.04697174496121,17.0739475214739,False,False,False,False,True,False,True
What you are looking at:
"n" represents the number of measuring points.
First column: Timestamp of values
Columns index 1 to "n": Average windspeed at different points, of the last 10 minutes
Columns index "n+1" to last (-1): Qualifies if the value of the respective point is valid (True) or invalid (False). So to the value "1_wind", the qualifier "1_wind_Q" applies
Want I'm trying to achieve:
The goal is to create a new column called "Avg_WS" which iterates through every row and calculates the following:
Average of the value ranges, ONLY if the corresponding Qualifier is TRUE
Example: So if in a given row, the column "4_wind_Q" is "False", the value "4_wind" should be excluded from the average on that given row.
Extra: If all Qualifiers are "False" in a given row, "Avg_WS" should equal to "NaN" in that same row.
I've tried to use apply, but I can't figure out how to match the pairs of value-qualifier
Thnak you so much in advanced!
I tried using mask for this.
quals = ['1_wind_Q','2_wind_Q','3_wind_Q','4_wind_Q','5_wind_Q','6_wind_Q','7_wind_Q']
fields = ['1_wind', '2_wind', '3_wind', '4_wind', '5_wind', '6_wind', '7_wind']
df[fields].mask( ~df[quals].values ).mean( axis=1 )
# output
0 11.047089
1 8.178635
2 7.378650
3 NaN
4 12.254836
5 11.945236
6 NaN
7 17.500237
8 12.516802
9 10.119969
10 8.802471
11 8.626705
12 2.112502
13 8.070175
14 17.504305
dtype: float64
# assign this to the dataframe
df.loc[ :, 'Avg_WS' ] = df[fields].mask( ~df[quals].values ).mean( axis=1 )
mask works by essentially applying a boolean mask on each of the "fields" - the caveat is the bool mask must be the same shape as the data you are trying applying it on (i.e. must have same dimensions n x m)
mean( axis= 1 ) tells the data frame to apply the mean function across each row ( rather than column which axis=0 would imply.

Summing values of (dropped) duplicate rows Pandas DataFrame

For a time series analysis, I have to drop instances that occur on the same date. However, keep some of the 'deleted' information and add it to the remaining 'duplicate' instance. Below a short example of part of my dataset.
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]})
lat lon time duration size
0 49.125 -114.125 2005-08-09 3 4
1 49.125 -114.125 2005-08-09 6 10
I would like to drop the (duplicate) instance which has the lowest 'duration' value but at the same time sum the 'size' variables. Output would look like:
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Does anyone know how I would be able to tackle such a problem? Furthermore, for another variable, I would like to take the mean of these values. Yet I do think the process would be similar to summing the values.
edit: so far I know how to get the highest duration value to remain using:
z.sort_values(by='duration', ascending=False).drop_duplicates(subset=['lat', 'lon','time'], keep='last')
If those are all the columns in your dataframe, you can get your result using a groupbyon your time column, and passing in your aggregations for each column.
More specifically, you can drop the (duplicate) instance which has the lowest 'duration' by keeping the max() duration, and at the same time sum the 'size' variables by using sum() on your size column.
res = z.groupby('time').agg({'lat':'first',
'lon':'first',
'duration':'max',
'size':'sum'}). \
reset_index()
res
time lat lon duration size
0 2005-08-09 49.125 -114.125 6 14
The only difference is that 'time' is now your first column, which you can quickly fix.
Group by to get the sum and merge back on unique values on the df without duplicates:
import pandas as pd
import numpy as np
z = pd.DataFrame({'lat':[49.125,49.125], 'lon':[-114.125 ,-114.125 ], 'time':[np.datetime64('2005-08-09'),np.datetime64('2005-08-09')], 'duration':[3,6],'size':[4,10]}) # original data
gp = z.groupby(['lat', 'lon','time'], as_index=False)[['size']].sum() # getting the sum of 'size' for unique combination of lat, lon, time
df = z.sort_values(by='duration', ascending=True).drop_duplicates(subset=['lat', 'lon','time'], keep='last') # dropping duplicates
pd.merge(df[['lat', 'lon', 'time', 'duration']], gp, on=['lat', 'lon', 'time']) # adding the columns summed onto the df without duplicates
lat lon time duration size
0 49.125 -114.125 2005-08-09 6 14
Another way base on sophocles answer:
res = z.sort_values(by='duration', ascending=False).groupby(['time', 'lat', 'lon']).agg({
'duration':'first', # same as 'max' since we've sorted the data by duration DESC
'size':'sum'})
This one could become less readable if you have several columns you want to keep (you'd have a lot of first in the agg function)

Find Gaps in a Pandas Dataframe

I have a Dataframe which has a column for Minutes and correlated value, the frequency is about 79 seconds but sometimes there is missing data for a period (no rows at all). I want to detect if there is a gap of 25 or more Minutes and delete the dataset if so.
How do I test if there is a gap which is?
The dataframe looks like this:
INDEX minutes data
0 23.000 1.456
1 24.185 1.223
2 27.250 0.931
3 55.700 2.513
4 56.790 1.446
... ... ...
So there is a irregular but short gap and one that exceeds 25 Minutes. In this case I want the dataset to be empty:
I am quite new to Python, especially to Pandas so an explanation would be helpful to learn.
You can use numpy.roll to create a column with shifted values (i.e. the first value from the original column becomes the second value, the second becomes the third, etc):
import pandas as pd
import numpy as np
df = pd.DataFrame({'minutes': [23.000, 24.185, 27.250, 55.700, 56.790]})
np.roll(df['minutes'], 1)
# output: array([56.79 , 23. , 24.185, 27.25 , 55.7 ])
Add this as a new column to your dataframe and subtract the original column with the new column.
We also drop the first row beforehand, since we don't want to calculate the difference from your first timepoint in the original column and your last timepoint that got rolled to the start of the new column.
Then we just ask if any of the values resulting from the subtraction is above your threshold:
df['rolled_minutes'] = np.roll(df['minutes'], 1)
dropped_df = df.drop(index=0)
diff = dropped_df['minutes'] - dropped_df['rolled_minutes']
(diff > 25).any()
# output: True

How can I create a multi-indexed pandas dataframe within a for loop?

I have a weekly time-series of multiple varibles and I am trying to view what percentrank the last 26week correlation would be in vs. all previous 26week correlations.
So I can generate a correlation matrix for the first 26wk period using the pd.corr function in pandas, but I dont know how I can loop through all previous periods too find the different values for these correlations to then rank.
I hope there is a better way to achieve this if so please let me know
I have tried calculating parallel dataframes but i couldnt write a formula to rank the most recent - so i beleive that the solution lays with multi-indexing.
'''python
daterange = pd.date_range('20160701', periods = 100, freq= '1w')
np.random.seed(120)
df_corr = pd.DataFrame(np.random.rand(100,5), index= daterange, columns = list('abcde'))
df_corr_chg=df_corr.diff()
df_corr_chg=df_corr_chg[1:]
df_corr_chg=df_corr_chg.replace(0, 0.01)
d=df_corr_chg.shape[0]
df_CCC=df_corr_chg[::-1]
for s in range(0,d-26):
i=df_CCC.iloc[s:26+s]
I am looking for a multi-indexed table showing the correlations at different times
Example of output
e.g. (formatting issues)
a b
a 1 1 -0.101713
2 1 -0.031109
n 1 0.471764
b 1 -0.101713 1
2 -0.031109 1
n 0.471764 1
Here is a receipe how you could approach the problem.
I assume, you have one price per week (otherwise just preaggregate your dataframe).
# in case you your weeks are not numbered
# Sort your dataframe for symbol (EUR, SPX, ...) and week descending.
df.sort_values(['symbol', 'date'], ascending=False, inplace=True)
# Now add a pseudo
indexer= df.groupby('symbol').cumcount() < 26
df.loc[indexer, 'pricecolumn'].corr()
One more hint, in case you need to preaggregate your dataframe. You could add another aux column with the week number in your frame like:
df['week_number']=df['datefield'].dt.week
Then I guess you would like to have the last price of each week. You could do that as follows:
df_last= df.sort_values(['symbol', 'week_number', 'date'], ascending=True, inplace=False).groupby(['symbol', 'week_number']).aggregate('last')
df_last.reset_index(inplace=True)
Then use df_last in in place of the df above. Please check/change the field names, I assumed.

Normalize data by first value in the group

I have a DataFrame of 6 million rows of intraday data that looks like such:
closingDate Time Last
1997-09-09 11:30:00-04:00 1997-09-09 11:30:00 100
1997-09-09 11:31:00-04:00 1997-09-09 11:31:00 105
I want to normalize my Last column in a vectorized manner by dividing every row by the price on the first row that contains that day. This is my attempt:
df['Last']/df.groupby('closingDate').first()['Last']
The denominator looks like such:
closingDate
1997-09-09 943.25
1997-09-10 942.50
1997-09-11 928.00
1997-09-12 915.75
1997-09-14 933.00
1997-09-15 933.00
However, this division just gives me a column of nans. How can I associate the division to be broadcasted across my DateTime index?
Usually, this is a good use case for transform:
df['Last'] /= df.groupby('closingDate')['Last'].transform('first')
The groupby result is broadcasted with respect to the original DataFrame, and division is now made possible.

Categories

Resources