Preprocessing multivariate time-series - python

Assume that I have indexed by time observations of a group of variables.
A B C D ... V day
16.50 14.00 53.00 45.70 ... 6.39 1
17.45 16.00 64.00 46.30 ... 6.00 2
18.40 12.00 51.00 47.30 ... 6.57 3
19.35 7.00 42.00 48.40 ... 5.84 4
20.30 9.00 34.00 49.50 ... 6.36 5
20.72 10.00 42.00 50.60 ... 5.78 6
21.14 6.00 45.00 51.90 ... 5.16 7
21.56 9.00 38.00 52.60 ... 5.62 8
21.98 2.00 32.00 53.50 ... 4.94 9
22.78 8.00 29.00 53.80 ... 6.25 10
...
Based on this data frame, I would like to construct a predictive model for a target Variable V with predictors being some of the other variables from previous observations, say with a time-horizon . That is, I would like to fit a model of this type
,
where is the error, and is chosen from a certain class of functions (e.g. a some kind of a neural network or a tree based method).
Question Are there packages that allow me a flexible choice of a learning algorithm, possibly from other packages, while also doing necessary preprocessing? Something in the spirit of what Caret does.
Thoughts Of course I could add a bunch of columns to the data frame, say A1, A2, ..., AK, etc, where say A1 is the value of A on the previous day, A2 is the value of A from 2 days before, etc. After that, I could apply any package I'd like. The problem is that if my initial data frame is large, this new data frame is going to be many times larger (about times, to be precise). This makes such an approach very inefficient if is large.
Caret package does have the function called createTimeSlices, but it seems to be designed for univariate time-series. This question is similar to this older one from 2009. The packages recommended there do not quite seem to do what I described above. I was wondering whether other tools have appeared since then. I would also appreciate any information if something like this exist in python.

Related

How to perform calculation in string column pandas

df[concentration]
'4.0±0.0'
'2.5±0.2'
'5.8±0.1'
'45.0'
'23'
'26.07'
'64'
I want result as:
4.00
2.70
5.90
45.00
23.00
26.07
64.00
which is (4.0+0.0) taking highest possible value.
However my concentration column is not float then how can I perform calculation on such type of data?
Any kind of help will be appreciated.
Thank you.
Use split(regex, expand), convert outcome digits to float and add row wise using lambda.
Data
df=pd.DataFrame({'concentration':['4.0±0.0','2.5±0.2','5.8±0.1','45.0','23','26.07','64']})
Solution
df['concentration']=df.concentration.str.split('±', expand=True).apply(lambda x: x.astype(float)).sum(1)
concentration
0 4.00
1 2.70
2 5.90
3 45.00
4 23.00
5 26.07
6 64.00

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

Summing rows in Python Dataframe

I just started learning Python so forgive me if this question has already been answered somewhere else. I want to create a new column called "Sum", which will simply be the previous columns added up.
Risk_Parity.tail()
VCIT VCLT PCY RWR IJR XLU EWL
Date
2017-01-31 21.704155 11.733716 9.588649 8.278629 5.061788 7.010918 7.951747
2017-02-28 19.839319 10.748690 9.582891 7.548530 5.066478 7.453951 7.950232
2017-03-31 19.986782 10.754507 9.593623 7.370828 5.024079 7.402774 7.654366
2017-04-30 18.897307 11.102380 10.021139 9.666693 5.901137 7.398604 11.284331
2017-05-31 63.962659 23.670240 46.018698 9.917160 15.234977 12.344524 20.405587
The table columns are a little off but all I need is (21.70 + 11.73...+7.95)
I can only get as far as creating the column Risk_Parity['sum'] = , but then I'm lost.
I'd rather not do have to do Risk_Parity['sum] = Risk_Parity['VCIT'] + Risk_Parity['VCLT']...
After creating the sum column, I want to divide each column by the sum column and make that into a new dataframe, which wouldn't include the sum column.
If anyone could help, I'd greatly appreciate it. Please try to dumb your answers down as much as possible lol.
Thanks!
Tom
Use sum with the parameter axis=1 to specify summation over rows
Risk_Parity['Sum'] = Risk_Parity.sum(1)
To create a new copy of Risk_Parity without writing a new column to the original
Risk_Parity.assign(Sum= Risk_Parity.sum(1))
Notice also, that I named the column Sum and not sum. I did this to avoid colliding with the very same method named sum I used to create the column.
To only include numeric columns... however, sum knows to skip non-numeric columns anyway.
RiskParity.assign(Sum=RiskParity.select_dtypes(['number']).sum(1))
# same as
# RiskParity.assign(Sum=RiskParity.sum(1))
VCIT VCLT PCY RWR IJR XLU EWL Sum
Date
2017-01-31 21.70 11.73 9.59 8.28 5.06 7.01 7.95 71.33
2017-02-28 19.84 10.75 9.58 7.55 5.07 7.45 7.95 68.19
2017-03-31 19.99 10.75 9.59 7.37 5.02 7.40 7.65 67.79
2017-04-30 18.90 11.10 10.02 9.67 5.90 7.40 11.28 74.27
2017-05-31 63.96 23.67 46.02 9.92 15.23 12.34 20.41 191.55
l = ['VCIT' , VCLT' ,PCY' ... 'EWL']
Risk_Parity['sum'] = 0
for item in l:
Risk_Parity['sum'] += Risk_Parity[item]

Appending data row from one dataframe to another with respect to date

I am brand new to pandas and working with two dataframes. My goal is to append the non-date values of df_ls (below) column-wise to their nearest respective date in df_1. Is the only way to do this with a traditional for-loop or is their some more effective built-in method/function. I have googled this extensively without any luck and have only found ways to append blocks of dataframes to other dataframes. I haven't found a way to search through a dataframe and append a row in another dataframe at the nearest respective date. See example below:
Example of first dataframe (lets call it df_ls):
DATE ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 1999-07-04 0.070771 1.606958 1.292280 0.128069 0.103018
1 1999-07-20 0.030795 2.326290 1.728147 0.099020 0.073595
2 1999-08-21 0.022819 2.492871 1.762536 0.096888 0.068502
3 1999-09-06 0.014613 2.792271 1.894225 0.090590 0.061445
4 1999-10-08 0.004978 2.781847 1.790768 0.089291 0.057521
5 1999-10-24 0.003144 2.818474 1.805257 0.090623 0.058054
6 1999-11-09 0.000859 3.146100 1.993941 0.092787 0.058823
7 1999-12-11 0.000912 2.913604 1.656642 0.097239 0.055357
8 1999-12-27 0.000877 2.974692 1.799949 0.098282 0.059427
9 2000-01-28 0.000758 3.092533 1.782112 0.095153 0.054809
10 2000-03-16 0.002933 2.969185 1.727465 0.083059 0.048322
11 2000-04-01 0.016814 2.366437 1.514110 0.089720 0.057398
12 2000-05-03 0.047370 1.847763 1.401930 0.109767 0.083290
13 2000-05-19 0.089432 1.402798 1.178798 0.137965 0.115936
14 2000-06-04 0.056340 1.807828 1.422489 0.118601 0.093328
Example of second dataframe (let's call it df_1)
Sample Date Value
0 2000-05-09 1.68
1 2000-05-09 1.68
2 2000-05-18 1.75
3 2000-05-18 1.75
4 2000-05-31 1.40
5 2000-05-31 1.40
6 2000-06-13 1.07
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
In the end, my goal is to have something like this (Note the appended values are values closest to the Sample Date, even though they dont match up perfectly):
Sample Date Value ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
1 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
2 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
3 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
4 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
5 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
6 2000-06-13 1.07 ETC.... ETC.... ETC ...
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
Thanks for any and all help. As I said I am new to this and I have experience with this sort of thing in MATLAB but PANDAS is a new to me.
Thanks

Categories

Resources