Expand time series data in pandas dataframe - python

I am attempting to interpolate between time points for all data in a pandas dataframe. My current data is in time increments of 0.04 seconds. I want it to be in increments of 0.01 seconds to match another data set. I realize I can use the DataFrame.interpolate() function to do this. However, I am stuck on how to insert 3 rows of NaN in-between every row of my dataframe in an efficient manner.
import pandas as pd
import numpy as np
df = pd.DataFrame(data={"Time": [0.0, 0.04, 0.08, 0.12],
"Pulse": [76, 74, 77, 80],
"O2":[99, 100, 99, 98]})
df_ins = pd.DataFrame(data={"Time": [np.nan, np.nan, np.nan],
"Pulse": [np.nan, np.nan, np.nan],
"O2":[np.nan, np.nan, np.nan]})
I want df to transform from this:
Time Pulse O2
0 0.00 76 99
1 0.04 74 100
2 0.08 77 99
3 0.12 80 98
To something like this:
Time Pulse O2
0 0.00 76 99
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 0.04 74 100
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 0.08 77 99
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 0.12 80 98
Which I can then call on
df = df.interpolate()
Which would yield something like this (I'm making up the numbers here):
Time Pulse O2
0 0.00 76 99
1 0.01 76 99
2 0.02 75 99
3 0.03 74 100
4 0.04 74 100
5 0.05 75 100
6 0.06 76 99
7 0.07 77 99
8 0.08 77 99
9 0.09 77 99
10 0.10 78 98
11 0.11 79 98
12 0.12 80 98
I attempted to use an iterrows technique by inserting the df_ins frame after every row. But my index was thrown off during the iteration. I also tried slicing df and concatenating the df slices and df_ins, but once again the indexes were thrown off by the loop.
Does anyone have any recommendations on how to do this efficiently?

Use resample here (replace ffill with your desired behavior, maybe mess around with interpolate)
df["Time"] = pd.to_timedelta(df["Time"], unit="S")
df.set_index("Time").resample("0.01S").ffill()
Pulse O2
Time
00:00:00 76 99
00:00:00.010000 76 99
00:00:00.020000 76 99
00:00:00.030000 76 99
00:00:00.040000 74 100
00:00:00.050000 74 100
00:00:00.060000 74 100
00:00:00.070000 74 100
00:00:00.080000 77 99
00:00:00.090000 77 99
00:00:00.100000 77 99
00:00:00.110000 77 99
00:00:00.120000 80 98
If you do want to interpolate:
df.set_index("Time").resample("0.01S").interpolate()
Pulse O2
Time
00:00:00 76.00 99.00
00:00:00.010000 75.50 99.25
00:00:00.020000 75.00 99.50
00:00:00.030000 74.50 99.75
00:00:00.040000 74.00 100.00
00:00:00.050000 74.75 99.75
00:00:00.060000 75.50 99.50
00:00:00.070000 76.25 99.25
00:00:00.080000 77.00 99.00
00:00:00.090000 77.75 98.75
00:00:00.100000 78.50 98.50
00:00:00.110000 79.25 98.25
00:00:00.120000 80.00 98.00

I believe using np.linspace and process column-wise should be faster than interpolate (if your Time column is not exactly in time format):
import numpy as np
import pandas as pd
new_dict = {}
for c in df.columns:
arr = df[c]
ret = []
for i in range(1, len(arr)):
ret.append(np.linspace(arr[i-1], arr[i], 4, endpoint=False)[1:])
new_dict[c] = np.concatenate(ret)
pd.concat([df, pd.DataFrame(new_dict)]).sort_values('Time').reset_index(drop=True)
Time Pulse O2
0 0.00 76.00 99.00
1 0.01 75.50 99.25
2 0.02 75.00 99.50
3 0.03 74.50 99.75
4 0.04 74.00 100.00
5 0.05 74.75 99.75
6 0.06 75.50 99.50
7 0.07 76.25 99.25
8 0.08 77.00 99.00
9 0.09 77.75 98.75
10 0.10 78.50 98.50
11 0.11 79.25 98.25
12 0.12 80.00 98.00

Related

Replace rows by index in a Pandas Dataframe with values with corresponding index in another Dataframe

I am trying to write a function that, given a number of user specified time step inputs, will overwrite the value at the given index with a value from another dataframe. For example:
df1
index date skew
92 2019-09-02 0
93 2019-09-03 0
94 2019-09-04 0
95 2019-09-05 0
96 2019-09-06 0
97 2019-09-09 0
df2
index change
13 0.63
14 0.61
15 0.98
16 0.11
17 0.43
The result I am after:
result_df
index date skew
92 2019-09-02 0
93 2019-09-03 0.63
94 2019-09-04 0.61
95 2019-09-05 0.98
96 2019-09-06 0.11
97 2019-09-09 0.43
Using a for loop and df1.at[-i, 'skew'] = df2.loc[-i, 'change']
I am getting the following result:
index date skew
92 2019-09-02 0
93 2019-09-03 0
94 2019-09-04 0
95 2019-09-05 0
96 2019-09-06 0
97 2019-09-09 0
-5 NaT 0.63
-4 NaT 0.61
-3 NaT 0.98
-2 NaT 0.11
-1 NaT 0.43
my current function:
num_timesteps = 5
def append_changes (df1, df2, num_timesteps):
# Reverse loop to start from index df1.iloc[-num_timsteps:]
for i in range(num_timesteps, 0, -1):
df1.at[-i:, 'filler'] = df2.loc[-i:, 'change']
return df1
I expect the row values under the skew column from index -5 (as per num_timesteps) to the end of the dataframe to be replaced with those values from the 'change' column in df2 at the same index.
I think no loop is necessary, only use DataFrame.iloc with positions of columns by Index.get_loc for select ans set new values - for avoid match index values assign numpy array created by .values:
num_timesteps = 5
def append_changes (df1, df2, num_timesteps):
arr = df2.iloc[-num_timesteps:, df2.columns.get_loc('change')].values
df1.iloc[-num_timesteps:, df1.columns.get_loc('skew')] = arr
return df1
print (append_changes(df1, df2, num_timesteps))
date skew
index
92 2019-09-02 0.00
93 2019-09-03 0.63
94 2019-09-04 0.61
95 2019-09-05 0.98
96 2019-09-06 0.11
97 2019-09-09 0.43

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00
After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Pandas: Perform operation on various columns and create, rename new columns

We have a dataframe 'A' with 5 columns, and we want to add the rolling mean of each column, we could do:
A = pd.DataFrame(np.random.randint(100, size=(5, 5)))
for i in range(0,5):
A[i+6] = A[i].rolling(3).mean()
If however 'A' has column named 'A', 'B'...'E':
A = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = ['A', 'B',
'C', 'D', 'E'])
How could we neatly add 5 columns with the rolling mean, and each name being ['A_mean', 'B_mean', ....'E_mean']?
try this:
for col in df:
A[col+'_mean'] = A[col].rolling(3).mean()
Output with your way:
0 1 2 3 4 6 7 8 9 10
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
and Output with mine:
A B C D E A_mean B_mean C_mean D_mean E_mean
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
Without loops:
pd.concat([A, A.apply(lambda x:x.rolling(3).mean()).rename(
columns={col: str(col) + '_mean' for col in A})], axis=1)
A B C D E A_mean B_mean C_mean D_mean E_mean
0 67 54 85 61 62 NaN NaN NaN NaN NaN
1 44 53 30 80 58 NaN NaN NaN NaN NaN
2 10 59 14 39 12 40.333333 55.333333 43.0 60.000000 44.000000
3 47 25 58 93 38 33.666667 45.666667 34.0 70.666667 36.000000
4 73 80 30 51 77 43.333333 54.666667 34.0 61.000000 42.333333

Python/Scikit-learn/regressions - from pandas Dataframes to Scikit prediction

I have the following pandas DataFrame, called main_frame:
target_var input1 input2 input3 input4 input5 input6
Date
2013-09-01 13.0 NaN NaN NaN NaN NaN NaN
2013-10-01 13.0 NaN NaN NaN NaN NaN NaN
2013-11-01 12.2 NaN NaN NaN NaN NaN NaN
2013-12-01 10.9 NaN NaN NaN NaN NaN NaN
2014-01-01 11.7 0 13 42 0 0 16
2014-02-01 12.0 13 8 58 0 0 14
2014-03-01 12.8 13 15 100 0 0 24
2014-04-01 13.1 0 11 50 34 0 18
2014-05-01 12.2 12 14 56 30 71 18
2014-06-01 11.7 13 16 43 44 0 22
2014-07-01 11.2 0 19 45 35 0 18
2014-08-01 11.4 12 16 37 31 0 24
2014-09-01 10.9 14 14 47 30 56 20
2014-10-01 10.5 15 17 54 24 56 22
2014-11-01 10.7 12 18 60 41 63 21
2014-12-01 9.6 12 14 42 29 53 16
2015-01-01 10.2 10 16 37 31 0 20
2015-02-01 10.7 11 20 39 28 0 19
2015-03-01 10.9 10 17 75 27 87 22
2015-04-01 10.8 14 17 73 30 43 25
2015-05-01 10.2 10 17 55 31 52 24
I've been having trouble to explore the dataset on Scikit-learn and I'm not sure if the problem is the pandas Dataset, the dates as index, the NaN's/Infs/Zeros (which I don't know how to solve), everything, something else I wasn't able to track.
I want to build a simple regression to predict the next target_var item based on the variables named "Input" (1,2,3..).
Note that there are a lot of zeros and NaN's in the time series, and eventually we might find Inf's as well.
You should first try to remove any row with a Inf, -Inf or NaN values (other methods include filling in the NaNs with, for example, the mean value of the feature).
df = df.replace(to_replace=[np.Inf, -np.Inf], value=np.NaN)
df = df.dropna()
Now, create a numpy matrix of you features and a vector of your targets. Given that your target variable is in the first column, you can use integer based indexing as follows:
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
Then create and fit your model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=X, y=y)
Now you can observe your estimates:
>>> model.intercept_
12.109583092421092
>>> model.coef_
array([-0.05269033, -0.17723251, 0.03627883, 0.02219596, -0.01377465,
0.0111017 ])

Efficiently expand lines from pandas DataFrame

I'm new to pandas and I'm trying to read a strange formated file into a DataFrame.
The original file looks like this:
; No Time Date MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 11:38:17 11.07.2012 11.37 48.20 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89.0 89.0 89.0 88.0
2 11:38:18 11.07.2012 11.44 48.20 5.13 88.88 2 346.22 12.08 11.83 -1.00 -1.00 89.0 89.0 -1.0 -1.0
3 11:38:19 11.07.2012 11.10 48.20 4.96 89.00 3 337.84 11.83 11.59 10.62 -1.00 89.0 89.0 89.0 -1.0
4 11:38:19 11.07.2012 11.82 48.20 5.54 88.60 3 355.92 11.10 13.54 12.32 -1.00 89.0 88.0 88.0 -1.0
I managed to get an equally structured DataFrame with:
In [42]: date_spec = {'FetchTime': [1, 2]}
In [43]: df = pd.read_csv('MeasureCK32450-20120711114050.mck', header=7, sep='\s\s+',
parse_dates=date_spec, na_values=['-1.0', '-1.00'])
In [44]: df
Out[52]:
FetchTime ; No MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
0 2012-11-07 11:38:17 1 11.37 48.2 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89 89 89 88
1 2012-11-07 11:38:18 2 11.44 48.2 5.13 88.88 2 346.22 12.08 11.83 NaN NaN 89 89 NaN NaN
2 2012-11-07 11:38:19 3 11.10 48.2 4.96 89.00 3 337.84 11.83 11.59 10.62 NaN 89 89 89 NaN
3 2012-11-07 11:38:19 4 11.82 48.2 5.54 88.60 3 355.92 11.10 13.54 12.32 NaN 89 88 88 NaN
But now I have to expand each line of this DataFrame
.... Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 .... 11.84 11.35 11.59 15.25 89 89 89 88
2 .... 12.08 11.83 NaN NaN 89 89 NaN NaN
into four lines (with three indexes No, FetchTime, and MeasureNo):
.... Moist TDR
No FetchTime MeasureNo
0 2012-11-07 11:38:17 1 .... 11.84 89 # from line 1, Moist1 and TDR1
1 2 .... 11.35 89 # from line 1, Moist2 and TDR2
2 3 .... 11.59 89 # from line 1, Moist3 and TDR3
3 4 .... 15.25 88 # from line 1, Moist4 and TDR4
4 2012-11-07 11:38:18 1 .... 12.08 89 # from line 2, Moist1 and TDR1
5 2 .... 11.83 89 # from line 2, Moist2 and TDR2
6 3 .... NaN NaN # from line 2, Moist3 and TDR3
7 4 .... NaN NaN # from line 2, Moist4 and TDR4
by preserving the other columns and MOST important, preserving the order of the entries. I
know I can iterate through each line with for row in df.iterrows(): ... but I read this is
not very fast. My first approach was this:
In [54]: data = []
In [55]: for d in range(1,5):
....: temp = df.ix[:, ['FetchTime', 'MoistAve', 'MatTemp', 'TDRConduct', 'TDRAve', 'DeltaCount', 'tpAve', 'Moist%d' % d, 'TDR%d' % d]]
....: temp.columns = ['FetchTime', 'MoistAve', 'MatTemp', 'TDRConduct', 'TDRAve', 'DeltaCount', 'tpAve', 'RawMoist', 'RawTDR']
....: temp['MeasureNo'] = d
....: data.append(temp)
....:
In [56]: test = pd.concat(data, ignore_index=True)
In [62]: test.head()
Out[62]:
FetchTime MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve RawMoist RawTDR MeasureNo
0 2012-11-07 11:38:17 11.37 48.2 5.15 88.87 15 344.50 11.84 89 1
1 2012-11-07 11:38:18 11.44 48.2 5.13 88.88 2 346.22 12.08 89 1
2 2012-11-07 11:38:19 11.10 48.2 4.96 89.00 3 337.84 11.83 89 1
3 2012-11-07 11:38:19 11.82 48.2 5.54 88.60 3 355.92 11.10 89 1
4 2012-11-07 11:38:20 12.61 48.2 5.87 88.38 3 375.72 12.80 89 1
But I don't see a way to influence the concatenation to get the order I need ...
Is there another way to get the resulting DataFrame I need?
Here is a solution based on numpy's repeat and array indexing to build de-stacked values, and pandas' merge to output the concatenated result.
First load a sample of your data into a DataFrame (slightly changed read_csv's arguments).
from cStringIO import StringIO
data = """; No Time Date MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 11:38:17 11.07.2012 11.37 48.20 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89.0 89.0 89.0 88.0
2 11:38:18 11.07.2012 11.44 48.20 5.13 88.88 2 346.22 12.08 11.83 -1.00 -1.00 89.0 89.0 -1.0 -1.0
3 11:38:19 11.07.2012 11.10 48.20 4.96 89.00 3 337.84 11.83 11.59 10.62 -1.00 89.0 89.0 89.0 -1.0
4 11:38:19 11.07.2012 11.82 48.20 5.54 88.60 3 355.92 11.10 13.54 12.32 -1.00 89.0 88.0 88.0 -1.0
"""
date_spec = {'FetchTime': [1, 2]}
df = pd.read_csv(StringIO(data), header=0, sep='\s\s+',parse_dates=date_spec, na_values=['-1.0', '-1.00'])
Then build a de-stacked vector of TDRs and merge it with the original data frame
stacked_col_names = ['TDR1','TDR2','TDR3','TDR4']
repeated_row_indexes = np.repeat(np.arange(df.shape[0]),4)
repeated_col_indexes = [np.where(df.columns == c)[0][0] for c in stacked_col_names]
destacked_tdrs = pd.DataFrame(data=df.values[repeated_row_indexes,repeated_col_indexes],index=df.index[repeated_row_indexes],columns=['TDR'])
ouput = pd.merge(left_index = True, right_index = True, left = df, right = destacked_tdrs)
With the desired output :
output.ix[:,['TDR1','TDR2','TDR3','TDR4','TDR']]
TDR1 TDR2 TDR3 TDR4 TDR
0 89 89 89 88 89
0 89 89 89 88 89
0 89 89 89 88 89
0 89 89 89 88 88
1 89 89 NaN NaN 89
1 89 89 NaN NaN 89
1 89 89 NaN NaN NaN
1 89 89 NaN NaN NaN
2 89 89 89 NaN 89
2 89 89 89 NaN 89
2 89 89 89 NaN 89
2 89 89 89 NaN NaN
3 89 88 88 NaN 89
3 89 88 88 NaN 88
3 89 88 88 NaN 88
3 89 88 88 NaN NaN
This gives every gives fourth row in test starting at 'i':
test.ix[i::4]
Using the same basic loop as above, just append the set of every forth row starting at 0 to 3 after you run your code above.
data = []
for i in range(0,3:):
temp = test.ix[i::4]
data.append(temp)
test2 = pd.concat(data,ignore_index=True)
Update:
I realize now that what's you'd want isn't every fourth row but every mth row, so this would just be the loop suggestions above. Sorry.
Update 2:
Maybe not. We can take advantage of the fact that even though concatenate doesn't return the order you want what it does return has a fixed mapping to what you do want. d is the number of rows per timestamp and m is the number of timestamps.
You seem to want the rows from test as follows:
[0,m,2m,3m,1,m+1,2m+1,3m+1,2,m+2,2m+2,3m+2,...,m-1,2m-1,3m-1,4m-1]
I'm sure there are much nicer ways to generate that list of indices, but this worked for me
d = 4
m = 10
small = (np.arange(0,m).reshape(m,1).repeat(d,1).T.reshape(-1,1))
shifter = (np.arange(0,d).repeat(m).reshape(-1,1).T * m)
NewIndex = (shifter.reshape(d,-1) + small.reshape(d,-1)).T.reshape(-1,1)
NewIndex = NewIndex.reshape(-1)
test = test.ix[NewIndex]

Categories

Resources