I am trying to write a function that, given a number of user specified time step inputs, will overwrite the value at the given index with a value from another dataframe. For example:
df1
index date skew
92 2019-09-02 0
93 2019-09-03 0
94 2019-09-04 0
95 2019-09-05 0
96 2019-09-06 0
97 2019-09-09 0
df2
index change
13 0.63
14 0.61
15 0.98
16 0.11
17 0.43
The result I am after:
result_df
index date skew
92 2019-09-02 0
93 2019-09-03 0.63
94 2019-09-04 0.61
95 2019-09-05 0.98
96 2019-09-06 0.11
97 2019-09-09 0.43
Using a for loop and df1.at[-i, 'skew'] = df2.loc[-i, 'change']
I am getting the following result:
index date skew
92 2019-09-02 0
93 2019-09-03 0
94 2019-09-04 0
95 2019-09-05 0
96 2019-09-06 0
97 2019-09-09 0
-5 NaT 0.63
-4 NaT 0.61
-3 NaT 0.98
-2 NaT 0.11
-1 NaT 0.43
my current function:
num_timesteps = 5
def append_changes (df1, df2, num_timesteps):
# Reverse loop to start from index df1.iloc[-num_timsteps:]
for i in range(num_timesteps, 0, -1):
df1.at[-i:, 'filler'] = df2.loc[-i:, 'change']
return df1
I expect the row values under the skew column from index -5 (as per num_timesteps) to the end of the dataframe to be replaced with those values from the 'change' column in df2 at the same index.
I think no loop is necessary, only use DataFrame.iloc with positions of columns by Index.get_loc for select ans set new values - for avoid match index values assign numpy array created by .values:
num_timesteps = 5
def append_changes (df1, df2, num_timesteps):
arr = df2.iloc[-num_timesteps:, df2.columns.get_loc('change')].values
df1.iloc[-num_timesteps:, df1.columns.get_loc('skew')] = arr
return df1
print (append_changes(df1, df2, num_timesteps))
date skew
index
92 2019-09-02 0.00
93 2019-09-03 0.63
94 2019-09-04 0.61
95 2019-09-05 0.98
96 2019-09-06 0.11
97 2019-09-09 0.43
I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00
After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.
We have a dataframe 'A' with 5 columns, and we want to add the rolling mean of each column, we could do:
A = pd.DataFrame(np.random.randint(100, size=(5, 5)))
for i in range(0,5):
A[i+6] = A[i].rolling(3).mean()
If however 'A' has column named 'A', 'B'...'E':
A = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = ['A', 'B',
'C', 'D', 'E'])
How could we neatly add 5 columns with the rolling mean, and each name being ['A_mean', 'B_mean', ....'E_mean']?
try this:
for col in df:
A[col+'_mean'] = A[col].rolling(3).mean()
Output with your way:
0 1 2 3 4 6 7 8 9 10
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
and Output with mine:
A B C D E A_mean B_mean C_mean D_mean E_mean
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
Without loops:
pd.concat([A, A.apply(lambda x:x.rolling(3).mean()).rename(
columns={col: str(col) + '_mean' for col in A})], axis=1)
A B C D E A_mean B_mean C_mean D_mean E_mean
0 67 54 85 61 62 NaN NaN NaN NaN NaN
1 44 53 30 80 58 NaN NaN NaN NaN NaN
2 10 59 14 39 12 40.333333 55.333333 43.0 60.000000 44.000000
3 47 25 58 93 38 33.666667 45.666667 34.0 70.666667 36.000000
4 73 80 30 51 77 43.333333 54.666667 34.0 61.000000 42.333333
I have the following pandas DataFrame, called main_frame:
target_var input1 input2 input3 input4 input5 input6
Date
2013-09-01 13.0 NaN NaN NaN NaN NaN NaN
2013-10-01 13.0 NaN NaN NaN NaN NaN NaN
2013-11-01 12.2 NaN NaN NaN NaN NaN NaN
2013-12-01 10.9 NaN NaN NaN NaN NaN NaN
2014-01-01 11.7 0 13 42 0 0 16
2014-02-01 12.0 13 8 58 0 0 14
2014-03-01 12.8 13 15 100 0 0 24
2014-04-01 13.1 0 11 50 34 0 18
2014-05-01 12.2 12 14 56 30 71 18
2014-06-01 11.7 13 16 43 44 0 22
2014-07-01 11.2 0 19 45 35 0 18
2014-08-01 11.4 12 16 37 31 0 24
2014-09-01 10.9 14 14 47 30 56 20
2014-10-01 10.5 15 17 54 24 56 22
2014-11-01 10.7 12 18 60 41 63 21
2014-12-01 9.6 12 14 42 29 53 16
2015-01-01 10.2 10 16 37 31 0 20
2015-02-01 10.7 11 20 39 28 0 19
2015-03-01 10.9 10 17 75 27 87 22
2015-04-01 10.8 14 17 73 30 43 25
2015-05-01 10.2 10 17 55 31 52 24
I've been having trouble to explore the dataset on Scikit-learn and I'm not sure if the problem is the pandas Dataset, the dates as index, the NaN's/Infs/Zeros (which I don't know how to solve), everything, something else I wasn't able to track.
I want to build a simple regression to predict the next target_var item based on the variables named "Input" (1,2,3..).
Note that there are a lot of zeros and NaN's in the time series, and eventually we might find Inf's as well.
You should first try to remove any row with a Inf, -Inf or NaN values (other methods include filling in the NaNs with, for example, the mean value of the feature).
df = df.replace(to_replace=[np.Inf, -np.Inf], value=np.NaN)
df = df.dropna()
Now, create a numpy matrix of you features and a vector of your targets. Given that your target variable is in the first column, you can use integer based indexing as follows:
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
Then create and fit your model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=X, y=y)
Now you can observe your estimates:
>>> model.intercept_
12.109583092421092
>>> model.coef_
array([-0.05269033, -0.17723251, 0.03627883, 0.02219596, -0.01377465,
0.0111017 ])
I'm new to pandas and I'm trying to read a strange formated file into a DataFrame.
The original file looks like this:
; No Time Date MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 11:38:17 11.07.2012 11.37 48.20 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89.0 89.0 89.0 88.0
2 11:38:18 11.07.2012 11.44 48.20 5.13 88.88 2 346.22 12.08 11.83 -1.00 -1.00 89.0 89.0 -1.0 -1.0
3 11:38:19 11.07.2012 11.10 48.20 4.96 89.00 3 337.84 11.83 11.59 10.62 -1.00 89.0 89.0 89.0 -1.0
4 11:38:19 11.07.2012 11.82 48.20 5.54 88.60 3 355.92 11.10 13.54 12.32 -1.00 89.0 88.0 88.0 -1.0
I managed to get an equally structured DataFrame with:
In [42]: date_spec = {'FetchTime': [1, 2]}
In [43]: df = pd.read_csv('MeasureCK32450-20120711114050.mck', header=7, sep='\s\s+',
parse_dates=date_spec, na_values=['-1.0', '-1.00'])
In [44]: df
Out[52]:
FetchTime ; No MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
0 2012-11-07 11:38:17 1 11.37 48.2 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89 89 89 88
1 2012-11-07 11:38:18 2 11.44 48.2 5.13 88.88 2 346.22 12.08 11.83 NaN NaN 89 89 NaN NaN
2 2012-11-07 11:38:19 3 11.10 48.2 4.96 89.00 3 337.84 11.83 11.59 10.62 NaN 89 89 89 NaN
3 2012-11-07 11:38:19 4 11.82 48.2 5.54 88.60 3 355.92 11.10 13.54 12.32 NaN 89 88 88 NaN
But now I have to expand each line of this DataFrame
.... Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 .... 11.84 11.35 11.59 15.25 89 89 89 88
2 .... 12.08 11.83 NaN NaN 89 89 NaN NaN
into four lines (with three indexes No, FetchTime, and MeasureNo):
.... Moist TDR
No FetchTime MeasureNo
0 2012-11-07 11:38:17 1 .... 11.84 89 # from line 1, Moist1 and TDR1
1 2 .... 11.35 89 # from line 1, Moist2 and TDR2
2 3 .... 11.59 89 # from line 1, Moist3 and TDR3
3 4 .... 15.25 88 # from line 1, Moist4 and TDR4
4 2012-11-07 11:38:18 1 .... 12.08 89 # from line 2, Moist1 and TDR1
5 2 .... 11.83 89 # from line 2, Moist2 and TDR2
6 3 .... NaN NaN # from line 2, Moist3 and TDR3
7 4 .... NaN NaN # from line 2, Moist4 and TDR4
by preserving the other columns and MOST important, preserving the order of the entries. I
know I can iterate through each line with for row in df.iterrows(): ... but I read this is
not very fast. My first approach was this:
In [54]: data = []
In [55]: for d in range(1,5):
....: temp = df.ix[:, ['FetchTime', 'MoistAve', 'MatTemp', 'TDRConduct', 'TDRAve', 'DeltaCount', 'tpAve', 'Moist%d' % d, 'TDR%d' % d]]
....: temp.columns = ['FetchTime', 'MoistAve', 'MatTemp', 'TDRConduct', 'TDRAve', 'DeltaCount', 'tpAve', 'RawMoist', 'RawTDR']
....: temp['MeasureNo'] = d
....: data.append(temp)
....:
In [56]: test = pd.concat(data, ignore_index=True)
In [62]: test.head()
Out[62]:
FetchTime MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve RawMoist RawTDR MeasureNo
0 2012-11-07 11:38:17 11.37 48.2 5.15 88.87 15 344.50 11.84 89 1
1 2012-11-07 11:38:18 11.44 48.2 5.13 88.88 2 346.22 12.08 89 1
2 2012-11-07 11:38:19 11.10 48.2 4.96 89.00 3 337.84 11.83 89 1
3 2012-11-07 11:38:19 11.82 48.2 5.54 88.60 3 355.92 11.10 89 1
4 2012-11-07 11:38:20 12.61 48.2 5.87 88.38 3 375.72 12.80 89 1
But I don't see a way to influence the concatenation to get the order I need ...
Is there another way to get the resulting DataFrame I need?
Here is a solution based on numpy's repeat and array indexing to build de-stacked values, and pandas' merge to output the concatenated result.
First load a sample of your data into a DataFrame (slightly changed read_csv's arguments).
from cStringIO import StringIO
data = """; No Time Date MoistAve MatTemp TDRConduct TDRAve DeltaCount tpAve Moist1 Moist2 Moist3 Moist4 TDR1 TDR2 TDR3 TDR4
1 11:38:17 11.07.2012 11.37 48.20 5.15 88.87 15 344.50 11.84 11.35 11.59 15.25 89.0 89.0 89.0 88.0
2 11:38:18 11.07.2012 11.44 48.20 5.13 88.88 2 346.22 12.08 11.83 -1.00 -1.00 89.0 89.0 -1.0 -1.0
3 11:38:19 11.07.2012 11.10 48.20 4.96 89.00 3 337.84 11.83 11.59 10.62 -1.00 89.0 89.0 89.0 -1.0
4 11:38:19 11.07.2012 11.82 48.20 5.54 88.60 3 355.92 11.10 13.54 12.32 -1.00 89.0 88.0 88.0 -1.0
"""
date_spec = {'FetchTime': [1, 2]}
df = pd.read_csv(StringIO(data), header=0, sep='\s\s+',parse_dates=date_spec, na_values=['-1.0', '-1.00'])
Then build a de-stacked vector of TDRs and merge it with the original data frame
stacked_col_names = ['TDR1','TDR2','TDR3','TDR4']
repeated_row_indexes = np.repeat(np.arange(df.shape[0]),4)
repeated_col_indexes = [np.where(df.columns == c)[0][0] for c in stacked_col_names]
destacked_tdrs = pd.DataFrame(data=df.values[repeated_row_indexes,repeated_col_indexes],index=df.index[repeated_row_indexes],columns=['TDR'])
ouput = pd.merge(left_index = True, right_index = True, left = df, right = destacked_tdrs)
With the desired output :
output.ix[:,['TDR1','TDR2','TDR3','TDR4','TDR']]
TDR1 TDR2 TDR3 TDR4 TDR
0 89 89 89 88 89
0 89 89 89 88 89
0 89 89 89 88 89
0 89 89 89 88 88
1 89 89 NaN NaN 89
1 89 89 NaN NaN 89
1 89 89 NaN NaN NaN
1 89 89 NaN NaN NaN
2 89 89 89 NaN 89
2 89 89 89 NaN 89
2 89 89 89 NaN 89
2 89 89 89 NaN NaN
3 89 88 88 NaN 89
3 89 88 88 NaN 88
3 89 88 88 NaN 88
3 89 88 88 NaN NaN
This gives every gives fourth row in test starting at 'i':
test.ix[i::4]
Using the same basic loop as above, just append the set of every forth row starting at 0 to 3 after you run your code above.
data = []
for i in range(0,3:):
temp = test.ix[i::4]
data.append(temp)
test2 = pd.concat(data,ignore_index=True)
Update:
I realize now that what's you'd want isn't every fourth row but every mth row, so this would just be the loop suggestions above. Sorry.
Update 2:
Maybe not. We can take advantage of the fact that even though concatenate doesn't return the order you want what it does return has a fixed mapping to what you do want. d is the number of rows per timestamp and m is the number of timestamps.
You seem to want the rows from test as follows:
[0,m,2m,3m,1,m+1,2m+1,3m+1,2,m+2,2m+2,3m+2,...,m-1,2m-1,3m-1,4m-1]
I'm sure there are much nicer ways to generate that list of indices, but this worked for me
d = 4
m = 10
small = (np.arange(0,m).reshape(m,1).repeat(d,1).T.reshape(-1,1))
shifter = (np.arange(0,d).repeat(m).reshape(-1,1).T * m)
NewIndex = (shifter.reshape(d,-1) + small.reshape(d,-1)).T.reshape(-1,1)
NewIndex = NewIndex.reshape(-1)
test = test.ix[NewIndex]