I have the following pandas DataFrame, called main_frame:
target_var input1 input2 input3 input4 input5 input6
Date
2013-09-01 13.0 NaN NaN NaN NaN NaN NaN
2013-10-01 13.0 NaN NaN NaN NaN NaN NaN
2013-11-01 12.2 NaN NaN NaN NaN NaN NaN
2013-12-01 10.9 NaN NaN NaN NaN NaN NaN
2014-01-01 11.7 0 13 42 0 0 16
2014-02-01 12.0 13 8 58 0 0 14
2014-03-01 12.8 13 15 100 0 0 24
2014-04-01 13.1 0 11 50 34 0 18
2014-05-01 12.2 12 14 56 30 71 18
2014-06-01 11.7 13 16 43 44 0 22
2014-07-01 11.2 0 19 45 35 0 18
2014-08-01 11.4 12 16 37 31 0 24
2014-09-01 10.9 14 14 47 30 56 20
2014-10-01 10.5 15 17 54 24 56 22
2014-11-01 10.7 12 18 60 41 63 21
2014-12-01 9.6 12 14 42 29 53 16
2015-01-01 10.2 10 16 37 31 0 20
2015-02-01 10.7 11 20 39 28 0 19
2015-03-01 10.9 10 17 75 27 87 22
2015-04-01 10.8 14 17 73 30 43 25
2015-05-01 10.2 10 17 55 31 52 24
I've been having trouble to explore the dataset on Scikit-learn and I'm not sure if the problem is the pandas Dataset, the dates as index, the NaN's/Infs/Zeros (which I don't know how to solve), everything, something else I wasn't able to track.
I want to build a simple regression to predict the next target_var item based on the variables named "Input" (1,2,3..).
Note that there are a lot of zeros and NaN's in the time series, and eventually we might find Inf's as well.
You should first try to remove any row with a Inf, -Inf or NaN values (other methods include filling in the NaNs with, for example, the mean value of the feature).
df = df.replace(to_replace=[np.Inf, -np.Inf], value=np.NaN)
df = df.dropna()
Now, create a numpy matrix of you features and a vector of your targets. Given that your target variable is in the first column, you can use integer based indexing as follows:
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
Then create and fit your model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=X, y=y)
Now you can observe your estimates:
>>> model.intercept_
12.109583092421092
>>> model.coef_
array([-0.05269033, -0.17723251, 0.03627883, 0.02219596, -0.01377465,
0.0111017 ])
Related
I have this dataframe that I wish to replace all the comma by dot, for example it would be 50.5 and 81.5.
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120 114 87 64 50,5 37
3 SUEZMAX 81,5 80 62 45 36 24
5 LR 2 69 72 57 42 32 20
7 AFRAMAX 66 68 55 40,5 30,5 19
9 LR 1 58 58 40 28 21 13,5
11 MR2 44 44,5 38 29 21 13
As dtypes for all the columns are object, I tried
df_useful[['NB', 'Ppt Resale ', '5 yrs', '10 yrs', '15 yrs',
'20 yrs']] = df_useful[['NB', 'Ppt Resale ', '5 yrs', '10 yrs', '15 yrs',
'20 yrs']].apply(pd.to_numeric, errors='coerce')
then the numbers with comma would become NAN.
A simple way:
out = df.replace(',', '.', regex=True)
Output:
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120 114 87 64 50.5 37
3 SUEZMAX 81.5 80 62 45 36 24
5 LR 2 69 72 57 42 32 20
7 AFRAMAX 66 68 55 40.5 30.5 19
9 LR 1 58 58 40 28 21 13.5
11 MR2 44 44.5 38 29 21 13
If your goal is to convert to numeric automatically, you can use:
df2 = (df
.drop(columns='Unnamed: 0')
.select_dtypes(exclude='number')
.apply(lambda s: pd.to_numeric(s.str.replace(',', '.'),
errors='coerce'))
)
df[list(df2)] = df2
Output:
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120.0 114.0 87 64.0 50.5 37.0
3 SUEZMAX 81.5 80.0 62 45.0 36.0 24.0
5 LR 2 69.0 72.0 57 42.0 32.0 20.0
7 AFRAMAX 66.0 68.0 55 40.5 30.5 19.0
9 LR 1 58.0 58.0 40 28.0 21.0 13.5
11 MR2 44.0 44.5 38 29.0 21.0 13.0
dtypes:
print(df.dtypes)
Unnamed: 0 object
NB float64
Ppt Resale float64
5 yrs int64
10 yrs float64
15 yrs float64
20 yrs float64
dtype: object
Another possible solution, based on the following idea:
Convert the dataframe to CSV format and then read the CSV string back, using the decimal separator parameter of pd.read_csv to have decimal dots instead of decimal commas.
from io import StringIO
pd.read_csv(StringIO(df.to_csv()), decimal=',', index_col=0)
Output:
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120.0 114.0 87 64.0 50.5 37.0
3 SUEZMAX 81.5 80.0 62 45.0 36.0 24.0
5 LR 2 69.0 72.0 57 42.0 32.0 20.0
7 AFRAMAX 66.0 68.0 55 40.5 30.5 19.0
9 LR 1 58.0 58.0 40 28.0 21.0 13.5
11 MR2 44.0 44.5 38 29.0 21.0 13.0
Coming from R and been working with the tidyverse mostly, I wonder how does pandas groupby and aggregations work. I have this code and the results are heartbreaking to me.
import pandas as pd
df = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
df.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
Now I would like to calculate the average displacement (disp) by cylinders, like that:
df['avg_disp'] = df.groupby('cyl').disp.mean()
Which results in something like:
cyl disp avg_disp
31 4 121.0 NaN
2 4 108.0 NaN
27 4 95.1 NaN
26 4 120.3 NaN
25 4 79.0 NaN
20 4 120.1 NaN
7 4 146.7 NaN
8 4 140.8 353.100000
19 4 71.1 NaN
18 4 75.7 NaN
17 4 78.7 NaN
29 6 145.0 NaN
0 6 160.0 NaN
1 6 160.0 NaN
3 6 258.0 NaN
10 6 167.6 NaN
9 6 167.6 NaN
5 6 225.0 NaN
13 8 275.8 NaN
28 8 351.0 NaN
4 8 360.0 105.136364
24 8 400.0 NaN
23 8 350.0 NaN
22 8 304.0 NaN
21 8 318.0 NaN
6 8 360.0 183.314286
11 8 275.8 NaN
16 8 440.0 NaN
30 8 301.0 NaN
14 8 472.0 NaN
12 8 275.8 NaN
15 8 460.0 NaN
After searching for a while, I discovered the transform function which leads to the correct value for avg_disp by assigning the mean value to each row according to the grouping cyl var.
My point is... why can't it be done easily with the mean function instead of using .transform('mean') on the grouped data frame?
If you want to add the results back to the ungrouped dataframe you could use .transform:
... and return a DataFrame having the same indexes as the original object filled with the transformed values.
df['avg_disp'] = df.groupby('cyl').disp.transform('mean')
I have two dataframes
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 36 28 6 20 1 ... 5 0 0 50 23 0
1 2021-04-13 46 15 5 16 6 ... 5 0 0 122 12 1
2 2021-04-14 12 4 1 5 2 ... 2 0 0 39 1 0
3 2021-04-15 30 23 3 14 2 ... 15 0 0 101 9 0
dt AAPL AMC AMZN ASO ATH ... SPCE SRNE TH TSLA VIAC WKHS
0 2021-04-12 41 28 4 33 10 ... 5 0 0 56 14 3
1 2021-04-13 76 22 7 12 29 ... 4 0 0 134 8 2
2 2021-04-14 21 15 2 7 16 ... 2 0 0 61 3 0
3 2021-04-15 54 43 9 2 31 ... 16 0 0 83 13 1
I want to remove numbers from two dataframe that are lower than 10 if the instance is deleted from one dataframe the same cell should be remove in another dataframe same thing goes other way around
Appreciate your help
Use a mask:
## pre-requisite
df1 = df1.set_index('dt')
df2 = df2.set_index('dt')
## processing
mask = df1.lt(10) | df2.lt(10)
df1 = df1.mask(mask)
df2 = df2.mask(mask)
output:
>>> df1
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 36 28.0 NaN 20.0 NaN NaN NaN NaN 50 23.0 NaN
2021-04-13 46 15.0 NaN 16.0 NaN NaN NaN NaN 122 NaN NaN
2021-04-14 12 NaN NaN NaN NaN NaN NaN NaN 39 NaN NaN
2021-04-15 30 23.0 NaN NaN NaN 15.0 NaN NaN 101 NaN NaN
>>> df2
AAPL AMC AMZN ASO ATH SPCE SRNE TH TSLA VIAC WKHS
dt
2021-04-12 41 28.0 NaN 33.0 NaN NaN NaN NaN 56 14.0 NaN
2021-04-13 76 22.0 NaN 12.0 NaN NaN NaN NaN 134 NaN NaN
2021-04-14 21 NaN NaN NaN NaN NaN NaN NaN 61 NaN NaN
2021-04-15 54 43.0 NaN NaN NaN 16.0 NaN NaN 83 NaN NaN
I want to move all values in the column val by 3 days ahead in the following dataframe:
datetime val val_b
12/20/2010 23
12/21/2010 12
12/22/2010 23 27
12/23/2010 26
12/24/2010 28
12/25/2010 17
12/26/2010 26
12/27/2010 21 14
12/28/2010 20
12/29/2010 18
12/30/2010 15 22
12/31/2010 20
1/1/2011 13
1/2/2011 12 30
1/3/2011 25
1/4/2011 15
1/5/2011 19
1/6/2011 14
I tried using the pd.DateOffset function, but hat moves all the columns ahead and I do not want that.
First create DatetimeIndex and then use shift with parameter freq:
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df['val'] = df['val'].shift(3, freq='d')
print (df)
val val_b
datetime
2010-12-20 NaN 23
2010-12-21 NaN 12
2010-12-22 NaN 27
2010-12-23 NaN 26
2010-12-24 NaN 28
2010-12-25 23.0 17
2010-12-26 NaN 26
2010-12-27 NaN 14
2010-12-28 NaN 20
2010-12-29 NaN 18
2010-12-30 21.0 22
2010-12-31 NaN 20
2011-01-01 NaN 13
2011-01-02 15.0 30
2011-01-03 NaN 25
2011-01-04 NaN 15
2011-01-05 12.0 19
2011-01-06 NaN 14
We have a dataframe 'A' with 5 columns, and we want to add the rolling mean of each column, we could do:
A = pd.DataFrame(np.random.randint(100, size=(5, 5)))
for i in range(0,5):
A[i+6] = A[i].rolling(3).mean()
If however 'A' has column named 'A', 'B'...'E':
A = pd.DataFrame(np.random.randint(100, size=(5, 5)), columns = ['A', 'B',
'C', 'D', 'E'])
How could we neatly add 5 columns with the rolling mean, and each name being ['A_mean', 'B_mean', ....'E_mean']?
try this:
for col in df:
A[col+'_mean'] = A[col].rolling(3).mean()
Output with your way:
0 1 2 3 4 6 7 8 9 10
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
and Output with mine:
A B C D E A_mean B_mean C_mean D_mean E_mean
0 16 53 9 16 67 NaN NaN NaN NaN NaN
1 55 37 93 92 21 NaN NaN NaN NaN NaN
2 10 5 93 99 27 27.0 31.666667 65.000000 69.000000 38.333333
3 94 32 81 91 34 53.0 24.666667 89.000000 94.000000 27.333333
4 37 46 20 18 10 47.0 27.666667 64.666667 69.333333 23.666667
Without loops:
pd.concat([A, A.apply(lambda x:x.rolling(3).mean()).rename(
columns={col: str(col) + '_mean' for col in A})], axis=1)
A B C D E A_mean B_mean C_mean D_mean E_mean
0 67 54 85 61 62 NaN NaN NaN NaN NaN
1 44 53 30 80 58 NaN NaN NaN NaN NaN
2 10 59 14 39 12 40.333333 55.333333 43.0 60.000000 44.000000
3 47 25 58 93 38 33.666667 45.666667 34.0 70.666667 36.000000
4 73 80 30 51 77 43.333333 54.666667 34.0 61.000000 42.333333