how to replace the comma in numbers in dataframe by dot? - python

I have this dataframe that I wish to replace all the comma by dot, for example it would be 50.5 and 81.5.
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120 114 87 64 50,5 37
3 SUEZMAX 81,5 80 62 45 36 24
5 LR 2 69 72 57 42 32 20
7 AFRAMAX 66 68 55 40,5 30,5 19
9 LR 1 58 58 40 28 21 13,5
11 MR2 44 44,5 38 29 21 13
As dtypes for all the columns are object, I tried
df_useful[['NB', 'Ppt Resale ', '5 yrs', '10 yrs', '15 yrs',
'20 yrs']] = df_useful[['NB', 'Ppt Resale ', '5 yrs', '10 yrs', '15 yrs',
'20 yrs']].apply(pd.to_numeric, errors='coerce')
then the numbers with comma would become NAN.

A simple way:
out = df.replace(',', '.', regex=True)
Output:
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120 114 87 64 50.5 37
3 SUEZMAX 81.5 80 62 45 36 24
5 LR 2 69 72 57 42 32 20
7 AFRAMAX 66 68 55 40.5 30.5 19
9 LR 1 58 58 40 28 21 13.5
11 MR2 44 44.5 38 29 21 13
If your goal is to convert to numeric automatically, you can use:
df2 = (df
.drop(columns='Unnamed: 0')
.select_dtypes(exclude='number')
.apply(lambda s: pd.to_numeric(s.str.replace(',', '.'),
errors='coerce'))
)
df[list(df2)] = df2
Output:
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120.0 114.0 87 64.0 50.5 37.0
3 SUEZMAX 81.5 80.0 62 45.0 36.0 24.0
5 LR 2 69.0 72.0 57 42.0 32.0 20.0
7 AFRAMAX 66.0 68.0 55 40.5 30.5 19.0
9 LR 1 58.0 58.0 40 28.0 21.0 13.5
11 MR2 44.0 44.5 38 29.0 21.0 13.0
dtypes:
print(df.dtypes)
Unnamed: 0 object
NB float64
Ppt Resale float64
5 yrs int64
10 yrs float64
15 yrs float64
20 yrs float64
dtype: object

Another possible solution, based on the following idea:
Convert the dataframe to CSV format and then read the CSV string back, using the decimal separator parameter of pd.read_csv to have decimal dots instead of decimal commas.
from io import StringIO
pd.read_csv(StringIO(df.to_csv()), decimal=',', index_col=0)
Output:
Unnamed: 0 NB Ppt Resale 5 yrs 10 yrs 15 yrs 20 yrs
1 VLCC 120.0 114.0 87 64.0 50.5 37.0
3 SUEZMAX 81.5 80.0 62 45.0 36.0 24.0
5 LR 2 69.0 72.0 57 42.0 32.0 20.0
7 AFRAMAX 66.0 68.0 55 40.5 30.5 19.0
9 LR 1 58.0 58.0 40 28.0 21.0 13.5
11 MR2 44.0 44.5 38 29.0 21.0 13.0

Related

New column based on last time row value equals some numbers in Pandas dataframe

I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.
Date Student_ID Rank Predicted_Score
4/7/2021 33 2 87
13/6/2021 33 4 88
31/3/2021 33 7 88
28/2/2021 33 2 86
14/2/2021 33 10 86
31/1/2021 33 8 86
23/12/2020 33 1 81
8/11/2020 33 3 80
21/10/2020 33 3 80
23/9/2020 33 4 80
20/5/2020 33 3 80
29/4/2020 33 4 80
15/4/2020 33 2 79
26/2/2020 33 3 79
12/2/2020 33 5 79
29/1/2020 33 1 70
I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
4/7/2021 33 2 87 86
13/6/2021 33 4 88 86
31/3/2021 33 7 88 86
28/2/2021 33 2 86 81
14/2/2021 33 10 86 81
31/1/2021 33 8 86 81
23/12/2020 33 1 81 80
8/11/2020 33 3 80 80
21/10/2020 33 3 80 80
23/9/2020 33 4 80 80
20/5/2020 33 3 80 79
29/4/2020 33 4 80 79
15/4/2020 33 2 79 79
26/2/2020 33 3 79 70
12/2/2020 33 5 79 70
29/1/2020 33 1 70
Here's what I have tried but it doesn't quite work, not sure if I am on the right track:
df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))
Thanks in advance.
Try:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df = df.sort_values(['Student_ID', 'Date'])
df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
print(df)
Prints:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-06-13 33 4 88 86.0
2 2021-03-31 33 7 88 86.0
3 2021-02-28 33 2 86 81.0
4 2021-02-14 33 10 86 81.0
5 2021-01-31 33 8 86 81.0
6 2020-12-23 33 1 81 80.0
7 2020-11-08 33 3 80 80.0
8 2020-10-21 33 3 80 80.0
9 2020-09-23 33 4 80 80.0
10 2020-05-20 33 3 80 79.0
11 2020-04-29 33 4 80 79.0
12 2020-04-15 33 2 79 79.0
13 2020-02-26 33 3 79 70.0
14 2020-02-12 33 5 79 70.0
15 2020-01-29 33 1 70
Mask the scores where rank is greater than 3 then group the masked column by Student_ID and backward fill to propagate the last predicted score
c = 'Recent_Predicted_Score'
df[c] = df['Predicted_Score'].mask(df['Rank'].gt(3))
df[c] = df.groupby('Student_ID')[c].apply(lambda s: s.shift(-1).bfill())
Result
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 4/7/2021 33 2 87 86.0
1 13/6/2021 33 4 88 86.0
2 31/3/2021 33 7 88 86.0
3 28/2/2021 33 2 86 81.0
4 14/2/2021 33 10 86 81.0
5 31/1/2021 33 8 86 81.0
6 23/12/2020 33 1 81 80.0
7 8/11/2020 33 3 80 80.0
8 21/10/2020 33 3 80 80.0
9 23/9/2020 33 4 80 80.0
10 20/5/2020 33 3 80 79.0
11 29/4/2020 33 4 80 79.0
12 15/4/2020 33 2 79 79.0
13 26/2/2020 33 3 79 70.0
14 12/2/2020 33 5 79 70.0
15 29/1/2020 33 1 70 NaN
Note: Make sure your dataframe is sorted on Date in descending order.
Let's assume:
there may be more than one unique Student_ID
the rows are ordered by descending Date as indicated by OP, but may not be ordered by Student_ID
we want to preserve the index of the original dataframe
Subject to these assumptions, here's a way to do what your question asks:
df['Recent_Predicted_Score'] = df.loc[df.Rank <= 3, 'Predicted_Score']
df['Recent_Predicted_Score'] = ( df
.groupby('Student_ID', sort=False)
.apply(lambda group: group.shift(-1).bfill())
['Recent_Predicted_Score'] )
Explanation:
create a new column Recent_Predicted_Score containing the PredictedScore where Rank is in the top 3 and NaN otherwise
use groupby() on Student_ID with the sort argument set to False for better performance (note that groupby() preserves the order of rows within each group, specifically, not influencing the existing descending order by Date)
within each group, do shift(-1) and bfill() to get the desired result for Recent_Predicted_Score.
Sample input (with two distinct Student_ID values):
Date Student_ID Rank Predicted_Score
0 2021-07-04 33 2 87
1 2021-07-04 66 2 87
2 2021-06-13 33 4 88
3 2021-06-13 66 4 88
4 2021-03-31 33 7 88
5 2021-03-31 66 7 88
6 2021-02-28 33 2 86
7 2021-02-28 66 2 86
8 2021-02-14 33 10 86
9 2021-02-14 66 10 86
10 2021-01-31 33 8 86
11 2021-01-31 66 8 86
12 2020-12-23 33 1 81
13 2020-12-23 66 1 81
14 2020-11-08 33 3 80
15 2020-11-08 66 3 80
16 2020-10-21 33 3 80
17 2020-10-21 66 3 80
18 2020-09-23 33 4 80
19 2020-09-23 66 4 80
20 2020-05-20 33 3 80
21 2020-05-20 66 3 80
22 2020-04-29 33 4 80
23 2020-04-29 66 4 80
24 2020-04-15 33 2 79
25 2020-04-15 66 2 79
26 2020-02-26 33 3 79
27 2020-02-26 66 3 79
28 2020-02-12 33 5 79
29 2020-02-12 66 5 79
30 2020-01-29 33 1 70
31 2020-01-29 66 1 70
Output:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
1 2021-07-04 66 2 87 86.0
2 2021-06-13 33 4 88 86.0
3 2021-06-13 66 4 88 86.0
4 2021-03-31 33 7 88 86.0
5 2021-03-31 66 7 88 86.0
6 2021-02-28 33 2 86 81.0
7 2021-02-28 66 2 86 81.0
8 2021-02-14 33 10 86 81.0
9 2021-02-14 66 10 86 81.0
10 2021-01-31 33 8 86 81.0
11 2021-01-31 66 8 86 81.0
12 2020-12-23 33 1 81 80.0
13 2020-12-23 66 1 81 80.0
14 2020-11-08 33 3 80 80.0
15 2020-11-08 66 3 80 80.0
16 2020-10-21 33 3 80 80.0
17 2020-10-21 66 3 80 80.0
18 2020-09-23 33 4 80 80.0
19 2020-09-23 66 4 80 80.0
20 2020-05-20 33 3 80 79.0
21 2020-05-20 66 3 80 79.0
22 2020-04-29 33 4 80 79.0
23 2020-04-29 66 4 80 79.0
24 2020-04-15 33 2 79 79.0
25 2020-04-15 66 2 79 79.0
26 2020-02-26 33 3 79 70.0
27 2020-02-26 66 3 79 70.0
28 2020-02-12 33 5 79 70.0
29 2020-02-12 66 5 79 70.0
30 2020-01-29 33 1 70 NaN
31 2020-01-29 66 1 70 NaN
Output sorted by Student_ID, Date for easier inspection:
Date Student_ID Rank Predicted_Score Recent_Predicted_Score
0 2021-07-04 33 2 87 86.0
2 2021-06-13 33 4 88 86.0
4 2021-03-31 33 7 88 86.0
6 2021-02-28 33 2 86 81.0
8 2021-02-14 33 10 86 81.0
10 2021-01-31 33 8 86 81.0
12 2020-12-23 33 1 81 80.0
14 2020-11-08 33 3 80 80.0
16 2020-10-21 33 3 80 80.0
18 2020-09-23 33 4 80 80.0
20 2020-05-20 33 3 80 79.0
22 2020-04-29 33 4 80 79.0
24 2020-04-15 33 2 79 79.0
26 2020-02-26 33 3 79 70.0
28 2020-02-12 33 5 79 70.0
30 2020-01-29 33 1 70 NaN
1 2021-07-04 66 2 87 86.0
3 2021-06-13 66 4 88 86.0
5 2021-03-31 66 7 88 86.0
7 2021-02-28 66 2 86 81.0
9 2021-02-14 66 10 86 81.0
11 2021-01-31 66 8 86 81.0
13 2020-12-23 66 1 81 80.0
15 2020-11-08 66 3 80 80.0
17 2020-10-21 66 3 80 80.0
19 2020-09-23 66 4 80 80.0
21 2020-05-20 66 3 80 79.0
23 2020-04-29 66 4 80 79.0
25 2020-04-15 66 2 79 79.0
27 2020-02-26 66 3 79 70.0
29 2020-02-12 66 5 79 70.0
31 2020-01-29 66 1 70 NaN

Substituting values with conditions

I have a dataframe like this one below
Air Station Code Humidity Temperature Latitude Longitude
St.1 20 10 10.00 10.00
St.2 4 15 25.00 30.00
St.3 16 21 8.00 15.00
St.4 38 8 31.00 40.00
St.5 10 18 10.00 10.00
St.6 40 4 25.00 30.00
St.7 10 13 8.00 15.00
St.8 46 11 31.00 40.00
St.9 28 9 10.00 10.00
St.10 14 22 25.00 30.00
St.11 5 40 8.00 15.00
St.12 11 10 31.00 40.00
...
St.89 61 35 10.00 10.00
St.90 23 29 25.00 30.00
St.91 35 12 8.00 15.00
St.92 31 7 31.00 40.00
I want to change the station codes by matching the coordinates, substituing the codes by repeating the first 4 codes, obtaining this
Air Station Code Humidity Temperature Latitude Longitude
St.1 20 10 10.00 10.00
St.2 4 15 25.00 30.00
St.3 16 21 8.00 15.00
St.4 38 8 31.00 40.00
St.1 10 18 10.00 10.00
St.2 40 4 25.00 30.00
St.3 10 13 8.00 15.00
St.4 46 11 31.00 40.00
St.1 28 9 10.00 10.00
St.2 14 22 25.00 30.00
St.3 5 40 8.00 15.00
St.4 11 10 31.00 40.00
...
St.1 61 35 10.00 10.00
St.2 23 29 25.00 30.00
St.3 35 12 8.00 15.00
St.4 31 7 31.00 40.00
Is there some way to implement an "if/else" substitution on the whole dataframe without going manually over every observation in python?
df['Air Station Code'] = 'St.' + pd.Series(df[['Latitude','Longitude']].astype(str).agg(sum, axis=1).factorize()[0] + 1).astype(str)
df
Out[77]:
Air Station Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.1 10 18 10.0 10.0
5 St.2 40 4 25.0 30.0
6 St.3 10 13 8.0 15.0
7 St.4 46 11 31.0 40.0
8 St.1 28 9 10.0 10.0
9 St.2 14 22 25.0 30.0
10 St.3 5 40 8.0 15.0
11 St.4 11 10 31.0 40.0
There may be a better way to do this, but this solves the problem. I create a second dataframe without the duplicates, which keeps the first occurrence of each lat/long. I make lat/long the index and drop the other columns. I then do a "join", adding a new column with the matching lat/long. I then overwrite the original station code with the looked up one.
import pandas as pd
data = [
["St.1", 20, 10, 10.00, 10.00],
["St.2", 4, 15, 25.00, 30.00],
["St.3", 16, 21, 8.00, 15.00],
["St.4", 38, 8, 31.00, 40.00],
["St.5", 10, 18, 10.00, 10.00],
["St.6", 40, 4, 25.00, 30.00],
["St.7", 10, 13, 8.00, 15.00],
["St.8", 46, 11, 31.00, 40.00],
["St.9", 28, 9, 10.00, 10.00],
["St.10", 14, 22, 25.00, 30.00],
["St.11", 5, 40, 8.00, 15.00],
["St.12", 11, 10, 31.00, 40.00],
["St.89", 61, 35, 10.00, 10.00],
["St.90", 23, 29, 25.00, 30.00],
["St.91", 35, 12, 8.00, 15.00],
["St.92", 31, 7, 31.00, 40.00],
]
column = "Air_Station_Code Humidity Temperature Latitude Longitude".split()
df = pd.DataFrame(data,columns=column)
print(df)
df1 = df.drop_duplicates(['Latitude','Longitude'])
df1 = df1[['Air_Station_Code','Latitude','Longitude']]
df1.set_index(['Latitude','Longitude'], inplace=True)
print(df1)
df2 = df.join( df1, on=['Latitude','Longitude'], rsuffix='R' )
print(df2)
df['Air_Station_Code'] = df2['Air_Station_CodeR']
print(df)
Output:
Air_Station_Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.5 10 18 10.0 10.0
5 St.6 40 4 25.0 30.0
6 St.7 10 13 8.0 15.0
7 St.8 46 11 31.0 40.0
8 St.9 28 9 10.0 10.0
9 St.10 14 22 25.0 30.0
10 St.11 5 40 8.0 15.0
11 St.12 11 10 31.0 40.0
12 St.89 61 35 10.0 10.0
13 St.90 23 29 25.0 30.0
14 St.91 35 12 8.0 15.0
15 St.92 31 7 31.0 40.0
Air_Station_Code
Latitude Longitude
10.0 10.0 St.1
25.0 30.0 St.2
8.0 15.0 St.3
31.0 40.0 St.4
Air_Station_Code Humidity ... Longitude Air_Station_CodeR
0 St.1 20 ... 10.0 St.1
1 St.2 4 ... 30.0 St.2
2 St.3 16 ... 15.0 St.3
3 St.4 38 ... 40.0 St.4
4 St.5 10 ... 10.0 St.1
5 St.6 40 ... 30.0 St.2
6 St.7 10 ... 15.0 St.3
7 St.8 46 ... 40.0 St.4
8 St.9 28 ... 10.0 St.1
9 St.10 14 ... 30.0 St.2
10 St.11 5 ... 15.0 St.3
11 St.12 11 ... 40.0 St.4
12 St.89 61 ... 10.0 St.1
13 St.90 23 ... 30.0 St.2
14 St.91 35 ... 15.0 St.3
15 St.92 31 ... 40.0 St.4
[16 rows x 6 columns]
Air_Station_Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.1 10 18 10.0 10.0
5 St.2 40 4 25.0 30.0
6 St.3 10 13 8.0 15.0
7 St.4 46 11 31.0 40.0
8 St.1 28 9 10.0 10.0
9 St.2 14 22 25.0 30.0
10 St.3 5 40 8.0 15.0
11 St.4 11 10 31.0 40.0
12 St.1 61 35 10.0 10.0
13 St.2 23 29 25.0 30.0
14 St.3 35 12 8.0 15.0
15 St.4 31 7 31.0 40.0

Reading multiple DataFrames from a given input

I have a couple of data frames given this way :
38 47 7 20 35
45 76 63 96 24
98 53 2 87 80
83 86 92 48 1
73 60 26 94 6
80 50 29 53 92
66 90 79 98 46
40 21 58 38 60
35 13 72 28 6
48 76 51 96 12
79 80 24 37 51
86 70 1 22 71
52 69 10 83 13
12 40 3 0 30
46 50 48 76 5
Could you please tell me how it is possible to add them to a list of dataframes?
Thanks a lot!
First convert values to one DataFrame with separator misisng values (converted from blank lines):
df = pd.read_csv(file, header=None, skip_blank_lines=False)
print (df)
0 1 2 3 4
0 38.0 47.0 7.0 20.0 35.0
1 45.0 76.0 63.0 96.0 24.0
2 98.0 53.0 2.0 87.0 80.0
3 83.0 86.0 92.0 48.0 1.0
4 73.0 60.0 26.0 94.0 6.0
5 NaN NaN NaN NaN NaN
6 80.0 50.0 29.0 53.0 92.0
7 66.0 90.0 79.0 98.0 46.0
8 40.0 21.0 58.0 38.0 60.0
9 35.0 13.0 72.0 28.0 6.0
10 48.0 76.0 51.0 96.0 12.0
11 NaN NaN NaN NaN NaN
12 79.0 80.0 24.0 37.0 51.0
13 86.0 70.0 1.0 22.0 71.0
14 52.0 69.0 10.0 83.0 13.0
15 12.0 40.0 3.0 0.0 30.0
16 46.0 50.0 48.0 76.0 5.0
And then in list comprehension create smaller DataFrames in list:
dfs = [g.iloc[1:].astype(int).reset_index(drop=True)
for _, g in df.groupby(df[0].isna().cumsum())]
print (dfs[1])
0 1 2 3 4
0 80 50 29 53 92
1 66 90 79 98 46
2 40 21 58 38 60
3 35 13 72 28 6
4 48 76 51 96 12

Appending or Adding Rows in Pandas Dataframe

In the following DataFrame I would like to add rows if the count of values in the column A is less than 10.
For eg., in the following Table column A group 60 appears 12 times, however gorup 61 appears 9 times. I would like to add a row after last record of group 61 and copy the value in column B,C,D from the corresponding values group 60. Similar operation for group 62 and so on.
A B C D
0 60 0.235 4 7.86
1 60 1.235 5 8.86
2 60 2.235 6 9.86
3 60 3.235 7 10.86
4 60 4.235 8 11.86
5 60 5.235 9 12.86
6 60 6.235 10 13.86
7 60 7.235 11 14.86
8 60 8.235 12 15.86
9 60 9.235 13 16.86
10 60 10.235 14 17.86
11 60 11.235 15 18.86
12 61 12.235 16 19.86
13 61 13.235 17 20.86
14 61 14.235 18 21.86
15 61 15.235 19 22.86
16 61 16.235 20 23.86
17 61 17.235 21 24.86
18 61 18.235 22 25.86
19 61 19.235 23 26.86
20 61 20.235 24 27.86
21 62 20.235 24 28.86
22 62 20.235 24 29.86
23 62 20.235 24 30.86
24 62 20.235 24 31.86
25 62 20.235 24 32.86
You can use:
#cumulative count per group
df['G'] = df.groupby('A').cumcount()
df = df.groupby(['A','G'])
.first() #agregate first
.unstack() #reshape DataFrame
.ffill() #same as fillna(method='ffill')
.stack() #get original shape
.reset_index(drop=True, level=1) #remove level G in index
.reset_index()
print (df)
A B C D
0 60 0.235 4.0 7.86
1 60 1.235 5.0 8.86
2 60 2.235 6.0 9.86
3 60 3.235 7.0 10.86
4 60 4.235 8.0 11.86
5 60 5.235 9.0 12.86
6 60 6.235 10.0 13.86
7 60 7.235 11.0 14.86
8 60 8.235 12.0 15.86
9 60 9.235 13.0 16.86
10 60 10.235 14.0 17.86
11 60 11.235 15.0 18.86
12 61 12.235 16.0 19.86
13 61 13.235 17.0 20.86
14 61 14.235 18.0 21.86
15 61 15.235 19.0 22.86
16 61 16.235 20.0 23.86
17 61 17.235 21.0 24.86
18 61 18.235 22.0 25.86
19 61 19.235 23.0 26.86
20 61 20.235 24.0 27.86
21 61 9.235 13.0 16.86
22 61 10.235 14.0 17.86
23 61 11.235 15.0 18.86
24 62 20.235 24.0 28.86
25 62 20.235 24.0 29.86
26 62 20.235 24.0 30.86
27 62 20.235 24.0 31.86
28 62 20.235 24.0 32.86
29 62 17.235 21.0 24.86
30 62 18.235 22.0 25.86
31 62 19.235 23.0 26.86
32 62 20.235 24.0 27.86
33 62 9.235 13.0 16.86
34 62 10.235 14.0 17.86
35 62 11.235 15.0 18.86
Another solution with pivot_table:
df['G'] = df.groupby('A').cumcount()
df = df.pivot_table(index='A', columns='G')
.ffill()
.stack()
.reset_index(drop=True, level=1)
.reset_index()
print (df)
A B C D
0 60 0.235 4.0 7.86
1 60 1.235 5.0 8.86
2 60 2.235 6.0 9.86
3 60 3.235 7.0 10.86
4 60 4.235 8.0 11.86
5 60 5.235 9.0 12.86
6 60 6.235 10.0 13.86
7 60 7.235 11.0 14.86
8 60 8.235 12.0 15.86
9 60 9.235 13.0 16.86
10 60 10.235 14.0 17.86
11 60 11.235 15.0 18.86
12 61 12.235 16.0 19.86
13 61 13.235 17.0 20.86
14 61 14.235 18.0 21.86
15 61 15.235 19.0 22.86
16 61 16.235 20.0 23.86
17 61 17.235 21.0 24.86
18 61 18.235 22.0 25.86
19 61 19.235 23.0 26.86
20 61 20.235 24.0 27.86
21 61 9.235 13.0 16.86
22 61 10.235 14.0 17.86
23 61 11.235 15.0 18.86
24 62 20.235 24.0 28.86
25 62 20.235 24.0 29.86
26 62 20.235 24.0 30.86
27 62 20.235 24.0 31.86
28 62 20.235 24.0 32.86
29 62 17.235 21.0 24.86
30 62 18.235 22.0 25.86
31 62 19.235 23.0 26.86
32 62 20.235 24.0 27.86
33 62 9.235 13.0 16.86
34 62 10.235 14.0 17.86
35 62 11.235 15.0 18.86

Python/Scikit-learn/regressions - from pandas Dataframes to Scikit prediction

I have the following pandas DataFrame, called main_frame:
target_var input1 input2 input3 input4 input5 input6
Date
2013-09-01 13.0 NaN NaN NaN NaN NaN NaN
2013-10-01 13.0 NaN NaN NaN NaN NaN NaN
2013-11-01 12.2 NaN NaN NaN NaN NaN NaN
2013-12-01 10.9 NaN NaN NaN NaN NaN NaN
2014-01-01 11.7 0 13 42 0 0 16
2014-02-01 12.0 13 8 58 0 0 14
2014-03-01 12.8 13 15 100 0 0 24
2014-04-01 13.1 0 11 50 34 0 18
2014-05-01 12.2 12 14 56 30 71 18
2014-06-01 11.7 13 16 43 44 0 22
2014-07-01 11.2 0 19 45 35 0 18
2014-08-01 11.4 12 16 37 31 0 24
2014-09-01 10.9 14 14 47 30 56 20
2014-10-01 10.5 15 17 54 24 56 22
2014-11-01 10.7 12 18 60 41 63 21
2014-12-01 9.6 12 14 42 29 53 16
2015-01-01 10.2 10 16 37 31 0 20
2015-02-01 10.7 11 20 39 28 0 19
2015-03-01 10.9 10 17 75 27 87 22
2015-04-01 10.8 14 17 73 30 43 25
2015-05-01 10.2 10 17 55 31 52 24
I've been having trouble to explore the dataset on Scikit-learn and I'm not sure if the problem is the pandas Dataset, the dates as index, the NaN's/Infs/Zeros (which I don't know how to solve), everything, something else I wasn't able to track.
I want to build a simple regression to predict the next target_var item based on the variables named "Input" (1,2,3..).
Note that there are a lot of zeros and NaN's in the time series, and eventually we might find Inf's as well.
You should first try to remove any row with a Inf, -Inf or NaN values (other methods include filling in the NaNs with, for example, the mean value of the feature).
df = df.replace(to_replace=[np.Inf, -np.Inf], value=np.NaN)
df = df.dropna()
Now, create a numpy matrix of you features and a vector of your targets. Given that your target variable is in the first column, you can use integer based indexing as follows:
X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values
Then create and fit your model:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X=X, y=y)
Now you can observe your estimates:
>>> model.intercept_
12.109583092421092
>>> model.coef_
array([-0.05269033, -0.17723251, 0.03627883, 0.02219596, -0.01377465,
0.0111017 ])

Categories

Resources