Hi I want to use one of the scikit learn's functions for cross validation. What I want is that the splitting of the folds is determined by one of the indexes. For example lets say I have this data with "month" and "day" being the indexes:
Month Day Feature_1
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90
Lets say I want to have a 1/4 of the data as test set for each validation. I want this fold seperation to be done by the first index which is the month. In this case the test set will be one of the months and the remaining 3 months will be the training set. As an example one of the train and test split will look like this:
TEST SET:
Month Day Feature_1
January 1 10
2 20
TRAINING SET:
Month Day Feature_1
February 1 30
2 40
March 1 50
2 60
3 70
April 1 80
2 90
How can I do this. Thank you.
This is called splitting by a group. Check out the user-guide in scikit-learn here to understand more about it:
...
To measure this, we need to ensure that all the samples in the
validation fold come from groups that are not represented at all in
the paired training fold.
...
You can use the GroupKFold or other strategies that have Group in the name. A sample can be
# I am not sure about this exact command,
# but after this, you should have individual columns for each index
df = df.reset_index()
print(df)
Month Day Feature_1
January 1 10
January 2 20
February 1 30
February 2 40
March 1 50
March 2 60
March 3 70
groups = df['Month']
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=3)
for train, test in gkf.split(X, y, groups=groups):
# Here "train", "test" are indices of location,
# you need to use "iloc" to get actual values
print("%s %s" % (train, test))
print(df.iloc[train, :])
print(df.iloc[test, :])
Update: For passing this into cross-validation methods, just pass the months data to groups param in those. Like below:
gkf = GroupKFold(n_splits=3)
y_pred = cross_val_predict(estimator, X_train, y_train, cv=gkf, groups=df['Month'])
Use -
indices = df.index.levels[0]
train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False)
test_indices = np.setdiff1d(indices, train_indices)
train = df[np.in1d(df.index.get_level_values(0), train_indices)]
test = df[np.in1d(df.index.get_level_values(0), test_indices)]
Output
Train
Feature_1
Month Day
January 1 10
2 20
February 1 30
2 40
March 1 50
2 60
3 70
Test
Feature_1
Month Day
April 1 80
2 90
Explanation
indices = df.index.levels[0] takes all the unique from level=0 index - Index(['April', 'February', 'January', 'March'], dtype='object', name='Month')
train_indices = np.random.choice(indices,size=int(len(indices)*0.75), replace=False) samples 75% of the indices chosen in previous step
Next we obtain the remaining indices to be test_indices
Finally we split train and test accordingly
Related
I have a dataframe like that:
year
count_yes
count_no
1900
5
7
1903
5
3
1915
14
6
1919
6
14
I want to have two bins, independently of the value itself.
How can I group those categories and sum its values?
Expected result:
year
count_yes
count_no
1900
10
10
1910
20
20
Logic: Grouped the first two rows (1900 and 1903) and the two last rows (1915 and 1919) and summed the values of each category
I want to create a stacked percentage column graphic, so 1900 would be 50/50% and 1910 would be also 50/50%.
I've already created the function to build this graphic, I just need to adjust the dataframe size into bins to create a better distribution and visualization
This is a way to do what you need, if you are ok using the decades as index:
df['year'] = (df.year//10)*10
df_group = df.groupby('year').sum()
Output>>>
df_group
count_yes count_no
year
1900 10 10
1910 20 20
You can bin the years with pandas.cut and aggregate with groupby+sum:
bins = list(range(1900, df['year'].max()+10, 10))
group = pd.cut(df['year'], bins=bins, labels=bins[:-1], right=False)
df.drop('year', axis=1).groupby(group).sum().reset_index()
If you only want to specify the number of bins, compute group with:
group = pd.cut(df['year'], bins=2, right=False)
output:
year count_yes count_no
0 1900 10 10
1 1910 20 20
I have a sample dataframe below df:
Step 1 Step 2 Step 3 Step 4
0 1/1 2/13 3/23 4/7
1 1/6 2/27 3/26 4/11
2 1/9 3/2 4/1 4/18
I would like to get the days difference in days between each successive step all at once, and create a new column for each difference, like so:
Step 1 Step 2 Step 3 Step 4 diff_btwn_1_2 diff_btwn_2_3 diff_btwn_3_4
0 1/1 2/13 3/23 4/7 43 38 15
1 1/6 2/27 3/26 4/11 52 27 16
2 1/9 3/2 4/1 4/18 52 30 17
Is there a way to do this efficiently in Python? I am running into some complications trying to loop through columns and dynamically name the variables based on the integer value associated with the step.
Convert your dates to_datetime, which will given a mostly useless year value of 1900 (or assign your own year if you know what it should be). This then allows you to use diff to calculate the date difference between successive columns. Finally join the results back.
To keep your initial formatted dates, I'll work with a copy of your DataFrame that has datetime values.
Also, since you don't have year information you may need to be careful with the days depending upon whether you expect it to be a leap year or not.
import pandas as pd
dfdates = df.apply(pd.to_datetime, format='%m/%d')
# Get difference in days, drop first `NaN` diff.
diffs = dfdates.diff(axis=1).apply(lambda x: x.dt.days).iloc[:, 1:]
# Rename columns
diffs.columns = 'diff_' + dfdates.columns[:-1] + '-' + diffs.columns
df = pd.concat([df, diffs], axis=1)
print(df)
Step 1 Step 2 Step 3 Step 4 diff_Step 1-Step 2 diff_Step 2-Step 3 diff_Step 3-Step 4
0 1/1 2/13 3/23 4/7 43 38 15
1 1/6 2/27 3/26 4/11 52 27 16
2 1/9 3/2 4/1 4/18 52 30 17
I figured out a solution. A couple of things, first my data originally does have dates with the year. Therefore, in my solution below, you will see I don't change the fields to datetimes (excluded the year in original post to save space). Secondly, there are NaT values in my original dataset, and solution posted by ALollz was causing unexpected behaviors (but works perfectly if there are no NaTs). I can't drop the NaT values from the original process_df because there would be almost no data in the later steps.
def create_days_diff_dataframe(data):
'''Creates days difference column for every sequential milestone
and adds to new dataframe'''
days_diff_df = pd.DataFrame()
for i in range(len(data.columns[:-1])):
days_diff_btwn_milestones = (data.iloc[:, i + 1] - data.iloc[:, i]).dt.days
days_diff_df['days_diff_' + data.columns[i] + ' - ' + data.columns[i + 1]] = days_diff_btwn_milestones
return days_diff_df
days_diff_df = create_days_diff_dataframe(process_df)
Typically when we have a data frame we split it into train and test. For example, imagine my data frame is something like this:
> df.head()
Date y wind temperature
1 2019-10-03 00:00:00 33 12 15
2 2019-10-03 01:00:00 10 5 6
3 2019-10-03 02:00:00 39 6 5
4 2019-10-03 03:00:00 60 13 4
5 2019-10-03 04:00:00 21 3 7
I want to predict y based on the wind and temperature. We then do a split something like this:
df_train = df.loc[df.index <= split_date].copy()
df_test = df.loc[df.index > split_date].copy()
X1=df_train[['wind','temperature']]
y1=df_train['y']
X2=df_test[['wind','temperature']]
y2=df_test['y']
from sklearn.model_selection import train_test_split
X_train, y_train =X1, y1
X_test, y_test = X2,y2
model.fit(X_train,y_train)
And we then predict our test data. However, this uses the features of wind and temperature in the test data frame. If I want to predict (unknown) tomorrow y without knowing tomorrow's hourly temperature and wind, does the method no longer work? (For LSTM or XGBoost for example)
The way you train your model, each row is considered an independent sample, regardless of the order, i.e. what values are observed earlier or later. If you have reason to believe that the chronological order is relevant to predicting y from wind speed and temperature you will need to change your model.
You could try, e.g. to add another column with the values for wind speed and temperature one hour before (shift it by one row), or, if you believe that y might be depend on the weekday, compute the weekday from the date and add that as input feature.
I have a dataset with 84 Monthly Sales (from 01/2013 to 12/2019) - just months, not days.
Month 01 | Sale 1
Month 02 | Sale 2
Month 03 | Sale 3
.... | ...
Month 84 | Sale 84
By visualization it looks like that the model fits very well... but I need to check it....
So what I understood is that cross val does not support Months, and so what I did was convert to use it w/ days(although there is no day info into my original df)...
I wanted to try my model w/ the first five years(60 months) and leave the 2 remaining years(24 months) to see how well the model is predicting....
So i did something like:
cv_results = cross_validation( model = prophet, initial='1825 days', period='30 days', horizon = '60 days')
Does this make sense?
I did not get the concept of cut off dates and forecast periods
I struggled with this for a while as well. But here is how it works. The initial model will be trained on the first 1,825 days of data. It will forecast the next 60 days of data (because horizon is set to 60). The model will then train on the initial period + the period (1,825 + 30 days in this case) and forecast the next 60 days. It will continued like this, adding another 30 days to the training data and then forecasting for the next 60 until there is no longer enough data to do this.
In summary, period is how much data to add to the training data set in every iteration of cross-validation, and horizon is how far out it will forecast.
Here is the setup:
The total number of data points is 700 days
Initial is 365 days
The period is 10 days
The horizon is 20 days
On the 1st iteration, it will train on days 1-365 and will forecast on days 366 to 385.
On the 2nd iteration, it will train on days 11-375 and will forecast on days 376 to 395, etc.
I'm creating a binary classifier in Python 3.5
So having the number of features (x1..xn) and target value y just like this:
x1 x2 x3 y
Monday 10 12 1
Tuesday 18 20 0
Monday 12 22 1
Wednesday 19 19 0
Thursday 10 11 1
Thursday 10 12 1
Friday 19 12 0
Friday 18 21 0
Friday 12 10 1
So there is no problem for me to do the classifier (and all the needed steps as data preprocessing, cross validation and evaluation).
My question-how to estimate is there any significant variation of the y variable depending on the day of the week (Monday-Friday), column x1?
I know some technics as feature importance but using them I can only understand what exact feature (x1,x2 or x3) is the most valuable for predictor.
How can I understand the importance of distinct value within a column (x1, days of week) for target variable?
Thanks!
The values from x1 column could be transformed into columns with binary logic values ([0;1]) in them. Then the features importances technics could be applied.
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html