I don't understand how to use the BaggingClassifier from sklearn.
Let's say I have a dataframe of inputs of shape (10,5) and a dataframe of targets of shape (10,1):
traininginputs:
Date A B
2015-01-02 5 1
2015-01-02 6 2
2015-01-02 4 3
2015-01-02 1 2
2015-01-02 3 2
2015-01-03 1 1
trainingtarget:
Date t
2015-01-02 1
2015-01-02 -1
2015-01-02 1
2015-01-02 1
2015-01-02 1
2015-01-03 -1
If I do the following:
clf1 = svm.SVC( probability=True)
model = BaggingClassifier(base_estimator=clf1)
model.fit(traininginputs.values, trainingtarget.values)
model.predict(testinputs)
with
testinputs:
Date A B
2015-01-02 5 1
2015-01-02 6 2
2015-01-02 4 3
2015-01-02 1 2
2015-01-02 3 2
2015-01-03 1 1
Why doesn't it work ?
I feel like I am missing something about the way to use BaggingClassifier
In fit(), y needs to be a 1d vector, but training_target.values will return a column vector. You can transform it using ravel():
model.fit(training_inputs.values, training_target.values.ravel())
I have a DataFrame with a timestamp column
d1=DataFrame({'a':[datetime(2015,1,1,20,2,1),datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5),datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28),datetime(2015,1,1,20,37,51),datetime(2015,1,1,20,41,19),
datetime(2015,1,1,20,49,4),datetime(2015,1,1,20,59,21)], 'b':[2,4,26,22,45,3,8,121,34]})
a b
0 2015-01-01 20:02:01 2
1 2015-01-01 20:14:58 4
2 2015-01-01 20:17:05 26
3 2015-01-01 20:31:05 22
4 2015-01-01 20:34:28 45
5 2015-01-01 20:37:51 3
6 2015-01-01 20:41:19 8
7 2015-01-01 20:49:04 121
8 2015-01-01 20:59:21 34
I can group by 15 minute intervals by doing these operations
d2=d1.set_index('a')
d3=d2.groupby(pd.TimeGrouper('15Min'))
The number of rows by group is found by
d3.size()
a
2015-01-01 20:00:00 2
2015-01-01 20:15:00 1
2015-01-01 20:30:00 4
2015-01-01 20:45:00 2
I want my original DataFrame to have a column corresponding to the unique number of rows in the specific group that it belongs to. For example, the first group
2015-01-01 20:00:00
has 2 rows so the first two rows of my new column in d1 should have the number 1
the second group
2015-01-01 20:15:00
has 1 row so the third row of my new column in d1 should have the number 2
the third group
2015-01-01 20:15:00
has 4 rows so the fourth, fifth, sixth, and seventh rows of my new column in d1 should have the number 3
I want my new DataFrame to look like this
a b c
0 2015-01-01 20:02:01 2 1
1 2015-01-01 20:14:58 4 1
2 2015-01-01 20:17:05 26 2
3 2015-01-01 20:31:05 22 3
4 2015-01-01 20:34:28 45 3
5 2015-01-01 20:37:51 3 3
6 2015-01-01 20:41:19 8 3
7 2015-01-01 20:49:04 121 4
8 2015-01-01 20:59:21 34 4
Use .transform() on your groupby object with an itertools.count iterator:
from datetime import datetime
from itertools import count
import pandas as pd
d1 = pd.DataFrame({'a': [datetime(2015,1,1,20,2,1), datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5), datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28), datetime(2015,1,1,20,37,51),
datetime(2015,1,1,20,41,19), datetime(2015,1,1,20,49,4),
datetime(2015,1,1,20,59,21)],
'b': [2, 4, 26, 22, 45, 3, 8, 121, 34]})
d2 = d1.set_index('a')
counter = count(1)
d2['c'] = (d2.groupby(pd.TimeGrouper('15Min'))['b']
.transform(lambda x: next(counter)))
print(d2)
Output:
b c
a
2015-01-01 20:02:01 2 1
2015-01-01 20:14:58 4 1
2015-01-01 20:17:05 26 2
2015-01-01 20:31:05 22 3
2015-01-01 20:34:28 45 3
2015-01-01 20:37:51 3 3
2015-01-01 20:41:19 8 3
2015-01-01 20:49:04 121 4
2015-01-01 20:59:21 34 4
With this DataFrame:
import pandas as pd
df = pd.DataFrame([[1,1],[1,2],[1,3],[1,5],[1,7],[1,9]], index=pd.date_range('2015-01-01', periods=6), columns=['a', 'b'])
i.e.
a b
2015-01-01 1 1
2015-01-02 1 2
2015-01-03 1 3
2015-01-04 1 5
2015-01-05 1 7
2015-01-06 1 9
the fact of using df = df.groupby(df.b // 4).last() makes the datetime index disappear. Why?
a b
b
0 1 3
1 1 7
2 1 9
Expected result instead:
a b
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9
For groupby your index always getting from grouping values. For you case you could use reset_index and then set_index:
df['c'] = df.b // 4
result = df.reset_index().groupby('c').last().set_index('index')
In [349]: result
Out[349]:
a b
index
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9
I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5
I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4