How to use BaggingClassifier - python

I don't understand how to use the BaggingClassifier from sklearn.
Let's say I have a dataframe of inputs of shape (10,5) and a dataframe of targets of shape (10,1):
traininginputs:
Date A B
2015-01-02 5 1
2015-01-02 6 2
2015-01-02 4 3
2015-01-02 1 2
2015-01-02 3 2
2015-01-03 1 1
trainingtarget:
Date t
2015-01-02 1
2015-01-02 -1
2015-01-02 1
2015-01-02 1
2015-01-02 1
2015-01-03 -1
If I do the following:
clf1 = svm.SVC( probability=True)
model = BaggingClassifier(base_estimator=clf1)
model.fit(traininginputs.values, trainingtarget.values)
model.predict(testinputs)
with
testinputs:
Date A B
2015-01-02 5 1
2015-01-02 6 2
2015-01-02 4 3
2015-01-02 1 2
2015-01-02 3 2
2015-01-03 1 1
Why doesn't it work ?
I feel like I am missing something about the way to use BaggingClassifier

In fit(), y needs to be a 1d vector, but training_target.values will return a column vector. You can transform it using ravel():
model.fit(training_inputs.values, training_target.values.ravel())

Related

Pandas pivoting/stacking/reshaping

I'm trying to import data to a pandas DataFrame with columns being date string, label, value. My data looks like the following (just with 4 dates and 5 labels)
from numpy import random
import numpy as np
import pandas as pd
# Creating the data
dates = ("2015-01-01", "2015-01-02", "2015-01-03", "2015-01-04")
values = [random.rand(5) for _ in range(4)]
data = dict(zip(dates,values))
So, the data is a dictionary where the keys are dates, the keys a list of values where the index is the label.
Loading this data structure into a DataFrame
df1 = pd.DataFrame(data)
gives me the dates as columns, the label as index, and the value as the value.
An alternative loading would be
df2 = pd.DataFrame()
df2.from_dict(data, orient='index')
where the dates are index, and columns are labels.
In either of both cases do I manage to do pivoting or stacking to my preferred view.
How should I approach the pivoting/stacking to get the view I want? Or should I change my data structure before loading it into a DataFrame? In particular I'd like to avoid of having to create all the rows of the table beforehand by using a bunch of calls to zip.
IIUC:
Option 1
pd.DataFrame.stack
pd.DataFrame(data).stack() \
.rename('value').rename_axis(['label', 'date']).reset_index()
label date value
0 0 2015-01-01 0.345109
1 0 2015-01-02 0.815948
2 0 2015-01-03 0.758709
3 0 2015-01-04 0.461838
4 1 2015-01-01 0.584527
5 1 2015-01-02 0.823529
6 1 2015-01-03 0.714700
7 1 2015-01-04 0.160735
8 2 2015-01-01 0.779006
9 2 2015-01-02 0.721576
10 2 2015-01-03 0.246975
11 2 2015-01-04 0.270491
12 3 2015-01-01 0.465495
13 3 2015-01-02 0.622024
14 3 2015-01-03 0.227865
15 3 2015-01-04 0.638772
16 4 2015-01-01 0.266322
17 4 2015-01-02 0.575298
18 4 2015-01-03 0.335095
19 4 2015-01-04 0.761181
Option 2
comprehension
pd.DataFrame(
[[i, d, v] for d, l in data.items() for i, v in enumerate(l)],
columns=['label', 'date', 'value']
)
label date value
0 0 2015-01-01 0.345109
1 1 2015-01-01 0.584527
2 2 2015-01-01 0.779006
3 3 2015-01-01 0.465495
4 4 2015-01-01 0.266322
5 0 2015-01-02 0.815948
6 1 2015-01-02 0.823529
7 2 2015-01-02 0.721576
8 3 2015-01-02 0.622024
9 4 2015-01-02 0.575298
10 0 2015-01-03 0.758709
11 1 2015-01-03 0.714700
12 2 2015-01-03 0.246975
13 3 2015-01-03 0.227865
14 4 2015-01-03 0.335095
15 0 2015-01-04 0.461838
16 1 2015-01-04 0.160735
17 2 2015-01-04 0.270491
18 3 2015-01-04 0.638772
19 4 2015-01-04 0.761181

groupby DataFrame with new column representing the group

I have a DataFrame with a timestamp column
d1=DataFrame({'a':[datetime(2015,1,1,20,2,1),datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5),datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28),datetime(2015,1,1,20,37,51),datetime(2015,1,1,20,41,19),
datetime(2015,1,1,20,49,4),datetime(2015,1,1,20,59,21)], 'b':[2,4,26,22,45,3,8,121,34]})
a b
0 2015-01-01 20:02:01 2
1 2015-01-01 20:14:58 4
2 2015-01-01 20:17:05 26
3 2015-01-01 20:31:05 22
4 2015-01-01 20:34:28 45
5 2015-01-01 20:37:51 3
6 2015-01-01 20:41:19 8
7 2015-01-01 20:49:04 121
8 2015-01-01 20:59:21 34
I can group by 15 minute intervals by doing these operations
d2=d1.set_index('a')
d3=d2.groupby(pd.TimeGrouper('15Min'))
The number of rows by group is found by
d3.size()
a
2015-01-01 20:00:00 2
2015-01-01 20:15:00 1
2015-01-01 20:30:00 4
2015-01-01 20:45:00 2
I want my original DataFrame to have a column corresponding to the unique number of rows in the specific group that it belongs to. For example, the first group
2015-01-01 20:00:00
has 2 rows so the first two rows of my new column in d1 should have the number 1
the second group
2015-01-01 20:15:00
has 1 row so the third row of my new column in d1 should have the number 2
the third group
2015-01-01 20:15:00
has 4 rows so the fourth, fifth, sixth, and seventh rows of my new column in d1 should have the number 3
I want my new DataFrame to look like this
a b c
0 2015-01-01 20:02:01 2 1
1 2015-01-01 20:14:58 4 1
2 2015-01-01 20:17:05 26 2
3 2015-01-01 20:31:05 22 3
4 2015-01-01 20:34:28 45 3
5 2015-01-01 20:37:51 3 3
6 2015-01-01 20:41:19 8 3
7 2015-01-01 20:49:04 121 4
8 2015-01-01 20:59:21 34 4
Use .transform() on your groupby object with an itertools.count iterator:
from datetime import datetime
from itertools import count
import pandas as pd
d1 = pd.DataFrame({'a': [datetime(2015,1,1,20,2,1), datetime(2015,1,1,20,14,58),
datetime(2015,1,1,20,17,5), datetime(2015,1,1,20,31,5),
datetime(2015,1,1,20,34,28), datetime(2015,1,1,20,37,51),
datetime(2015,1,1,20,41,19), datetime(2015,1,1,20,49,4),
datetime(2015,1,1,20,59,21)],
'b': [2, 4, 26, 22, 45, 3, 8, 121, 34]})
d2 = d1.set_index('a')
counter = count(1)
d2['c'] = (d2.groupby(pd.TimeGrouper('15Min'))['b']
.transform(lambda x: next(counter)))
print(d2)
Output:
b c
a
2015-01-01 20:02:01 2 1
2015-01-01 20:14:58 4 1
2015-01-01 20:17:05 26 2
2015-01-01 20:31:05 22 3
2015-01-01 20:34:28 45 3
2015-01-01 20:37:51 3 3
2015-01-01 20:41:19 8 3
2015-01-01 20:49:04 121 4
2015-01-01 20:59:21 34 4

GroupBy makes time index disappear

With this DataFrame:
import pandas as pd
df = pd.DataFrame([[1,1],[1,2],[1,3],[1,5],[1,7],[1,9]], index=pd.date_range('2015-01-01', periods=6), columns=['a', 'b'])
i.e.
a b
2015-01-01 1 1
2015-01-02 1 2
2015-01-03 1 3
2015-01-04 1 5
2015-01-05 1 7
2015-01-06 1 9
the fact of using df = df.groupby(df.b // 4).last() makes the datetime index disappear. Why?
a b
b
0 1 3
1 1 7
2 1 9
Expected result instead:
a b
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9
For groupby your index always getting from grouping values. For you case you could use reset_index and then set_index:
df['c'] = df.b // 4
result = df.reset_index().groupby('c').last().set_index('index')
In [349]: result
Out[349]:
a b
index
2015-01-03 1 3
2015-01-05 1 7
2015-01-06 1 9

Generating sub data frame based on a value in an column

I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5

split, groupby, combine in Pandas to find a difference in dates

I have a simple dataframe that looks like this:
I would like to use groupby to group by id, then find some way to difference the dates, and then column bind them back to the dataframe, so I end up with this:
The groupby is straightforward,
grouped = DF.groupby('id')
and finding the earliest date is straightforward,
maxdates = grouped['date'].min()
But I'm not sure how to proceed. How do I apply the date subtraction operation, then combine?
There is a similar question here.
Thanks for reading this far.
My dataframe is:
dates=pd.to_datetime(['2015-01-01', '2015-02-01', '2015-03-01', '2015-04-01', '2015-05-01', '2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'])
DF = DataFrame({'id':[1,1,1,1,1,2,2,2,2,2], 'date':dates})
cols = ['id', 'date']
DF=DF[cols]
EDIT:
Both answers below are awesome. I wish I could accept them both.
You can use apply like this:
earliest_by_id = DF.groupby('id')['date'].min()
def since_earliest(row):
return row.date - earliest_by_id[row.id]
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1)
print(DF)
id date days_since_earliest
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
edit:
DF['days_since_earliest'] = DF.apply(since_earliest, axis=1).astype('timedelta64[D]')
print(DF)
id date days_since_earliest
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4
FWIW, using transform can often be simpler (and usually faster) than apply. transform takes the results of a groupby operation and broadcasts it up to the original index:
>>> df["dse"] = df["date"] - df.groupby("id")["date"].transform(min)
>>> df
id date dse
0 1 2015-01-01 0 days
1 1 2015-02-01 31 days
2 1 2015-03-01 59 days
3 1 2015-04-01 90 days
4 1 2015-05-01 120 days
5 2 2015-01-01 0 days
6 2 2015-01-02 1 days
7 2 2015-01-03 2 days
8 2 2015-01-04 3 days
9 2 2015-01-05 4 days
If you'd prefer integer days instead of timedelta objects, you can use the dt.days accessor:
>>> df["dse"] = df["dse"].dt.days
>>> df
id date dse
0 1 2015-01-01 0
1 1 2015-02-01 31
2 1 2015-03-01 59
3 1 2015-04-01 90
4 1 2015-05-01 120
5 2 2015-01-01 0
6 2 2015-01-02 1
7 2 2015-01-03 2
8 2 2015-01-04 3
9 2 2015-01-05 4

Categories

Resources