Create a DataFrame for MultiClass classification - python

I have a data frame with n rows, I want to assign a class to every row randomly from m classes such that the proportion of all classes are the same.
Example:
>>> classes = ['c1','c2','c3','c4']
>>> df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
>>> df
a b c d e
0 -0.341559 1.499159 0.269614 -0.198663 -1.081290
1 -1.966477 1.902292 -0.092296 -1.730710 -1.342866
2 1.188634 -2.851902 1.130480 -0.495677 -0.569557
3 -0.816190 1.205463 1.157507 -0.217025 -0.160752
4 -2.001114 -0.818852 -0.696057 -0.874615 -0.577101
.. ... ... ... ... ...
95 0.502192 0.434275 0.358244 -0.763562 -0.787102
96 -1.071011 0.045387 0.297905 -0.120974 0.185418
97 2.458274 -1.852953 -0.049336 -0.150604 -0.292824
98 1.992513 -0.431639 0.566920 -1.289439 0.626914
99 0.685915 -0.723009 -0.168497 1.630057 1.587378
[100 rows x 5 columns]
Expected output:
>>> df
a b c d e class
0 -0.341559 1.499159 0.269614 -0.198663 -1.081290 c3
1 -1.966477 1.902292 -0.092296 -1.730710 -1.342866 c4
2 1.188634 -2.851902 1.130480 -0.495677 -0.569557 c2
3 -0.816190 1.205463 1.157507 -0.217025 -0.160752 c3
4 -2.001114 -0.818852 -0.696057 -0.874615 -0.577101 c1
.. ... ... ... ... ... ...
95 0.502192 0.434275 0.358244 -0.763562 -0.787102 c1
96 -1.071011 0.045387 0.297905 -0.120974 0.185418 c3
97 2.458274 -1.852953 -0.049336 -0.150604 -0.292824 c2
98 1.992513 -0.431639 0.566920 -1.289439 0.626914 c1
99 0.685915 -0.723009 -0.168497 1.630057 1.587378 c2
[100 rows x 6 columns]
With the class proportions being the same

This should do the job
classes = ['c1','c2','c3','c4']
df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
classes = np.repeat(classes, df.shape[0]/len(classes))
np.random.shuffle(classes)
df['class'] = classes

Related

Divide a group into n and add block numbers for each group in python

I have the following table:
ColumnA
ColumnB
A
12
B
32
C
44
D
76
E
99
F
123
G
65
H
87
I
76
J
231
k
80
l
55
m
27
n
67
I would like to divide this table in to 'n' (n = 4, here) groups and add another column with group name. The output should look like the following:
ColumnA
ColumnB
ColumnC
A
12
1
B
32
1
C
44
1
D
76
1
E
99
2
F
123
2
G
65
2
H
87
2
I
76
3
J
231
3
k
80
3
l
55
4
m
27
4
n
67
4
What I tried so for?
TGn = 4
idx = set(df.index // TGn)
treatment_groups = [i for i in range(1, n+1)]
df['columnC'] = (df.index // TGn).map(dict(zip(idx, treatment_groups)))
This does not split the group properly, not sure where I went wrong. How do I correct it?
Assuming that your sample size is exactly divided by n (i.e. sample_size%n is 0):
import numpy as np
groups = range(1,n+1)
df['columnC'] = np.repeat(groups,int(len(df)/n))
If your sample size is not exactly divided by n (i.e. sample_size%n is not 0):
# Assigning the remaining rows to random groups
df['columnC'] = np.concatenate(
[np.repeat(groups,int(len(df)/n)),
np.random.randint(1, high=n, size=int(len(df)%n), dtype=int)])
# Assigning the remaining rows to group 'm'
df['columnC'] = np.concatenate(
[np.repeat(groups,int(len(df)/n)),
np.repeat([m],int(len(df)%n)), dtype=int)])

pandas outcome variable is NaN

I have set the outcome variable y as a column in a csv. It loads properly and works when I print just y, but when I use y = y[x:] I start getting NaN as values.
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables
Then later in the file I print the outcome column. final_df is a dataframe which does not yet have the outcome variable set, so I set it below:
final_df['outcome'] = y
print(final_df['outcome'])
But the outcome is:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 L
It looks like the last value is correct (they should all be 'W' or 'L').
How can I line up my data frames properly so I do not get NaN?
Entire Code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
np.random.seed(0)
from array import array
iris=load_iris()
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
axis=1) #Predictor variables
X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])
#final_y = pd.DataFrame(columns=['Unnamed: 7'])
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]
for game in range(0, 10):
X = previous_games_stats[['GF', 'GA']]
X = X[count:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
final_df = final_df.append(stats_df, ignore_index=True)
count+=1
numGamesToLookBack+=1
print("final_df:\n", final_df)
stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris
final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65
train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]
features = df.columns[:4]
stats_features = final_df.columns[:2]
y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)
stats_clf.predict_proba(stats_test[stats_features])[0:10]
preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]
pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))
It is expected, because y have no indices (no data) for first 9 values, so after assign back get NaNs.
If column is new and length of y is same as length of df assign numpy array:
final_df['outcome'] = y.values
But if lengths are different, it is a bit complicated, because need same lengths:
df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
a b
0 a0 a20
1 a1 a21
2 a2 a22
3 a3 a23
4 a4 a24
5 a5 a25
6 a6 a26
7 a7 a27
8 a8 a28
9 a9 a29
y = df['a']
y = y[4:]
print (y)
4 a4
5 a5
6 a6
7 a7
8 a8
9 a9
Name: a, dtype: object
len(final_df) < len(y):
Filter y by final_df, then convert to numpy array for not align indices:
final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
len(final_df) > len(y):
Create new Series by filtered index values:
final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
5 105 a9
6 106 NaN
7 107 NaN
8 108 NaN
9 109 NaN

Subtracting many columns in a df by one column in another df

I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']

How to combine boolean indexer with multi-index in pandas?

I have a multi-indexed dataframe and I wish to extract a subset based on index values and on a boolean criteria. I wish to overwrite the values of a specific new values using multi-index keys and boolean indexers to select the records to modify.
import pandas as pd
import numpy as np
years = [1994,1995,1996]
householdIDs = [ id for id in range(1,100) ]
midx = pd.MultiIndex.from_product( [years, householdIDs], names = ['Year', 'HouseholdID'] )
householdIncomes = np.random.randint( 10000,100000, size = len(years)*len(householdIDs) )
householdSize = np.random.randint( 1,5, size = len(years)*len(householdIDs) )
df = pd.DataFrame( {'HouseholdIncome':householdIncomes, 'HouseholdSize':householdSize}, index = midx )
df.sort_index(inplace = True)
Here's what the sample data looks like...
df.head()
=> HouseholdIncome HouseholdSize
Year HouseholdID
1994 1 23866 3
2 57956 3
3 21644 3
4 71912 4
5 83663 3
I'm able to successfully query the dataframe using the indices and column labels.
This example gives me the HouseholdSize for household 3 in year 1996
df.loc[ (1996,3 ) , 'HouseholdSize' ]
=> 1
However, I'm unable to combine boolean selection with multi-index queries...
The pandas docs on Multi-indexing says there is a way to combine boolean indexing with multi-indexing and gives an example...
In [52]: idx = pd.IndexSlice
In [56]: mask = dfmi[('a','foo')]>200
In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
...which I can't seem to replicate on my dataframe
idx = pd.IndexSlice
housholdSizeAbove2 = ( df.HouseholdSize > 2 )
df.loc[ idx[ housholdSizeAbove2, 1996, :] , 'HouseholdSize' ]
Traceback (most recent call last):
File "python", line 1, in <module>
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'
In this example I would want to see all the households in 1996 with householdsize above 2
Pandas.query() should work in this case:
df.query("Year == 1996 and HouseholdID > 2")
Demo:
In [326]: with pd.option_context('display.max_rows',20):
...: print(df.query("Year == 1996 and HouseholdID > 2"))
...:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 4
4 11057 1
5 36321 2
6 89469 4
7 35711 2
8 85741 1
9 34758 3
10 56085 2
11 32275 4
12 77096 4
... ... ...
90 40276 4
91 10594 2
92 61080 4
93 65334 2
94 21477 4
95 83112 4
96 25627 2
97 24830 4
98 85693 1
99 84653 4
[97 rows x 2 columns]
UPDATE:
Is there a way to select a specific column?
In [333]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdIncome']
Out[333]:
Year HouseholdID
1996 3 28664
4 11057
5 36321
6 89469
7 35711
8 85741
9 34758
10 56085
11 32275
12 77096
...
90 40276
91 10594
92 61080
93 65334
94 21477
95 83112
96 25627
97 24830
98 85693
99 84653
Name: HouseholdIncome, dtype: int32
and ultimately I want to overwrite the data on the dataframe.
In [331]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdSize'] *= 10
In [332]: df.loc[df.eval("Year == 1996 and HouseholdID > 2")]
Out[332]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 40
4 11057 10
5 36321 20
6 89469 40
7 35711 20
8 85741 10
9 34758 30
10 56085 20
11 32275 40
12 77096 40
... ... ...
90 40276 40
91 10594 20
92 61080 40
93 65334 20
94 21477 40
95 83112 40
96 25627 20
97 24830 40
98 85693 10
99 84653 40
[97 rows x 2 columns]
UPDATE2:
I want to pass a variable year instead of a specific value. Is there
a cleaner way to do it than Year == " + str(year) + " and HouseholdID > " + str(householdSize) ?
In [5]: year = 1996
In [6]: household_ids = [1, 2, 98, 99]
In [7]: df.loc[df.eval("Year == #year and HouseholdID in #household_ids")]
Out[7]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 1 42217 1
2 66009 3
98 33121 4
99 45489 3

python compare strings in a table and return the best string

I have a table with 3 columns delimited by whitespaces
A1 3445 1 24
A1 3445 1 214
A2 3603 2 45
A2 3603 2 144
A0 3314 3 8
A0 3314 3 134
A0 3314 4 46
I would like to compare the last column with the ID (e.g. A1) in the first column to return the string with biggest number. So, the end result will be like this.
A1 3445 1 214
A2 3603 2 144
A0 3314 3 134
I have done up to spliting the lines, but I don't get how to compare the line.
A help would be nice.
Use the sorted function, giving the last column as the key
with open('a.txt', 'r') as a: # 'a.txt' is your file
table = []
for line in a:
table.append(line.split())
s = sorted(table, key=lambda x:int(x[-1]), reverse=True)
for r in s:
print '\t'.join(r)
Result:
A1 3445 1 214
A2 3603 2 144
A0 3314 3 134
A0 3314 4 46
A2 3603 2 45
A1 3445 1 24
A0 3314 3 8
dataDic = {}
for data in open('1.txt').readlines():
id, a, b ,num = data.split(" ")
if not dataDic.has_key(id):
dataDic[id] = [a, b, int(num)]
else:
if int(num) >= dataDic[id][-1]:
dataDic[id] = [a, b, int(num)]
print dataDic
I think, maybe this result is what you want.
data = [('A1',3445,1,24), ('A1',3445,1,214), ('A2',3603,2,45),
('A2',3603,2,144), ('A0',3314,3,8), ('A0',3314,3,134),
('A0',3314,4, 46)]
from itertools import groupby
for key, group in groupby(data, lambda x: x[0]):
print sorted(group, key=lambda x: x[-1], reverse=True)[0]
The output is:
('A1', 3445, 1, 214)
('A2', 3603, 2, 144)
('A0', 3314, 3, 134)
You can use this function groupby.

Categories

Resources