I have set the outcome variable y as a column in a csv. It loads properly and works when I print just y, but when I use y = y[x:] I start getting NaN as values.
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[9:] #causes NaN for outcome variables
Then later in the file I print the outcome column. final_df is a dataframe which does not yet have the outcome variable set, so I set it below:
final_df['outcome'] = y
print(final_df['outcome'])
But the outcome is:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 L
It looks like the last value is correct (they should all be 'W' or 'L').
How can I line up my data frames properly so I do not get NaN?
Entire Code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
import matplotlib.pyplot as plt
np.random.seed(0)
from array import array
iris=load_iris()
previous_games_stats = pd.read_csv('stats/2016-2017 CANUCKS STATS.csv', header=1)
numGamesToLookBack = 10;
axis=1) #Predictor variables
X = previous_games_stats[['GF', 'GA']]
count = 0
final_df = pd.DataFrame(columns=['GF', 'GA'])
#final_y = pd.DataFrame(columns=['Unnamed: 7'])
y = previous_games_stats['Unnamed: 7'] #outcome variable (win/loss)
y = y[numGamesToLookBack-1:]
for game in range(0, 10):
X = previous_games_stats[['GF', 'GA']]
X = X[count:numGamesToLookBack] #num games to look back
stats_feature_names = list(X.columns.values)
df = pd.DataFrame(iris.data, columns=iris.feature_names)
stats_df = pd.DataFrame(X, columns=stats_feature_names).sum().to_frame().T
final_df = final_df.append(stats_df, ignore_index=True)
count+=1
numGamesToLookBack+=1
print("final_df:\n", final_df)
stats_target_names = np.array(['Win', 'Loss']) #don't need?...just a label it looks like
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
final_df['outcome'] = y
final_df['outcome'].update(y) #ADDED UPDATE TO FIX NaN
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75 #for iris
final_df['is_train'] = np.random.uniform(0, 1, len(final_df)) <= .65
train, test = df[df['is_train']==True], df[df['is_train']==False]
stats_train = final_df[final_df['is_train']==True]
stats_test = final_df[final_df['is_train']==False]
features = df.columns[:4]
stats_features = final_df.columns[:2]
y = pd.factorize(train['species'])[0]
stats_y = pd.factorize(stats_train['outcome'])[0]
clf = RandomForestClassifier(n_jobs=2, random_state=0)
stats_clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(train[features], y)
stats_clf.fit(stats_train[stats_features], stats_y)
stats_clf.predict_proba(stats_test[stats_features])[0:10]
preds = iris.target_names[clf.predict(test[features])]
stats_preds = stats_target_names[stats_clf.predict(stats_test[stats_features])]
pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome'])
print("~~~confusion matrix~~~\nColumns represent what we predicted for the outcome of the game, and rows represent the actual outcome of the game.\n")
print(pd.crosstab(stats_test['outcome'], stats_preds, rownames=['Actual Outcome'], colnames=['Predicted Outcome']))
It is expected, because y have no indices (no data) for first 9 values, so after assign back get NaNs.
If column is new and length of y is same as length of df assign numpy array:
final_df['outcome'] = y.values
But if lengths are different, it is a bit complicated, because need same lengths:
df = pd.DataFrame({'a':range(10), 'b':range(20,30)}).astype(str).radd('a')
print (df)
a b
0 a0 a20
1 a1 a21
2 a2 a22
3 a3 a23
4 a4 a24
5 a5 a25
6 a6 a26
7 a7 a27
8 a8 a28
9 a9 a29
y = df['a']
y = y[4:]
print (y)
4 a4
5 a5
6 a6
7 a7
8 a8
9 a9
Name: a, dtype: object
len(final_df) < len(y):
Filter y by final_df, then convert to numpy array for not align indices:
final_df = pd.DataFrame({'new':range(100, 105)})
final_df['s'] = y.iloc[:len(final_df)].values
print (final_df)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
len(final_df) > len(y):
Create new Series by filtered index values:
final_df1 = pd.DataFrame({'new':range(100, 110)})
final_df1['s'] = pd.Series(y.values, index=final_df1.index[:len(y)])
print (final_df1)
new s
0 100 a4
1 101 a5
2 102 a6
3 103 a7
4 104 a8
5 105 a9
6 106 NaN
7 107 NaN
8 108 NaN
9 109 NaN
Related
I'm trying to do SMOTE oversampling from imblearn. This is my code:
X = data[['a','b','c']]
y = data['targets']
oversampler = SMOTE(random_state=42)
X_over, y_over = oversampler.fit_resample(X,y)
And, the last line X_over, y_over = oversampler.fit_resample(X,y) raises the error setting an array elemenet with a sequence
I am sure the reason is because of the shape of my 'X'.
X is a dataframe where each row of column 'a' is a list of length 118, each row of column 'b' a list of length 15 and column 'c' is an integer column.
i.e,
For example,
a(length - 118) b(length -15) c
[1,2,3,4,.....0] [4,7,8,9...0] 3
Now, how do I convert this dataframe X into array of shape (n_samples, n_features), which is required as per the documentation
Could someone please help me transform the input dataframe to get rid of this error?
You can expand the columns, check that the lengths are the same first:
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
data = pd.DataFrame({'targets':np.random.binomial(1,0.15,100),
'a':np.random.randint(0,10,(100,2)).tolist(),
'b':np.random.randint(11,20,(100,3)).tolist(),
'c':np.random.randint(0,100,100)
})
data['a'].apply(len).value_counts()
2 100
Function to expand the columns, new columns will be named e.g a0..aN, and previous list columns will be dropped:
def expand_cols(da,col_list):
for C in col_list:
ix = [C+str(i) for i in range(len(da[C][0]))]
da[ix] = pd.DataFrame(data[C].tolist(),columns = ix)
da = da.drop(col_list,axis=1)
return da
Your code, and we expand it when we fit:
X = data[['a','b','c']]
y = data['targets']
oversampler = SMOTE(random_state=42)
X_over, y_over = oversampler.fit_resample(expand_cols(X,['a','b']),y)
Looks like this:
X_over.head()
c a0 a1 b0 b1 b2
0 67 4 0 19 15 16
1 12 3 7 12 17 19
2 41 8 9 15 18 18
3 35 8 0 11 13 11
4 46 0 5 12 12 12
I have a data frame with n rows, I want to assign a class to every row randomly from m classes such that the proportion of all classes are the same.
Example:
>>> classes = ['c1','c2','c3','c4']
>>> df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
>>> df
a b c d e
0 -0.341559 1.499159 0.269614 -0.198663 -1.081290
1 -1.966477 1.902292 -0.092296 -1.730710 -1.342866
2 1.188634 -2.851902 1.130480 -0.495677 -0.569557
3 -0.816190 1.205463 1.157507 -0.217025 -0.160752
4 -2.001114 -0.818852 -0.696057 -0.874615 -0.577101
.. ... ... ... ... ...
95 0.502192 0.434275 0.358244 -0.763562 -0.787102
96 -1.071011 0.045387 0.297905 -0.120974 0.185418
97 2.458274 -1.852953 -0.049336 -0.150604 -0.292824
98 1.992513 -0.431639 0.566920 -1.289439 0.626914
99 0.685915 -0.723009 -0.168497 1.630057 1.587378
[100 rows x 5 columns]
Expected output:
>>> df
a b c d e class
0 -0.341559 1.499159 0.269614 -0.198663 -1.081290 c3
1 -1.966477 1.902292 -0.092296 -1.730710 -1.342866 c4
2 1.188634 -2.851902 1.130480 -0.495677 -0.569557 c2
3 -0.816190 1.205463 1.157507 -0.217025 -0.160752 c3
4 -2.001114 -0.818852 -0.696057 -0.874615 -0.577101 c1
.. ... ... ... ... ... ...
95 0.502192 0.434275 0.358244 -0.763562 -0.787102 c1
96 -1.071011 0.045387 0.297905 -0.120974 0.185418 c3
97 2.458274 -1.852953 -0.049336 -0.150604 -0.292824 c2
98 1.992513 -0.431639 0.566920 -1.289439 0.626914 c1
99 0.685915 -0.723009 -0.168497 1.630057 1.587378 c2
[100 rows x 6 columns]
With the class proportions being the same
This should do the job
classes = ['c1','c2','c3','c4']
df = pd.DataFrame(np.random.randn(100, 5), columns = list("abcde"))
classes = np.repeat(classes, df.shape[0]/len(classes))
np.random.shuffle(classes)
df['class'] = classes
I am trying to perform a nested if (together with AND & OR function) in pandas, I have the following two data frame
dF1
TR_ID C_ID Code Check1 Check2
1 101 P1 N Y
2 102 P2 Y Y
3 103 P3 N Y
4 104 P4 Y N
5 105 P5 N N
6 106 P6 Y Y
7 107 P7 N N
8 108 P8 N N
9 109 P9 Y Y
10 110 P10 Y N
dF2
C_ID CC
101 A1
102 A2
103 A3
104 A4
105 A5
106 A6
107 A7
108 A8
109 A9
110 A10
I am trying to create a new column 'Result' in Df1 using the below excel formula, I am fairly new to coding in Pandas Python,
Excel Formula =
IF(AND(OR($D2="P2",$D2="P4",$D2="P6",$D2="P9"),$E2="Y",$F2="Y"),"A11",VLOOKUP($C2,$J$2:$K$11,2,0))'
The resulting data frame should look like this
TR_ID C_ID Code Check1 Check2 RESULT
1 101 P1 N Y A1
2 102 P2 Y Y A11
3 103 P3 N Y A3
4 104 P4 Y N A4
5 105 P5 N N A5
6 106 P6 Y Y A11
7 107 P7 N N A7
8 108 P8 N N A8
9 109 P9 Y Y A11
10 110 P10 Y N A10
I am trying this code in python df1['CC'] = df1['Code'].apply(lambda x: 'A11' if x in ('P2','P4','P6','P9') else 'N')
But I sm unable to incorporate the check1 & Check2 criteria and also else vlookup is not working.
any suggestion is greatly appreciated
Try this:
# This is the first part of your IF statement
cond = (
df1['Code'].isin(['P2', 'P4', 'P6', 'P9'])
& df1['Check1'].eq('Y')
& df1['Check2'].eq('Y')
)
# And the VLOOKUP
# (but don't name your dataframe `vlookup` in production code please
vlookup = df1[['C_ID']].merge(df2, on='C_ID')
# Combining the two
df1['RESULT'] = np.where(cond, 'All', vlookup['CC'])
Unlike Excel that does not treat worksheets or cell ranges as data set objects, Pandas allows you to interact with data with named columns and attributes.
Therefore, consider using DataFrame.merge followed by a conditional logic such as Series.where calculation similar to IF formula. Also, ~ operator negates the logic condition.
p_list = ['P2', 'P4', 'P6', 'P9']
final_df = dF1.merge(dF2, on = "C_ID")
final_df['Result'] = final_df['CC'].where(~((final_df['Code'].isin(p_list))
& (final_df['Check1'] == 'Y')
& (final_df['Check2'] == 'Y')
), 'A11')
print(final_df)
# TR_ID C_ID Code Check1 Check2 CC Result
# 0 1 101 P1 N Y A1 A1
# 1 2 102 P2 Y Y A2 A11
# 2 3 103 P3 N Y A3 A3
# 3 4 104 P4 Y N A4 A4
# 4 5 105 P5 N N A5 A5
# 5 6 106 P6 Y Y A6 A11
# 6 7 107 P7 N N A7 A7
# 7 8 108 P8 N N A8 A8
# 8 9 109 P9 Y Y A9 A11
# 9 10 110 P10 Y N A10 A10
Online Demo (click Run at top)
How is it possible to change multiple columns on subset by some conditions in a pandas dataframe?
For example given the input data:
import pandas as pd
dat = pd.DataFrame({"y": ("441912", "abc", "121", "4455")})
dat['leny'] = dat['y'].str.len()
dat['yfoo'] = None
dat
y leny yfoo
1: 441912 6 NA
2: abc 3 NA
3: 121 3 NA
4: 4455 4 NA
Then subset the rows for which y starts with 44 and has a length of 4 or 5, then for those rows strip the 44 from the beginning in y, substract 2 from leny and set yfoo to False, resulting to the following output:
y leny yfoo
1: 441912 6 NA
2: abc 3 NA
3: 121 3 NA
4: 55 2 FALSE
My attempt at doing this:
# pandas struggle follows
dat[dat.leny.isin((4, 5)) & dat.y.str.match('^44', na=False)]
What do I do next?
Create a mask:
m = dat.leny.isin((4, 5)) & dat.y.str.startswith('44')
Now, use loc and perform your operations.
dat.loc[m, 'y'] = dat.loc[m, 'y'].str[2:]
dat.loc[m, 'leny'] -= 2
dat.loc[m, 'yfoo'] = False
dat
y leny yfoo
0 441912 6 None
1 abc 3 None
2 121 3 None
3 55 2 False
Using a comprehension to gather data.
y = dat.y.values.tolist()
dat2 = np.array([
[x[2:], len(x) - 2, False, i]
for i, x in enumerate(y)
if x.startswith('44') and (len(x) // 2 == 2)
], object)
dat.iloc[dat2[:, -1].astype(int), :] = dat2[:, :-1]
dat
y leny yfoo
0 441912 6 None
1 abc 3 None
2 121 3 None
3 55 2 False
I would like to change the format of my output for the following code.
import pandas as pd
x= pd.read_csv('x.csv')
y= pd.read_csv('y.csv')
z= pd.read_csv('z.csv')
list = pd.merge(x, y, how='left', on=['xx'])
list = pd.merge(list, z, how='left', on=['xx'])
columns_to_keep = ['yy','zz', 'uu']
list = list.set_index(['xx'])
list = list[columns_to_keep]
list = list.sort_index(axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True, by=None)
with open('write.csv','w') as f:
list.to_csv(f,header=True, index=True, index_label='xx')
from this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2 a2
2 1/8/2007 3 a3
3 12/14/2007 4 a4
4 3/6/2008 5 a5
4 4/14/2009 6 a6
4 5/30/2008 7 a7
4 5/30/2008 8 a8
5 6/17/2007 9 a9
to this:
id date user_id user_name
1 8/13/2007 1 a1
2 1/8/2007 2;3 a2;a3
3 12/14/2007 4 a4
4 3/6/2008 5;6;7;8 a5;a6;a7;a8
5 6/17/2007 9 a9
I think the following should work on the final dataframe (list), though I would suggest not to use "list" as a name as it is a built in function in python and you might want to use that function somewhere else. So in my code I will use "df" instead of "list":
ind = list(set(df.index.get_values()))
finaldf = pd.DataFrame(columns = list(df.columns))
for val in ind:
tempDF = df.loc[val]
print tempDF
for i in range(tempDF.shape[0]):
for jloc,j in enumerate(list(df.columns)):
if i != 0 and j != 'date':
finaldf.loc[val,j] += (";"+str(tempDF.iloc[i,jloc]))
elif i == 0:
finaldf.loc[val,j] = str(tempDF.iloc[i,jloc])
print finaldf