Machine learning random forest classifier - python

data=pd.DataFrame({'gender':['m','f','m'],'icds':[['i10'],['i20','i30'],['i40']],'med':[[1,2,4,5],[3,4,6],[5,6,7]]})
Which machine learning algorithm shall I use for this type of data? I think of the inconsistent length of arrays in the med column. Whenever I try to pass it in the random forest classifier, med column is basically the labels.

Yeah, you are right, the algorithm you should use is RF or logistic also should be good. The issue is with the inconsistent length of data in 'med' column. If its not necessary you can use the following functions to average/sum out the numerical data in med columns arrays:
def sum_out(x):
return np.nansum(x)
def avg_out(x):
return np.nanmean(x)
data=pd.DataFrame({'gender':['m','f','m'],'icds':[['i10'],['i20','i30'],['i40']],'med':[[1,2,4,5],[3,4,6],[5,6,7]]})
data['med_sum']= data['med'].map(sum_out)
data['med_avg']= data['med'].map(avg_out)

You can actually add those meds as features, something like this:
data=pd.DataFrame({'gender':['m','f','m'],'icds':[['i10'],['i20','i30'],['i40']],'med':[['xanex','isotopin'],['cz3','hicet','t-montair'],['t-montair','xanex']]})
all_med= list(np.unique(flatten(list(data['med'].values))))
for meds in all_med:
med_list=[]
for i in xrange(len(data)):
d= data['med'][i]
if meds in d:
med_list.append(1)
else:
med_list.append(0)
data[meds]=med_list
Output:
gender icds med cz3 hicet isotopin \
0 m [i10] [xanex, isotopin] 0 0 1
1 f [i20, i30] [cz3, hicet, t-montair] 1 1 0
2 m [i40] [t-montair, xanex] 0 0 0
t-montair xanex
0 0 1
1 1 0
2 1 1

Related

Analyzing and Visualizing Similarity with Binary Data (Dimensionality Reduction)

I have a dataset of users and locations they are affiliated with, binary encoded as columns.
I'd like to visualize each user on a 2-d axis based on the similar of their affiliated locations. The closer they are in the vector space, the more similar they are in terms of locations they're affiliated with.
Here is an example of what I'd like to create...where each dot represents a user and they are closer or further based on their location profile.
I am trying to think through methods to collapse the location information (many dimensions) into 2 dimensions.
The ask:
Are there any techniques that are well suited for this problem?
A few ideas so far:
1) PCA (or similar): Conduct dimensionality reduction via PCA with an eye for techniques that work with binary features (looking into Kernal PCA)
2) Neural Network Embedding: Apply techniques similar to word embeddings to this problem. Create an embedded layer where each user is translated into a continuous vector space (which can be reduced down to 2 dimensions).
Reproducible data below. The actual dataset is ~5k users and 50 locations, but I'd like to solution to be scalable.
import names
import pandas as pd
import numpy as np
names_list = []
for i in range(1,100):
single_name = names.get_full_name()
names_list.append(single_name)
df = pd.DataFrame(names_list,columns=['Names'])
df['Var1'] = np.random.randint(0,2, size=len(df))
df['Var2'] = np.random.randint(0,2, size=len(df))
df['Var100'] = np.random.randint(0,2, size=len(df))
print(df)
#sample data
Names Var1 Var2 Var100
0 Clayton Stocks 1 1 1
1 Gary Beavers 0 0 1
2 Kristal Feagin 0 1 1
3 Crystal Barb 0 0 1
4 William Wilburn 1 0 0
.. ... ... ... ...
94 Jennifer Cool 0 0 0
95 Roberta Larsen 0 0 0
96 Malcom Mosley 1 0 1
97 Hazel Wilkins 1 1 0
98 Chanell Jaremka 1 0 1

I am using google colab, everything is up to date and still get this error TypeError: drop() got an unexpected keyword argument 'axis'

I am using google colab, everything is up to date and still get this error TypeError: drop() got an unexpected keyword argument 'axis'. What am I doing wrong? The error is coming from the last two lines of code. If I get my data using pandas dataframe from an array, it works fine. Here's the error output.
import numpy as np
import pandas as pd
import io
from google.colab import files
uploaded = files.upload()
admissions = pd.read_csv(io.BytesIO(uploaded['student_data.csv']))
# Make Dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
# Drop the column in which the dummy variables was created from
data.pop('rank')
# Standardize features
for field in ['gre', 'gpa']:
# get the mean and standard deviations
mean, std = data[field].mean(), data[field].std()
# get the ...
data.loc[:,field] = (data[field] -mean) / std
# split the random 10% of the data for testing
np.random.seed(42)
# takes away 90% data, test_data = data.index[sample], data.drop(sample) # removes the 90% and stores the remaining 10% into test_data
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit'] # takes the admit column away and store the remaining in the features and the # admin in the targets
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']
Here's the error ouput
TypeError Traceback (most recent call last)
<ipython-input-1-9293df238858> in <module>()
31
32 # Split into features and targets
---> 33 features, targets = data.drop('admit', axis=1), data['admit'] # takes the admit column away and store the remaining in the features and the # admin in the targets
34 features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']
TypeError: drop() got an unexpected keyword argument 'axis'
Can you please try this alternative?
data.drop(columns=['admit'])
In general, this should return the same thing as yours. But from times to times when I got the same errors, this works.
I try to reproduce your code this way in Colab:
import pandas as pd
import numpy as np
admissions = pd.DataFrame({"gre": [5,6,5,1,21,5,8],
"gpa": [1,2,2,2,3,1,4],
"rank": ['a','a','b','c','a','b','a'],
"admit": [0,1,1,1,0,1,1]})
test_data = pd.DataFrame({"gre": [5,6,5],
"gpa": [3,1,4],
"rank": ['a','b','a'],
"admit": [0,1,1]})
Then copy your code from # Make Dummy variables for rank until the end. It produces no error. I would suggest you to add print(data.head()) before the # Split into features and targets to see how does the data look like. In my case, this code
print(data)
# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit'] # takes the admit column away and store the remaining in the features and the # admin in the targets
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']
print(features)
print(targets)
produces output like this:
gre gpa admit rank_a rank_b rank_c
0 -0.357384 -1.069045 0 1 0 0
1 -0.201028 -0.133631 1 1 0 0
2 -0.357384 -0.133631 1 0 1 0
3 -0.982806 -0.133631 1 0 0 1
4 2.144304 0.801784 0 1 0 0
5 -0.357384 -1.069045 1 0 1 0
6 0.111682 1.737198 1 1 0 0
gre gpa rank_a rank_b rank_c
0 -0.357384 -1.069045 1 0 0
1 -0.201028 -0.133631 1 0 0
2 -0.357384 -0.133631 0 1 0
3 -0.982806 -0.133631 0 0 1
4 2.144304 0.801784 1 0 0
5 -0.357384 -1.069045 0 1 0
6 0.111682 1.737198 1 0 0
0 0
1 1
2 1
3 1
4 0
5 1
6 1

Should I use tf.keras.utils.normalize to normalize my targets?

Working on a machine learning model regression problem that predicts a score.
Usually, when using a scaler for normalization, for example MinMaxScaler, You get a reference to the scaler so later you can inverse your data back to its original values.
When using tf.keras.utils.normalize, which is (as far as I understand it) is an L2 normalization for the following Data:
val target
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
You get this output:
val target
0 0.13484 0.13484
1 0.26968 0.26968
2 0.40452 0.40452
3 0.53936 0.53936
4 0.67420 0.67420
So I see no possible way to go back to the original series of 10,20,30,40,50
Q: If I want to inverse the predicted targets back to their original scale, should I normalize the scores separately using MinMaxScalar?
Neural network activations generally like their inputs to be normalized. Normalizing inputs to nodes in a network helps prevent the so-called vanishing (and exploding) gradients.
Generally, Batch Normalization is performed on the inputs, but it has its own drawbacks like slower predictions due to extra computation. Instead, you can use any other normalizing technique as you mentioned.
In your example instead of normalizing both input and target, normalize only input like mentioned below.
Dataframe:
val target
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
Normalizing input:
df["val"] = tf.keras.utils.normalize(df["val"].values,axis=-1, order=2 )[0]
Input Normalized Dataframe:
val target
0 0.13484 10
1 0.26968 20
2 0.40452 30
3 0.53936 40
4 0.67420 50

How do I improve model accuracy predicting categorical outcome using categorial predictors?

I'm trying to predict Para using Cols. My data is in this format:
Record ID Para Col2 Col3
1 A x a
1 A x b
2 B y a
2 B y b
1 A z c
1 C x a
So far, I have tried applying One Hot Encoding (OHE) and running algorithms on the following transformed data:
Record Para a b c x y z
1 A 1 1 1 1 0 1
1 C 1 1 1 1 0 1
2 B 1 1 0 0 1 0
The accuracy has been shoddy, highest of 27% with Logistic Regression. I tried kNN, Random Forest, Decision Tree.
Next, I tried encoding the Cols to ordinal variables and then reran the algorithms (except Logistic Regression). Similarly poor results.
Am I doing something incorrectly? How can I improve the accuracy?
The raw data is 249681 rows × 9 columns. Both outcome and predictor columns are categorical. When doing OHE, the data is 5534 rows × 865 columns.
One thing that I'd like to try is Naive Bayes that calculates P(Outcome|Predictor) and then assign the highest probability predictor to the outcome. Is that a reasonable approach to take?
If your categories are exclusive you probably should take a look at Softmax Regression:
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y(i)∈{0,1}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y(i)∈{1,…,K} where K is the number of classes.

Memory usage in creating Term Density Matrix from pandas dataFrame

I have a DataFrame which I save/read from a csv file, and I want to create a Term Density Matrix DataFrame from it. Following herrfz's suggestion here, I use CounVectorizer from sklearn. I wrapped that code in a function
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
from scipy.sparse import coo_matrix, csc_matrix, hstack
def df2tdm(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
'''
tdm_df = pd.DataFrame(countvec.fit_transform(df[titleColumn]).toarray(), columns=countvec.get_feature_names())
tdm_df = tdm_df.join(pd.DataFrame(df[placementColumn]))
return tdm_df
Which returns the TDM as a DataFrame, for example:
df = pd.DataFrame({'title':['Delicious boiled egg','Fried egg ', 'Potato salad', 'Split orange','Something else'], 'page':[1, 1, 2, 3, 4]})
print df.head()
tdm_df = df2tdm(df,'title','page')
tdm_df.head()
boiled delicious egg else fried orange potato salad something \
0 1 1 1 0 0 0 0 0 0
1 0 0 1 0 1 0 0 0 0
2 0 0 0 0 0 0 1 1 0
3 0 0 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 0 1
split page
0 0 1
1 0 1
2 0 2
3 1 3
4 0 4
This implementation suffers from bad memory scaling: When I use a DataFrame which occupies 190 kB saved as utf8, the function uses ~200 MB to create the TDM dataframe. When the csv file is 600 kB, the function uses 700 MB, and when the csv is 3.8 MB the function uses up all of my memory and swap file (8 GB) and crashes.
I also made an implementation using sparse matrices and sparse DataFrames (below), but the memory usage is pretty much the same, only it is considerably slower
def df2tdm_sparse(df,titleColumn,placementColumn):
'''
Takes in a DataFrame with at least two columns, and returns a dataframe with the term density matrix
of the words appearing in the titleColumn. This implementation uses sparse DataFrames.
Inputs: df, a DataFrame containing titleColumn, placementColumn among other columns
Outputs: tdm_df, a DataFrame containing placementColumn and columns with all the words appearrig in df.titleColumn
Credits:
https://stackoverflow.com/questions/22205845/efficient-way-to-create-term-density-matrix-from-pandas-dataframe
https://stackoverflow.com/questions/17818783/populate-a-pandas-sparsedataframe-from-a-scipy-sparse-matrix
https://stackoverflow.com/questions/6844998/is-there-an-efficient-way-of-concatenating-scipy-sparse-matrices
'''
pm = df[[placementColumn]].values
tm = countvec.fit_transform(df[titleColumn])#.toarray()
m = csc_matrix(hstack([pm,tm]))
dfout = pd.SparseDataFrame([ pd.SparseSeries(m[i].toarray().ravel()) for i in np.arange(m.shape[0]) ])
dfout.columns = [placementColumn]+countvec.get_feature_names()
return dfout
Any suggestions on how to improve memory usage? I wonder if this is related to the memory issues of scikit, e.g. here
I also think that the problem might be with the conversion from sparse matrix to sparse data frame.
try this function (or something similar)
def SparseMatrixToSparseDF(xSparseMatrix):
import numpy as np
import pandas as pd
def ElementsToNA(x):
x[x==0] = NaN
return x
xdf1 =
pd.SparseDataFrame([pd.SparseSeries(ElementsToNA(xSparseMatrix[i].toarray().ravel()))
for i in np.arange(xSparseMatrix.shape[0]) ])
return xdf1
you can see that it reduces the size by using function density
df1.density
I hope it helps

Categories

Resources