I have a dataset of users and locations they are affiliated with, binary encoded as columns.
I'd like to visualize each user on a 2-d axis based on the similar of their affiliated locations. The closer they are in the vector space, the more similar they are in terms of locations they're affiliated with.
Here is an example of what I'd like to create...where each dot represents a user and they are closer or further based on their location profile.
I am trying to think through methods to collapse the location information (many dimensions) into 2 dimensions.
The ask:
Are there any techniques that are well suited for this problem?
A few ideas so far:
1) PCA (or similar): Conduct dimensionality reduction via PCA with an eye for techniques that work with binary features (looking into Kernal PCA)
2) Neural Network Embedding: Apply techniques similar to word embeddings to this problem. Create an embedded layer where each user is translated into a continuous vector space (which can be reduced down to 2 dimensions).
Reproducible data below. The actual dataset is ~5k users and 50 locations, but I'd like to solution to be scalable.
import names
import pandas as pd
import numpy as np
names_list = []
for i in range(1,100):
single_name = names.get_full_name()
names_list.append(single_name)
df = pd.DataFrame(names_list,columns=['Names'])
df['Var1'] = np.random.randint(0,2, size=len(df))
df['Var2'] = np.random.randint(0,2, size=len(df))
df['Var100'] = np.random.randint(0,2, size=len(df))
print(df)
#sample data
Names Var1 Var2 Var100
0 Clayton Stocks 1 1 1
1 Gary Beavers 0 0 1
2 Kristal Feagin 0 1 1
3 Crystal Barb 0 0 1
4 William Wilburn 1 0 0
.. ... ... ... ...
94 Jennifer Cool 0 0 0
95 Roberta Larsen 0 0 0
96 Malcom Mosley 1 0 1
97 Hazel Wilkins 1 1 0
98 Chanell Jaremka 1 0 1
Working on a machine learning model regression problem that predicts a score.
Usually, when using a scaler for normalization, for example MinMaxScaler, You get a reference to the scaler so later you can inverse your data back to its original values.
When using tf.keras.utils.normalize, which is (as far as I understand it) is an L2 normalization for the following Data:
val target
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
You get this output:
val target
0 0.13484 0.13484
1 0.26968 0.26968
2 0.40452 0.40452
3 0.53936 0.53936
4 0.67420 0.67420
So I see no possible way to go back to the original series of 10,20,30,40,50
Q: If I want to inverse the predicted targets back to their original scale, should I normalize the scores separately using MinMaxScalar?
Neural network activations generally like their inputs to be normalized. Normalizing inputs to nodes in a network helps prevent the so-called vanishing (and exploding) gradients.
Generally, Batch Normalization is performed on the inputs, but it has its own drawbacks like slower predictions due to extra computation. Instead, you can use any other normalizing technique as you mentioned.
In your example instead of normalizing both input and target, normalize only input like mentioned below.
Dataframe:
val target
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
Normalizing input:
df["val"] = tf.keras.utils.normalize(df["val"].values,axis=-1, order=2 )[0]
Input Normalized Dataframe:
val target
0 0.13484 10
1 0.26968 20
2 0.40452 30
3 0.53936 40
4 0.67420 50
I'm using a simple RandomForestRegressor script to predict a target variable. I'm trying to write a new CSV based on my training / validation data to include the actual value and the predicted value. However, when I export the data, the "Predicted Values" column is missing about half the values, and the values that do show up don't correlate well with the features / actual values. It seems like the values are randomized and then assigned to the first half of the rows.
To test, I've tried not splitting the data between validation and training data in the first place. I'm still finding the same problem.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
#file path
My_File_Path = "//path.csv"
#read the file
My_Data = pd.read_csv(My_File_Path)
#drop the null values
My_Data = My_Data.dropna(axis=0)
#define the target variable
y = My_Data.Annualized_2018_Payments
my_features = ['feature1','feature2','feature3']
#define the features
x = My_Data[my_features]
#set the split data
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 1)
forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_x, train_y)
WA_My_preds = forest_model.predict(val_x)
print("MAE for validation data is ", mean_absolute_error(val_y, WA_My_preds))
#print(My_Data.columns)
My_Data_Predicted = My_Data
#My_Data_Predicted.append(prediction_column, ignore_index = False, sort=None)
My_Data_Predicted['Predicted_Value'] = pd.DataFrame(data = forest_model.predict(My_Data_Predicted[my_features]))
print("The average predicted value is ", My_Data_Predicted['Predicted_Value'].mean())
print("The average true value is ", My_Data_Predicted['Annualized_2018_Payments'].mean())
#write to csv
My_Data_Predicted.to_csv("//path….Preds.csv")
I expect every row to have a column that reads "predicted values" with the values predicted by the random forest regressor. But the last half of the rows are missing that value.
For a short answer and resolution:
Based on testing your code, you should try this line instead:
My_Data_Predicted['Predicted_Value'] = forest_model.predict(My_Data_Predicted[my_features])
And now Here's why this is happening:
I tested this using my own dataset and it looks like the issue is this line:
My_Data_Predicted['Predicted_Value'] = pd.DataFrame(data = forest_model.predict(My_Data_Predicted[my_features]))
What is happening, it would seem, is that when you drop the null rows here:
My_Data = My_Data.dropna(axis=0)
you are also dropping the indexes along with the rows, which is not wrong, but important for your issue. To test this, try My_Data_Predicted.index.max() to get the highest index and compare that to My_Data_Predicted.shape and you will see that there are skipped indexes.
The reason this is a problem is that by making your predicted column a dataframe instead of a series, it is automatically trying to merge the new data based on indexes. The issue is that the original dataframe has a higher max index with some gaps, and this new one for predictions has sequential indexes, so some of your predictions are getting dropped in the process of merging.
Here is a dumbed down example of whats going on (pay attention to the indexes):
My_Data_Predicted predictions My_Data_Predicted (merged)
index a b c index d index a b c d
0 1 4 3 0 1 0 1 4 3 1
3 3 2 7 1 2 3 3 2 7 4
4 2 2 2 2 3 4 2 2 2 5
6 4 3 5 3 4 6 4 3 5 NaN
8 6 2 1 4 5 8 6 2 1 NaN
Notice that in the merged dataframe the last two are NaN because there is no index 6 or 8 in the predictions dataframe.
All of this should resolve by passing in the result if the predictions just like this:
My_Data_Predicted['Predicted_Value'] = forest_model.predict(My_Data_Predicted[my_features])
since the type is a numpy array and will not try to merge on the index.
Regression algorithms seem to be working on features represented as numbers.
For example:
This data set doesn't contain categorical features/variables. It's quite clear how to do regression on this data and predict price.
But now I want to do a regression analysis on data that contain categorical features:
There are 5 features: District, Condition, Material, Security, Type
How can I do a regression on this data? Do I have to transform all the string/categorical data to numbers manually? I mean if I have to create some encoding rules and according to that rules transform all data to numeric values.
Is there any simple way to transform string data to numbers without having to create my own encoding rules manually? Maybe there are some libraries in Python that can be used for that? Are there some risks that the regression model will be somehow incorrect due to "bad encoding"?
Yes, you will have to convert everything to numbers. That requires thinking about what these attributes represent.
Usually there are three possibilities:
One-Hot encoding for categorical data
Arbitrary numbers for ordinal data
Use something like group means for categorical data (e. g. mean prices for city districts).
You have to be carefull to not infuse information you do not have in the application case.
One hot encoding
If you have categorical data, you can create dummy variables with 0/1 values for each possible value.
E. g.
idx color
0 blue
1 green
2 green
3 red
to
idx blue green red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
This can easily be done with pandas:
import pandas as pd
data = pd.DataFrame({'color': ['blue', 'green', 'green', 'red']})
print(pd.get_dummies(data))
will result in:
color_blue color_green color_red
0 1 0 0
1 0 1 0
2 0 1 0
3 0 0 1
Numbers for ordinal data
Create a mapping of your sortable categories, e. g.
old < renovated < new → 0, 1, 2
This is also possible with pandas:
data = pd.DataFrame({'q': ['old', 'new', 'new', 'ren']})
data['q'] = data['q'].astype('category')
data['q'] = data['q'].cat.reorder_categories(['old', 'ren', 'new'], ordered=True)
data['q'] = data['q'].cat.codes
print(data['q'])
Result:
0 0
1 2
2 2
3 1
Name: q, dtype: int8
Using categorical data for groupby operations
You could use the mean for each category over past (known events).
Say you have a DataFrame with the last known mean prices for cities:
prices = pd.DataFrame({
'city': ['A', 'A', 'A', 'B', 'B', 'C'],
'price': [1, 1, 1, 2, 2, 3],
})
mean_price = prices.groupby('city').mean()
data = pd.DataFrame({'city': ['A', 'B', 'C', 'A', 'B', 'A']})
print(data.merge(mean_price, on='city', how='left'))
Result:
city price
0 A 1
1 B 2
2 C 3
3 A 1
4 B 2
5 A 1
In linear regression with categorical variables you should be careful of the Dummy Variable Trap. The Dummy Variable trap is a scenario in which the independent variables are multicollinear - a scenario in which two or more variables are highly correlated; in simple terms one variable can be predicted from the others. This can produce singularity of a model, meaning your model just won't work. Read about it here
Idea is to use dummy variable encoding with drop_first=True, this will omit one column from each category after converting categorical variable into dummy/indicator variables. You WILL NOT lose any relevant information by doing that simply because your all point in dataset can fully be explained by rest of the features.
Here is complete code on how you can do it for your housing dataset
So you have categorical features:
District, Condition, Material, Security, Type
And one numerical features that you are trying to predict:
Price
First you need to split your initial dataset on input variables and prediction, assuming its pandas dataframe it would look like this:
Input variables:
X = housing[['District','Condition','Material','Security','Type']]
Prediction:
Y = housing['Price']
Convert categorical variable into dummy/indicator variables and drop one in each category:
X = pd.get_dummies(data=X, drop_first=True)
So now if you check shape of X with drop_first=True you will see that it has 4 columns less - one for each of your categorical variables.
You can now continue to use them in your linear model. For scikit-learn implementation it could look like this:
from sklearn import linear_model
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .20, random_state = 40)
regr = linear_model.LinearRegression() # Do not use fit_intercept = False if you have removed 1 column after dummy encoding
regr.fit(X_train, Y_train)
predicted = regr.predict(X_test)
You can use "Dummy Coding" in this case.
There are Python libraries to do dummy coding, you have a few options:
You may use scikit-learn library. Take a look at here.
Or, if you are working with pandas, it has a built-in function to create dummy variables.
An example with pandas is below:
import pandas as pd
sample_data = [[1,2,'a'],[3,4,'b'],[5,6,'c'],[7,8,'b']]
df = pd.DataFrame(sample_data, columns=['numeric1','numeric2','categorical'])
dummies = pd.get_dummies(df.categorical)
df.join(dummies)
One way to achieve regression with categorical variables as independent variables is as mentioned above - Using encoding.
Another way of doing is by using R like statistical formula using statmodels library. Here is a code snippet
from statsmodels.formula.api import ols
tips = sns.load_dataset("tips")
model = ols('tip ~ total_bill + C(sex) + C(day) + C(day) + size', data=tips)
fitted_model = model.fit()
fitted_model.summary()
Dataset
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Summary of regression
I'm trying to predict Para using Cols. My data is in this format:
Record ID Para Col2 Col3
1 A x a
1 A x b
2 B y a
2 B y b
1 A z c
1 C x a
So far, I have tried applying One Hot Encoding (OHE) and running algorithms on the following transformed data:
Record Para a b c x y z
1 A 1 1 1 1 0 1
1 C 1 1 1 1 0 1
2 B 1 1 0 0 1 0
The accuracy has been shoddy, highest of 27% with Logistic Regression. I tried kNN, Random Forest, Decision Tree.
Next, I tried encoding the Cols to ordinal variables and then reran the algorithms (except Logistic Regression). Similarly poor results.
Am I doing something incorrectly? How can I improve the accuracy?
The raw data is 249681 rows × 9 columns. Both outcome and predictor columns are categorical. When doing OHE, the data is 5534 rows × 865 columns.
One thing that I'd like to try is Naive Bayes that calculates P(Outcome|Predictor) and then assign the highest probability predictor to the outcome. Is that a reasonable approach to take?
If your categories are exclusive you probably should take a look at Softmax Regression:
Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: y(i)∈{0,1}. We used such a classifier to distinguish between two kinds of hand-written digits. Softmax regression allows us to handle y(i)∈{1,…,K} where K is the number of classes.