Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have x1=Job level (numerical), x2= Job code (categorical) and y = Stock value (numerical). For a data set of 3x500 i have 250 NaN values in Stock Value.
What do I need to change in my code below to read x2 as a categorical value and rerun the program to find the coefficients?Data set example
> import pandas as pd from sklearn.linear_model import LinearRegression
> df = pd.read_excel("stats.xlsx")
> df_nonull=df.dropna() X_train = df_nonull[['Job Code','Job Level']]
> y_train = df_nonull[['Stock Value']]
>
>
> X_test = df[['Job Code','Job Level']] y_test = df[['Stock Value']]
>
> regressor = LinearRegression() model=regressor.fit(X_train, y_train)
> # display coefficients print(regressor.coef_)
> print(regressor.coef_)
This is a straightforward model training problem. Your available training data (observations) are the rows with Stock Value present; your later "real" data are the rows without.
Categorical data is quite legal in such cases. In fact, you might try declaring Job Level as categorical, as well, since it's discrete; that will free you from any assumptions of linearity (although it also denies any applicability of the level-code ordering).
Your task is to choose a model type that services your data properly. This requires research and experimentation; welcome to Data Science. Since you haven't discussed your data shape, density, connectivity, clustering, etc., there's really not much we can explore that with you. Six observations on three features (note that Job Code and Job Title are not 100% coupled) is not enough for educated speculation.
Try adding some polynomial terms to your "linear" regression: perhaps a sqared term and a square root for each input. That's often the first attempt for such a task.
Related
I am performing regression analysis on some reasonably large vectors (for now, working with numpy and other scientific tools is ok if I leave the computer working overnight) but they will grow by several factors eventually, and so I was looking to improve performance, moving the implementation to pytorch.
The regression is fairly simple. I have 2 vectors, predictions and betas, with dimensions (750, 6340) and (750, 4313) respectively. The least squares solution I am looking for is predictions * x = betas where x would have dimensions (6340, 4313), but I have to account for intercepts in the regression. With numpy I solved this by iterating through the second dimension in predictions, creating a vector with each column + a column of ones, and passing that as the first argument
for candidate in range(0, predictions.shape[1])): #each column is a candidate
prediction = predictions[:, candidate]
#allow for an intercept by adding a column with ones
prediction = np.vstack([prediction, np.ones(prediction.shape[0])]).T
sol = np.linalg.lstsq(prediction, betas, rcond=-1)
Question number 1 would be: is there a way to avoid iterating over each candidate in order to allow the least squares calculation to account for an intercept? That would improve computation time by a lot.
I tried using statsmodels.regression.linear_model.ols which allows for this by default (you can add a -1to the formula if you want it removed), but using this approach either forces me to iterate over each candidate (using apply was appealing but didn't really improve computation time noticeably) or there is something I'm missing. Question 1.5 then: can I use this tool in such a way or is all there is to it?
Similarly in pytorch I would do
t_predictions = torch.tensor(predictions, dtype=torch.float)
t_betas_roi = torch.tensor(betas, dtype=torch.float)
t_sol = torch.linalg.lstsq(t_predictions, t_betas_roi)
And it's fast indeed, but I'm missing the intercept here. I reckon if I did this with numpy instead of iterating as I do it would also be much faster but either way, if question 1 has a solution I imagine it could be similarly applied here, right?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
When do I apply PCA, is it after preprocessing (i.e removing null values, encoding etc.,) the entire dataset or before? After I've completely preprocessed my dataset,
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,0:14] = sc.fit_transform(x_train[:,0:14])
x_test[:,0:14] = sc.transform(x_test[:,0:14])
I'm left with the shape, 113126x91
Applying PCA is better on scaled data because you won't face the Large vs. Tiny problem between features.
Large vs. Tiny problem means that the variance of features would be different. for example, in a dataset, one feature has a range (-5, +5) and another lies in the range of (-10000, +10000). Features with larger values can dominate the process.
PCA is a dimensionality reduction technique used to reduce the dimensionality of large data sets by transforming a large collection of variables into a smaller one that still contains most of the information in the large group. To reduce dimensions, PCA takes eigenvectors with higher eigenvalues and map your data points to those vectors; hence dimensionality is reduced.
Let me give you an example of how applying PCA after scaling will be helpful.
Let me import some valuable things that we will be using for this example.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, normalize
import matplotlib.pyplot as plt
# For reproducibility
np.random.seed(123)
Let me make a dummy data set on which we will see the effect of applying PCA before and after scaling.
rows = 100
features = 7
X = np.random.normal(size=[rows, features])
X = np.append(X, 3*np.random.choice(2, size = [rows,1]), axis = 1)
A dummy dataset is created in variable X having 100 examples and 7 features. Now lets apply PCA on it without scaling and plot the data.
pca = PCA(2)
low_x = pca.fit_transform(X)
plt.scatter(low_x[:,0], low_x[:,1])
Here is a plot of data after reducing the number of features from 7 to 2 without scaling the dataset. You can see that data points are very near and messy. One feature has a higher variance than the other. For further processing or modeling, this will affect the results.
Let's apply feature scaling first and then apply PCA to the dataset.
X_normalized = normalize(X)
pca = PCA(2)
low_x = pca.fit_transform(X_noramlized)
plt.scatter(low_x[:,0], low_x[:,1])
In the following plot, the data is clear and scattered. There is no big difference between the variance of both features.
Hence, it is always better to apply normalization before applying PCA to a dataset.
But always remember one thing, Data science is mostly hit and try for developers. Try this if it doesn't help your results, you can always try a different way.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Can the dimension of the data be reduced to only one principal component?
I tried it on the iris data set-
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
pca = PCA(n_components=1)
pca_X = pca.fit_transform(X) #X = standardized iris data
pca_df = pd.DataFrame(pca_X, columns=["PCA1"])
plt.plot(pca_df["PCA1"], "o")
We can see three different clusters. So can to dimension be reduced to 1?
You can choose to reduce the dimensions to 1 using PCA, the only thing it promises is that the resultant principal component is in the direction of highest variance in the data.
If you are reducing the dimensions in order to improve classification you can use Linear Discriminant Analysis which gives you the direction of maximum separation between the classes.
Yes, the dimension can be reduced to 1, which is exactly what you have done in your example.
The y Axis in your plot shows the coordinate for each observation wrt the first principal component.
The three clusters likely relate to the three species in the Iris dataset and have nothing to do with the number of components.
I have two dimensional data including frequent crime type in certain regions and corresponding house prices along the year. I want to understand possible association between crime frequency in certain regions and house price fluctuation. Initially I tried to use linear regression to do that, but it didn't work well. Now I want to try PCA analysis on my data, but it is still not efficient to me to grab meaningful results. How can I perform efficient PCA analysis on panel data for the purpose of doing regression? any efficient workaround to make this happen? thanks
data :
because my data is bit long in terms of dimension, it is bit difficult to make reproducible example here, so let's see how panel data looks like:
here is safest cloud link that you can browse input panel data: example data snippet.
update: my attempt:
since # flyingmeatball pointed out that using PCA is not a good idea, I tried simple linear regression but it didn't help me to capture the relation between crime frequencies and house price. here is what I did:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import scale
import re
import urllib
import pandas as pd
# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")
# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values
poly = PolynomialFeatures(degree=2)
sc_y = StandardScaler()
X = poly.fit_transform(X_feats)
y= sc_y.fit_transform(y_label.reshape(-1,1)).flatten()
X = log(X)
y = log(y)
regModel = LinearRegression()
regModel.fit(X, y)
above code doesn't help me because I want to see which features contributed to house price fluctuation along the year. Any thoughts on how to make this happen?
goal:
what I am trying to achieve is to build model that explain the dynamics between crime frequency in certain regions and respective house price fluctuation. Any efficient workaround to make this happen?
update:
if PCA is not good idea, then any possible regression model that can capture the relation between crime frequencies in certain community area and house price fluctuation? any idea?
A couple thoughts:
1) Please post complete code. I don't see where crime_realEstate is defined anywhere. If you leave out the line where you read in your data to that variable, it makes it really hard to reproduce your error, and you're less likely to get help. Also, you should organize all of your import statements so they are at the top of your code. It isn't really a function thing, more of a convention that everyone expects and makes it easier to read.
2) When you reference panel data, are you really talking about a pandas DataFrame? That is sort of the "typical" way to store this kind of stuff for analysis. You may want to get in the habit of referring to data as dataframes so it's clearer to your audience. You should also post the full error traceback so we can see what line of code is bombing exactly.
3) I think you may be misunderstanding PCA, or at least what it is for. PCA (principle component analysis) is a data transformation method, where you are capturing variation in data that is across multiple variables and restating that data as fewer components that capture the same amount (or less, depending on how many components you keep) of variability. Once you run PCA, you won't be able to see which features are contributing to crime, because they will be replaced by totally new components. If it is important to identify the features that are correlated with crime, then PCA is a bad idea.
Please fix items above.
EDIT
I'm not saying PCA is wrong, I'm just saying that the question you asked above ("how do I apply PCA and why is my code bombing"), isn't really the right question. PCA should be used if you think that you have many correlated variables that need to be reduced to a lower level of dimensionality. I wouldn't start there though - see what kind of accuracy you can get without doing that. You've now reformulated to a much broader question of "how do I make a predictive model for this data, preferably using a regression?", which should probably go to https://datascience.stackexchange.com/ instead, but I'll give you a starting point of how I would approach coding that solution.
First - PCA is probably not the ideal starting point because from just looking at the data/columns, your problem isn't dimensionality. You basically have 10 different crimes over 5 years. You also only have 58 different rows...or is that just the sample data? Also, your data is a bit weird - you have the same prices for multiple rows, but the crimes differ. I can't tell if it's just because you're posting sample data. If this is indeed the full dataset, stop your analysis now and get more data/go do something else.
I made some executive decisions about how I would approach the problem. All of these are just for demonstration purposes of how to code regression. I summed crime across all years (you maybe want average? Highest? Change in? Those are all design decisions for you). My metric was Change in Price From 2012-2016, the timeframe you have crime data. I normalized crime counts by type of crime. Didn't scale the target variable.
Here's how I would start:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.preprocessing import scale
import pandas as pd
# Load data
filePath = 'L:\\crime_realEstate.txt'
crime_df = pd.read_csv(filePath, sep = '\t').drop(['Unnamed: 0','community_area'],axis = 1)
#calculate price change between 2016 and 2012 - same timeframe you have crime data
crime_df['price_change'] = crime_df['Avg_Price_2016'] - crime_df['Avg_Price_2012']
crime_df.drop(['Avg_Price_2012','Avg_Price_2013','Avg_Price_2014','Avg_Price_2015','Avg_Price_2016','Avg_Price_2017','Avg_Price_2018','Avg_Price_2019'],axis = 1,inplace = True)
#split years if they are data over time
crime_df.columns = pd.MultiIndex.from_tuples([(x.split('_20')[1] if '_20' in x else x ,x.split('_20')[0]) for x in crime_df.columns])
#sum across years for crimeFields
crime_df = crime_df.groupby(level=[1],axis = 1).sum(axis = 1)
#split out tgt var
price_growth = crime_df['price_change']
#create dummy variable from area name
dummy_df = pd.get_dummies(crime_df['Area_Name'])
crime_df.drop(['Area_Name','price_change'],axis = 1,inplace = True)
#scales crime variables
scaler = StandardScaler()
crime_df[crime_df.columns] = scaler.fit_transform(crime_df)
crime_df = pd.merge(crime_df,dummy_df,left_index = True, right_index = True)
regModel = LinearRegression()
#split to training testing
train_df = crime_df.sample(frac=0.8,random_state=200)
test_df = crime_df.drop(train_df.index)
regModel.fit(train_df, price_growth[train_df.index])
#R2
r2_score(price_growth.drop(train_df.index),regModel.predict(test_df))
0.7355837132941521
Simpler answer to your analysis: wherever the white people live in Chicago, the property is expensive.
I took a look at your data. Here's my 2 cents on few preprocessing steps:
You need to rearrange it, such that Y is Price_For_Area_Year. e.g. your first record transforms into following:
1 hot encode the area/area_code
Impute missing values using some standard method
take care of multicollinearity using pca etc. The independent variables have high correlation.
I think you should get some meaningful linear correlation. If not, try transforming some of the variables into ranks. Do share how that works out.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to find an autoregressive model on some data stored in a dataframe and I have 96 data points per day. The data is the value of solar irradiance in some region and I know it has a 1-day seasonality. I want to obtain a simple linear model using scikit LinearRegression and I want to specify which lagged data points to use. I would like to use the last 10 data points, plus the data point that has a lag of 97, which corresponds to the data point of 24 hour earlier. How can I specify the lagged coefficients that I want to use? I don't want to have 97 coefficients, I just want to use 11 of them: the previous 10 data points and the data point 97 positions back.
Just make a dataset X with 11 columns [x0-97, x0-10, x0-9,...,x0-1]. Then series of x0 will be your target Y.