Any efficient way to build up regression model on panel data? - python

I have two dimensional data including frequent crime type in certain regions and corresponding house prices along the year. I want to understand possible association between crime frequency in certain regions and house price fluctuation. Initially I tried to use linear regression to do that, but it didn't work well. Now I want to try PCA analysis on my data, but it is still not efficient to me to grab meaningful results. How can I perform efficient PCA analysis on panel data for the purpose of doing regression? any efficient workaround to make this happen? thanks
data :
because my data is bit long in terms of dimension, it is bit difficult to make reproducible example here, so let's see how panel data looks like:
here is safest cloud link that you can browse input panel data: example data snippet.
update: my attempt:
since # flyingmeatball pointed out that using PCA is not a good idea, I tried simple linear regression but it didn't help me to capture the relation between crime frequencies and house price. here is what I did:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import scale
import re
import urllib
import pandas as pd
# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")
# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values
poly = PolynomialFeatures(degree=2)
sc_y = StandardScaler()
X = poly.fit_transform(X_feats)
y= sc_y.fit_transform(y_label.reshape(-1,1)).flatten()
X = log(X)
y = log(y)
regModel = LinearRegression()
regModel.fit(X, y)
above code doesn't help me because I want to see which features contributed to house price fluctuation along the year. Any thoughts on how to make this happen?
goal:
what I am trying to achieve is to build model that explain the dynamics between crime frequency in certain regions and respective house price fluctuation. Any efficient workaround to make this happen?
update:
if PCA is not good idea, then any possible regression model that can capture the relation between crime frequencies in certain community area and house price fluctuation? any idea?

A couple thoughts:
1) Please post complete code. I don't see where crime_realEstate is defined anywhere. If you leave out the line where you read in your data to that variable, it makes it really hard to reproduce your error, and you're less likely to get help. Also, you should organize all of your import statements so they are at the top of your code. It isn't really a function thing, more of a convention that everyone expects and makes it easier to read.
2) When you reference panel data, are you really talking about a pandas DataFrame? That is sort of the "typical" way to store this kind of stuff for analysis. You may want to get in the habit of referring to data as dataframes so it's clearer to your audience. You should also post the full error traceback so we can see what line of code is bombing exactly.
3) I think you may be misunderstanding PCA, or at least what it is for. PCA (principle component analysis) is a data transformation method, where you are capturing variation in data that is across multiple variables and restating that data as fewer components that capture the same amount (or less, depending on how many components you keep) of variability. Once you run PCA, you won't be able to see which features are contributing to crime, because they will be replaced by totally new components. If it is important to identify the features that are correlated with crime, then PCA is a bad idea.
Please fix items above.
EDIT
I'm not saying PCA is wrong, I'm just saying that the question you asked above ("how do I apply PCA and why is my code bombing"), isn't really the right question. PCA should be used if you think that you have many correlated variables that need to be reduced to a lower level of dimensionality. I wouldn't start there though - see what kind of accuracy you can get without doing that. You've now reformulated to a much broader question of "how do I make a predictive model for this data, preferably using a regression?", which should probably go to https://datascience.stackexchange.com/ instead, but I'll give you a starting point of how I would approach coding that solution.
First - PCA is probably not the ideal starting point because from just looking at the data/columns, your problem isn't dimensionality. You basically have 10 different crimes over 5 years. You also only have 58 different rows...or is that just the sample data? Also, your data is a bit weird - you have the same prices for multiple rows, but the crimes differ. I can't tell if it's just because you're posting sample data. If this is indeed the full dataset, stop your analysis now and get more data/go do something else.
I made some executive decisions about how I would approach the problem. All of these are just for demonstration purposes of how to code regression. I summed crime across all years (you maybe want average? Highest? Change in? Those are all design decisions for you). My metric was Change in Price From 2012-2016, the timeframe you have crime data. I normalized crime counts by type of crime. Didn't scale the target variable.
Here's how I would start:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.preprocessing import scale
import pandas as pd
# Load data
filePath = 'L:\\crime_realEstate.txt'
crime_df = pd.read_csv(filePath, sep = '\t').drop(['Unnamed: 0','community_area'],axis = 1)
#calculate price change between 2016 and 2012 - same timeframe you have crime data
crime_df['price_change'] = crime_df['Avg_Price_2016'] - crime_df['Avg_Price_2012']
crime_df.drop(['Avg_Price_2012','Avg_Price_2013','Avg_Price_2014','Avg_Price_2015','Avg_Price_2016','Avg_Price_2017','Avg_Price_2018','Avg_Price_2019'],axis = 1,inplace = True)
#split years if they are data over time
crime_df.columns = pd.MultiIndex.from_tuples([(x.split('_20')[1] if '_20' in x else x ,x.split('_20')[0]) for x in crime_df.columns])
#sum across years for crimeFields
crime_df = crime_df.groupby(level=[1],axis = 1).sum(axis = 1)
#split out tgt var
price_growth = crime_df['price_change']
#create dummy variable from area name
dummy_df = pd.get_dummies(crime_df['Area_Name'])
crime_df.drop(['Area_Name','price_change'],axis = 1,inplace = True)
#scales crime variables
scaler = StandardScaler()
crime_df[crime_df.columns] = scaler.fit_transform(crime_df)
crime_df = pd.merge(crime_df,dummy_df,left_index = True, right_index = True)
regModel = LinearRegression()
#split to training testing
train_df = crime_df.sample(frac=0.8,random_state=200)
test_df = crime_df.drop(train_df.index)
regModel.fit(train_df, price_growth[train_df.index])
#R2
r2_score(price_growth.drop(train_df.index),regModel.predict(test_df))
0.7355837132941521
Simpler answer to your analysis: wherever the white people live in Chicago, the property is expensive.

I took a look at your data. Here's my 2 cents on few preprocessing steps:
You need to rearrange it, such that Y is Price_For_Area_Year. e.g. your first record transforms into following:
1 hot encode the area/area_code
Impute missing values using some standard method
take care of multicollinearity using pca etc. The independent variables have high correlation.
I think you should get some meaningful linear correlation. If not, try transforming some of the variables into ranks. Do share how that works out.

Related

When to apply Principal Component Analysis PCA [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
When do I apply PCA, is it after preprocessing (i.e removing null values, encoding etc.,) the entire dataset or before? After I've completely preprocessed my dataset,
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,0:14] = sc.fit_transform(x_train[:,0:14])
x_test[:,0:14] = sc.transform(x_test[:,0:14])
I'm left with the shape, 113126x91
Applying PCA is better on scaled data because you won't face the Large vs. Tiny problem between features.
Large vs. Tiny problem means that the variance of features would be different. for example, in a dataset, one feature has a range (-5, +5) and another lies in the range of (-10000, +10000). Features with larger values can dominate the process.
PCA is a dimensionality reduction technique used to reduce the dimensionality of large data sets by transforming a large collection of variables into a smaller one that still contains most of the information in the large group. To reduce dimensions, PCA takes eigenvectors with higher eigenvalues and map your data points to those vectors; hence dimensionality is reduced.
Let me give you an example of how applying PCA after scaling will be helpful.
Let me import some valuable things that we will be using for this example.
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale, normalize
import matplotlib.pyplot as plt
# For reproducibility
np.random.seed(123)
Let me make a dummy data set on which we will see the effect of applying PCA before and after scaling.
rows = 100
features = 7
X = np.random.normal(size=[rows, features])
X = np.append(X, 3*np.random.choice(2, size = [rows,1]), axis = 1)
A dummy dataset is created in variable X having 100 examples and 7 features. Now lets apply PCA on it without scaling and plot the data.
pca = PCA(2)
low_x = pca.fit_transform(X)
plt.scatter(low_x[:,0], low_x[:,1])
Here is a plot of data after reducing the number of features from 7 to 2 without scaling the dataset. You can see that data points are very near and messy. One feature has a higher variance than the other. For further processing or modeling, this will affect the results.
Let's apply feature scaling first and then apply PCA to the dataset.
X_normalized = normalize(X)
pca = PCA(2)
low_x = pca.fit_transform(X_noramlized)
plt.scatter(low_x[:,0], low_x[:,1])
In the following plot, the data is clear and scattered. There is no big difference between the variance of both features.
Hence, it is always better to apply normalization before applying PCA to a dataset.
But always remember one thing, Data science is mostly hit and try for developers. Try this if it doesn't help your results, you can always try a different way.

Techniques for analyzing clusters after performing k-means clustering on dataset

I was recently introduced to clustering techniques because I was given the task to find "profiles" or "patterns" of professors of my university based on a survey they had to answer. I've been studying some of the avaible options to perform this and I came across the k-means clustering algorithm. Since most of my data is categorical I had to perform a one-hot-encoding (transforming the categorical variable in 0-1 single column vectors) and right after that I did a correlation analysis on Excel in order to exclude some redundant variables. After this I used python with pandas, numpy, matplotlib and sklearn libraries to perform a optimal cluster number check (elbow method) and then run k-means, finally.
This is the code I used to import the .csv with the data from the professors survey and to run the elbow method:
# loads the .csv dataframe (DF)
df = pd.read_csv('./dados_selecionados.csv', sep=",")
# prints the df
print(df)
#list for the sum of squared distances
SQD = []
#cluster number for testing in elbow method
num_clusters = 10
# runs k-means for each cluster number
for k in range(1,num_clusters+1):
kmeans = KMeans(n_clusters=k)
kmeans.fit(df)
SQD.append(kmeans.inertia_)
# sets up the plot and show it
plt.figure(figsize=(16, 8))
plt.plot(range(1, num_clusters+1), SQD, 'bx-')
plt.xlabel('Número de clusters')
plt.ylabel('Soma dos quadrados das distâncias de cada ponto ao centro de seu cluster')
plt.title('Método do cotovelo')
plt.show()
According to the figure I decided to go with 3 clusters. After that I run k-means for 3 clusters and sent cluster data to a .xlsx with the following code:
# runs k-means
kmeans = KMeans(n_clusters=3, max_iter=100,verbose=2)
kmeans.fit(df)
clusters = kmeans.fit_predict(df)
# dict to store clusters data
cluster_dict=[]
for c in clusters:
cluster_dict.append(c)
# prints the cluster dict
cluster_dict
# adds the cluster information as a column in the df
df['cluster'] = cluster_dict
# saves the df as a .xlsx
df.to_excel("3_clusters_k_means_selecionado.xlsx")
# shows the resulting df
print(df)
# shows each separate cluster
for c in clusters:
print(df[df['cluster'] == c].head(10))
My main doubt right know is how to perform a reasonable analysis on each cluster data to understand how they were created? I began using means on each variable and also conditional formatting on Excel to see if some patterns would show up and they kind of did actually, but I think this is not the best option.
And I'm also going to use this post to ask for any recommendations on the whole method. Maybe some of the steps I took were not the best.
If you're using scikit learns kmeans function, there is a parameter called n_init, which is the number of times the kmeans algorithm will run with different centroid seeds. By default it is set to 10 iteration, so essentially it does 10 different runs and outputs a single result with the lowest sum of squares. Another parameter you could mess around with is random_state which is a seed number to initialize the centroids randomly. This may give you better reproducibility because you choose the seed number, so if you see an optimal result you know which seed corresponds to that result.
You may want to consider testing several different clustering algos. Here is a list of some of the popular ones.
https://scikit-learn.org/stable/modules/clustering.html
I think there are over 100 different clustering algos out there now.
Also, some clustering algos will automatically select the optimal number of clusters for you, so you don't have to 'guess'. I say guess, because the silhouette and elbow techniques will help quantify the K number for you, but you, yourself, still need to do some kind of guess-work.

Feature agglomeration: How to retrieve the features that make up the clusters?

I am using the scikit-learn's feature agglomeration to use a hierarchical clustering procedure on features rather than on the observations.
This is my code:
from sklearn import cluster
import pandas as pd
#load the data
df = pd.read_csv('C:/Documents/data.csv')
agglo = cluster.FeatureAgglomeration(n_clusters=5)
agglo.fit(df)
df_reduced = agglo.transform(df)
My original df had the shape (990, 15), after using feature agglomeration, df_reduced now has (990, 5).
How do now find out how the original 15 features have been clustered together? In other words, what original features from df make up each of the 5 new features in df_reduced?
The way how the features within each of the clusters are combined during transform is set by the way you perform the hierarchical clustering. The reduced feature set simply consists of the n_clusters cluster centers (which are n_samples - dimensional vectors). For certain applications you might think of computing centers manually using different definitions of cluster centers (i.e. median instead of mean to avoid the influence of outliers etc.).
n_features = 15
feature_identifier = range(n_features)
feature_groups = [np.array(feature_identifier )[agglo.labels_==i] for i in range(n_clusters)]
new_features = [df.loc[:,df.keys()[group]].mean(0) for group in feature_groups]
Don't forget to standardize the features beforehand (for example using sklearn's scaler). Otherwise you are rather grouping the scales of the quantities than clustering similar behavior.
Hope that helps!
Haven't tested the code. Let me know if there are problems.
After fitting the clusterer, agglo.labels_ contains a list that tells in which cluster in the reduced dataset each feature in the original dataset belongs.

PYMC3 Bayesian Prediction Cones

I'm still learning PYMC3, but I cannot find anything on the following problem in the docs. Consider the Bayesian Structure Time Series (BSTS) model from this question with no seasonality. This can be modeled in PYMC3 as follows:
import pymc3, numpy, matplotlib.pyplot
# generate some test data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
y_train = y_full[:90]
y_test = y_full[90:]
# specify the model
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_train.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_train)
trace = pymc3.sample(1000)
y_mean_pred = pymc3.sample_ppc(trace,samples=1000,model=model)['y'].mean(axis=0)
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='b')
ax.plot(t[:90],y_mean_pred,c='r')
matplotlib.pyplot.show()
Now I would like to predict the behavior for the next 10 time steps, i.e., y_test. I would also like to include credible regions over this area produce a Bayesian cone, e.g., see here. Unfortunately the mechanism for producing the cones in the aforementioned link is a little vague. In a more conventional AR model one could learn the mean regression coefficients and manually extend the mean curve. However, in this BSTS model there is no obvious way to do this. Alternatively, if there were regressors, then I could use a theano.shared and update it with a finer/extended grid to impute and extrapolate with sample_ppc, but thats not really an option in this setting. Perhaps sample_ppc is a red herring here, but its unclear how else to proceed. Any help would be welcome.
I think the following work. However, its super clunky and requires that I know how far in advance I want to predict before I train (in particular it percludes streaming usage or simple EDA). I suspect there is a better way and I would much rather accept a better solution by someone with more Pymc3 experience
import numpy, pymc3, matplotlib.pyplot, seaborn
# generate some data
t = numpy.linspace(0,2*numpy.pi,100)
y_full = numpy.cos(5*t)
# mask the data that I want to predict (requires knowledge
# that one might not always have at training time).
cutoff_idx = 80
y_obs = numpy.ma.MaskedArray(y_full,numpy.arange(t.size)>cutoff_idx)
# specify and train the model, used the masked array to supply only
# the observed data
with pymc3.Model() as model:
grw = pymc3.GaussianRandomWalk('grw',mu=0,sd=1,shape=y_obs.size)
y = pymc3.Normal('y',mu=grw,sd=1,observed=y_obs)
trace = pymc3.sample(5000)
y_pred = pymc3.sample_ppc(trace,samples=20000,model=model)['y']
y_pred_mean = y_pred.mean(axis=0)
# compute percentiles
dfp = numpy.percentile(y_pred,[2.5,25,50,70,97.5],axis=0)
# plot actual data and summary posterior information
pal = seaborn.color_palette('Purples')
fig = matplotlib.pyplot.figure(dpi=100)
ax = fig.add_subplot(111)
ax.plot(t,y_full,c='g',label='true value',alpha=0.5)
ax.plot(t,y_pred_mean,c=pal[5],label='posterior mean',alpha=0.5)
ax.plot(t,dfp[2,:],alpha=0.75,color=pal[3],label='posterior median')
ax.fill_between(t,dfp[0,:],dfp[4,:],alpha=0.5,color=pal[1],label='CR 95%')
ax.fill_between(t,dfp[1,:],dfp[3,:],alpha=0.4,color=pal[2],label='CR 50%')
ax.axvline(x=t[cutoff_idx],linestyle='--',color='r',alpha=0.25)
ax.legend()
matplotlib.pyplot.show()
This outputs the following which seems like a really bad prediction, but at least the code is supplying out of sample values.

Does sklearn silhouette_samples depend on the order of the data?

I’ve been trying to determine the silhouette scores for each sample in a data set, which contains two different classes. However, the distribution and sample values change depending on how I’ve sorted my data ahead of time. For example, if I sort my dataframe by the class labels (0 & 1) in ascending vs descending order prior to calling silhouette_samples(), the silhouette scores change.
Can someone help me figure out what’s going on? I’d like to know
Is there a bug in my code that I’m not aware of?
Is this normal behavior of the sklearn silhouette_samples function that I’m
ignorant of?
Or is this a bug in the sklearn silhouette_samples?
The effect occurs with the following code:
import pandas as pd
from sklearn.metrics import silhouette_samples
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
df_myfeatures #data frame containing features and class labels
'''data frame sorted by output labels in ascending order'''
df1 = df_myfeatures.copy().sort_values(['output_label'], ascending = True)
'''data frame sorted by output labels in descending order'''
df2 = df_myfeatures.copy().sort_values(['output_label'], ascending = False)
'''
standardize the features ahead of time since they’re on different scales
the X matrix has 26k rows (observations) and 9 columns (features)
'''
standard_scaler = StandardScaler()
X1 = standard_scaler.fit_transform(df1[cols]) #cols is just a list of columns for fitting
X2 = standard_scaler.fit_transform(df2[cols])
y1 = df1['output_label']
y2 = df2['output_label']
'''find the silhouette scores'''
ss1 = silhouette_samples(X1,y1)
ss2 = silhouette_samples(X2,y2)
'''plot the distribution'''
plt.hist(ss1, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted ascending')
plt.hist(ss2, bins = np.linspace(-1,1,21), alpha = 0.3, label = 'sorted descending')
plt.legend()
plt.title('distribution of silhouette scores')
Which generates the following distributions of scores:
histogram of silhouette scores
As you can see, the distribution of scores changes depending on the order of data. I’ve verified that there’s not an issue with standard scaler producing different results for the data depending on the order, and that there’s not an issue with the pandas sorting the data and somehow messing up the alignment of different rows across columns. I'm completely stumped, as far as I was aware, there shouldn't be any order effects in the calculation of silhouette scores.
Please, help me understand this behavior! Thanks!
Note: I’m running on a Windows 10 machine, using Anaconda 4.0 (64 bit), Python v3.5.1, sklearn v0.17.1 and Pandas v0.18.0

Categories

Resources