How find two principal components with PCA - python

I have two arrays. It looks like:
GarageMudroomLights [kW] DiningRoomOutlets [kW]
0 0.001159 0.000000
1 0.001223 0.000000
2 0.001281 0.000000
3 0.001525 0.000000
4 0.001549 0.000000
5 0.001490 0.000000
... ... ...
39796 0.003750 0.001783
39797 0.003717 0.001850
39798 0.003617 0.001867
I need to find principal components like two orthogonal vectors.
I've tried to get it with PCA in python, but I can't understand where is components there, where is the rotation angle between vectors (necessary vectors and angle (picture)) and how it works only with two arrays.
Now I'm standardized arrays and don't understand what's next.
import pandas
from sklearn.preprocessing import StandardScaler
s_i_t = pandas.read_csv('dataset.csv', sep=',')
s_i_t_std = StandardScaler().fit_transform(s_i_t)

Related

Table of differences between observed and expected counts

I have data where I'm modeling a binary dependent variable. There are 5 other categorical predictor variables and I have the chi-square test for independence for each of them, vs. the dependent variable. All came up with very low p-values.
Now, I'd like to create a chart that displays all of the differences between the observed and expected counts. It seems like this should be part of the scipy chi2_contingency function but I can't figure it out.
The only thing I can think of is that the chi2_contingency function will output an array of expected counts, so I guess I need to figure out how to convert my cross tab table of observed counts into an array and then subtract the two.
## Gender & Income: cross-tabulation table and chi-square
ct_sex_income=pd.crosstab(adult_df.sex, adult_df.income, margins=True)
ct_sex_income
## Run Chi-Square test
scipy.stats.chi2_contingency(ct_sex_income)
## try to subtract them
ct_sex_income.observed - chi2_contingency(ct_sex_income)[4]
Error I get is "AttributeError: 'DataFrame' object has no attribute 'observed'"
I'd like just an array that shows the differences.
TIA for any help
I don't know your data and have no clue about how your observed function is defined. I couldn't understand much of your intention, probably something about predicting people's income based on their marital status.
I am posting here one possible solution for your problem.
import pandas as pd
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2_contingency
# some bogus data
data = [['single','30k-35k'],['divorced','40k-45k'],['married','25k-30k'],
['single','25k-30k'],['married','40k-45k'],['divorced','40k-35k'],
['single','30k-35k'],['married','30k-35k'],['divorced','30k-35k'],
['single','30k-35k'],['married','40k-45k'],['divorced','25k-30k'],
['single','40k-45k'],['married','30k-35k'],['divorced','30k-35k'],
]
adult_df = pd.DataFrame(data,columns=['marital','income'])
X = adult_df['marital'] #variable
Y = adult_df['income'] #prediction
dfObserved = pd.crosstab(Y,X)
results = []
#Chi-Statistic, P-Value, Degrees of Freedom and the expected frequencies
results = stats.chi2_contingency(dfObserved.values)
chi2 = results[0]
pv = results[1]
free = results[2]
efreq = results[3]
dfExpected = pd.DataFrame(efreq, columns=dfObserved.columns, index = dfObserved.index)
print(dfExpected)
"""
marital divorced married single
income
25k-30k 1.000000 1.000000 1.000000
30k-35k 2.333333 2.333333 2.333333
40k-35k 0.333333 0.333333 0.333333
40k-45k 1.333333 1.333333 1.333333
"""
print(dfObserved)
"""
marital divorced married single
income
25k-30k 1 1 1
30k-35k 2 2 3
40k-35k 1 0 0
40k-45k 1 2 1
"""
difference = dfObserved - dfExpected
print(difference)
""""
marital divorced married single
income
25k-30k 0.000000 0.000000 0.000000
30k-35k -0.333333 -0.333333 0.666667
40k-35k 0.666667 -0.333333 -0.333333
40k-45k -0.333333 0.666667 -0.333333
"""
I hope it helps

ValueError: ndarray is not contiguous

when I build a matrix using the last row of my dataframe:
x = w.iloc[-1, :]
a = np.mat(x).T
it goes:
ValueError: ndarray is not contiguous
`print the x shows(I have 61 columns in my dataframe):
print(x)
cdl2crows 0.000000
cdl3blackcrows 0.000000
cdl3inside 0.000000
cdl3linestrike 0.000000
cdl3outside 0.191465
cdl3starsinsouth 0.000000
cdl3whitesoldiers_x 0.000000
cdl3whitesoldiers_y 0.000000
cdladvanceblock 0.000000
cdlhighwave 0.233690
cdlhikkake 0.218209
cdlhikkakemod 0.000000
...
cdlidentical3crows 0.000000
cdlinneck 0.000000
cdlinvertedhammer 0.351235
cdlkicking 0.000000
cdlkickingbylength 0.000000
cdlladderbottom 0.002259
cdllongleggeddoji 0.629053
cdllongline 0.588480
cdlmarubozu 0.065362
cdlmatchinglow 0.032838
cdlmathold 0.000000
cdlmorningdojistar 0.000000
cdlmorningstar 0.327749
cdlonneck 0.000000
cdlpiercing 0.251690
cdlrickshawman 0.471466
cdlrisefall3methods 0.000000
Name: 2010-01-04, Length: 61, dtype: float64
how to solve it? so many thanks
np.mat expects array form of input.
refer to the doc
doc
So your code should be
x = w.iloc[-1, :].values
a = np.mat(x).T
.values will give numpy array format of dataframe values, so np.mat will work.
Use np.array instead of np.mat:
a = np.array(x).T

token-pattern for numbers in tfidfvectorizer sklearn in python

I need to calculate the tfidf matrix for few sentences. sentence include both numbers and words.
I am using below code to do so
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
data1=['1/8 wire','4 tube','1-1/4 brush']
dataset=pd.DataFrame(data1, columns=['des'])
vectorizer1 = TfidfVectorizer(lowercase=False)
tf_idf_matrix = pd.DataFrame(vectorizer1.fit_transform(dataset['des']).toarray(),columns=vectorizer1.get_feature_names())
Tfidf function is considering only words as its vocabulary i.e
Out[3]: ['brush', 'tube', 'wire']
but i need numbers to be part of tokens
expected
Out[3]: ['brush', 'tube', 'wire','1/8','4','1-1/4']
After reading TfidfVectorizer documentation, I came to know have to change token_pattern and tokenizer parameters. But I am not getting how to change it to consider numbers and punctuation.
can anyone please tell me how to change the parameters.
You're right, token_pattern requires a custom regex pattern, pass a regex that treats any one or more characters that don't contain whitespace characters as a single token.
tfidf = TfidfVectorizer(lowercase=False, token_pattern=r'\S+')
tf_idf_matrix = pd.DataFrame(
tfidf.fit_transform(dataset['des']).toarray(),
columns=tfidf.get_feature_names()
)
print(tf_idf_matrix)
1-1/4 1/8 4 brush tube wire
0 0.000000 0.707107 0.000000 0.000000 0.000000 0.707107
1 0.000000 0.000000 0.707107 0.000000 0.707107 0.000000
2 0.707107 0.000000 0.000000 0.707107 0.000000 0.000000
you can explicitly point out in token_pattern parameter the symbols you would like to parse:
token_pattern_ = r'([a-zA-Z0-9-/]{1,})'
where {1,} indicates the minimum number of symbols the word should contain. End then you pass this as a parameter to token_pattern:
tfidf = TfidfVectorizer(token_pattern = token_pattern_)

Writing XY coordinates from an ASCII file with no feature part ID using Python

I need to read an ASCII file containing X and Y coordinates as well a Z value using Python. These will be written as features in a feature class in ArcMap. Each point makes up a polygon where each feature is seperated by a row containing '999.0 999.0 999.0' as shown in the example. I'm wondering what the best way is to seperate each feature as there is no feature ID column.
329462.713287 8981177.910780 0.000000
331660.441771 8981187.405700 0.000000
331669.945462 8978975.695090 0.000000
329472.340912 8978966.180280 0.000000
329462.713287 8981177.910780 0.000000
999.0 999.0 999.0
297517.590475 8981318.596530 0.000000
299715.649732 8981329.876880 0.000000
299726.953175 8979117.630860 0.000000
297529.017922 8979106.326860 0.000000
297517.590475 8981318.596530 0.000000
999.0 999.0 999.0
Simply iterate the data line by line, and check whether the line contains your magic triplet and when you catch that line increase the feature index.

Is there any nipype interface for avscale (FSL script)?

I am trying to use nipype to analyze transformation matrixes that were created by FSL.
FSL has a script called "avscale" that analyzes those transformation matrixes (*.mat files).
I was wondering whether nipype has any interface that wrap that script and enable to work with its output.
Thanks
Based on the docs and the current source the answer is no. Also, avscale has also not been mentioned on the nipy-devel mailing list since at least last February. It's possible that Nipype already wraps something else that does this (perhaps with a matlab wrapper?) You could try opening an issue or asking the the mailing list.
As long as you're trying to use Python (with nipype and all), maybe the philosophy of the nipype project is that you should just use numpy/scipy for this? Just a guess, I don't know the functions to replicate this output with those tools. It's also possible that no one has gotten around to adding it yet.
For the uninitiated, avscale takes this affine matrix:
1.00614 -8.39414e-06 0 -0.757356
0 1.00511 -0.00317841 -0.412038
0 0.0019063 1.00735 -0.953364
0 0 0 1
and yields this or similar output:
Rotation & Translation Matrix:
1.000000 0.000000 0.000000 -0.757356
0.000000 0.999998 -0.001897 -0.412038
0.000000 0.001897 0.999998 -0.953364
0.000000 0.000000 0.000000 1.000000
Scales (x,y,z) = 1.006140 1.005112 1.007354
Skews (xy,xz,yz) = -0.000008 0.000000 -0.001259
Average scaling = 1.0062
Determinant = 1.01872
Left-Right orientation: preserved
Forward half transform =
1.003065 -0.000004 -0.000000 -0.378099
0.000000 1.002552 -0.001583 -0.206133
0.000000 0.000951 1.003669 -0.475711
0.000000 0.000000 0.000000 1.000000
Backward half transform =
0.996944 0.000004 0.000000 0.376944
0.000000 0.997452 0.001575 0.206357
0.000000 -0.000944 0.996343 0.473777
0.000000 0.000000 0.000000 1.000000

Categories

Resources