How to label dots in python PCA analysis? - python

I have tried PCA analysis with this script.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from sklearn.preprocessing import StandardScaler
raw_data_frame =
pd.read_table('/content/drive/MyDrive/BI/colab_input_output/16samples_vaf_df_forpca.csv',
sep=",", header=0, index_col=0)
data_scaler = StandardScaler()
data_scaler.fit(raw_data_frame)
scaled_data_frame = data_scaler.transform(raw_data_frame)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(scaled_data_frame)
x_pca = pca.transform(scaled_data_frame)
plt.figure(figsize=(10, 7))
plt.scatter(x_pca[:,0],x_pca[:,1], c=raw_data_frame['target'], cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
And the output is
I want to label the dots with the information in the dataframe.
The format of the dataframe is
('1', 187963806) ('19', 49972822) ('8', 14555764) ('11', 127666530) ('18', 67693298) target
15_R71_epi 0.310344828 0.227272727 0.217391304 0.149253731 0 1
15_R21_epi 0.1875 0.228070175 0.173913043 0.25862069 0 1
15_L133_epi 0.078947368 0.085714286 0.145454545 0.119047619 0 1
15_L58_epi 0.222222222 0.19047619 0.302325581 0.333333333 0 1
15_C5_epi 0.267326733 0.132075472 0.275362319 0.220779221 0 1
15_Lt_Nasal_derm 0.359375 0.039215686 0.274509804 0.192982456 0 2
15-H-21 0.322580645 0.255319149 0.238095238 0.380952381 0 3
15_H-55 0.446808511 0.27027027 0.387755102 0.347826087 0 3
15_H-49 0.30952381 0.236363636 0.266666667 0.235294118 0 3
15_H-3 0.12962963 0.153846154 0.085106383 0.205479452 0 3
15_H-33 0.349206349 0.263157895 0.298245614 0.328571429 0 3
15-RK-62 0.235294118 0.152173913 0.191780822 0.2 0 4
15_RK-29 0.078431373 0.094339623 0.175438596 0.121212121 0 4
15_LK-168 0.185185185 0.132075472 0.12 0.2 0 5
15_LK-114 0.173076923 0.075 0.14893617 0.237288136 0 5
15_LK-176 0.253968254 0.113207547 0.127272727 0.291666667 0.035087719 5
(This looks bad, but if you copy, it would be in a good form)
The color of the dots correspond with the numbers in the column "target"
But in the figure I can't distinguish the names of the samples.
How can I do?

Related

Erasing outliers from a dataframe in python

For an assignment I have to erase the outliers of a csv based on the different method
I tried working with the variable 'height' of the csv after opening the csv into a panda dataframe, but it keeps giving me errors or not touching the outliers at all, all this trying to use KNN method in python
The code that I wrote is the following
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs
df = pd.read_csv("data.csv")
print(df.describe())
print(df.columns)
df['height'].plot(kind='hist')
print(df['height'].value_counts())
data= pd.DataFrame(df['height'],df['active'])
k=1
knn = NearestNeighbors(n_neighbors=k)
knn.fit([df['height']])
neighbors_and_distances = knn.kneighbors([df['height']])
knn_distances = neighbors_and_distances[0]
tnn_distance = np.mean(knn_distances, axis=1)
print(knn_distances)
PCM = df.plot(kind='scatter', x='x', y='y', c=tnn_distance, colormap='viridis')
plt.show()
And the data it something like this:
id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,50,64.0,130,70,3,1,0,0,0,1
3,17623,2,250,82.0,150,100,1,1,0,0,1,1
I dont know what Im missing or doing wrong
df = pd.read_csv("data.csv")
X = df[['height', 'weight']]
X.plot(kind='scatter', x='weight', y='height', colormap='viridis')
plt.show()
knn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = knn.kneighbors(X)
X['distances'] = distances[:,1]
X.distances
0 1.000000
1 1.000000
2 1.000000
3 3.000000
4 1.000000
5 1.000000
6 133.958949
7 100.344407
...
X.plot(kind='scatter', x='weight', y='height', c='distances', colormap='viridis')
plt.show()
MAX_DIST = 10
X[distances < MAX_DIST]
height weight
0 162 78.0
1 162 78.0
2 151 76.0
3 151 76.0
4 171 84.0
...
And finally to filter out all the outliers:
MAX_DIST = 10
X = X[X.distances < MAX_DIST]

How to select columns of a data base to call a linear regression (OLS and lasso) in sklearn

I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 ... 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 ... 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 ... 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 ... 19.44 1 0 3 1
In trying to use LinearRegression() the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.
You can maybe try patsy which is used by statsmodels:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrix
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mat = dmatrix("disp + qsec + C(cyl)", mtcars)
Looks like this, we can omit first column intercept since it is included in sklearn:
mat
DesignMatrix with shape (32, 5)
Intercept C(cyl)[T.6] C(cyl)[T.8] disp qsec
1 1 0 160.0 16.46
1 1 0 160.0 17.02
1 0 0 108.0 18.61
1 1 0 258.0 19.44
1 0 1 360.0 17.02
X = pd.DataFrame(mat[:,1:],columns = mat.design_info.column_names[1:])
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,mtcars['mpg'])
But the parameters names in model.coef_ will not be named. You just have to put them into a series to read them maybe:
pd.Series(model.coef_,index = X.columns)
C(cyl)[T.6] -5.087564
C(cyl)[T.8] -5.535554
disp -0.025860
qsec -0.162425
Pvalues from sklearn linear regression, there's no ready method to do it, you can check out these answers, maybe one of them is what you are looking for.
Here are two ways - unsatisfactory, especially because the variables labels seem to be gone once the regression gets going:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
import numpy as np
from sklearn.linear_model import LinearRegression
Single variable regression mpg (i.v.) ~ hp (d.v.):
lm = LinearRegression()
mat = np.matrix(df)
lmFit = lm.fit(mat[:,3], mat[:,0])
print(lmFit.coef_)
print(lmFit.intercept_)
For multiple regression drat ~ wt + cyl + carb:
lmm = LinearRegression()
wt = np.array(df['wt'])
cyl = np.array(df['cyl'])
carb = np.array(df['carb'])
stack = np.column_stack((cyl,wt,carb))
stackmat = np.matrix(stack)
lmFit2 = lmm.fit(stackmat,mat[:,4])
print(lmFit2.coef_)
print(lmFit2.intercept_)

How to feed multiple files to pandas to filter data and concatenate all the results

I have written a code to perform some data cleaning to get the final columns and values from a tab spaced file.
import matplotlib.image as image
import numpy as np
import tkinter as tk
import matplotlib.ticker as ticker
from tkinter import filedialog
import matplotlib.pyplot as plt
root = tk.Tk()
root.withdraw()
root.call('wm', 'attributes', '.', '-topmost', True)
files1 = filedialog.askopenfilename(multiple=True)
files = root.tk.splitlist(files1)
List = list(files)
%gui tk
for i,file in enumerate(List,1):
d = pd.read_csv(file,sep=None,engine='python')
h = d.drop(d.index[19:])
transpose = h.T
header =transpose.iloc[0]
df = transpose[1:]
df.columns =header
df.columns = df.columns.str.strip()
all_columns = list(df)
df[all_columns] = df[all_columns].astype(str)
k =df.drop(columns =['Op:','Comment:','Mod Type:', 'PN', 'Irradiance:','Irr Correct:', 'Lamp Voltage:','Corrected To:', 'MCCC:', 'Rseries:', 'Rshunt:'], axis=1)
k.head()
I want to run this code to multiple files and do the same and concatenate all the results to one data frame.
for eg, If I select 20 files, then new data frame with one line of header and all the 20 results below with increasing order of the value from the column['Module Temp:'].
It would be great if someone could provide a solution to this problem
Please find the link to sample data:https://drive.google.com/drive/folders/1sL2-CwCGeGm0-fvcpzMVzgFnYzN3wzVb?usp=sharing
The following code shows how to parse the files and extract the data. It doesn't show the tkinter GUI component. files will represent your selected files.
Assumptions:
The first 92 rows of the files are always the measurement parameters
Rows from 93 are the measurements.
The 'Module Temp' for each file is different
The lists will be sorted based on the sort order of mod_temp, so the data will be in order in the DataFrame.
The list sorting uses the accepted answer to Sorting list based on values from another list?
import pandas as p
from patlib import Path
# set path to files
path_ = Path('e:/PythonProjects/stack_overflow/data/so_data/2020-11-16')
# select the correct files
files = path_.glob('*.ivc')
# create lists for metrics
measurement_params = list()
mod_temp = list()
measurements = list()
# iterate through the files
for f in files:
# get the first 92 rows with the measurement parameters
mp = pd.read_csv(f, sep='\t', nrows=91, index_col=0)
# remove the whitespace and : from the end of the index names
mp.index = mp.index.str.replace(':', '').str.strip().str.replace('\\s+', '_')
# get the column header
col = mp.columns[0]
# get the module temp
mt = mp.loc['Module_Temp', col]
# add Modult_Temp to mod_temp
mod_temp.append(float(mt))
# get the measurements
m = pd.read_csv(f, sep='\t', skiprows=92, nrows=3512)
# remove the whitespace and : from the end of the column names
m.columns = m.columns.str.replace(':', '').str.strip()
# add Module_Temp column
m['mod_temp'] = mt
# store the measure parameters
measurement_params.append(mp.T)
# store the measurements
measurements.append(m)
# sort lists based on mod_temp sort order
measurement_params = [x for _, x in sorted(zip(mod_temp, measurement_params))]
measurements = [x for _, x in sorted(zip(mod_temp, measurements))]
# create a dataframe for the measurement parameters
df_mp = pd.concat(measurement_params)
# create a dataframe for the measurements
df_m = pd.concat(measurements).reset_index(drop=True)
df_mp
Title: Comment Op ID Mod_Type PN Date Time Irradiance IrrCorr Irr_Correct Lamp_Voltage Module_Temp Corrected_To MCCC Voc Isc Rseries Rshunt Pmax Vpm Ipm Fill_Factor Active_Eff Aperture_Eff Segment_Area Segs_in_Ser Segs_in_Par Panel_Area Vload Ivld Pvld Frequency SweepDelay SweepLength SweepSlope SweepDir MCCC2 MCCC3 MCCC4 LampI IntV IntV2 IntV3 IntV4 LoadV PulseWidth1 PulseWidth2 PulseWidth3 PulseWidth4 TRef1 TRef2 TRef3 TRef4 MCMode Irradiance2 IrrCorr2 Voc2 Isc2 Pmax2 Vpm2 Ipm2 Fill_Factor2 Active_Eff2 ApertureEff2 LoadV2 PulseWidth12 PulseWidth22 Irradiance3 IrrCorr3 Voc3 Isc3 Pmax3 Vpm3 Ipm3 Fill_Factor3 Active_Eff3 ApertureEff3 LoadV3 PulseWidth13 PulseWidth23 RefCellID RefCellTemp RefCellIrrMM RefCelIscRaw RefCellIsc VTempCoeff ITempCoeff PTempCoeff MismatchCorr Serial_No Soft_Ver
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:12:52 100.007 100 Ref Cell 2400 25.2787 25 1.3669 46.4379 9.13215 0.43411 294.467 331.924 38.3403 8.65732 0.78269 1.89434 1.7106 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.4736 6.87023 6.8645 6 6 6.76 107.683 109.977 0 0 27.2224 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9834 1001.86 0.15142 0.15135 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Solarium SGE24P330 STC Admin MCIND_2021_0074 ModuleType1 NaN 17-09-2020 15:06:12 99.3671 100 Ref Cell 2400 25.3380 25 1.3669 45.2903 8.87987 0.48667 216.763 311.031 36.9665 8.41388 0.77338 1.77510 1.60292 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.405 6.82362 6.8212 6 6 6.6 107.660 109.977 0 0 25.9418 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 4.943 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 4.943 107.660 109.977 WPVS mono C-Si Ref Cell 25.3315 998.370 0.15085 0.15082 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:11:32 100.010 100 Ref Cell 2400 25.3557 25 1.3669 46.4381 9.11368 0.41608 299.758 331.418 38.3876 8.63345 0.78308 1.89144 1.70798 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.3820 6.87018 6.8645 6 6 6.76 107.683 109.977 0 0 27.2535 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.683 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.683 109.977 WPVS mono C-Si Ref Cell 25.9614 1003.80 0.15171 0.15164 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
Nease 345W N345M72 STC Admin MCIND2021-058 ModuleType1 NaN 10-09-2020 19:14:09 99.9925 100 Ref Cell 2400 25.4279 25 1.3669 46.4445 9.14115 0.43428 291.524 332.156 38.2767 8.67776 0.78236 1.89566 1.71179 243.36 72 1 19404 0 0 0 218000 10 100 0.025 0 1 1.155 1.155 20.5044 6.87042 6.8645 6 6 6.76 107.660 109.977 0 0 27.1989 0 0 0 False -1.#INF 70 0 0 0 0 0 0 0 0 5 107.660 109.977 -1.#INF 40 0 0 0 0 0 0 0 0 5 107.660 109.977 WPVS mono C-Si Ref Cell 26.0274 1000.93 0.15128 0.15121 -0.31 0.05 -0.4 0.9985 S91-00052 5.5.1
df_m.head()
Voltage Current mod_temp
0 -1.193405 9.202885 25.2787
1 -1.196560 9.202489 25.2787
2 -1.193403 9.201693 25.2787
3 -1.196558 9.201298 25.2787
4 -1.199711 9.200106 25.2787
df_m.tail()
Voltage Current mod_temp
14043 46.30869 0.315269 25.4279
14044 46.31411 0.302567 25.4279
14045 46.31949 0.289468 25.4279
14046 46.32181 0.277163 25.4279
14047 46.33039 0.265255 25.4279
Plot
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8))
sns.scatterplot(x='Current', y='Voltage', data=df_m, hue='mod_temp', s=10)
plt.show()
Note
After doing this, I was having trouble plotting the data because the columns were not float type. However, an error occurred when trying to set the type. Looking back at the data, after row 92, there are multiple headers throughout the two columns.
Row 93: Voltage: Current:
Row 3631: Ref Cell: Lamp I:
Row 7169: Voltage2: Current2:
Row 11971: Ref Cell2: Lamp I2:
Row 16773: Voltage3: Current3:
Row 21575: Ref Cell3: Lamp I3:
Row 26377: Raw Voltage: Raw Current :
Row 29915: WPVS Voltage: WPVS Current:
I went back and used the nrows parameter when creating m, so only the first set of headers and associated measurements are extracted from the file.
I recommend writing a script using the csv module to read each file, and create a new file beginning at each blank row, this will make the files have consistent types of measurements.
This should be a new question, if needed.
There are various ways to do it. You can append one dataframe to another (basically stack one on top of the other), and you can do it in the loop. Here is an example. I use fake dfs but you will use your own
import pandas as pd
import numpy as np
combined = None
for _ in range(5):
# stub df creation -- you will use your real code here
df = pd.DataFrame(columns = ['Module Temp','A', 'B'], data = np.random.random((5,3)))
if combined is None:
# initialize with the first one
combined = df.copy()
else:
# add the next one
combined = combined.append(df, sort = False, ignore_index = True)
combined.sort_values('Module Temp', inplace = True)
Here combined will have all the dfs, sorted by 'Module Temp'

How to apply text/color formatting to multi index Python Chart?

I am trying to create a chart with a simple yes/no indication (in this case, using seaborn's heatmap), but with multiple indexes--the hierarchy of which I establish with calling the MultiIndex.from_arrays method. The code listed below, which pulls from the example data listed in the link below, produces this heatmap: CurrentGraph
What is the easiest way to 1) Delete the outermost index labels ("Type of Category....", "None-None") 2) Change the display options and/or color formatting of the next inner index labels? (Category A, Category B, Loc1, Loc2, Loc3)
Currently, the way they are displayed is not very easy on the eyes; I'd like to make it so each label's text in this layer only shows up once and is centered along the rows/columns that were matched in the MultiIndex.from_arrays method, rather than spliced onto the beginning of each entry. If possible, I'd also like to put a different background color underneath each of these "sections" (Category A, Category B, Loc1, Loc2, Loc3) so that they appear more distinct to the eye on the graph.
Is there an easy way to do this sort of editing with heatmap/Seaborn, or is this something I would have to make from scratch? Any help on this is appreciated, as I'm still rather green with this.
http://www.filedropper.com/heatmapdata
EDIT: added sample data as text below, before code
Type of Category Variables Division 01 Division 02 Division 03 Division 04 Division 05 Division 06 Division 07 \\
Category A V1 0 1 1 0 0 0 1 \\
Category A V2 1 1 1 0 0 0 1 \\
Category A V3 1 1 1 0 0 1 1 \\
Category A V4 0 0 1 1 1 1 0 \\
Category A V5 0 1 1 1 0 0 0 \\
Category A V6 1 1 1 1 0 0 1 \\
Category B V7 1 0 0 1 0 0 1
import seaborn as sns; sns.set()
import pandas as pd
import matplotlib.pyplot as plt
metrics = pd.read_excel(r'EnterYourFileLocationHere\Heatmap data.xlsx', sheet_name='Example_Data')
metrics.set_index([metrics.columns[0], metrics.columns[1]], inplace=True)
metrics.columns = pd.MultiIndex.from_arrays([['Loc1', 'Loc1', 'Loc1', 'Loc2',
'Loc2', 'Loc3', 'Loc3'],
['Division 01', 'Division 02',
'Division 03', 'Division 04',
'Divsion 05', 'Division06',
'Division07']])
graph = sns.heatmap(metrics, annot=True, fmt = "d", linewidth=0.5, cmap="Blues", cbar=False)
I couldn't find an example of decorating the background of the label, so I've tried to show the category names as an example of visualization in the annotations.
import seaborn as sns
sns.set()
import pandas as pd
import matplotlib.pyplot as plt
# metrics = pd.read_excel(r'EnterYourFileLocationHere\Heatmap data.xlsx', sheet_name='Example_Data')
metrics.set_index([metrics.columns[0], metrics.columns[1]], inplace=True)
metrics.columns = pd.MultiIndex.from_arrays([['Loc1', 'Loc1', 'Loc1', 'Loc2',
'Loc2', 'Loc3', 'Loc3'],
['Division 01', 'Division 02',
'Division 03', 'Division 04',
'Divsion 05', 'Division06',
'Division07']])
graph = sns.heatmap(metrics, annot=False, fmt = "d", linewidth=0.5, cmap="Blues", cbar=False)
for y in range(metrics.shape[1]):
for x in range(metrics.shape[0]):
if metrics.iloc[x,y] == 1:
graph.annotate(metrics.columns[y][0]+'\n '+metrics.columns[y][1][-2:], xy = (y+0.25, x+0.75), size=12, color="#ffd700")

plotting a timeseries graph in python using matplotlib from a csv file

I have some csv data in the following format.
Ln Dr Tag Lab 0:01 0:02 0:03 0:04 0:05 0:06 0:07 0:08 0:09
L0 St vT 4R 0 0 0 0 0 0 0 0 0
L2 Tx st 4R 8 8 8 8 8 8 8 8 8
L2 Tx ss 4R 1 1 9 6 1 0 0 6 7
I want to plot a timeseries graph using the columns (Ln , Dr, Tg,Lab) as the keys and the 0:0n field as values on a timeseries graph.
I have the following code.
#!/usr/bin/env python
import matplotlib.pyplot as plt
import datetime
import numpy as np
import csv
import sys
with open("test.csv", 'r', newline='') as fin:
reader = csv.DictReader(fin)
for row in reader:
key = (row['Ln'], row['Dr'], row['Tg'],row['Lab'])
#code to extract the values and plot a timeseries.
How do I extract all the values in columns 0:0n without induviduall specifying each one of them. I want all the timeseries to be plotted on a single timeseries?
I'd suggest using pandas:
import pandas as pd
a=pd.read_csv('yourfile.txt',delim_whitespace=True)
for x in a.iterrows():
x[1][4:].plot(label=str(x[1][0])+str(x[1][1])+str(x[1][2])+str(x[1][3]))
plt.ylim(-1,10)
plt.legend()
I'm not really sure exactly what you want to do but np.loadtxt is the way to go here. make sure to set the delimiter correctly for your file
data = np.loadtxt(fname="test.csv",delimiter=',',skiprows=1)
now the n-th column of data is the n-th column of the file and same for rows.
you can access data by line: data[n] or by column: data[:,n]

Categories

Resources