Erasing outliers from a dataframe in python - python

For an assignment I have to erase the outliers of a csv based on the different method
I tried working with the variable 'height' of the csv after opening the csv into a panda dataframe, but it keeps giving me errors or not touching the outliers at all, all this trying to use KNN method in python
The code that I wrote is the following
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import chi2
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs
df = pd.read_csv("data.csv")
print(df.describe())
print(df.columns)
df['height'].plot(kind='hist')
print(df['height'].value_counts())
data= pd.DataFrame(df['height'],df['active'])
k=1
knn = NearestNeighbors(n_neighbors=k)
knn.fit([df['height']])
neighbors_and_distances = knn.kneighbors([df['height']])
knn_distances = neighbors_and_distances[0]
tnn_distance = np.mean(knn_distances, axis=1)
print(knn_distances)
PCM = df.plot(kind='scatter', x='x', y='y', c=tnn_distance, colormap='viridis')
plt.show()
And the data it something like this:
id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
0,18393,2,168,62.0,110,80,1,1,0,0,1,0
1,20228,1,156,85.0,140,90,3,1,0,0,1,1
2,18857,1,50,64.0,130,70,3,1,0,0,0,1
3,17623,2,250,82.0,150,100,1,1,0,0,1,1
I dont know what Im missing or doing wrong

df = pd.read_csv("data.csv")
X = df[['height', 'weight']]
X.plot(kind='scatter', x='weight', y='height', colormap='viridis')
plt.show()
knn = NearestNeighbors(n_neighbors=2).fit(X)
distances, indices = knn.kneighbors(X)
X['distances'] = distances[:,1]
X.distances
0 1.000000
1 1.000000
2 1.000000
3 3.000000
4 1.000000
5 1.000000
6 133.958949
7 100.344407
...
X.plot(kind='scatter', x='weight', y='height', c='distances', colormap='viridis')
plt.show()
MAX_DIST = 10
X[distances < MAX_DIST]
height weight
0 162 78.0
1 162 78.0
2 151 76.0
3 151 76.0
4 171 84.0
...
And finally to filter out all the outliers:
MAX_DIST = 10
X = X[X.distances < MAX_DIST]

Related

How to label dots in python PCA analysis?

I have tried PCA analysis with this script.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from sklearn.preprocessing import StandardScaler
raw_data_frame =
pd.read_table('/content/drive/MyDrive/BI/colab_input_output/16samples_vaf_df_forpca.csv',
sep=",", header=0, index_col=0)
data_scaler = StandardScaler()
data_scaler.fit(raw_data_frame)
scaled_data_frame = data_scaler.transform(raw_data_frame)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
pca.fit(scaled_data_frame)
x_pca = pca.transform(scaled_data_frame)
plt.figure(figsize=(10, 7))
plt.scatter(x_pca[:,0],x_pca[:,1], c=raw_data_frame['target'], cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
And the output is
I want to label the dots with the information in the dataframe.
The format of the dataframe is
('1', 187963806) ('19', 49972822) ('8', 14555764) ('11', 127666530) ('18', 67693298) target
15_R71_epi 0.310344828 0.227272727 0.217391304 0.149253731 0 1
15_R21_epi 0.1875 0.228070175 0.173913043 0.25862069 0 1
15_L133_epi 0.078947368 0.085714286 0.145454545 0.119047619 0 1
15_L58_epi 0.222222222 0.19047619 0.302325581 0.333333333 0 1
15_C5_epi 0.267326733 0.132075472 0.275362319 0.220779221 0 1
15_Lt_Nasal_derm 0.359375 0.039215686 0.274509804 0.192982456 0 2
15-H-21 0.322580645 0.255319149 0.238095238 0.380952381 0 3
15_H-55 0.446808511 0.27027027 0.387755102 0.347826087 0 3
15_H-49 0.30952381 0.236363636 0.266666667 0.235294118 0 3
15_H-3 0.12962963 0.153846154 0.085106383 0.205479452 0 3
15_H-33 0.349206349 0.263157895 0.298245614 0.328571429 0 3
15-RK-62 0.235294118 0.152173913 0.191780822 0.2 0 4
15_RK-29 0.078431373 0.094339623 0.175438596 0.121212121 0 4
15_LK-168 0.185185185 0.132075472 0.12 0.2 0 5
15_LK-114 0.173076923 0.075 0.14893617 0.237288136 0 5
15_LK-176 0.253968254 0.113207547 0.127272727 0.291666667 0.035087719 5
(This looks bad, but if you copy, it would be in a good form)
The color of the dots correspond with the numbers in the column "target"
But in the figure I can't distinguish the names of the samples.
How can I do?

How to select columns of a data base to call a linear regression (OLS and lasso) in sklearn

I am not comfortable with Python - much less intimidated and at ease with R. So indulge me on a silly question that is taking me a ton of searches without success.
I want to fit in a regression model with sklearn both with OLS and lasso. In particular, I like the mtcars dataset that is so easy to call in R, and, as it turns out, also very accessible in Python:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
It looks like this:
mpg cyl disp hp drat ... qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 ... 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 ... 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 ... 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 ... 19.44 1 0 3 1
In trying to use LinearRegression() the usual structure found is
import numpy as np
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(x, y)
but to do so, I need to select several columns of df to fit into the regressors x, and a column to be the independent variable y. For example, I'd like to get an x matrix that includes a column of 1's (for the intercept) as well as the disp and qsec (numerical variables), as well as cyl (categorical variable). On the side of the independent variable, I'd like to use mpg.
It would look if it were possible to word this way as
model = LinearRegression().fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
But how do I go about the syntax for it?
Similarly, how can I do the same with lasso:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.001)
lasso.fit(mpg ~['disp', 'qsec', C('cyl')], data=df)
but again this is not the right syntax.
I did find that you can get the actual regression (OLS or lasso) by turning the dataframe into a matrix. However, the names of the columns are gone, and it is hard to read the variable corresponding to each coefficients. And I still haven't found a simple method to run diagnostic values, like p-values, or the r-square to begin with.
You can maybe try patsy which is used by statsmodels:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
from patsy import dmatrix
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
mat = dmatrix("disp + qsec + C(cyl)", mtcars)
Looks like this, we can omit first column intercept since it is included in sklearn:
mat
DesignMatrix with shape (32, 5)
Intercept C(cyl)[T.6] C(cyl)[T.8] disp qsec
1 1 0 160.0 16.46
1 1 0 160.0 17.02
1 0 0 108.0 18.61
1 1 0 258.0 19.44
1 0 1 360.0 17.02
X = pd.DataFrame(mat[:,1:],columns = mat.design_info.column_names[1:])
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X,mtcars['mpg'])
But the parameters names in model.coef_ will not be named. You just have to put them into a series to read them maybe:
pd.Series(model.coef_,index = X.columns)
C(cyl)[T.6] -5.087564
C(cyl)[T.8] -5.535554
disp -0.025860
qsec -0.162425
Pvalues from sklearn linear regression, there's no ready method to do it, you can check out these answers, maybe one of them is what you are looking for.
Here are two ways - unsatisfactory, especially because the variables labels seem to be gone once the regression gets going:
import statsmodels.api as sm
import pandas as pd
import statsmodels.formula.api as smf
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df = pd.DataFrame(mtcars)
import numpy as np
from sklearn.linear_model import LinearRegression
Single variable regression mpg (i.v.) ~ hp (d.v.):
lm = LinearRegression()
mat = np.matrix(df)
lmFit = lm.fit(mat[:,3], mat[:,0])
print(lmFit.coef_)
print(lmFit.intercept_)
For multiple regression drat ~ wt + cyl + carb:
lmm = LinearRegression()
wt = np.array(df['wt'])
cyl = np.array(df['cyl'])
carb = np.array(df['carb'])
stack = np.column_stack((cyl,wt,carb))
stackmat = np.matrix(stack)
lmFit2 = lmm.fit(stackmat,mat[:,4])
print(lmFit2.coef_)
print(lmFit2.intercept_)

Scikit learn for predicting likelihood based on two values

New to python, building a classifier that predicts likelihood of vaccination if trust in government (trustingov) and trust in public health (poptrusthealth) from the dataset is greater than a certain percentage. Not sure how to get both as classes.
UPDATE: Concatenated the dataframe values, but why is the accuracy of the model 1.0?
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os
df = pd.read_csv("covidpopulation2.csv")
print(df.head())
99853 8254 219 0.649999976 0.80763793
0 99853 8254 219 0.65 0.807638
1 48490 4007 227 0.49 0.357625
2 190179 8927 107 0.54 0.853186
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
99853 8254 219 0.649999976 0.80763793
count 1.342500e+04 13425.000000 13425.000000 13425.000000 13425.000000
mean 3.095292e+05 20555.570056 225.864655 0.473157 0.684484
std 5.070872e+05 28547.608184 218.078176 0.184501 0.167985
min 1.225700e+04 26.000000 2.000000 0.000000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.370000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.490000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.630000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.770000 0.983146
df = pd.read_csv("covidpopulation2.csv", na_values = ['?'], names = ['covidcases','coviddeaths','mortalityperm','trustngov','poptrusthealth'])
print(df.head())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
0 99853 8254 219 0.65 0.807638
1 99853 8254 219 0.65 0.807638
2 48490 4007 227 0.49 0.357625
3 190179 8927 107 0.54 0.853186
4 190179 8927 107 0.54 0.853186
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
df.dropna(inplace=True)
In [212]:
print(df.describe())
covidcases coviddeaths mortalityperm trustngov poptrusthealth
count 1.342600e+04 13426.000000 13426.000000 13426.00000 13426.000000
mean 3.095136e+05 20554.653806 225.864144 0.47317 0.684493
std 5.070715e+05 28546.742358 218.070062 0.18450 0.167982
min 1.225700e+04 26.000000 2.000000 0.00000 0.357625
25% 5.456200e+04 1674.000000 28.000000 0.37000 0.563528
50% 1.581740e+05 8254.000000 148.000000 0.49000 0.660156
75% 2.992510e+05 29575.000000 453.000000 0.63000 0.838449
max 2.234475e+06 119941.000000 621.000000 0.77000 0.983146
all_features = df[['covidcases',
'coviddeaths',
'mortalityperm',
'trustngov',
'poptrusthealth',]].values
all_classes = (df['poptrusthealth'].values + df['trustngov'].values)
willing = 0
unwilling = 0
label = [None] * 13426
for i in range (len(all_classes)):
if all_classes[i] > 0.70:
willing += 1
label[i] = 1
else:
unwilling = unwilling + 1
label[i] = 0
print(willing)
print(unwilling)
all_classes = label
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
from sklearn.model_selection import train_test_split
np.random.seed(1234)
(training_inputs,testing_inputs,training_classes,testing_classes) = train_test_split(all_features_scaled,all_classes,train_size = 0.8,test_size = 0.2,random_state = 1)
from sklearn.tree import DecisionTreeClassifier
clf=DecisionTreeClassifier(random_state=1)
clf.fit(training_inputs, training_classes)
DecisionTreeClassifier(random_state=1)
print(clf)
DecisionTreeClassifier(random_state=1)
print('the accuracy of the decision tree is:',clf.score(testing_inputs, testing_classes))
the accuracy of the decision tree is: 1.0
import pydotplus
from sklearn import tree
import collections
import graphviz
feature_names = ['covidcases','coviddeaths', 'mortalityperm','trustngov',
'poptrusthealth']
dot_data = tree.export_graphviz(clf, feature_names = feature_names, out_file =None, filled = True, rounded = True)
graph = pydotplus.graph_from_dot_data(dot_data)
colors = ('turquoise','orange')
edges = collections.defaultdict(list)
for edge in graph.get_edge_list():
edges[edge.get_source()].append(int(edge.get_destination()))
for edge in edges:
edges[edge].sort()
for i in range (2):
dest = graph.get_node(str(edges[edge][i]))[0]
dest.set_fillcolor(colors[i])
graph.write_png('tree.png')
Any help or ideas would be appreciated.
Sorry, but this makes no sense from a machine learning point of view. Your label is directly created from the input features. That's why the model accuracy is 100%.
Here is your final classifier (without needing any machine learning):
if trustingov + poptrusthealth > 0.7 predict 1, otherwise predict 0.
It is perfectly possible to have 100% accuracy with training data, as the ML algorithm know them.
You have to apply your ML to data not used during the learning phase. It is usually done by splitting data into a training data set and a test data set.
Then you train/fit the ML with train data only. Then test it and calculate accuracy on test data. The test data result/Accuracy will tell you if your ML is well trained and working.
Unused test data is important to do a good ML test. So you will find unbiased accuracy of it.

GeoPandas - grid scattered data and reproject

I need to grid scattered data in a GeoPandas dataframe to a regular grid (e.g. 1 degree) and get the mean values of the individual grid boxes and secondly plot this data with various projections.
The first point I managed to achieve using the gpd_lite_toolbox.
This result I can plot on a simple lat lon map, however trying to convert this to any other projection fails.
Here is a small example with some artificial data showing my issue:
import gpd_lite_toolbox as glt
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
from shapely import wkt
# creating the artificial df
df = pd.DataFrame(
{'data': [20, 15, 17.5, 11.25, 16],
'Coordinates': ['POINT(-58.66 -34.58)', 'POINT(-47.91 -15.78)',
'POINT(-70.66 -33.45)', 'POINT(-74.08 4.60)',
'POINT(-66.86 10.48)']})
# converting the df to a gdf with projection
df['Coordinates'] = df['Coordinates'].apply(wkt.loads)
crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(df, crs=crs, geometry='Coordinates')
# gridding the data using the gridify_data function from the toolbox and setting grids without data to nan
g1 = glt.gridify_data(gdf, 1, 'data', cut=False)
g1 = g1.where(g1['data'] > 1)
# simple plot of the gridded data
fig, ax = plt.subplots(ncols=1, figsize=(20, 10))
g1.plot(ax=ax, column='data', cmap='jet')
# trying to convert to (any) other projection
g2 = g1.to_crs({'init': 'epsg:3395'})
# I get the following error
---------------------------------------------------------------------------
AttributeError: 'float' object has no attribute 'is_empty'
I would also be happy to use different gridding function if this solves the problem
Your g1 conatin too much NaN value.
g1 = g1.where(g1['data'] > 1)
print(g1)
geometry data
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 POLYGON ((-74.08 5.48, -73.08 5.48, -73.08 4.4... 11.25
...
You should use g1[g1['data'] > 1] instead of g1.where(g1['data'] > 1).
g1 = g1[g1['data'] > 1]
print(g1)
geometry data
5 POLYGON ((-74.08 5.48, -73.08 5.48, -73.08 4.4... 11.25
181 POLYGON ((-71.08 -32.52, -70.08 -32.52, -70.08... 17.50
322 POLYGON ((-67.08 10.48, -66.08 10.48, -66.08 9... 16.00
735 POLYGON ((-59.08 -34.52, -58.08 -34.52, -58.08... 20.00
1222 POLYGON ((-48.08 -15.52, -47.08 -15.52, -47.08... 15.00
g2 = g1.to_crs({'init': 'epsg:3395'})
print(g2)
geometry data
5 POLYGON ((-8246547.877965705 606885.3761893312... 11.25
181 POLYGON ((-7912589.405585884 -3808795.10464339... 17.50
322 POLYGON ((-7467311.442412791 1165421.424891677... 16.00
735 POLYGON ((-6576755.516066602 -4074627.00861716... 20.00
1222 POLYGON ((-5352241.117340593 -1737775.44359649... 15.00

How to sum up Y values for bins instead of averaging?

I have the following dataframe data:
import pandas as pd
from io import StringIO
data = pd.read_table(StringIO("""time_diff avg_trips_per_day
631 1.0
231 1.0
431 1.0
7031 1.0
17231 1.0
20000 20.0
21000 15.0
22000 10.0"""), delim_whitespace=True)
I create a barchart as folows:
import seaborn as sns
data['timegroup'] = pd.qcut(data['time_diff'], 3)
sns.barplot(x='timegroup', y='avg_trips_per_day', data=data)
Currently it takes the values of avg_trips_per_day for each bin (timegroup) and calculates a mean avg_trips_per_day.
However, I want to sum-up the values of avg_trips_per_day for each bin timegroup instead of using mean. How can I do this?
Use the estimator parameter of barplot:
sns.barplot(x='timegroup', y='avg_trips_per_day', data=data, estimator=sum)

Categories

Resources