Subclassing pandas dataframe and setting field in constuctor - python

I'm trying to subclass pandas data structure. If I set a field on the instance, it works fine.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('iris')
class Results(pd.DataFrame):
def __init__(self, *args, **kwargs):
# use the __init__ method from DataFrame to ensure
# that we're inheriting the correct behavior
super(Results, self).__init__(*args, **kwargs)
#property
def _constructor(self):
return Results
result_object = Results(df)
result_object['scheme'] = 'not_default'
print(result_object.head(5))
>>> sepal_length sepal_width petal_length petal_width species scheme
0 5.1 3.5 1.4 0.2 setosa not_default
1 4.9 3.0 1.4 0.2 setosa not_default
2 4.7 3.2 1.3 0.2 setosa not_default
3 4.6 3.1 1.5 0.2 setosa not_default
4 5.0 3.6 1.4 0.2 setosa not_default
I don't quite understand the _constructor method under the hood well enough to tell why this does not work.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('iris')
class Results(pd.DataFrame):
def __init__(self, *args,scheme='default', **kwargs):
# use the __init__ method from DataFrame to ensure
# that we're inheriting the correct behavior
super(Results, self).__init__(*args, **kwargs)
self['scheme'] = scheme
#property
def _constructor(self):
return Results
result_object = Results(df.copy(),scheme='not_default')
print(result_object.head(5))
>>>
# scheme is still 'default'
sepal_length sepal_width petal_length petal_width species scheme
0 5.1 3.5 1.4 0.2 setosa default
1 4.9 3.0 1.4 0.2 setosa default
2 4.7 3.2 1.3 0.2 setosa default
3 4.6 3.1 1.5 0.2 setosa default
4 5.0 3.6 1.4 0.2 setosa default
Notice the scheme field still says default.
Is there anyway to set a field in the instance constructor?

Your current version creates scheme as an attribute (like .index, .columns):
result_object.scheme
# 0 not_default
# 1 not_default
# ...
# 148 not_default
# 149 not_default
# Name: scheme, Length: 150, dtype: object
To make it a proper column, you can modify the incoming data before sending it to super():
class Results(pd.DataFrame):
def __init__(self, data=None, *args, scheme='default', **kwargs):
# add column to incoming data
if isinstance(data, pd.DataFrame):
data['scheme'] = scheme
super(Results, self).__init__(data=data, *args, **kwargs)
#property
def _constructor(self):
return Results
df = sns.load_dataset('iris')
result_object = Results(df.copy(), scheme='not_default')
# sepal_length sepal_width petal_length petal_width species scheme
# 0 5.1 3.5 1.4 0.2 setosa not_default
# 1 4.9 3.0 1.4 0.2 setosa not_default
# 2 4.7 3.2 1.3 0.2 setosa not_default
# 3 4.6 3.1 1.5 0.2 setosa not_default
# ... ... ... ... ... ... ...

Related

Sklearn OneHotEncoding inside pipeline is converting all data types not only categorical/object ones

I have been experimenting with Scikit Learn's Pipeline class and the Iris dataset. A short summary of each section of my code is as follows:
df:
Id slengthCm sWidthCm pLengthCm PWidthCm Species
0 1 5.1 3.5 1.4 0.2 Iris-setosa
1 2 4.9 3.0 1.4 0.2 Iris-setosa
2 3 4.7 3.2 1.3 0.2 Iris-setosa
3 4 4.6 3.1 1.5 0.2 Iris-setosa
4 5 5.0 3.6 1.4 0.2 Iris-setosa
... ... ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3 Iris-virginica
146 147 6.3 2.5 5.0 1.9 Iris-virginica
147 148 6.5 3.0 5.2 2.0 Iris-virginica
dtypes:
Id int64
SepalLengthCm float64
SepalWidthCm float64
PetalLengthCm float64
PetalWidthCm float64
Species object
dtype: object
pipeline elements:
class Debug(BaseEstimator, TransformerMixin):
def transform(self, X):
print(pd.DataFrame(X).head())
print(X.shape)
self.X = X
self.df = pd.DataFrame(self.X)
return X
def fit(self, X, y=None, **fit_params):
return self
pipeline = Pipeline(steps=[('one_hot_encoding', OneHotEncoder(sparse=False)),
('debug_1', Debug()),
('standard_scaler', StandardScaler(with_mean=False)),
('debug_2', Debug()),
('kmeans_clustering', KMeans())])
Now if I fit this pipeline then view the content of the first debug step:
pipeline.fit_transform(df.values)
pipeline.named_steps["debug_2"].df
It seems that the one_hot_encoding step has 0-1 encoded all the values of the df instead of only the Species (object type) column
Is there to make OHE inside a pipeline apply only on specified columns or categorical/object ones?
You're looking for the ColumnTransformer, possibly with the helper make_column_selector for the specification of which columns to give to each transformer. For example,
preproc = ColumnTransformer(
transformers=[
('num', StandardScaler(withmean=False), make_column_selector(dtype_include=np.number)),
('obj', OneHotEncoder(), make_column_selector(dtype_include=object)),
],
)
or, being more explicit about columns,
preproc = ColumnTransformer(
transformers=[
('num', StandardScaler(withmean=False), ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm"]),
('obj', OneHotEncoder(), ["Species"]),
],
)
Then
pipeline = Pipeline(steps=[('preproc', preproc),
('debug', Debug()),
('kmeans_clustering', KMeans())])

how to generate a list within a list delimited by a space

how do i replicate the structure of result of itertools.product?
so as you know itertools.product gives us an object and we need to put them in a list so we can print it
.. something like this.. right?
import itertools
import numpy as np
CN=np.asarray((itertools.product([0,1], repeat=5)))
print(CN)
i want to be able to make something like that but i want the data to be from a csv file.. so i want to make something like this
#PSEUDOCODE
import pandas as pd
df = pd.read_csv('csv here')
#a b c d are the columns that i want to get
x = list(df['a'] df['c'] df['c'] df['d'])
print(x)
so the result will be something like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
how can i do that?
EDIT:
i am trying to learn how to do recursive feature elimination and i saw in some codes in google that they use the iris data set..
from sklearn import datasets
dataset = datasets.load_iris()
x = dataset.data
print(x)
and when printed it looked something like this
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
how could i make my dataset something like that so i can use this RFE template ?
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
print(rfe)
rfe = rfe.fit(dataset.data, dataset.target)
print("features:",dataset.data)
print("target:",dataset.target)
print(rfe)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)
You don't have to. If you want to use rfe.fit function, you need to feed features and target seperately.
So if your df is like:
a b c d target
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 0
7 5.0 3.4 1.5 0.2 0
8 4.4 2.9 1.4 0.2 1
9 4.9 3.1 1.5 0.1 1
you can use:
...
rfe = rfe.fit(df[['a', 'b', 'c', 'd']], df['target'])
...

Multiple hover_name for 3D plot in Python Plotly

I would like to add more hover text using multiple columns for my 3D plot model.
For example:
df:
sepal_length sepal_width petal_length petal_width species species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
5 5.4 3.9 1.7 0.4 setosa 1
Code:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
color='petal_length', symbol='species', hover_name="species")
fig.show()
produced plot
In the plot, the hover_name="species" shows only species in the hover_name. How can I include species_id in hover_name as well?
Simply add additional information in hover_data argument below:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
color='petal_length', symbol='species', hover_name="species", hover_data=["species", "species_id"])
fig.show()
Docs could be found here Customizing Hover text with Plotly Express

Find the range of all columns (difference between maximum and minimum) while gracefully handling string columns

I have a scenario where I have to find the range of all the columns in a dataset which contains multiple columns with numeric value but one column has string values.
Please find sample records from my data set below:
import seaborn as sns
iris = sns.load_dataset('iris')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
The maximum and minimum of these columns are given by
sepal_length 7.9
sepal_width 4.4
petal_length 6.9
petal_width 2.5
species virginica
dtype: object
and
sepal_length 4.3
sepal_width 2
petal_length 1
petal_width 0.1
species setosa
dtype: object
...respectively. To find the range of all the columns I can use the below code:
iris.max() - iris.min()
But as the column 'species' has string values, the above code is throwing the below error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
If the above error occurs, I want to print the value as the
"{max string value}" - "{min string value}"
IOW, my expected output would be something like:
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
species virginica - setosa
How do I resolve this issue?
Handle the numeric and string columns separately. You can select these using df.select_dtypes. Finally, concat the result.
u = Iris.select_dtypes(include=[np.number])
# U = u.apply(np.ptp, axis=0)
U = u.max() - u.min()
v = Iris.select_dtypes(include=[object])
V = v.max() + ' - ' + v.min()
U.append(V)
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
species virginica - setosa
dtype: object

Matplotilb- Need to find source data from a class attributes

I have a lines object which was created with the following:
junk = plt.plot([xxxx], [yyyy])
for x in junk:
print type(x)
<class 'matplotlib.lines.Line2D'>
I need to find the names of the two lists 'xxxx' and 'yyyy'. How can I get them from the class attributes?
You can use dir to see the content of an object in python, or check the docs for the class. I guess the objects you are looking for are xdata and ydata (although I'm a bit confused, in your post you ask for the names of the lists?)
In [27]:
import numpy as np
import matplotlib.pyplot as plt
​
x = np.arange(0, 5, 0.1);
y = np.sin(x)
junk = plt.plot(x, y)
for x in junk:
#print(dir(x))
print(x.get_xdata())
print(x.get_ydata())
[ 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4
1.5 1.6 1.7 1.8 1.9 2. 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9
3. 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4. 4.1 4.2 4.3 4.4
4.5 4.6 4.7 4.8 4.9]
[ 0. 0.09983342 0.19866933 0.29552021 0.38941834 0.47942554
0.56464247 0.64421769 0.71735609 0.78332691 0.84147098 0.89120736
0.93203909 0.96355819 0.98544973 0.99749499 0.9995736 0.99166481
0.97384763 0.94630009 0.90929743 0.86320937 0.8084964 0.74570521
0.67546318 0.59847214 0.51550137 0.42737988 0.33498815 0.23924933
0.14112001 0.04158066 -0.05837414 -0.15774569 -0.2555411 -0.35078323
-0.44252044 -0.52983614 -0.61185789 -0.68776616 -0.7568025 -0.81827711
-0.87157577 -0.91616594 -0.95160207 -0.97753012 -0.993691 -0.99992326
-0.99616461 -0.98245261]
Hope it helps.

Categories

Resources