Multiple hover_name for 3D plot in Python Plotly - python

I would like to add more hover text using multiple columns for my 3D plot model.
For example:
df:
sepal_length sepal_width petal_length petal_width species species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1
5 5.4 3.9 1.7 0.4 setosa 1
Code:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
color='petal_length', symbol='species', hover_name="species")
fig.show()
produced plot
In the plot, the hover_name="species" shows only species in the hover_name. How can I include species_id in hover_name as well?

Simply add additional information in hover_data argument below:
import plotly.express as px
df = px.data.iris()
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
color='petal_length', symbol='species', hover_name="species", hover_data=["species", "species_id"])
fig.show()
Docs could be found here Customizing Hover text with Plotly Express

Related

Subclassing pandas dataframe and setting field in constuctor

I'm trying to subclass pandas data structure. If I set a field on the instance, it works fine.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('iris')
class Results(pd.DataFrame):
def __init__(self, *args, **kwargs):
# use the __init__ method from DataFrame to ensure
# that we're inheriting the correct behavior
super(Results, self).__init__(*args, **kwargs)
#property
def _constructor(self):
return Results
result_object = Results(df)
result_object['scheme'] = 'not_default'
print(result_object.head(5))
>>> sepal_length sepal_width petal_length petal_width species scheme
0 5.1 3.5 1.4 0.2 setosa not_default
1 4.9 3.0 1.4 0.2 setosa not_default
2 4.7 3.2 1.3 0.2 setosa not_default
3 4.6 3.1 1.5 0.2 setosa not_default
4 5.0 3.6 1.4 0.2 setosa not_default
I don't quite understand the _constructor method under the hood well enough to tell why this does not work.
import seaborn as sns
import pandas as pd
df = sns.load_dataset('iris')
class Results(pd.DataFrame):
def __init__(self, *args,scheme='default', **kwargs):
# use the __init__ method from DataFrame to ensure
# that we're inheriting the correct behavior
super(Results, self).__init__(*args, **kwargs)
self['scheme'] = scheme
#property
def _constructor(self):
return Results
result_object = Results(df.copy(),scheme='not_default')
print(result_object.head(5))
>>>
# scheme is still 'default'
sepal_length sepal_width petal_length petal_width species scheme
0 5.1 3.5 1.4 0.2 setosa default
1 4.9 3.0 1.4 0.2 setosa default
2 4.7 3.2 1.3 0.2 setosa default
3 4.6 3.1 1.5 0.2 setosa default
4 5.0 3.6 1.4 0.2 setosa default
Notice the scheme field still says default.
Is there anyway to set a field in the instance constructor?
Your current version creates scheme as an attribute (like .index, .columns):
result_object.scheme
# 0 not_default
# 1 not_default
# ...
# 148 not_default
# 149 not_default
# Name: scheme, Length: 150, dtype: object
To make it a proper column, you can modify the incoming data before sending it to super():
class Results(pd.DataFrame):
def __init__(self, data=None, *args, scheme='default', **kwargs):
# add column to incoming data
if isinstance(data, pd.DataFrame):
data['scheme'] = scheme
super(Results, self).__init__(data=data, *args, **kwargs)
#property
def _constructor(self):
return Results
df = sns.load_dataset('iris')
result_object = Results(df.copy(), scheme='not_default')
# sepal_length sepal_width petal_length petal_width species scheme
# 0 5.1 3.5 1.4 0.2 setosa not_default
# 1 4.9 3.0 1.4 0.2 setosa not_default
# 2 4.7 3.2 1.3 0.2 setosa not_default
# 3 4.6 3.1 1.5 0.2 setosa not_default
# ... ... ... ... ... ... ...

Adding column names and values to statistic output in Python?

Background:
I'm currently developing some data profiling in SQL Server. This consists of calculating aggregate statistics on the values in targeted columns.
I'm using SQL for most of the heavy lifting, but calling Python for some of the statistics that SQL is poor at calculating. I'm leveraging the Pandas package through SQL Server Machine Language Services.
However,
I'm currently developing this script on Visual Studio. The SQL portion is irrelevant other than as background.
Problem:
My issue is that when I call one of the Python statistics functions, it produces the output as a series with the labels seemingly not part of the data. I cannot access the labels at all. I need the values of these labels, and I need to normalize the data and insert a column with static values describing which calculation was performed on that row.
Constraints:
I will need to normalize each statistic so I can union the datasets and pass the values back to SQL for further processing. All output needs to accept dynamic schemas, so no hardcoding labels etc.
Attempted solutions:
I've tried explicitly coercing output to dataframes. This just results in a series with label "0".
I've also tried adding static values to the columns. This just adds the target column name as one of the inaccessible labels, and the intended static value as part of the series.
I've searched many times for a solution, and couldn't find anything relevant to the problem.
Code and results below. Using the iris dataset as an example.
###########################
## AGG STATS TEST SCRIPT
##
###########################
#LOAD MODULES
import pandas as pds
#GET SAMPLE DATASET
iris = pds.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#CENTRAL TENDENCY
mode1 = iris.mode()
stat_mode = pds.melt(
mode1
)
stat_median = iris.median()
stat_median['STAT_NAME'] = 'STAT_MEDIAN' #Try to add a column with the value 'STAT_MEDIAN'
#AGGREGATE STATS
stat_describe = iris.describe()
#PRINT RESULTS
print(iris)
print(stat_median)
print(stat_describe)
###########################
## OUTPUT
##
###########################
>>> #PRINT RESULTS
... print(iris) #ORIGINAL DATASET
...
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
>>> print(stat_median) #YOU CAN SEE THAT IT INSERTED COLUMN INTO ROW LABELS, VALUE INTO RESULTS SERIES
sepal_length 5.8
sepal_width 3
petal_length 4.35
petal_width 1.3
STAT_NAME STAT_MEDIAN
dtype: object
>>> print(stat_describe) #BASIC DESCRIPTIVE STATS, NEED TO LABEL THE STATISTIC NAMES TO UNPIVOT THIS
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>>>
Any assistance is greatly appreciated. Thank you!
I figured it out. There's a function called reset_index that will convert the index to a column, and create a new numerical index.
stat_median = pds.DataFrame(stat_median)
stat_median.reset_index(inplace=True)
stat_median = stat_median.rename(columns={'index' : 'fieldname', 0: 'value'})
stat_median['stat_name'] = 'median'

Find the range of all columns (difference between maximum and minimum) while gracefully handling string columns

I have a scenario where I have to find the range of all the columns in a dataset which contains multiple columns with numeric value but one column has string values.
Please find sample records from my data set below:
import seaborn as sns
iris = sns.load_dataset('iris')
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
The maximum and minimum of these columns are given by
sepal_length 7.9
sepal_width 4.4
petal_length 6.9
petal_width 2.5
species virginica
dtype: object
and
sepal_length 4.3
sepal_width 2
petal_length 1
petal_width 0.1
species setosa
dtype: object
...respectively. To find the range of all the columns I can use the below code:
iris.max() - iris.min()
But as the column 'species' has string values, the above code is throwing the below error:
TypeError: unsupported operand type(s) for -: 'str' and 'str'
If the above error occurs, I want to print the value as the
"{max string value}" - "{min string value}"
IOW, my expected output would be something like:
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
species virginica - setosa
How do I resolve this issue?
Handle the numeric and string columns separately. You can select these using df.select_dtypes. Finally, concat the result.
u = Iris.select_dtypes(include=[np.number])
# U = u.apply(np.ptp, axis=0)
U = u.max() - u.min()
v = Iris.select_dtypes(include=[object])
V = v.max() + ' - ' + v.min()
U.append(V)
sepal_length 3.6
sepal_width 2.4
petal_length 5.9
petal_width 2.4
species virginica - setosa
dtype: object

Plot all columns from two separate dataframes by group in Python

I have two dataframes with lots of columns and rows with the same structure as the examples below:
df1:
Time Group Var1 Var2 Var3
1/1/2016 A 0.1 1.1 2.1
2/1/2016 A 0.2 1.2 2.2
1/1/2016 B 3.5 4.5 5.5
2/1/2016 B 3.6 4.6 5.6
df2:
Time Group Var1 Var2 Var3
1/1/2016 A 0.3 1.3 2.3
2/1/2016 A 0.4 1.4 2.4
1/1/2016 B 3.7 4.7 5.7
2/1/2016 B 3.8 4.8 5.8
I would like write a code that created one plot per column having Time as x-axis for each group and plotting columns with the same name from each dataframe on the same plot.
I was able to write a code that did this but not varying by group:
df1_agg = df1.groupby(by =['Time']).sum()
df2_agg = df2.groupby(by =['Time']).sum()
def plots_all_columns(col_names, filename):
with PdfPages(filename) as pdf:
for i in col_names:
plt.figure()
df1_agg[i].plot(label="df1", legend=True, title = i)
df2_agg[i].plot(label="df2", legend=True)
pdf.savefig()
plt.close('all')
How could I do the same as above but keeping the Group dimension in my dataframe? I have lots of groups so I would need a separate plot for each group category.
Thank you.

Series imported but unused error Python

import numpy as np
from pandas import Series, DataFrame
import pandas as pd
import matplotlib.pyplot as plt
iris_df = DataFrame()
iris_data_path = 'Z:\WORK\Programming\Python\irisdata.csv'
iris_df = pd.read_csv(iris_data_path,index_col=False,header=None,encoding='utf-8')
iris_df.columns = ['sepal length','sepal width','petal length','petal width','class']
print iris_df.columns.values
print iris_df.head()
print iris_df.tail()
irisX = irisdata[['sepal length','sepal width','petal length','petal width']]
print irisX.tail()
irisy = irisdata['class']
print irisy.head()
print irisy.tail()
colors = ['red','green','blue']
markers = ['o','>','x']
irisyn = np.where(irisy=='Iris-setosa',0,np.where(irisy=='Iris-virginica',2,1))
Col0 = irisdata['sepal length']
Col1 = irisdata['sepal width']
Col2 = irisdata['petal length']
Col3 = irisdata['petal width']
plt.figure(num=1,figsize=(16,10))
plt.subplot(2,3.1)
for i in range(len(colors)):
xs = Col0[irisyn==i]
xy = Col1[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.legend( ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica') )
plt.xlabel(irisdata.columns[0])
plt.ylabel(irisdata.columns[1])
plt.subplot(2,3,2)
for i in range(len(colors)):
xs = Col0[irisyn==i]
xy = Col2[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[0])
plt.ylabel(irisdata.columns[2])
plt.subplot(2,3,3)
for i in range(len(colors)):
xs = Col0[irisyn==i]
xy = Col3[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[0])
plt.ylabel(irisdata.columns[3])
plt.subplot(2,3,4)
for i in range(len(colors)):
xs = Col1[irisyn==i]
xy = Col2[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[1])
plt.ylabel(irisdata.columns[2])
plt.subplot(2,3,5)
for i in range(len(colors)):
xs = Col1[irisyn==i]
xy = Col3[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[1])
plt.ylabel(irisdata.columns[3])
plt.subplot(2,3,6)
for i in range(len(colors)):
xs = Col2[irisyn==i]
xy = Col3[irisyn==i]
plt.scatter(xs,xy,color=colors[i],marker=markers[i])
plt.xlabel(irisdata.columns[2])
plt.ylabel(irisdata.columns[3])
plt.show()
This is code from Howard Bandy's book Quantitative Technical Analysis. The problem is that it is giving me errors even though I typed it out exactly like it is in the book.
I still get the series imported but unused and undefined name irisdata errors/warnings.
This is in the console:
Code:
runfile('Z:/WORK/Programming/Python/Scripts/irisplotpairsdata2.py', wdir='//AMN/annex/WORK/Programming/Python/Scripts')
['sepal length' 'sepal width' 'petal length' 'petal width' 'class']
sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
sepal length sepal width petal length petal width class
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
Traceback (most recent call last):
File "<ipython-input-100-f0b2002668bd>", line 1, in <module>
runfile('Z:/WORK/Programming/Python/Scripts/irisplotpairsdata2.py', wdir='//AMN/annex/WORK/Programming/Python/Scripts')
File "C:\MyPrograms\Spyder(Python)\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\MyPrograms\Spyder(Python)\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "Z:/WORK/Programming/Python/Scripts/irisplotpairsdata2.py", line 24, in <module>
irisX = irisdata[['sepal length','sepal width','petal length','petal width']]
TypeError: list indices must be integers, not list
Obviously, the program does not run.
I'm using spyder with python 2.7. Which is the platform he was using in the book.
Thanks for any insight.
Well Python is not wrong. You imported Series but never used, which is a warning that does not cause crash. The crash happens because you are dereferencing a variable, irisdata, which was never defined before. (Ctrl + f irisdata in your code and take a look.) Judging by your code, irisdataprobably needs to contain the parsed data of Z:\WORK\Programming\Python\irisdata.csv doesn't it? So you need to parse that out and assign it to irisdata. See this post
eg.
import csv
...
irisdata = list(csv.reader(open(iris_data_path, 'rb')))

Categories

Resources