Seaborn pairplot not displaying in matrix form - python

I have a DataFrame with the variables below. I am trying to find the relationship by plotting "profit" with other variables excluding "Date".
Date
Billable_Fixed Bid
Billable_Time_Material
Billable_Transaction_Based
Non_Billable
Indirect_Costs
Unbilled_CP_and_AM
Direct_Costs
Profit
Code:
cols = [
'Billable_Fixed Bid',
'Billable_Time_Material',
'Billable_Transaction_Based',
'Non_Billable',
'Unbilled_CP_and_AM',
'Direct_Costs'
]
sns.pairplot(data1,x_vars=cols,y_vars='Profit',size =5,kind='reg')
Problem is the plots are getting displayed in a single line, which is not clearly visible.
I want it to display 2 plots per line so that it is clearly visible.
Can anyone help?

As per the comment: Using FacetGrid with col_wrap=2 will solve your problem. Check the examples in the documentation.

Related

Ipywidgets and plotly interaction

I'm trying to make an interactive plot with ipywidgets using plotly, but I'm afraid i'm not getting something.
I have some dataframe with coordinates and some columns. I'd want to plot the dataframe in a scatterplot so that coord1=x, coord2=y and each marker point is colored by the value of a column selected by a column selected interactively.
Additionally I'd want that when I change the column value with the interactive menu, the color for every point changes to the column that i selected, rescaling the min and max of the colorbar accordingly to the min and max of the new column.
Furthermore, when I change another selector (selector2) then i want the plot to display only the subset of mu dataframe that matched a certain colID big_grid[big_grid["id_col"]==selector2.value].
Lastly there should be a rangeslider widget to adjust the color range of the colorbar
so by now i have this
big_grid=pd.DataFrame(data=dict(id_col=[1,2,3,4,5],
col1=[0.1,0.2,0.3,0.4,0.5],
col2=[10,20,30,40,50],
coord1=[6,7,8,9,10],
coord2=[6,7,8,9,10]))
list_elem=["col1","col2"]
list_id=big_grid.id_col.values
dropm_elem=widgets.Dropdown(options=list(list_elem))
dropm_id=widgets.SelectMultiple(
options=list_id,
description="Active",
disabled=False
)
rangewidg=widgets.FloatRangeSlider(value=[big_grid[dropm_elem.value].min(),big_grid[dropm_elem.value].max()],
min=big_grid[dropm_elem.value].min(),
max=big_grid[dropm_elem.value].max(),
step=0.001,
readout_format='.3f',
description="Color Range",
continuous_update=False)
fig = go.FigureWidget(data=px.scatter(big_grid,
x="coord1",
y="coord2",
color=big_grid[dropm_elem.value],
color_continuous_scale="Turbo",)
)
def handle_id_change(change):
fig.data[0]['x']=big_grid[big_grid['id_col'].isin(dropm_id.value)]["coord1"]
fig.data[0]['y']=big_grid[big_grid['id_col'].isin(dropm_id.value)]["coord2"]
fig.data[0]['marker']['color']=big_grid[big_grid['id_col'].isin(dropm_id.value)][dropm_elem.value]
fig.data[0]['marker']['cmin']=big_grid[big_grid['id_col'].isin(dropm_id.value)][dropm_elem.value].min()
fig.data[0]['marker']['cmax']=big_grid[big_grid['id_col'].isin(dropm_id.value)][dropm_elem.value].max()
def handle_elem_change(change):
fig.data[0]['marker']['color']=big_grid[big_grid['id_col'].isin(dropm_id.value)][dropm_elem.value]
dropm_elem.observe(handle_elem_change,names='value')
dropm_id.observe(handle_id_change,names='value')
right_box1 =widgets.HBox([fig])
right_box2=widgets.VBox([dropm_elem,dropm_id,rangewidg])
box=widgets.HBox([right_box1,right_box2])
box
So, like this the selection of the subset (from dropm_id) works, but the rangewidget and the hovering are broken. Basically when i change dromp_elem the color doesn't adjust as i am expecting, and instead it gets dark and uniform. At the same time if you change column and you hover over the points it lists the value of col2, but the label still says col1.
I'm afraid that I'm overcomplicating my life and there is surely an easier way, could someone enlighten me?
EDIT: If I use a different approach and I use a global variable to define the subset to plot, a plotting function and a the widget.interact function I can make it work. The problem is that in this case the plot is not a widget, so i cannot put it into a VBox or HBox.
It also still feels wrong and using global variables is not grood practice. I'll provide the code anyway for reference:
def plot(elem,rang):
fig = px.scatter(subset, x="coord1", y="coord2", color=elem,color_continuous_scale="Turbo",range_color=rang)
fig.show()
def handle_elem_change(change):
with rangewidg.hold_trait_notifications(): #This is because if you do't put it it set max,
rangewidg.max=big_grid[dropm_elem.value].max() #and if max is < min he freaks out. Like this he first
rangewidg.min=big_grid[dropm_elem.value].min() #set everything and then send the eventual errors notification.
rangewidg.value=[big_grid[dropm_elem.value].min(),big_grid[dropm_elem.value].max()]
def handle_id_change(change):
global subset
subset=big_grid[big_grid['id_col'].isin(dropm_id.value)]
big_grid=pd.DataFrame(data=dict(id_col=[1,2,3,4,5],
col1=[0.1,0.2,0.3,0.4,0.5],
col2=[10,20,30,40,50],
coord1=[6,7,8,9,10],
coord2=[6,7,8,9,10]))
subset=big_grid
list_elem=["col1","col2"]
list_id=big_grid.id_col.values
dropm_elem=widgets.Dropdown(options=list(list_elem))
dropm_id=widgets.SelectMultiple(
options=list_id,
description="Active",
disabled=False
)
rangewidg=widgets.FloatRangeSlider(value=[big_grid[dropm_elem.value].min(),big_grid[dropm_elem.value].max()],
min=big_grid[dropm_elem.value].min(),
max=big_grid[dropm_elem.value].max(),
step=0.001,
readout_format='.3f',
description="Color Range",
continuous_update=False)
dropm_elem.observe(handle_elem_change,names='value')
dropm_id.observe(handle_id_change,names='value')
display(dropm_id)
widgets.interact(plot,elem=dropm_elem,rang=rangewidg)
So, I would want the behaviour of this second code, but in a widget.Hbox, ans possibly without using global variables
UPDATE: I manage to get a working version using the following code:
def handle_elem_change(change):
with rangewidg.hold_trait_notifications(): #This is because if you do't put it it set max,
rangewidg.max=big_grid[dropm_elem.value].max() #and if max is < min he freaks out. Like this he first
rangewidg.min=big_grid[dropm_elem.value].min() #set everything and then send the eventual errors notification.
rangewidg.value=[big_grid[dropm_elem.value].min(),big_grid[dropm_elem.value].max()]
def plot_change(change):
df=big_grid[big_grid['id_col'].isin(dropm_id.value)]
output.clear_output(wait=True)
with output:
fig = px.scatter(df, x="coord1", y="coord2", color=dropm_elem.value,hover_data=["info"],
width=500,height=800, color_continuous_scale="Turbo",range_color=rangewidg.value)
fig.show()
#define the widgets dropm_elem and rangewidg, which are the possible df.columns and the color range
#used in the function plot.
big_grid=pd.DataFrame(data=dict(id_col=[1,2,3,4,5],
col1=[0.1,0.2,0.3,0.4,0.5],
col2=[10,20,30,40,50],
coord1=[6,7,8,9,10],
coord2=[6,7,8,9,10],
info=["info1","info2","info3","info4","info5",]))
list_elem=["col1","col2","info"]
list_id=big_grid.id_col.values
dropm_elem=widgets.Dropdown(options=list_elem) #creates a widget dropdown with all the _ppms
dropm_id=widgets.SelectMultiple(
options=list_id,
description="Active Jobs",
disabled=False
)
rangewidg=widgets.FloatRangeSlider(value=[big_grid[dropm_elem.value].min(),big_grid[dropm_elem.value].max()],
min=big_grid[dropm_elem.value].min(),
max=big_grid[dropm_elem.value].max(),
step=0.001,
readout_format='.3f',
description="Color Scale Range",
continuous_update=False)
output=widgets.Output()
# this line is crucial, it basically says: Whenever you move the dropdown menu widget, call the function
# #handle_elem_change, which will in turn update the values of rangewidg
dropm_elem.observe(handle_elem_change,names='value')
dropm_elem.observe(plot_change,names='value')
dropm_id.observe(plot_change,names='value')
rangewidg.observe(plot_change,names='value')
# # #this line is also crucial, it links the widgets dropmenu and rangewidg with the function plot, assigning
# # #to elem and to rang (parameters of function plot) the values of dropmenu and rangewidg
left_box = widgets.VBox([output])
right_box =widgets.VBox([dropm_elem,rangewidg,dropm_id])
tbox=widgets.HBox([left_box,right_box])
# widgets.interact(plot,elem=dropm_elem,rang=rangewidg)
display(tbox)
This way everything works, but I basically need to create a new dataframe every time that I move anything. It might not be very efficient for big dataframes, but it runs.

dataframe line plot is not plotting a line with column values

I think there is something wrong with the data in my dataframe, but I am having a hard time coming to a conclusion. I think there might be some missing datetime values, which is the index of the dataframe. Given that there are over 1000 rows, it isn't possible for me to check each row manually. Here is a picture of my data and the corresponding line plt. Clearly this isn't a line plot!
Is there any way to supplement the possible missing values in my dataframe somehow?
I also did a line plot in seaborne as well to get another perspective, but I don't think it was immediately helpful.
You have effectively done same as I have simulated. Really you have a multi-index date and age_group. plotting both together means line jumps between the two. Separate them out and plot as separate lines and it is as you expect.
d = pd.date_range("1-jan-2020", "16-mar-2021")
df = pd.concat([pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0.5,1, len(d)))}, index=d).assign(age_group="0-9 Years"),
pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0,0.5, len(d)))}, index=d).assign(age_group="20-29 Years")])
df.plot(kind="line", y="daily_percent", color="red")
df.set_index("age_group", append=True).unstack(1).droplevel(0, axis=1).plot(kind="line", color=["red","blue"])

Altair: Sorting faceted "text" chart not reflecting expectation

This is a direct follow up to Sorting based on the alt.Color field in Altair
using the same dataframe (that is included for ease of reference). I asked a follow up in the comments section, but after giving it a whirl on my own and getting close, I am creating a new question.
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636218971,good,3.094046639,0.001584756
The follow up question was after grouping by "color", how can I do a subsequent ordering within the groups by "LDA Score" or essentially by bar length and have the text column sort by LDA, as well. I didn't know how to incorporate a second level or ordering in the code I was using, so I opted to turn the groups into facets and then try sorting by LDA Score for both the bar charts and the text column. I am getting the proper sorting by LDA score on the charts, but I can't seem to make it work for the text column. I am pasting the code and the image. As you can see, I am telling it to use LDA Score as the sorting field for the "text" chart (which is the pvalue), but it is still sorting alphabetically by species. Any thoughts? To be honest I feel like I'm heading down the rabbit hole where I'm forcing a solution to work in the current code, so if you think a different strategy altogether is the better way to go, let me know.
FYI, there are some formatting issues (like redundant labels on axes) that you can ignore for now.
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort='-x'),
color='group:N',
row='group:N'
).resolve_scale(y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y('Species:N', sort=alt.EncodingSortField(field='LDA Score', op='count', order='descending')),
row='group:N'
).resolve_scale(y='independent'
).properties(width=50)
#bars | text
alt.hconcat(bars, text, spacing=0)
Drop op="count". The count in each row is exactly 1 (there is one data point in each row). It sounds like you want to instead sort by the data value.
It also would make sense in this context to use this same sort expression for both y encodings, since they're meant to match:
y_sort = alt.EncodingSortField(field='LDA Score', order='descending')
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort=y_sort),
color='group:N',
row='group:N'
).resolve_scale(
y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y("Species:N", sort=y_sort, axis=None),
alt.Row('group:N', header=alt.Header(title=None, labelFontSize=0))
).resolve_scale(
y='independent'
).properties(width=50)
alt.hconcat(bars, text, spacing=0)
(labelFontSize is a workaround because there is a bug with labels=False)

Python : how to create stacked graph withTitanic Dataset

I'm new to Python and working with Titanic Dataset to create a stacked chart using for loop. Can anyone suggest to me how to convert from bar to stacked? What option should be changed for the below code?
df.drop(["name","ticket","cabin","boat","body","home.dest"], axis=1,inplace=True)
df.embarked = df.embarked.fillna(df.embarked.mode()[0])
es_grp1=df.groupby(['embarked','survived'])
for i in es_grp1.groups.keys():
plt.bar(str(i),es_grp1.get_group(i).embarked.size)
plt.text(str(i),es_grp1.get_group(i).embarked.size,es_grp1.get_group(i).embarked.size)
plt.show()
It's difficult to judge without seeing exactly how your data are structured, but based on the link I've provided this should work ok. If you can show what the data actually look like then that might make this clearer.
df.drop(["name","ticket","cabin","boat","body","home.dest"], axis=1,inplace=True)
df.embarked = df.embarked.fillna(df.embarked.mode()[0])
es_grp1=df.groupby(['embarked','survived'])
value_sum = 0
for i in es_grp1.groups.keys():
plt.bar(0,es_grp1.get_group(i).embarked.size, bottom=value_sum)
value_sum += es_grp1.get_group(i).embarked.size
plt.text(str(i),es_grp1.get_group(i).embarked.size,es_grp1.get_group(i).embarked.size)
plt.show()

Formatting in HoverTool

I love how easy it is to set up basic hover feedback with HoverTool, but I'm wrestling with a couple aspects of the display. I have time-series data, with measurements that represent amounts in US$. This data starts out life as a pandas.Series. Legible plotting is easy (following assumes jupyter notebook):
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line(my_data.index, my_data)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
show(p)
This shows me the time-series, with date-formatting on the x-axis and y-axis values that look like "$150,000", "$200,000", "$250,000", etc. I have two questions about HoverTool behavior:
Controlling formatting for $x and $y.
Accessing the name of the dataset under the cursor.
Simply adding a HoverTool allows me to see values, but in unhelpful units:
p.add_tools(HoverTool())
The corresponding tooltip values with these defaults show "1.468e+5" rather than "$146,800" (or even "146800", the underlying Series value); similarly, the date value appears as "1459728000000" rather than (say) "2016-04-04". I can manually work around this display issue by making my pandas.Series into a ColumnDataSource and adding string columns with the desired formatting:
# Make sure Series and its index have `name`, before converting to DataFrame
my_data.name = 'revenue'
my_data.index.name = 'day'
df = my_data.reset_index()
# Add str columns for tooltip display
df['daystr'] = df['day'].dt.strftime('%m %b %Y')
df['revstr'] = df['revenue'].apply(lambda x: '${:,d}'.format(int(x)))
cds = ColumnDataSource(df)
p = figure(title='Example currency', x_axis_type='datetime',
plot_height=200, plot_width=600, tools='')
p.line('day', 'revenue', source=cds)
p.yaxis[0].formatter = NumeralTickFormatter(format='$0,0')
p.add_tools(HoverTool(tooltips=[('Amount', '#revstr'), ('Day', '#daystr')]))
show(p)
but is there a way to handle the formatting in the HoverTool configuration instead? That seems much more desirable than all the data-set transformation that's required above. I looked through the documentation and (quickly) scanned through the source, and didn't see anything obvious that would save me from building the "output" columns as above.
Related to that, when I have several lines in a single plot, is there any way for me to access the name (or perhaps legend value) of each line within HoverTool.tooltips? It would be extremely helpful to include something in the tooltip to differentiate which dataset values are coming from, rather than needing to rely on (say) line-color in conjunction with the tool-tip display. For now, I've added an additional column to the ColumnDataSource that's just the string value I want to show; that obviously only works for datasets that include a single measurement column. When multiple lines are sharing an underlying ColumnDataSource, it would be sufficient to access the column-name that's provided to y.
Hey i know its 2 years late but this is for other people that come across this
p.add_tools(HoverTool(
tooltips=[
('Date', '#Date{%F}'),
('Value', '#Value{int}')],
formatters={
'Date':'datetime',
'Value':'numeral'},mode='vline'
))

Categories

Resources