I'm new to Python and working with Titanic Dataset to create a stacked chart using for loop. Can anyone suggest to me how to convert from bar to stacked? What option should be changed for the below code?
df.drop(["name","ticket","cabin","boat","body","home.dest"], axis=1,inplace=True)
df.embarked = df.embarked.fillna(df.embarked.mode()[0])
es_grp1=df.groupby(['embarked','survived'])
for i in es_grp1.groups.keys():
plt.bar(str(i),es_grp1.get_group(i).embarked.size)
plt.text(str(i),es_grp1.get_group(i).embarked.size,es_grp1.get_group(i).embarked.size)
plt.show()
It's difficult to judge without seeing exactly how your data are structured, but based on the link I've provided this should work ok. If you can show what the data actually look like then that might make this clearer.
df.drop(["name","ticket","cabin","boat","body","home.dest"], axis=1,inplace=True)
df.embarked = df.embarked.fillna(df.embarked.mode()[0])
es_grp1=df.groupby(['embarked','survived'])
value_sum = 0
for i in es_grp1.groups.keys():
plt.bar(0,es_grp1.get_group(i).embarked.size, bottom=value_sum)
value_sum += es_grp1.get_group(i).embarked.size
plt.text(str(i),es_grp1.get_group(i).embarked.size,es_grp1.get_group(i).embarked.size)
plt.show()
Related
I'm trying to create a chart where it is possible to select combinations of different columns of the data by toggling checkboxes. However I don't always want to display the checkboxes for all the columns. So I want to add the selections to the chart in a 'dynamic' way.
The thing I want to accomplish is that I want to make a pre-selection of which categories I want to visualize (this is done before the altair chart is created). These categories are then added as checkboxes in altair. However the only way I could find to do this is by adding them in a hardcoded way like the "sel1[0] & sel1[1] & sel1[2] & sel1[3] & sel1[4]" in the code below:
sel1 = [
alt.selection_single(
bind=alt.binding_checkbox(name=field),
fields=[field],
init={field: False}
)
for field in category_selection
]
transform_args = {str(col): f'toBoolean(datum.{col})' for col in category_selection}
alt.Chart(df1).transform_calculate(**transform_args).mark_point(filled=True).encode(
x='dim1',
y='dim2',
opacity=alt.condition(
sel1[0] & sel1[1] & sel1[2] & sel1[3] & sel1[4],
alt.value(1), alt.value(0)
)
).add_selection(
*sel1
)
I have tried doing it like this:
alt.Chart(df1).transform_calculate(**transform_args).mark_point(filled=True).encode(
x='dim1',
y='dim2',
opacity=alt.condition(
{'and': sel[:2]},
alt.value(1), alt.value(0)
)
).add_selection(
*sel1[:2]
)
But that does not work.
I can't seem to figure out how to achieve something like this with altair. Could someone provide an example on how to do this with checkboxes or help me find another method to achieve the same thing?
TLDR: I basically want to support a variable amount of categories that also supports the ability to create combinations of the categories.
EDIT: Tried to make it more clear what I'm trying achieve with the code.
It sounds like you want to write the equivalent of this without knowing the length of sel:
sel = [alt.selection_single() for i in range(3)]
combined = sel[0] & sel[1] & sel[2]
For Python operators in general, you can do so like this:
import operator
import functools
combined = functools.reduce(operator.and_, sel)
In Altair, you can alternatively construct the resulting Vega-Lite specification directly:
combined = {"selection": {"and": [s.name for s in sel]}}
Any of these three approaches should lead to identical results when used within an Altair chart.
This is a direct follow up to Sorting based on the alt.Color field in Altair
using the same dataframe (that is included for ease of reference). I asked a follow up in the comments section, but after giving it a whirl on my own and getting close, I am creating a new question.
Unnamed: 0,Species,Unknown,group,LDA Score,p value
11,a,3.474929757,bad,3.07502591,5.67e-05
16,b,3.109308852,bad,2.739744898,0.000651725
31,c,3.16979865,bad,2.697247855,0.03310557
38,d,0.06730106400000001,bad,2.347746497,0.013009626000000002
56,e,2.788383183,good,2.223874347,0.0027407140000000004
65,f,2.644346144,bad,2.311106698,0.00541244
67,g,3.626001112,good,2.980960068,0.038597163
74,h,3.132399759,good,2.849798377,0.007021518000000001
117,i,3.192113412,good,2.861299028,8.19e-06
124,j,0.6140430960000001,bad,2.221483531,0.0022149739999999998
147,k,2.873671544,bad,2.390164757,0.002270102
184,l,3.003479213,bad,2.667274876,0.008129727
188,m,2.46344998,good,2.182085465,0.001657861
256,n,0.048663767,bad,2.952260299,0.013009626000000002
285,o,2.783848855,good,2.387345098,0.00092491
286,p,3.636218971,good,3.094046639,0.001584756
The follow up question was after grouping by "color", how can I do a subsequent ordering within the groups by "LDA Score" or essentially by bar length and have the text column sort by LDA, as well. I didn't know how to incorporate a second level or ordering in the code I was using, so I opted to turn the groups into facets and then try sorting by LDA Score for both the bar charts and the text column. I am getting the proper sorting by LDA score on the charts, but I can't seem to make it work for the text column. I am pasting the code and the image. As you can see, I am telling it to use LDA Score as the sorting field for the "text" chart (which is the pvalue), but it is still sorting alphabetically by species. Any thoughts? To be honest I feel like I'm heading down the rabbit hole where I'm forcing a solution to work in the current code, so if you think a different strategy altogether is the better way to go, let me know.
FYI, there are some formatting issues (like redundant labels on axes) that you can ignore for now.
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort='-x'),
color='group:N',
row='group:N'
).resolve_scale(y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y('Species:N', sort=alt.EncodingSortField(field='LDA Score', op='count', order='descending')),
row='group:N'
).resolve_scale(y='independent'
).properties(width=50)
#bars | text
alt.hconcat(bars, text, spacing=0)
Drop op="count". The count in each row is exactly 1 (there is one data point in each row). It sounds like you want to instead sort by the data value.
It also would make sense in this context to use this same sort expression for both y encodings, since they're meant to match:
y_sort = alt.EncodingSortField(field='LDA Score', order='descending')
bars = alt.Chart(df).mark_bar().encode(
alt.X('LDA Score'),
alt.Y("Species:N", sort=y_sort),
color='group:N',
row='group:N'
).resolve_scale(
y='independent'
)
text = alt.Chart(df).mark_text().encode(
alt.Text('p value:Q', format='.2e'),
alt.Y("Species:N", sort=y_sort, axis=None),
alt.Row('group:N', header=alt.Header(title=None, labelFontSize=0))
).resolve_scale(
y='independent'
).properties(width=50)
alt.hconcat(bars, text, spacing=0)
(labelFontSize is a workaround because there is a bug with labels=False)
I've been searching this topic for a couple hours now, and still can't get my code to work. I am trying to generate a heat map from a pivot table I've created using pandas. I'm very new to coding and am probably not using the right terminology, but I'll try my best.
My table looks like this:
enter image description here
It has many more rows as well. I am trying to generate a plotly heat map with the countries on the y axis, the 4 ownership types on the x, and the numeric values being used as the z values. I've been getting a lot of errors, but I think I'm getting close because it gets to my last line and says "TypeError: Object of type 'DataFrame' is not JSON serializable." I've searched this error but can't find anything that I can understand. I set up the table like so, and am having trouble with the z, x, and y inputs:
data = [go.Heatmap(z=[Country_Ownership_df[['Company Owned', 'Franchise', 'Joint Venture', 'Licensed']]],
y=[Country_Ownership_df['Country']],
x=['Company Owned', 'Franchise', 'Joint Venture', 'Licensed'],
colorscale=[[0.0, 'white'], [0.000001, 'rgb(191, 0, 0)'], [.001, 'rgb(209, 95, 2)'], [.005, 'rgb(244, 131, 67)'], [.015, 'rgb(253,174,97)'], [.03, 'rgb(249, 214, 137)'], [.05, 'rgb(224,243,248)'], [0.1, 'rgb(116,173,209)'], [0.3, 'rgb(69,117,180)'], [1, 'rgb(49,54,149)']])]
layout = go.Layout(
margin = dict(t=30,r=260,b=30,l=260),
title='Ownership',
xaxis = dict(ticks=''),
yaxis = dict(ticks='', nticks=0 )
)
fig = go.Figure(data=data, layout=layout)
#iplot(fig)
plotly.offline.plot(fig, filename= 'tempfig3.html')
It's probably a fairly simple task, I just haven't learned much with coding and appreciate any help you could offer.
Plotly takes the data arguments as lists, and doesn't support Pandas DataFrames. To get a DataFrame that is already in a correct format,
data as the values ('z' in Plotly notation),
x-values as columns
y-values as index
The following function works:
def df_to_plotly(df):
return {'z': df.values.tolist(),
'x': df.columns.tolist(),
'y': df.index.tolist()}
As it returns a dict, you can directly pass it as an argument to go.HeatMap:
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(df_to_plotly(df)))
fig.show()
Apparently Plotly doesn't directly support DataFrames. But you can turn your DataFrames into dictionaries of lists like this:
Country_Ownership_df[['foo', 'bar']].to_dict()
Then non-Pandas tools like Plotly should work, because dicts and lists are JSON serializable by default.
I have a DataFrame with the variables below. I am trying to find the relationship by plotting "profit" with other variables excluding "Date".
Date
Billable_Fixed Bid
Billable_Time_Material
Billable_Transaction_Based
Non_Billable
Indirect_Costs
Unbilled_CP_and_AM
Direct_Costs
Profit
Code:
cols = [
'Billable_Fixed Bid',
'Billable_Time_Material',
'Billable_Transaction_Based',
'Non_Billable',
'Unbilled_CP_and_AM',
'Direct_Costs'
]
sns.pairplot(data1,x_vars=cols,y_vars='Profit',size =5,kind='reg')
Problem is the plots are getting displayed in a single line, which is not clearly visible.
I want it to display 2 plots per line so that it is clearly visible.
Can anyone help?
As per the comment: Using FacetGrid with col_wrap=2 will solve your problem. Check the examples in the documentation.
I'm plotting:
df['close'].plot(legend=True,figsize=(10,4))
The original data series comes in an descending order,I then did:
df.sort_values(['quote_date'])
The table now looks good and sorted in the desired manner, but the graph is still the same, showing today first and then going back in time.
Does the .plot() order by index? If so, how can I fix this ?
Alternatively, I'm importing the data with:
df = pd.read_csv(url1)
Can I somehow sort the data there already?
There are two problems with this code:
1) df.sort_values(['quote_date']) does not sort in place. This returns a sorted data frame but df is unchanged =>
df = df.sort_values(['quote_date'])
2) Yes, the plot() method plots by index by default but you can change this behavior with the keyword use_index
df['close'].plot(use_index=False, legend=True,figsize=(10,4))