Question about plotting mean across segments in Matplotlib and Seaborn - python

I'm trying to plot some dataframes and I have some problems.
What I want to find out is where are the best performers performing best in and where are they performing worst in by seeing it in the chart (I would have done a lineplot but barchart would work as well.
The values are in percent.
I first created a dataframe: (to get the ones over 80%)
best_performers_numbers = best_performers_numbers[best_performers_numbers['MEAN'] > 80]
then i created a pivot:
best_performers_pivot = pd.pivot_table(best_performers_numbers, values=metrics_no_target, index=['MEAN'],
aggfunc={np.mean})
best_performers_pivot.sort_values(by='MEAN', ascending=False, inplace=True)
Now my df looks like this:
I wanted to plot it now to see in which segments (e.g. COM) the best performers are performing best in and where the worst
I started with matplotlib until I gave it up and then started with seaborn but I'm quite lost now because my value error says it does not match the length
sns.lineplot(data=best_performers_pivot, x=best_performers_pivot.index, y=best_performers_pivot.columns[0:])

Related

Altair: Controlling tick counts for binned axis

I'm trying to generate a histogram in Altair, but I'm having trouble controlling the tick count for the axis corresponding to the binned variable (x-axis). I'm new to Altair so apologies I'm missing something obvious here. I tried to look for whether others had faced this kind of issue but didn't find an exact match.
The code to generate the histogram is
alt.Chart(df_test).mark_bar().encode(
x=alt.X('x:Q', bin=alt.Bin(step=0.1), scale=alt.Scale(domain=[8.9, 11.6])),
y=alt.Y('count(y):Q', title='Count(Y)')
).configure_axis(labelLimit=0, tickCount=3)
df_test is a Pandas dataframe - the data for which is available here.
The above code generates the following histogram. Changing tickCount changes the y-axis tick counts, but not the x-axis.
Any guidance is appreciated.
There might be a more convenient way to do this using bin=, but one approach is to use transform_bin with mark_rect, since this does not change the axis into a binned axis (which are more difficult to customize):
import altair as alt
from vega_datasets import data
source = data.movies.url
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(tickCount=3)),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
You might notice that you don't get the exact number of ticks, this is because there is rounding to "nice" values, such as multiple of 5 etc. I couldn't turn this off even when setting nice=False on the scale, so another approach in those cases is to pass the exact tick values values=.
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(values=[0, 3, 6, 9])),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
Be careful with decimal values, these are automatically displayed as integers (even with tickRound=False), but in the wrong position (this seems like a bug to me so if you investigate it more you might want to report on the Vega Lite issue tracker.

Visualize NaN-Values in Features of a Class via Pandas GroupBy

Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.
A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()
Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)
With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots

Plotting Pandas Series only showing partial values

I'm trying to plot a Pandas Series with lots of samples:
In [1]: vp_series = pd.Series(data=raw_df.Count, index=raw_df.Timestamp)
In [2]: len(vp_series)
Out[2]: 17499650
In [3]: vp_series.index[-1]
Out[3]: 559888625359
When I try to plot this series, the produced plot looks like this:
In [4]: vp_series.plot()
Clearly not all data points are plotted, and max value on the x axis is only about 1.75e7 instead of 5.59e11.
However, when I try to plot the same data in Julia (using Plots and the PyPlot backend) it produces the correct figure:
What should I do here to make the plot contain all the data points? I tried to search in the doc of matplotlib and Pandas.Series but found nothing.
I found the reason is that the way I used to create the pandas.Series is wrong. Instead of
vp_series = pd.Series(data=raw_df.Count, index=raw_df.Timestamp)
I should be using
vp_series = pd.Series(data=raw_df.Count.values, index=raw_df.Timestamp)
The first way is causing my series to contain a lot of missing values (NaN) which are not plotted. The reason is well explained in here.
I know I didn't ask my question properly and I appreciate all the comments.

Pandas histogram of dates with empty bins

My use case is very simliar to this post, but my data is not continuous through each bin. I'm attempting to create multiple figures over the same time span to show activity (or lack thereof) over 18 months. I thought I hit the jackpot with the df.groupby(df.date.month).count() approach, but since my data is irregular I get different bins per dataset.
My question, then, is how would I go about creating some kind of master x-axis with fixed bins (month,year) and plot each dataset against these bins. I think I'm missing some fundamental understanding of either Pandas or MPL, and I apologize for what I'm sure is a silly question. First post, go easy...
Since I can't comment yet, I'll edit here:
I have 18 months generated with pd.period_range. I also have a DataFrame full of observations with timestamps within those months. Some of months have zero observations. How do I effectively count and chart the observations by month?
Have you tried the suggestions here?
You can also try this sort of approach to manually define the bin boundaries
bins = [0, 30, 60, 90, 120]
labels = [1, 2, 3, 4]
df['new_bin'] = pd.cut(df['existing_value'], bins=bins, labels=labels)

Using fill_between() with a Pandas Data Series

I have graphed (using matplotlib) a time series and its associated upper and lower confidence interval bounds (which I calculated in Stata). I used Pandas to read the stata.csv output file and so the series are of type pandas.core.series.Series.
Matplotlib allows me to graph these three series on the same plot, but I wish to shade between the upper and lower confidence bounds to generate a visual confidence interval. Unfortunately I get an error, and the shading doesn't work. I think this is to do with the fact that the functions between which I wish to fill are pandas.core.series.Series.
Another post on here suggests that passing my_series.value instead of my_series will fix this problem; however I cannot get this to work. I'd really appreciate an example.
As long as you don't have NaN values in your data, you should be okay:
In [78]: x = Series(linspace(0, 2 * pi, 10000))
In [79]: y = sin(x)
In [80]: fill_between(x.values, y.min(), y.values, alpha=0.5)
Which yields:

Categories

Resources