Plotting Pandas Series only showing partial values - python

I'm trying to plot a Pandas Series with lots of samples:
In [1]: vp_series = pd.Series(data=raw_df.Count, index=raw_df.Timestamp)
In [2]: len(vp_series)
Out[2]: 17499650
In [3]: vp_series.index[-1]
Out[3]: 559888625359
When I try to plot this series, the produced plot looks like this:
In [4]: vp_series.plot()
Clearly not all data points are plotted, and max value on the x axis is only about 1.75e7 instead of 5.59e11.
However, when I try to plot the same data in Julia (using Plots and the PyPlot backend) it produces the correct figure:
What should I do here to make the plot contain all the data points? I tried to search in the doc of matplotlib and Pandas.Series but found nothing.

I found the reason is that the way I used to create the pandas.Series is wrong. Instead of
vp_series = pd.Series(data=raw_df.Count, index=raw_df.Timestamp)
I should be using
vp_series = pd.Series(data=raw_df.Count.values, index=raw_df.Timestamp)
The first way is causing my series to contain a lot of missing values (NaN) which are not plotted. The reason is well explained in here.
I know I didn't ask my question properly and I appreciate all the comments.

Related

Altair: Controlling tick counts for binned axis

I'm trying to generate a histogram in Altair, but I'm having trouble controlling the tick count for the axis corresponding to the binned variable (x-axis). I'm new to Altair so apologies I'm missing something obvious here. I tried to look for whether others had faced this kind of issue but didn't find an exact match.
The code to generate the histogram is
alt.Chart(df_test).mark_bar().encode(
x=alt.X('x:Q', bin=alt.Bin(step=0.1), scale=alt.Scale(domain=[8.9, 11.6])),
y=alt.Y('count(y):Q', title='Count(Y)')
).configure_axis(labelLimit=0, tickCount=3)
df_test is a Pandas dataframe - the data for which is available here.
The above code generates the following histogram. Changing tickCount changes the y-axis tick counts, but not the x-axis.
Any guidance is appreciated.
There might be a more convenient way to do this using bin=, but one approach is to use transform_bin with mark_rect, since this does not change the axis into a binned axis (which are more difficult to customize):
import altair as alt
from vega_datasets import data
source = data.movies.url
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(tickCount=3)),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
You might notice that you don't get the exact number of ticks, this is because there is rounding to "nice" values, such as multiple of 5 etc. I couldn't turn this off even when setting nice=False on the scale, so another approach in those cases is to pass the exact tick values values=.
alt.Chart(source).mark_rect(stroke='white').encode(
x=alt.X('x1:Q', title='IMDB Rating', axis=alt.Axis(values=[0, 3, 6, 9])),
x2='x2:Q',
y='count()',
).transform_bin(
['x1', 'x2'], field='IMDB_Rating'
)
Be careful with decimal values, these are automatically displayed as integers (even with tickRound=False), but in the wrong position (this seems like a bug to me so if you investigate it more you might want to report on the Vega Lite issue tracker.

How to modify time interval in altair line graph

I have a simple line graph that looks like this: line graph of stock returns
I have been trying to format the x axis such that the time interval is in years instead of months, as it currently is now. But when I use the timeUnit attribute, it produces a stunted graph like this: line graph of stock returns in years
Code:
alt.Chart(data).mark_line().encode(
x = alt.X('Date', timeUnit = 'year'),
y = alt.Y('Cumul_R', axis = alt.Axis(format='%', orient='right')),
color = 'Stock')
What I'm trying to produce is a graph that looks like the first graph, but with intervals expressed in years like 06-2010, 06-2011, ... etc without compressing the graph like in the second pic. In other words, how do I only show some tick labels and not all of them.
I've seen answers to my question but they deal with absolute values using tickCount or tickMinStep, not for datetime values. There is apparently an altair attribute called timeinterval in https://altair-viz.github.io/user_guide/generated/core/altair.TimeInterval.html#altair.TimeInterval.init
that may solve the problem, but I'm not sure how to use it.
Appreciate all help on the matter. Thank you!
It appears that you are plotting your dates as nominal typed values, when you should probably be plotting them as temporal.
You should change x = alt.X('Date') to x = alt.X('Date:T') to specify that the x channel is temporal. When you do that, the renderer will use a temporal axis label that is probably closer to what you had in mind.
See Encoding Data Types in the documentation for more information.

How to plot normal distribution-like histogram?

I have data like [A,A,A,B,B,B,B,B,B,C,C,C,C,D,D,D,...]
And I convert it into numerical list like [1,1,1,2,2,2,2,2,2,3,3,3,3,4,4,4,...]
Each element has its frequency, for example, A shows up 3 times
I try to plot histogram and I get like this
Third element (probably C as character) shows up most often.
And I would like to place "third element vertical bar" in the center
And next to that center, I would like to place second and third frequent element to draw normal distribution-like arrangement.
In conclusion, I would like to see whether distribution of data has normal distribution shape or not
I checked this by using QQ plot but I also would like to see this in histogram plot using actual data
If I understood well what your goal is, I would recommend you to use the distplot function from seaborn. You will get both distribution and hist !
You have asked so many questions in a single post. I will answer the one regarding plotting frequency of occurrence. Suppose your list has strings. You can use Counter module to compute the frequencies. You can then directly plot the frequencies and items using plt.plot()
from collections import Counter
import matplotlib.pyplot as plt
lst = ['A','A','A','B','B','B','B','B','B','C','C','C','C','D','D','D','E', 'E','E','E']
counts = Counter(lst)
plt.bar(counts.keys(), counts.values())
plt.show()

Visualize NaN-Values in Features of a Class via Pandas GroupBy

Cause of the kind and helping community i solved the first problem i had in my work which you can see here: Basic Problem - necessary for understanding the upcoming
After i used this, i wanted to visualize the distribution of the classes and the nan-Values in the features. Therefore i plottet it in a bar-diagramm. with a few classes it is pretty handy.
the problem is, i have about 120 different classes and all in all 50000 data objects - the plots are not readable with this amount of data.
therefor i wanted to split the visualization.
for each class there should be a subplot that is showing the sum of the nan-values of each feature.
Data:
CLASS FEATURE1 FEATURE2 FEATURE3
X 1 1 2
B 0 0 0
C 2 3 1
Actual Plot:
Expected Plots:
Non of my approaches worked so far.
i tried to solve it through df.groupBy('Class').plot(kind="barh", subplots=True) - completely destroyed the layout and plotted per feature not per class.
i tried this approach but if i write my groupBy-df into the Variable 'grouped' i can print it in a perfect format with all informations, but i can not access it in the way it is done in the solution. i always get the error: 'string indices must be integers'
my approach:
grouped = df.groupby('Class')
for name, group in grouped:
group.plot.bar()
EDIT - Further Information
The Data I use is completely categorical - no numerical values - i want to display the amount of nan-values in the different features of the classes(labels) of my dataset.
A solution using seaborn
import seaborn as sns
ndf = pd.melt(df, id_vars="CLASS", var_name="feature", value_name="val")
sns.catplot("feature", "val", col="CLASS", data=ndf, kind="bar", col_wrap=1)
plt.show()
Grouping is the way to go, just set the labels
for name, grp in df3.groupby('CLASS'):
ax = grp.plot.bar()
ax.set_xticks([])
ax.set_xlabel(name)
With the solution provided by #meW i was able to achieve a result that is near my goal.
I had to do two steps to actually use his solution.
Cast the GroupBy-Object to an DataFrame-Object via df = pd.DataFrame(df.groupBy('Class').count().rsub(df.groupby('Class').size(), axis =0)
Through the groupby-query the Class-Column was transformed to the index so i had to transform it back via grouped['class'] = grouped.index
Two Questions that rise from this solution: is it possible to fit the ticks to the different amounts of nan. because there are classes with only 5-10 nan values in the features and classes with over 1000 nan-values. (see pictures)
Second Question - The Feature-Names are only shown in the last plot - is there a way to add them to all x-axis of all plots

Seaborn pairplot: how to change legend label text

I'm making a simple pairplot with Seaborn in Python that shows different levels of a categorical variable by the color of plot elements across variables in a Pandas DataFrame. Although the plot comes out exactly as I want it, the categorical variable is binary, which makes the legend quite meaningless to an audience not familiar with the data (categories are naturally labeled as 0 & 1).
An example of my code:
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
Is there a way to change legend label text with pairplot? Or should I use PairGrid, and if so how would I approach this?
Found it! It was answered here: Edit seaborn legend
g = sns.pairplot(df, hue='categorical_var', palette='Set3')
g._legend.set_title(new_title)
Since you don't provide a full example of code, nor mock data, I will use my own codes to answer.
First solution
The easiest must be to keep your binary labels for analysis and to create a column with proper names for plotting. Here is a sample code of mine, you should grab the idea:
def transconum(morph):
if (morph == 'S'):
return 1.0
else:
return 0.0
CompactGroups['MorphNum'] = CompactGroups['MorphGal'].apply(transconum)
Second solution
Another way would be to overwrite labels on the flight. Here is a sample code of mine which works perfectly:
grid = sns.jointplot(x="MorphNum", y="PropS", data=CompactGroups, kind="reg")
grid.set_axis_labels("Central type", "Spiral proportion among satellites")
grid.ax_joint.set_xticks([0, 1, 1])
plt.xticks(range(2), ('$Red$', '$S$'))

Categories

Resources