Seaborn tsplot - ValueError: Index contains duplicate entries, cannot reshape tsplot - python

I am trying to use seaborn's tsplot (sns.tsplot) as I want to show the variation in values of a series for the different months over 5 years (as well as the mean).
I have attached an image of my dataframe. This is the code I am using:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.tsplot(data = df_monthly, time = 'month', value = 'pm_local', unit="dayofweek");
plt.show()
I cannot work out why I receive the following error message since the index just goes from 0 onwards:
ValueError: Index contains duplicate entries, cannot reshape
Any suggestions of why this might be and how I can get the tsplot working?

Related

Matplotlib shows NaN on X axis [duplicate]

This question already has answers here:
Use index in pandas to plot data
(6 answers)
Closed 1 year ago.
I'm learning Python, specifically Pandas and Matplotlib at the moment. I have a dataset of Premier League Hattrick scorers and have been using pandas to do some basic analysis. I then want to produce a bar chart based on this data extract. I have been able to create a bar chart, but the X axis shows 'nan' instead of the player names.
My code to extract the data...
import matplotlib.pyplot as plt
import pandas as pd
top10 = df.groupby(['Player'])[['Goals']].count().sort_values(['Goals'],ascending=False).head(10)
This produces the following, which I know is a Pandas DataFrame as if I print the type of 'top10' i get the following:
<class 'pandas.core.frame.DataFrame'>
This produces the following if printed out...
I tried to create a chart direct from this dataFrame, but was given an error message 'KeyError: Player'
So, I made an new dataframe and plotted this, which was kind of successful, but it displayed 'nan' on the X access?
top10df = pd.DataFrame(top10,columns=['Player','Goals'])
top10df.plot(x ='Player', y='Goals', kind='bar')
plt.show()
I did manually create a dataframe and it worked, so unsure where to go, tried googling and searching stackoverflow with no success. Any ideas please??
You could plot directly using the groupby results in the following way:
top10.plot(kind='bar', title='Your Title', ylabel='Goals',
xlabel='Player', figsize=(6, 5))
A dummy example, since you did not supply your data (next time it's best to do so):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'category': list('XYZXY'),
'sex': list('mfmff'),
'ThisColumnIsNotUsed': range(5,10)})
x = df.groupby('sex').count()
x.plot(kind='bar', ylabel='category', xlabel='sex')
we get:

Seaborn boxplot showing number on x-axis, not the name of pd.Series object

Problem : I want my seaborn boxplot to show names of pd.Series(Group A, Group B)
on X axis, but it only shows number. The number 0 for the first pd.Series, and 1 for the next pd.Series object.
My codes are as follows.
import pandas as pd
import seaborn as sns
Group_A=pd.Series([26,21,22,26,19,22,26,25,24,21,23,23,18,29,22])
Group_B=pd.Series([18,23,21,20,20,29,20,16,20,26,21,25,17,18,19])
sns.set(style="whitegrid")
ax=sns.boxplot(data=[Group_A, Group_B], palette='Set2')
Result :
You can concatenate the two series into a dataframe. There are a lot of options to do so, here is one example which will produce nice names:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Group_A=pd.Series([26,21,22,26,19,22,26,25,24,21,23,23,18,29,22])
Group_B=pd.Series([18,23,21,20,20,29,20,16,20,26,21,25,17,18,19])
df = pd.DataFrame({"ColumnA" : Group_A, "ColumnB" : Group_B})
sns.set(style="whitegrid")
ax=sns.boxplot(data=df , palette='Set2')
plt.show()

Avoid plotting missing values in Seaborn

Problem: I have timeseries data of several days and I use sns.FacetGrid function of Seaborn python library to plot this data in facet form. In several cases, I found that mentioned seaborn function plots consecutive missing values (nan values) between two readings with a continuous line. While as matplotlib shows missing values as a gap, which makes sense. A demo example is as
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# create timeseries data for 3 days such that day two contains NaN values
time_duration1 = pd.date_range('1/1/2018', periods=24,freq='H')
data1 = np.random.randn(len(time_duration1))
ds1 = pd.Series(data=data1,index=time_duration1)
time_duration2 = pd.date_range('1/2/2018',periods=24,freq='H')
data2 = [float('nan')]*len(time_duration2)
ds2 = pd.Series(data=data2,index=time_duration2)
time_duration3 = pd.date_range('1/3/2018', periods=24,freq='H')
data3 = np.random.randn(len(time_duration3))
ds3 = pd.Series(data=data3,index=time_duration3)
# combine all three days series and then convert series into pandas dataframe
DS = pd.concat([ds1,ds2,ds3])
DF = DS.to_frame()
DF.plot()
It results into following plot
Above Matplotlib plot shows missing values with a gap.
Now let us prepare same data for seaborn function as
DF['col'] = np.ones(DF.shape[0])# dummy column but required for facets
DF['timestamp'] = DF.index
DF.columns = ['data_val','col','timestamp']
g = sns.FacetGrid(DF,col='col',col_wrap=1,size=2.5)
g.map_dataframe(plt.plot,'timestamp','data_val')
See, how seaborn plot shows missing data with a line. How should I force seaborn to not plot nan values with such a line?
Note: This is a dummy example, and I need facet grid in any case to plot my data.
FacetGrid by default removes nan from the data. The reason is that some functions inside seaborn would not work properly with nans (especially some of the statistical function, I'd say).
In order to keep the nan values in the data, use the dropna=False argument to FacetGrid:
g = sns.FacetGrid(DF,... , dropna=False)

Missing tick label from a plot of data indexed with a PeriodIndex

I'm trying to plot two series and have the x-axis ticks labeled every 5 years. If I index the data with a PeriodIndex for some reason I get ticks every 10 years. If I use a list of integers to index, then it works fine. Is there a way to get the right tick labels with a PeriodIndex?
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
np.random.seed(0)
idx = pd.PeriodIndex(range(2000,2021),freq='A')
data = pd.DataFrame(np.random.normal(size=(len(idx),2)),index=idx)
fig,ax = plt.subplots(1,2,figsize=(10,5))
data.loc[:,0].plot(ax=ax[0])
data.iloc[9:,1].plot(ax=ax[1])
ax[1].xaxis.set_major_locator(mpl.ticker.MultipleLocator(5))
plt.show()
idx = range(2000,2021)
The workaround I know is to convert the PeriodIndex to DatetimeIndex and then to an array of datetime.datetimeobjects and use plt.plot_date() to plot and mpl.dates.YearLocator(5) to format. This seems overly complicated.

Plotting datetimeindex on x-axis with matplotlib creates wrong ticks in pandas 0.15 in contrast to 0.14

I create a simple pandas dataframe with some random values and a DatetimeIndex like so:
import pandas as pd
from numpy.random import randint
import datetime as dt
import matplotlib.pyplot as plt
# create a random dataframe with datetimeindex
dateRange = pd.date_range('1/1/2011', '3/30/2011', freq='D')
randomInts = randint(1, 50, len(dateRange))
df = pd.DataFrame({'RandomValues' : randomInts}, index=dateRange)
Then I plot it in two different ways:
# plot with pandas own matplotlib wrapper
df.plot()
# plot directly with matplotlib pyplot
plt.plot(df.index, df.RandomValues)
plt.show()
(Do not use both statements at the same time as they plot on the same figure.)
I use Python 3.4 64bit and matplotlib 1.4. With pandas 0.14, both statements give me the expected plot (they use slightly different formatting of the x-axis which is okay; note that data is random so the plots do not look the same):
However, when using pandas 0.15, the pandas plot looks alright but the matplotlib plot has some strange tick format on the x-axis:
Is there any good reason for this behaviour and why it has changed from pandas 0.14 to 0.15?
Note that this bug was fixed in pandas 0.15.1 (https://github.com/pandas-dev/pandas/pull/8693), and plt.plot(df.index, df.RandomValues) now just works again.
The reason for this change in behaviour is that starting from 0.15, the pandas Index object is no longer a numpy ndarray subclass. But the real reason is that matplotlib does not support the datetime64 dtype.
As a workaround, in the case you want to use the matplotlib plot function, you can convert the index to python datetime's using to_pydatetime:
plt.plot(df.index.to_pydatetime(), df.RandomValues)
More in detail explanation:
Because Index is no longer a ndarray subclass, matplotlib will convert the index to a numpy array with datetime64 dtype (while before, it retained the Index object, of which scalars are returned as Timestamp values, a subclass of datetime.datetime, which matplotlib can handle). In the plot function, it calls np.atleast_1d() on the input which now returns a datetime64 array, which matplotlib handles as integers.
I opened an issue about this (as this gets possibly a lot of use): https://github.com/pydata/pandas/issues/8614
With matplotlib 1.5.0 this 'just works':
import pandas as pd
from numpy.random import randint
import datetime as dt
import matplotlib.pyplot as plt
# create a random dataframe with datetimeindex
dateRange = pd.date_range('1/1/2011', '3/30/2011', freq='D')
randomInts = randint(1, 50, len(dateRange))
df = pd.DataFrame({'RandomValues' : randomInts}, index=dateRange)
fig, ax = plt.subplots()
ax.plot('RandomValues', data=df)

Categories

Resources