How to create year-over-year plots in python - python

I have the following df
player season pts
A 2017 6
A 2018 5
A 2019 9
B 2017 2
B 2018 1
B 2019 3
C 2017 10
C 2018 8
C 2019 7
I would like to make a plot to look at the stability of pts year-over-year. That is, I want to see how correlated pts are on a year-to year-basis. I have tried various ways to plot this, but can't seem to get it quite right. Here is what I tried initially:
fig, ax = plt.subplots(figsize=(15,10))
for i in df.season:
sns.scatterplot(df.pts.iloc[i],df.pts.iloc[i]+1)
plt.xlabel('WOPR Year n')
plt.ylabel('WOPR Year n+1')
IndexError: single positional indexer is out-of-bounds
I thought about it some more, and thought something like this may work:
fig, ax = plt.subplots(figsize=(15,10))
seasons = [2017,2018,2019]
for i in seasons:
sns.scatterplot(df.pts.loc[df.season==i],df.pts.loc[df.season==i+1])
plt.xlabel('WOPR Year n')
plt.ylabel('WOPR Year n+1')
This didn't return an error, but just gave me a blank plot. I think I am close here. Any help is appreciated. Thanks! To clarify, I want each player to be plotted twice. Once for x=2017 and y=2018, and another for x=2018 and y=2019 (hence the year n+1). EDIT: a sns.regplot() would probably be better here compared to sns.scatterplot as I could leverage the trendline to my liking. The below image captures the stability of the desired metric from year to year.

I think you can do a to do a self-merge:
sns.lineplot(data=df.merge(df.assign(season=df.season+1),
on=['player','season'],
suffixes=['_last','_current']),
x='pts_last', y='pts_current', hue='player')
Output:
Note: If you don't care for players, then you could drop hue. Also, use scatterplot instead of lineplot if it fits you better.

Based on your second idea:
for i in seasons[:-1]:
sns.scatterplot(df.pts.loc[df.season==i].tolist(),df.pts.loc[df.season==(i+1)].tolist())
It seems there were two issues: one is that the Seaborn method expect numerical data; converting the series to a list gets rid of the index so that Seaborn handles it properly. The other is that you need to exclude the last element of seasons, since you're plotting n against n+1.

Related

Plot numbers from different years which are different columns python

I have the following df:
Country 2013 2014 2015 2016 2017
0 USA 40 30 20 30 30
1 Chile 1 2 4 6 1
So i need to plot the total Infected (which are the numbers in each year) throughout time per year.
So I did:
grid = sns.FacetGrid(data=df, col="Country", col_wrap=5, hue="Country")
grid.map(plt.plot,)
But this is not going to work because each year is a column and I cannot pass that to the grid.map
Any ideas on how to do this?
Not sure what exactly kind of plot you wanted, but this is one way I got around your problem:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Country':['USA', 'Chile'],
'2013':[40,1],
'2014':[30,2],
'2015':[20,4],
'2016':[30,6],
'2017':[30,1]})
df = df.T # This will transpose our df: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
df.columns = df.iloc[0] #Set the row [0] as our header
df.drop(['Country'], inplace=True, axis=0) # Drop row [0] since we don't want it.
Right now, this is what our df looks like:
From our df we can call:
df.plot.bar()
plt.xticks(rotation=0)
And we get the desired plot:
Plot
Ps. I can't post pictures so far, but please take a look o the links StackOverflow provides for them.
This code is one way of solving it, but definitely you can approach this by different method. Remember the plot is based on matplotlib, so you can customize as such.

.groupby function in Python

I am trying to create a pie chart in Python. I have a dataset with 2137 responses to a question, with answer choices ranging from 1 to 5. I am trying to produce a pie chart with the percentage of responses for each answer choice, but when I run my code, it produce a pie chart of each respondent (so 2137 pieces of the pie. I am thinking I need to use the .groupby function, but I am not entirely sure on how to correctly do it.
df3 = pd.DataFrame(df, columns=['Q78']).groupby(['Q78'])
df3.plot.pie(subplots=True)
Here is what I have tried. (PS I am just starting to learn Python, so sorry if this is a dumb question!!)
One of possible solutions:
s = df.Q87
s.groupby(s).size().plot.pie(autopct='%1.1f%%');
To test my code I created a Datarame limited to just 8 answers:
Q87
0 A
1 B
2 C
3 D
4 E
5 A
6 B
7 A
and I got the following picture:

matplotlib set xticks to column, labels to corresponding index

I am pretty new to matlpotlib and I find the tick locators and labels confusing, so please bear with me. I swear I've been googling for hours.
I have a dataframe 'frame' like this (relevant columns):
dayofweek sla
weekday
Mon 1 0.889734
Tue 2 0.895131
Wed 3 0.879747
Thu 4 0.935000
Fri 5 0.967742
Sat 6 0.852941
Sun 7 1.000000
where the weekday name is the index and the weekday number is a column. There are no datetime objects in this frame.
I turn this into a plt.figure
fig=plt.figure(figsize=(7,5))
ax=plt.subplot(111)
I need to have my x-axis as numeric values, because I want to add a scatter plot later, which is not possible with string values.
x_=frame.dayofweek.values
anbar=ax.bar(x_,y_an,width=0.8,color=an_c,label='angekommen')
This works ok
So basically I want my xticks to be the 'dayofweek' column and their labels to be the corresponding index.
Now if I set_xticklabels manually by
ax.set_xticklabels(frame.index)
the labels start from position 0 on the axis.
I can work around this by rearranging the list of labels, but there should be a 'correct' way to use the Locators or Formatter, but (see above) this is quite confusing for me.
Can someone point me to how I make the labels correspond to their index?
The straight forward solution is to not only set the xticklabels but also the ticks themselves:
ax.set_xticks(frame.dayofweek.values)
ax.set_xticklabels(frame.index)
The same can be accomplished with a FixedLocator and a FixedFormatter,
ax.xaxis.set_major_locator(matplotlib.ticker.FixedLocator(frame.dayofweek.values))
ax.xaxis.set_major_formatter(matplotlib.ticker.FixedFormatter(frame.index))
but seems quite unnecessary for this simple task.

Python Pandas - Don't sort bar graph on y axis values

I am beginner in Python. I have a Series with Date and count of some observation as below
Date Count
2003 10
2005 50
2015 12
2004 12
2003 15
2008 10
2004 05
I wanted to plot a graph to find out the count against the year with a Bar graph (x axis as year and y axis being count). I am using the below code
import pandas as pd
pd.value_counts(sfdf.Date_year).plot(kind='bar')
I am getting the bar graph which is automatically sorted on the count. So I am not able to clearly visualize how the count is distributed over the years. Is there any way we can stop sorting the data on the bar graph on the count and instead sort on the x axis values (i,e year)?
I know this is an old question, but in case someone is still looking for another answer.
I solved this by adding .sort_index(axis=0)
So, instead of this:
pd.value_counts(sfdf.Date_year).plot(kind='bar')
you can write this:
pd.value_counts(sfdf.Date_year).sort_index(axis=0).plot(kind='bar')
Hope, this helps.
The following code uses groupby() to join the multiple instances of the same year together, and then calls sum() on the groupby() object to sum it up. By default groupby() pushes the grouped object to the dataframe index. I think that groupby() automatically sorts, but just in case, sort(axis=0) will sort the index. All that then remains is to plot. All in one line:
df = pd.DataFrame([[2003,10],[2005,50],[2015,12],[2004,12],[2003,15],[2008,10],[2004,5]],columns=['Date','Count'])
df.groupby('Date').sum().sort(axis=0).plot(kind='bar')

How to change axis limits for time in Matplotlib?

I have a data set stored in a Pandas dataframe object, and the first column of the dataframe is a datetime type, which looks like this:
0 2013-09-09 10:35:42.640000
1 2013-09-09 10:35:42.660000
2 2013-09-09 10:35:42.680000
3 2013-09-09 10:35:42.700000
In another column, I have another column called eventno, and that one looks like:
0 0
1 0
2 0
3 0
I am trying to create a scatter plot with Matplotlib, and once I have the scatter plot ready, I would like to change the range in the date axis (x-axis) to focus on certain times in the data. My problem is, I could not find a way to change the range the data will be plotted over in the x axis. I tried this below, but I get a Not implemented for this type error.
plt.figure(figsize=(13,7), dpi=200)
ax.set_xlim(['2013-09-09 10:35:00','2013-09-09 10:36:00'])
scatter(df2['datetime'][df.eventno<11],df2['eventno'][df.eventno<11])
If I comment out the ax.set.xlim line, I get the scatter plot, however with some default x axis range, not even matching my dates.
Do I have to tell matplotlib that my data is of datetime type, as well? If so, then how can I do it? Assuming this part is somehow accomplished, then how can I change the range of my data to be plotted?
Thanks!
PS: I tried uploading the picture, but I got a "Framing not allowed" error. Oh well... It just plots it from Jan 22 1970 to Jan 27 1970. No clue how it comes up with that :)
Try putting ax.set_xlim after the scatter command.

Categories

Resources