Plot numbers from different years which are different columns python - python

I have the following df:
Country 2013 2014 2015 2016 2017
0 USA 40 30 20 30 30
1 Chile 1 2 4 6 1
So i need to plot the total Infected (which are the numbers in each year) throughout time per year.
So I did:
grid = sns.FacetGrid(data=df, col="Country", col_wrap=5, hue="Country")
grid.map(plt.plot,)
But this is not going to work because each year is a column and I cannot pass that to the grid.map
Any ideas on how to do this?

Not sure what exactly kind of plot you wanted, but this is one way I got around your problem:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'Country':['USA', 'Chile'],
'2013':[40,1],
'2014':[30,2],
'2015':[20,4],
'2016':[30,6],
'2017':[30,1]})
df = df.T # This will transpose our df: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
df.columns = df.iloc[0] #Set the row [0] as our header
df.drop(['Country'], inplace=True, axis=0) # Drop row [0] since we don't want it.
Right now, this is what our df looks like:
From our df we can call:
df.plot.bar()
plt.xticks(rotation=0)
And we get the desired plot:
Plot
Ps. I can't post pictures so far, but please take a look o the links StackOverflow provides for them.
This code is one way of solving it, but definitely you can approach this by different method. Remember the plot is based on matplotlib, so you can customize as such.

Related

How to create year-over-year plots in python

I have the following df
player season pts
A 2017 6
A 2018 5
A 2019 9
B 2017 2
B 2018 1
B 2019 3
C 2017 10
C 2018 8
C 2019 7
I would like to make a plot to look at the stability of pts year-over-year. That is, I want to see how correlated pts are on a year-to year-basis. I have tried various ways to plot this, but can't seem to get it quite right. Here is what I tried initially:
fig, ax = plt.subplots(figsize=(15,10))
for i in df.season:
sns.scatterplot(df.pts.iloc[i],df.pts.iloc[i]+1)
plt.xlabel('WOPR Year n')
plt.ylabel('WOPR Year n+1')
IndexError: single positional indexer is out-of-bounds
I thought about it some more, and thought something like this may work:
fig, ax = plt.subplots(figsize=(15,10))
seasons = [2017,2018,2019]
for i in seasons:
sns.scatterplot(df.pts.loc[df.season==i],df.pts.loc[df.season==i+1])
plt.xlabel('WOPR Year n')
plt.ylabel('WOPR Year n+1')
This didn't return an error, but just gave me a blank plot. I think I am close here. Any help is appreciated. Thanks! To clarify, I want each player to be plotted twice. Once for x=2017 and y=2018, and another for x=2018 and y=2019 (hence the year n+1). EDIT: a sns.regplot() would probably be better here compared to sns.scatterplot as I could leverage the trendline to my liking. The below image captures the stability of the desired metric from year to year.
I think you can do a to do a self-merge:
sns.lineplot(data=df.merge(df.assign(season=df.season+1),
on=['player','season'],
suffixes=['_last','_current']),
x='pts_last', y='pts_current', hue='player')
Output:
Note: If you don't care for players, then you could drop hue. Also, use scatterplot instead of lineplot if it fits you better.
Based on your second idea:
for i in seasons[:-1]:
sns.scatterplot(df.pts.loc[df.season==i].tolist(),df.pts.loc[df.season==(i+1)].tolist())
It seems there were two issues: one is that the Seaborn method expect numerical data; converting the series to a list gets rid of the index so that Seaborn handles it properly. The other is that you need to exclude the last element of seasons, since you're plotting n against n+1.

Pandas Rolling Correlation Introduces Gaps

I have a relatively clean data set with two columns and no gaps, a snapshot is shown below:
I run the following line of code:
correlation = pd.rolling_corr(data['A'], data['B'], window=120)
and for some reason, this outputs a dataframe (shown as a plot below) with large gaps in it:
I haven't personally seen this issue before, and am not sure after reviewing the data (more than the code) what the issue could be?
It happens due to the missing dates in the time series, weekends etc. Evidence of this in your example being 7/2/2003 -> 10/2/2003. One solution is to fill in these gaps by re-indexing the time series dataframe.
df.index = pd.DatetimeIndex(df.index) # required
df = df.asfreq('D') # reindex will include missing days
df = df.fillna(method='bfill') # fill / interpolate NaNs
corr = df.A.rolling(30).corr(df.B) # no gaps
You are getting NAN values in your correlation variable where the number of rows is less than the value of the window attribute.
import pandas as pd
import numpy as np
data = pd.DataFrame({'A':np.random.randn(10), 'B':np.random.randn(10)})
correlation = pd.rolling_corr(data['A'], data['B'], window=3)
print correlation
0 NaN
1 NaN
2 0.852602
3 0.020681
4 -0.915110
5 -0.741857
6 0.173987
7 0.874049
8 -0.874258
9 -0.835340
In the docs for this function is warns about this in the min_periods attribute section: "Minimum number of observations in window required to have a value (otherwise result is NA)."
It seems the default None is not working, since you'd think you wouldn't see the NaN unless you set a value for this.

Acquire the data from a row in a Pandas

Instructions given by Professor:
1. Using the list of countries by continent from World Atlas data, load in the countries.csv file into a pandas DataFrame and name this data set as countries.
2. Using the data available on Gapminder, load in the Income per person (GDP/capita, PPP$ inflation-adjusted) as a pandas DataFrame and name this data set as income.
3. Transform the data set to have years as the rows and countries as the columns. Show the head of this data set when it is loaded.
4. Graphically display the distribution of income per person across all countries in the world for any given year (e.g. 2000). What kind of plot would be best?
In the code below, I have some of these tasks completed, but I'm having a hard time understanding how to acquire data from a DataFrame row. I want to be able to acquire data from a row and then plot it. It may seem like a trivial concept, but I've been at it for a while and need assistance please.
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
countries = pd.read_csv('2014_data/countries.csv')
countries.head(n=3)
income = pd.read_excel('indicator gapminder gdp_per_capita_ppp.xlsx')
income = income.T
def graph_per_year(year):
stryear = str(year)
dfList = income[stryear].tolist()
graph_per_year(1801)
Pandas uses three types of indexing.
If you are looking to use integer indexing, you will need to use .iloc
df_1
Out[5]:
consId fan-cnt
0 1155696024483 34.0
1 1155699007557 34.0
2 1155694005571 34.0
3 1155691016680 12.0
4 1155697016945 34.0
df_1.iloc[1,:] #go to the row with index 1 and select all the columns
Out[8]:
consId 1.155699e+12
fan-cnt 3.400000e+01
Name: 1, dtype: float64
And to go to a particular cell, you can use something along the following lines,
df_1.iloc[1][1]
Out[9]: 34.0
You need to go through the documentation for other types of indexing namely .ix and .loc as suggested by sohier-dane.
To answer your first question, a bar graph with a year sector would be best. You'll have to keep countries on y axis and per capita income on y. And a dropdown perhaps to select a particular year for which the graph will change.

Convert categorical data into various columns for plotting in pandas

I'm new to python and pandas but I've used R in the past for data analysis. I have a simple dataset:
df.head()
Sequence Level Count
1 Easy 5
1 Medium 7
1 Hard 9
I would like to convert this to:
Sequence Easy Medium Hard
1 5 7 9
In R, I could simply do this by using the reshape2 package. In python it seems like one of my options is to create dummy variables using get_dummies but that would still generate multiple rows for the same Sequence in my case. Is there an easy way of achieving my resultset?
I'm finally trying to plot it using:
import matplotlib.pyplot as plt
df.plot(kind='bar', stacked=True)
plt.show()
Any help would be appreciated.
You could use pandas pivot_table:
In [1436]: pd.pivot_table(df, index='Sequence', columns='Level', values='Count')
Out[1436]:
Level Easy Hard Medium
Sequence
1 5 9 7
Then you could plot it:
df1 = pd.pivot_table(df, index='Sequence', columns='Level', values='Count')
df1.plot(kind='bar', stacked=True)

Python Pandas - Don't sort bar graph on y axis values

I am beginner in Python. I have a Series with Date and count of some observation as below
Date Count
2003 10
2005 50
2015 12
2004 12
2003 15
2008 10
2004 05
I wanted to plot a graph to find out the count against the year with a Bar graph (x axis as year and y axis being count). I am using the below code
import pandas as pd
pd.value_counts(sfdf.Date_year).plot(kind='bar')
I am getting the bar graph which is automatically sorted on the count. So I am not able to clearly visualize how the count is distributed over the years. Is there any way we can stop sorting the data on the bar graph on the count and instead sort on the x axis values (i,e year)?
I know this is an old question, but in case someone is still looking for another answer.
I solved this by adding .sort_index(axis=0)
So, instead of this:
pd.value_counts(sfdf.Date_year).plot(kind='bar')
you can write this:
pd.value_counts(sfdf.Date_year).sort_index(axis=0).plot(kind='bar')
Hope, this helps.
The following code uses groupby() to join the multiple instances of the same year together, and then calls sum() on the groupby() object to sum it up. By default groupby() pushes the grouped object to the dataframe index. I think that groupby() automatically sorts, but just in case, sort(axis=0) will sort the index. All that then remains is to plot. All in one line:
df = pd.DataFrame([[2003,10],[2005,50],[2015,12],[2004,12],[2003,15],[2008,10],[2004,5]],columns=['Date','Count'])
df.groupby('Date').sum().sort(axis=0).plot(kind='bar')

Categories

Resources