Stacked histogram by decade from dataframe - python

I have a dataframe that contains the date of a snowstorm and also a ranking of said snowstorm ranging from 1950-2019. I want to create a stacked histogram where the x-axis is decade and the y-axis is counts of snowstorm by category.
An example of what I am trying to create is listed below.
I am having trouble understanding how exactly to aggregate the data in such a fashion that would allow me to plot something like I shared.
For example, here is a snippet of the 1950s dataframe:
Start End Category Year count
1959-03-12 1959-03-14 2 1950 13
1958-03-18 1958-03-23 3 1950 6
1958-02-12 1958-02-18 3 1950 6
1957-12-03 1957-12-05 1 1950 32
1956-03-18 1956-03-20 1 1950 32
I have all of the counts for each category, but how do I turn that into data that is plottable on stacked histogram?

Aggregate you data first, then plot with the argument stacked=True
pivot_table
df.pivot_table('count', 'Year', 'Category', 'sum').plot.bar(stacked=True)
groupby
df.groupby(['Year', 'Category'])['count'].sum().unstack().plot.bar(stacked=True)
Keep in mind that you can change the aggregation to something else.
df.pivot_table('count', 'Year', 'Category', 'first').plot.bar(stacked=True)
df.groupby(['Year', 'Category'])['count'].first().unstack().plot.bar(stacked=True)
Also, you can drop duplicates prior.
(
df.drop_duplicates(['Year', 'Category'])
.pivot_table('count', 'Year', 'Category')
.plot.bar(stacked=True)
)

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Seaborn barplot: isna is not defined for MultiIndex

I would like to use seaborn barplot() to create a bar chart from a multi-indexed Series. I have grouped my dataset by two variables:
module_7_a_df = module_7_df.groupby(by=['Reported Race "MONRACE"', 'Hispanic Origin "HISPORIG"'])['SENTENCE CAP "SENSPCAP"'].count()
Grouping the dataframe creates a Series. This is what the resulting Series looks like:
When I try to create a barplot, I keep getting an error stating 'isna is not defined for MultiIndex.' The code for the barplot is:
sns.barplot(x=module_7_a_df.values, y=module_7_a_df.index)
This code works for Series created where the data has only been grouped by one column.
Can someone understand how to deal with this error?
Remove all nan values from the columns you groupby before you group them.
module_7_a_df.dropna(subset=['Reported Race "MONRACE"', 'Hispanic Origin "HISPORIG"'])
When you have a multi-index, you need to reset_index and when use hue = to enable the grouping, using an example dataset:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("tips")
counts = df.groupby(['time','day']).size()
counts
time day
Lunch Thur 61
Fri 7
Sat 0
Sun 0
Dinner Thur 1
Fri 12
Sat 87
Sun 76
dtype: int64
Then with the following:
counts = counts.to_frame('counts').reset_index()
sns.barplot(data = counts, x = "time",y="counts",hue="day")

stacked bar chart for grouped pandas series

Hi have a dataframe that looks like this:
ID Date Total_Amount priority
1 2007 4488 High
2 2007 40981 Low
3 2017 450 Medium
4 2008 1000 Low
each row is a new person (ID) and the rows show how much they spent per year (total amount).
I want to create a bar chart with the years on the x-axis and the Total_Amount as the y-axis height but it needs to be stacked by priority. e.g. if 10 spent money in 2007 and their Total_Amount sum is £100,000, the height of the bar needs to be 100,000 stacked by priority( e.g. 5 may have been high, 4 low and 1 medium).
I tried using crosstab with date as row and priority as columns but I don't get a dataframe for Total_Amount spent, I get one for the number of people in each priority.
You can use groupby() and then unstack():
df2 = df.groupby(['Date','priority'])['Total_Amount'].sum().unstack('priority').fillna(0)
df2.plot(kind='bar', stacked=True)
Produces:
Almost same we still using crosstab
pd.crosstab(index=df.Date,columns=df.priority,values=df.Total_Amount,aggfunc='sum')\
.fillna(0).plot(kind='bar')

How to plot by category over time

I have two columns, categorical and year, that I am trying to plot. I am trying to take the sum total of each categorical per year to create a multi-class time series plot.
ax = data[data.categorical=="cat1"]["categorical"].plot(label='cat1')
data[data.categorical=="cat2"]["categorical"].plot(ax=ax, label='cat3')
data[data.categorical=="cat3"]["categorical"].plot(ax=ax, label='cat3')
plt.xlabel("Year")
plt.ylabel("Number per category")
sns.despine()
But am getting an error stating no numeric data to plot. I am looking for something similar to the above, perhaps with data[data.categorical=="cat3"]["categorical"].lambda x : (1 for x in data.categorical)
I will use the following lists as examples.
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2","cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3","cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
My goal is to obtain something similar to the following picture
I'm hesitant to call this a "solution", as it's basically just a summary of basic Pandas functionality, which is explained in the same documentation where you found the time series plot you've placed in your post. But seeing as there's some confusion around groupby and plotting, a demo may help clear things up.
We can use two calls to groupby().
The first groupby() gets a count of category appearances per year, using the count aggregation.
The second groupby() is used to plot the time series for each category.
To start, generate a sample data frame:
import pandas as pd
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2",
"cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3",
"cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,
2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
df = pd.DataFrame({'categorical':categorical,
'year':year})
categorical year
0 cat1 2013
1 cat1 2014
...
21 cat1 2015
22 cat3 2013
Now get counts per category, per year:
# reset_index() gives a column for counting, after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year','categorical'], as_index=False)
.count()
# rename isn't strictly necessary here, it's just for readability
.rename(columns={'index':'ct'})
)
year categorical ct
0 2013 cat1 2
1 2013 cat2 2
2 2013 cat3 3
3 2014 cat1 5
4 2014 cat2 3
5 2014 cat3 1
6 2015 cat1 1
7 2015 cat2 2
8 2015 cat3 4
Finally, plot time series for each category, keyed by color:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
# key gives the group name (i.e. category), data gives the actual values
for key, data in ctdf.groupby('categorical'):
data.plot(x='year', y='ct', ax=ax, label=key)
Have you tried groupby?
df.groupby(["year","categorical"]).count()

pandas: How to format timestamp axis labels nicely in df.plt()?

I have a dataset that looks like this:
prod_code month items cost
0 040201060AAAIAI 2016-05-01 5 572.20
1 040201060AAAKAK 2016-05-01 164 14805.19
2 040201060AAALAL 2016-05-01 13465 14486.07
Doing df.dtypes shows that the month column is a datetime64[ns] type.
I am now trying to plot the cost per month for a particular product:
df[df.bnf_code=='040201060AAAIAI'][['month', 'cost']].plot()
plt.show()
This works, but the x-axis isn't a timestamp as I'd expect:
How can I format the x-axis labels nicely, with month and year labels?
Update: I also tried this, to get a bar chart, which does output timestamps on the x-axis, but in a very long unwieldy format:
df[df.bnf_code=='040201060AAAIAI'].plot.bar(x='month', y='cost', title='Spending on 040201060AAAIAI')
If you set the dates as index, the x-axis should be labelled properly:
df[df.bnf_code=='040201060AAAIAI'][['month', 'cost']].set_index('month').plot()
I have simply added set_index to your code.

Categories

Resources