How to plot by category over time - python

I have two columns, categorical and year, that I am trying to plot. I am trying to take the sum total of each categorical per year to create a multi-class time series plot.
ax = data[data.categorical=="cat1"]["categorical"].plot(label='cat1')
data[data.categorical=="cat2"]["categorical"].plot(ax=ax, label='cat3')
data[data.categorical=="cat3"]["categorical"].plot(ax=ax, label='cat3')
plt.xlabel("Year")
plt.ylabel("Number per category")
sns.despine()
But am getting an error stating no numeric data to plot. I am looking for something similar to the above, perhaps with data[data.categorical=="cat3"]["categorical"].lambda x : (1 for x in data.categorical)
I will use the following lists as examples.
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2","cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3","cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
My goal is to obtain something similar to the following picture

I'm hesitant to call this a "solution", as it's basically just a summary of basic Pandas functionality, which is explained in the same documentation where you found the time series plot you've placed in your post. But seeing as there's some confusion around groupby and plotting, a demo may help clear things up.
We can use two calls to groupby().
The first groupby() gets a count of category appearances per year, using the count aggregation.
The second groupby() is used to plot the time series for each category.
To start, generate a sample data frame:
import pandas as pd
categorical = ["cat1","cat1","cat2","cat3","cat2","cat1","cat3","cat2",
"cat1","cat3","cat3","cat3","cat2","cat1","cat2","cat3",
"cat2","cat2","cat3","cat1","cat1","cat1","cat3"]
year = [2013,2014,2013,2015,2014,2014,2013,2014,2014,2015,2015,2013,
2014,2014,2013,2014,2015,2015,2015,2013,2014,2015,2013]
df = pd.DataFrame({'categorical':categorical,
'year':year})
categorical year
0 cat1 2013
1 cat1 2014
...
21 cat1 2015
22 cat3 2013
Now get counts per category, per year:
# reset_index() gives a column for counting, after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year','categorical'], as_index=False)
.count()
# rename isn't strictly necessary here, it's just for readability
.rename(columns={'index':'ct'})
)
year categorical ct
0 2013 cat1 2
1 2013 cat2 2
2 2013 cat3 3
3 2014 cat1 5
4 2014 cat2 3
5 2014 cat3 1
6 2015 cat1 1
7 2015 cat2 2
8 2015 cat3 4
Finally, plot time series for each category, keyed by color:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
# key gives the group name (i.e. category), data gives the actual values
for key, data in ctdf.groupby('categorical'):
data.plot(x='year', y='ct', ax=ax, label=key)

Have you tried groupby?
df.groupby(["year","categorical"]).count()

Related

Seaborn barplot: isna is not defined for MultiIndex

I would like to use seaborn barplot() to create a bar chart from a multi-indexed Series. I have grouped my dataset by two variables:
module_7_a_df = module_7_df.groupby(by=['Reported Race "MONRACE"', 'Hispanic Origin "HISPORIG"'])['SENTENCE CAP "SENSPCAP"'].count()
Grouping the dataframe creates a Series. This is what the resulting Series looks like:
When I try to create a barplot, I keep getting an error stating 'isna is not defined for MultiIndex.' The code for the barplot is:
sns.barplot(x=module_7_a_df.values, y=module_7_a_df.index)
This code works for Series created where the data has only been grouped by one column.
Can someone understand how to deal with this error?
Remove all nan values from the columns you groupby before you group them.
module_7_a_df.dropna(subset=['Reported Race "MONRACE"', 'Hispanic Origin "HISPORIG"'])
When you have a multi-index, you need to reset_index and when use hue = to enable the grouping, using an example dataset:
import pandas as pd
import seaborn as sns
df = sns.load_dataset("tips")
counts = df.groupby(['time','day']).size()
counts
time day
Lunch Thur 61
Fri 7
Sat 0
Sun 0
Dinner Thur 1
Fri 12
Sat 87
Sun 76
dtype: int64
Then with the following:
counts = counts.to_frame('counts').reset_index()
sns.barplot(data = counts, x = "time",y="counts",hue="day")

How can i create multiple pie chart using matplotlib

I have a Pandas DataFrame seems like this
Year EventCode CityName EventCount
2015 10 Jakarta 12
2015 10 Yogjakarta 15
2015 10 Padang 27
...
2015 13 Jayapura 34
2015 14 Jakarta 24
2015 14 Yogjaarta 15
...
2019 14 Jayapura 12
i want to visualize top 5 city that have the biggest EventCount (with pie chart), group by eventcode in every year
How can i do that?
This could be achieved by restructuring your data with pivot_table, filtering on top cities using sort_values and the DataFrame.plot.pie method with subplots parameter:
# Pivot your data
df_piv = df.pivot_table(index='EventCode', columns='CityName',
values='EventCount', aggfunc='sum', fill_value=0)
# Get top 5 cities by total EventCount
plot_cities = df_piv.sum().sort_values(ascending=False).head(5).index
# Plot
df_piv.reindex(columns=plot_cities).plot.pie(subplots=True,
figsize=(10, 7),
layout=(-1, 3))
[out]
Pandas supports plotting each column into a subplot automatically. So you want to select the CityName as index, make EventCode as column and plot.
(df.sort_values('EventCount', ascending=False) # sort descending by `EventCount`
.groupby('EventCode', as_index=False)
.head(5) # get 5 most count within `EventCode`
.pivot(index='CityName', # pivot for plot.pie
columns='EventCode',
values='EventCount'
)
.plot.pie(subplots=True, # plot with some options
figsize=(10,6),
layout=(2,3))
)
Output:

Search column for specific phrases and count the amount of times they appear in the column and plot to bar graph

Search column for each month of the year. Column is organized like this "01-Jan-2018". I want to find how many times "Jan-2018" appears in the column. Basically count it and plot it on a bar graph. I want it to show all the quantities for "Jan-2018" , "Feb-2018", etc. Should be 12 bars on the graph. Maybe using count or sum. I am pulling the data from a CSV using pandas and python.
I have tried to printing it out onto the console with some success. But I am getting confused as correct way to search a portion of the date.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import seaborn as sns
data = pd.read_csv(r'C:\Users\rmond\Downloads\PS_csvFile1.csv', error_bad_lines=False, encoding="ISO-8859-1", skiprows=6)
cols = data.columns
cols = cols.map(lambda x: x.replace(' ', '_') if isinstance(x, (str)) else x)
data.columns = cols
print(data.groupby('Case_Date').mean().plot(kind='bar'))
I am expecting the a bar graph that will show the total quantity for each month. So there should be 12 bar graphs. But I am not sure how to search the column 12 times and each time only looking for the data of each month. While excluding the date, only searching for the month and year.
IIUC, this is what you need.
Let's work with the below dataframe as input dataframe.
date
0 1/31/2018
1 2/28/2018
2 2/28/2018
3 3/31/2018
4 4/30/2018
5 5/31/2018
6 6/30/2018
7 6/30/2018
8 7/31/2018
9 8/31/2018
10 9/30/2018
11 9/30/2018
12 9/30/2018
13 9/30/2018
14 10/31/2018
15 11/30/2018
16 12/31/2018
The below mentioned lines of code will get the number of count for each month as a bar graph. When you have a column as as datetime object, a lot of function are much easy & the contents of the column are much more flexible. With that, you don't need search string of the name of the month.
df['date'] = pd.to_datetime(df['date'])
df['my']=df.date.dt.strftime('%b-%Y')
ax = df.groupby('my', sort=False)['my'].value_counts().plot(kind='bar')
ax.set_xticklabels(df.my, rotation=90);
Output

Plotting histogram for column by grouping two column in pandas

I am new to pandas and matplotlib. I have a csv file which consist of year from 2012 to 2018. For each month of the year, I have Rain data. I want to analyze by the histogram, which month of the year having maximum rainfall. Here is my dataset.
year month Temp Rain
2012 1 10 100
2012 2 20 200
2012 3 30 300
.. .. .. ..
2012 12 40 400
2013 1 50 300
2013 2 60 200
.. .. .. ..
2018 12 70 400
I could not able to plot with histogram, I tried plotting with the bar but not getting desired result. Here what I have tried:
import pandas as pd
import numpy as npy
import matplotlib.pyplot as plt
df2=pd.read_csv('Monthly.csv')
df2.groupby(['year','month'])['Rain'].count().plot(kind="bar",figsize=(20,10))
Here what I got output:
Please suggest me an approach to plot an histogram to analyze maxmimum rainfall happening in which month grouped by year.
Probably you don't want to see the count per group but
df2.groupby(['year','month'])['Rain'].first().plot(kind="bar",figsize=(20,10))
or maybe
df2.groupby(['month'])['Rain'].sum().plot(kind="bar",figsize=(20,10))
you are closed to solution, i'll write: use max() and not count()
df2.groupby(['year','month'])['Rain'].max().plot(kind="bar",figsize=(20,10))
First groubby year and month as you already did, but only keep the maximum rainfall.
series_df2 = df2.groupby(['year','month'], sort=False)['Rain'].max()
Then unstack the series, transpose it and plot it.
series_df2.unstack().T.plot(kind='bar', subplots=False, layout=(2,2))
This will give you an output that looks like this for your sample data:

Groupby and plot bar graph

I want to plot a bar graph for sales over period of year. x-axis as 'year' and y-axis as sum of weekly sales per year. While plotting I am getting 'KeyError: 'year'. I guess it's because 'year' became index during group by.
Below is the sample content from csv file:
Store year Weekly_Sales
1 2014 24924.5
1 2010 46039.49
1 2015 41595.55
1 2010 19403.54
1 2015 21827.9
1 2010 21043.39
1 2014 22136.64
1 2010 26229.21
1 2014 57258.43
1 2010 42960.91
Below is the code I used to group by
storeDetail_df = pd.read_csv('Details.csv')
result_group_year= storeDetail_df.groupby(['year'])
total_by_year = result_group_year['Weekly_Sales'].agg([np.sum])
total_by_year.plot(kind='bar' ,x='year',y='sum',rot=0)
Updated the Code and below is the output:
DataFrame output:
year sum
0 2010 42843534.38
1 2011 45349314.40
2 2012 35445927.76
3 2013 0.00
below is the Graph i am getting:
While reading your csv file, you needed to use white space as the delimiter as delim_whitespace=True and then reset the index after summing up the Weekly_Sales. Below is the working code:
storeDetail_df = pd.read_csv('Details.csv', delim_whitespace=True)
result_group_year= storeDetail_df.groupby(['year'])
total_by_year = result_group_year['Weekly_Sales'].agg([np.sum]).reset_index()
total_by_year.plot(kind='bar' ,x='year',y='sum',rot=0, legend=False)
Output
In case it is making year your index due to group by command. you need to remove it as a index before plotting.
Try
total_by_year = total_by_year.reset_index(drop=False, inplace=True)
You might want to try this
storeDetail_df = pd.read_csv('Details.csv')
result_group_year= storeDetail_df.groupby(['year'])['Weekly_Sales'].sum()
result_group_year = result_group_year.reset_index(drop=False)
result_group_year.plot.bar(x='year', y='Weekly_Sales')

Categories

Resources