I want to plot a bar graph for sales over period of year. x-axis as 'year' and y-axis as sum of weekly sales per year. While plotting I am getting 'KeyError: 'year'. I guess it's because 'year' became index during group by.
Below is the sample content from csv file:
Store year Weekly_Sales
1 2014 24924.5
1 2010 46039.49
1 2015 41595.55
1 2010 19403.54
1 2015 21827.9
1 2010 21043.39
1 2014 22136.64
1 2010 26229.21
1 2014 57258.43
1 2010 42960.91
Below is the code I used to group by
storeDetail_df = pd.read_csv('Details.csv')
result_group_year= storeDetail_df.groupby(['year'])
total_by_year = result_group_year['Weekly_Sales'].agg([np.sum])
total_by_year.plot(kind='bar' ,x='year',y='sum',rot=0)
Updated the Code and below is the output:
DataFrame output:
year sum
0 2010 42843534.38
1 2011 45349314.40
2 2012 35445927.76
3 2013 0.00
below is the Graph i am getting:
While reading your csv file, you needed to use white space as the delimiter as delim_whitespace=True and then reset the index after summing up the Weekly_Sales. Below is the working code:
storeDetail_df = pd.read_csv('Details.csv', delim_whitespace=True)
result_group_year= storeDetail_df.groupby(['year'])
total_by_year = result_group_year['Weekly_Sales'].agg([np.sum]).reset_index()
total_by_year.plot(kind='bar' ,x='year',y='sum',rot=0, legend=False)
Output
In case it is making year your index due to group by command. you need to remove it as a index before plotting.
Try
total_by_year = total_by_year.reset_index(drop=False, inplace=True)
You might want to try this
storeDetail_df = pd.read_csv('Details.csv')
result_group_year= storeDetail_df.groupby(['year'])['Weekly_Sales'].sum()
result_group_year = result_group_year.reset_index(drop=False)
result_group_year.plot.bar(x='year', y='Weekly_Sales')
Related
Month Year Open High Low Close/Price Volume
6 2019 86.78 87.11 86.06 86.55 1507828
6 2019 86.63 87.23 84.81 85.06 2481284
6 2019 85.38 85.81 84.75 85.33 2034693
6 2019 85.65 86.86 85.13 86.43 1394847
6 2019 86.66 87.74 86.66 87.55 3025379
7 2019 88.84 89.72 87.77 88.45 4017249
7 2019 89.21 90 87.95 88.87 2237183
7 2019 89.14 91.08 89.14 90.67 1647124
7 2019 90.39 90.95 89.07 90.59 3227673
I want to get the monthly average of: Open High Low Close/Price
How do i set two values (Month, Year) as parameters for getting a value that is in another column?
df = pd.read_excel('DatosUnited.xlsx')
month = df.groupby('Month')
year = df.groupby('Year')
june2019 = month.get_group("6")
year2019 = year.get_group('2019')
I tried something like this, but i dont know how to use both as a filter simultaneously
You can use .groupby() with multiple columns, and then you can use .mean() to get the desired averages:
df.groupby(["Month", "Year"]).mean()
This outputs:
Open High Low Close/Price Volume
Month Year
6 2019 86.220 86.9500 85.4820 86.184 2088806.20
7 2019 89.395 90.4375 88.4825 89.645 2782307.25
I'm trying to finish my workproject but I'm getting stuck at a certain point.
Part of the dataframe I have is this:
year_month
year
month
2007-01
2007
1
2009-07
2009
7
2010-03
2010
3
However, I want to add the column "season". I'm illustrating soccer seasons and the season column needs to illustrate what season the players plays. So if month is equal or smaller than 3, the "season" column needs to correspond with ((year-1), "/", year) and if larger with (year, "/", (year + 1)).
The table should look like this:
year_month
year
month
season
2007-01
2007
1
2006/2007
2009-07
2009
7
2009/2010
2010-03
2010
3
2009/2010
Hopefully someone else can help me with this problem.
Here is the code to create the first Table:
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'year_month':["2007-01", "2009-07", "2010-03"],
'year':[2007, 2009, 2010],
'month':[1, 7, 3]})
# convert the 'Date' columns to datetime format
df['year_month']= pd.to_datetime(df['year_month'])
Thanks in advance!
You can use np.where() to specify the condition and get corresponding strings according to True / False of the condition, as follows:
df['season'] = np.where(df['month'] <= 3,
(df['year'] - 1).astype(str) + '/' + df['year'].astype(str),
df['year'].astype(str) + '/' + (df['year'] + 1).astype(str))
Result:
year_month year month season
0 2007-01-01 2007 1 2006/2007
1 2009-07-01 2009 7 2009/2010
2 2010-03-01 2010 3 2009/2010
You can use a lambda function with conditionals and axis=1 to apply it to each row. Using f-Strings reduces the code needed to transform values from the year column into strings as needed for your new season column.
df['season'] = df.apply(lambda x: f"{x['year']-1}/{x['year']}" if x['month'] <= 3 else f"{x['year']}/{x['year']+1}", axis=1)
Output:
year_month year month season
0 2007-01 2007 1 2006/2007
1 2009-07 2009 7 2009/2010
2 2010-03 2010 3 2009/2010
I have a DataFrame object named df, and I want to generate a list of properly formatted dates. (datetime module is properly imported)
I wrote:
dates = [datetime.date(df.at(index, "year"), df.at(index, "month"), df.at(index, "day")) for index in df.index]
which gives the error in the title.
If it helps, this is the value of df.head():
year month day smoothed trend
0 2011 1 1 391.26 389.76
1 2011 1 2 391.29 389.77
2 2011 1 3 391.33 389.78
3 2011 1 4 391.36 389.78
4 2011 1 5 391.39 389.79
(This is new to me, so I have likely misinterpreted the docs)
df.at is not callable but a property that supports indexing. So change parantheses to square brackets around it:
df.at[index, "year"]
i.e. ( to [ and similar for closing.
Apart from using [ instead of ( you can achieve your goal simply by
pd.to_datetime(df[['year', 'month', 'day']])
I have a Pandas DataFrame seems like this
Year EventCode CityName EventCount
2015 10 Jakarta 12
2015 10 Yogjakarta 15
2015 10 Padang 27
...
2015 13 Jayapura 34
2015 14 Jakarta 24
2015 14 Yogjaarta 15
...
2019 14 Jayapura 12
i want to visualize top 5 city that have the biggest EventCount (with pie chart), group by eventcode in every year
How can i do that?
This could be achieved by restructuring your data with pivot_table, filtering on top cities using sort_values and the DataFrame.plot.pie method with subplots parameter:
# Pivot your data
df_piv = df.pivot_table(index='EventCode', columns='CityName',
values='EventCount', aggfunc='sum', fill_value=0)
# Get top 5 cities by total EventCount
plot_cities = df_piv.sum().sort_values(ascending=False).head(5).index
# Plot
df_piv.reindex(columns=plot_cities).plot.pie(subplots=True,
figsize=(10, 7),
layout=(-1, 3))
[out]
Pandas supports plotting each column into a subplot automatically. So you want to select the CityName as index, make EventCode as column and plot.
(df.sort_values('EventCount', ascending=False) # sort descending by `EventCount`
.groupby('EventCode', as_index=False)
.head(5) # get 5 most count within `EventCode`
.pivot(index='CityName', # pivot for plot.pie
columns='EventCode',
values='EventCount'
)
.plot.pie(subplots=True, # plot with some options
figsize=(10,6),
layout=(2,3))
)
Output:
I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.
I have a dataframe returned from a query that looks like the following:
department action date
marketing close 09-01-2017
marketing close 07-01-2018
marketing close 06-01-2017
marketing close 10-21-2019
marketing open 08-01-2018
marketing other 07-14-2018
sales open 02-01-2019
sales open 02-01-2017
sales close 02-22-2019
The ultimate goal is I need a count of the types of actions grouped within particular date ranges.
My DESIRED output is something along the lines of:
department 01/01/2017-12/31/2017 01/01/2018-12/31/2018 01/01/2019-12/31/2019
open close other open close other open close other
marketing 0 2 0 1 1 1 0 1 0
sales 1 0 0 0 0 0 1 1 0
"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).
I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas
I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.
Building a dataframe with dynamic date ranges using filtered results from another dataframe
There are quite a few concepts you need at once here.
First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby() and count()
In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")
Out[66]:
date action department
2017 close marketing 2
open sales 1
2018 close marketing 1
open marketing 1
other marketing 1
2019 close marketing 1
sales 1
open sales 1
Name: count, dtype: int64
The squeeze() and rename() are there because afterwards both the count column and the year would be called date and you get a name conflict. You could equivalently use rename(columns={'date': 'count'}) and not cast to a Series.
The second step is a pivot_table. This creates column names from values. Because there are combinations of date and action without a corresponding value, you need pivot_table.
In [62]: df2.pivot_table(index="department", columns=["date", "action"])
Out[62]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2.0 NaN 1.0 1.0 1.0 1.0 NaN
sales NaN 1.0 NaN NaN NaN 1.0 1.0
Because NaN is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna and convert back to int.
In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)
Out[65]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2 0 1 1 1 1 0
sales 0 1 0 0 0 1 1
To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year. You can do this with strftime (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex and a fillna should do the trick.
EDIT: Yes it does
In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
...:
In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]:
date action department count
0 2017 close marketing 2
1 2017 close sales 0
2 2017 open marketing 0
3 2017 open sales 1
4 2017 other marketing 0
5 2017 other sales 0
6 2018 close marketing 1
.. ... ... ... ...
11 2018 other sales 0
12 2019 close marketing 1
13 2019 close sales 1
14 2019 open marketing 0
15 2019 open sales 1
16 2019 other marketing 0
17 2019 other sales 0
In [79]: df3.pivot_table(index="department", columns=["date", "action"])
Out[79]:
count
date 2017 2018 2019
action close open other close open other close open other
department
marketing 2 0 0 1 1 1 1 0 0
sales 0 1 0 0 0 0 1 1 0