Include 0 value days on barchart - python

I currently have a grouped dataframe of dates and values that I am creating a bar chart of:
date | value
--------|--------
7-9-19 | 250
7-14-19 | 400
7-20-19 | 500
7-20-19 | 300
7-21-19 | 200
7-30-19 | 142
When I plot the df, I get back a bar chart only showing the days that have a value. Is there a way for me to easily plot a bar chart with all the days for the month without inserting dates and 0 values for all the missing days in the dataframe ?
**Edit: I left out that certain dates may have more than one entry, so adding the missing dates by re-indexing throws a duplicate axis error.
*** Solution - I ended up using just the day of the month to simplify having to deal with the datetime objs. ie, 7-9-19 => 9 . After a helpful suggestion by Quang Hoang below, I realized I could do this a little bit easier using just the day #:
ind = range(1,32)
df = df.reindex(ind, fill_value=0)

You could use reindex, remember to set date as index:
# convert to datetime
# skip if already is
df['date'] = pd.to_datetime(df['date'], format='%m-%d-%y')
(df.set_index('date')
.reindex(pd.date_range('2019-07-01','2019-07-31', freq='D'))
.plot.bar()
)
Output:

Related

Plotting datetime index

I am running into an error with a grouped by date dataframe:
byDate = df.groupby('Date').count()
Date Value
2019-08-15 2
2019-08-19 1
2019-08-23 7
2019-08-28 4
2019-09-04 7
2019-09-09 2
I know that type(df["Date"].iloc[0])
returns datetime.date
I want to plot the data in such a way, that days, for which no value is available are shown as 0.
I have played around with
ax = sns.lineplot(x=byDate.index.fillna(0), y="Value", data=byDate)
I am however only able to get this output, where the y-axis indicates that a line is not drawn to 0 for days for which no value is available.
Have you ever tried creating a new dataFrame object indexed by all the dates ranging from startDate to endDate and then filling in the missing values with 0.0?
The output would looks something like:
dates = pd.to_datetime(['2019-08-15','2019-08-19','2019-08-23','2019-08-28','2019-09-04','2019-09-09']).date
byDate = pd.DataFrame({'Value':[2,1,7,4,7,2]},index=dates)
startDate = byDate.index.min()
endDate = byDate.index.max()
newDates = pd.date_range(startDate,endDate, periods=(endDate - startDate).days).date.tolist()
newDatesDf = pd.DataFrame( index=newDates)
newByDate = pd.concat([newDatesDf,byDate],1).fillna(0)
sns.lineplot(x=newByDate.index, y="Value", data=newByDate)
output

Linear Regression in Pandas Groupby with freq='W-MON'

I have data over the timespan of over a year. I am interested in grouping the data by week, and getting the slope of two variables by week. Here is what the data looks like:
Date | Total_Sales| Products
2015-12-30 07:42:50| 2900 | 24
2015-12-30 09:10:10| 3400 | 20
2016-02-07 07:07:07| 5400 | 25
2016-02-07 07:08:08| 1000 | 64
So ideally I would like to perform a linear regression on total_sales and products on each week of this data and record the slope. This works when each week is represented in the data, but I have problems when there are some weeks skipped in the data. I know I could do this with turning the date into the week number but I feel like the result will be skewed because there is over a year's worth of data.
Here is the code I have so far:
df['Date']=pd.to_datetime(vals['EventDate']) - pd.to_timedelta(7,unit='d')
df.groupby(pd.Grouper(key='Week', freq='W-MON')).apply(lambda v: linregress(v.Total_Sales, v.Products)[0]).reset_index()
However, I get the following error:
ValueError: Inputs must not be empty.
I expect the output to look like this:
Date | Slope
2015-12-28 | -0.008
2016-02-01 | -0.008
I assume this is happening because python is unable to groupby properly and also it is unable to recognise datetime as key ,as Date column has varying timestamp too.
Try the following code.It worked for me:
df['Date']=pd.to_datetime(df['Date']) #### Converts Date column to Python Datetime
df['daysoffset'] = df['Date'].apply(lambda x: x.weekday())
#### Return the day of the week as an integer, where Monday is 0 and Sunday is 6.
df['week_start'] = df.apply(lambda x: x['Date'].date()-timedelta(days=x['daysoffset']), axis=1)
#### x.['Date'].date() removes timestamp and considers only Date
#### the line assigns date corresponding to last Monday to column 'week_start'.
df.groupby('week_start').apply(lambda v: stats.linregress(v.Total_Sales,v.Products)
[0]).reset_index()

Pandas group values and get mean by date range

I have a DataFrame like this
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
value date
0 64.885 2018-01-11
1 74.839 2018-01-15
2 41.481 2018-01-17
3 22.027 2018-01-17
4 53.747 2018-01-18
... ... ...
514 61.017 2018-12-22
515 68.376 2018-12-21
516 79.079 2018-12-26
517 73.975 2018-12-26
518 76.923 2018-12-26
519 rows × 2 columns
And I want to plot this value vs date and I am using this
df.plot( x='date',y='value')
And I get this
The point here, this plot have to many fluctuation, and I want to soften this, my idea is group the values by date intervals and get the mean, for example 10 days, the mean between July 1 and July 10, and create de point in July 5
A long way is, get date range, separate in N ranges with start and end dates, filter data with date calculate the mean, and put in other DataFrame
Is there a short way to do that?
PD: Ignore the peaks
One thing you could do for instance is to take the rolling mean of the dataframe, using DataFrame.rolling along with mean:
df = df.set_index(df.date).drop('date', axis=1)
df.rolling(3).mean().plot()
For the example dataframe you have, directly plotting the dataframe would result in:
And having taking the rolling mean, you would have:
Here I chose a window of 3, but his will depend on how wmooth you want it to be
Based on yatu answer
The problem with his answer, is the rolling function considere values as index, not as date, with some transformations rolling can read Timestamp as use time as window [ pandas.rolling ]
df = pd.DataFrame( data = numpy_data, columns=['value','date'])
df['date'] = df.apply(lambda row: pd.Timestamp(row.date), axis=1 )
df = df.set_index(df.date).drop('date', axis=1)
df.sort_index(inplace=True)
df.rolling('10d').mean().plot( ylim=(30,100) , figsize=(16,5),grid='true')
Final results

Slice, combine, and map fiscal year dates to calendar year dates to new column

I have the following pandas data frame:
Shortcut_Dimension_4_Code Stage_Code
10225003 2
8225003 1
8225004 3
8225005 4
It is part of a much larger dataset that I need to be able to filter by month and year. I need to pull the fiscal year from the first two digits for values larger than 9999999 in the Shortcut_Dimension_4_Code column, and the first digit for values less than or equal to 9999999. That value needs to be added to "20" to produce a year i.e. "20" + "8" = 2008 | "20" + "10" = 2010.
That year "2008, 2010" needs to be combined with the stage code value (1-12) to produce a month/year, i.e. 02/2010.
The date 02/2010 then needs to converted from fiscal year date to calendar year date, i.e. Fiscal Year Date : 02/2010 = Calendar Year date: 08/2009. The resulting date needs to be presented in a new column. The resulting df would end up looking like this:
Shortcut_Dimension_4_Code Stage_Code Date
10225003 2 08/2009
8225003 1 07/2007
8225004 3 09/2007
8225005 4 10/2007
I am new to pandas and python and could use some help. I am beginning with this:
Shortcut_Dimension_4_Code Stage_Code CY_Month Fiscal_Year
0 10225003 2 8.0 10
1 8225003 1 7.0 82
2 8225003 1 7.0 82
3 8225003 1 7.0 82
4 8225003 1 7.0 82
I used .map and .str methods to produce this df, but have not been able to figure out how to get the FY's right, for fy 2008-2009.
In below code, I'll assume Shortcut_Dimension_4_Code is an integer. If it's a string you can convert it or slice it like this: df['Shortcut_Dimension_4_Code'].str[:-6]. More explanations in comments alongside the code.
That should work as long as you don't have to deal with empty values.
import pandas as pd
import numpy as np
from datetime import date
from dateutil.relativedelta import relativedelta
fiscal_month_offset = 6
input_df = pd.DataFrame(
[[10225003, 2],
[8225003, 1],
[8225004, 3],
[8225005, 4]],
columns=['Shortcut_Dimension_4_Code', 'Stage_Code'])
# make a copy of input dataframe to avoid modifying it
df = input_df.copy()
# numpy will help us with numeric operations on large collections
df['fiscal_year'] = 2000 + np.floor_divide(df['Shortcut_Dimension_4_Code'], 1000000)
# loop with `apply` to create `date` objects from available columns
# day is a required field in date, so we'll just use 1
df['fiscal_date'] = df.apply(lambda row: date(row['fiscal_year'], row['Stage_Code'], 1), axis=1)
df['calendar_date'] = df['fiscal_date'] - relativedelta(months=fiscal_month_offset)
# by default python dates will be saved as Object type in pandas. You can verify with `df.info()`
# to use clever things pandas can do with dates we need co convert it
df['calendar_date'] = pd.to_datetime(df['calendar_date'])
# I would just keep date as datetime type so I could access year and month
# but to create same representation as in question, let's format it as string
df['Date'] = df['calendar_date'].dt.strftime('%m/%Y')
# copy important columns into output dataframe
output_df = df[['Shortcut_Dimension_4_Code', 'Stage_Code', 'Date']].copy()
print(output_df)

pandas: How to format timestamp axis labels nicely in df.plt()?

I have a dataset that looks like this:
prod_code month items cost
0 040201060AAAIAI 2016-05-01 5 572.20
1 040201060AAAKAK 2016-05-01 164 14805.19
2 040201060AAALAL 2016-05-01 13465 14486.07
Doing df.dtypes shows that the month column is a datetime64[ns] type.
I am now trying to plot the cost per month for a particular product:
df[df.bnf_code=='040201060AAAIAI'][['month', 'cost']].plot()
plt.show()
This works, but the x-axis isn't a timestamp as I'd expect:
How can I format the x-axis labels nicely, with month and year labels?
Update: I also tried this, to get a bar chart, which does output timestamps on the x-axis, but in a very long unwieldy format:
df[df.bnf_code=='040201060AAAIAI'].plot.bar(x='month', y='cost', title='Spending on 040201060AAAIAI')
If you set the dates as index, the x-axis should be labelled properly:
df[df.bnf_code=='040201060AAAIAI'][['month', 'cost']].set_index('month').plot()
I have simply added set_index to your code.

Categories

Resources