I am running into an error with a grouped by date dataframe:
byDate = df.groupby('Date').count()
Date Value
2019-08-15 2
2019-08-19 1
2019-08-23 7
2019-08-28 4
2019-09-04 7
2019-09-09 2
I know that type(df["Date"].iloc[0])
returns datetime.date
I want to plot the data in such a way, that days, for which no value is available are shown as 0.
I have played around with
ax = sns.lineplot(x=byDate.index.fillna(0), y="Value", data=byDate)
I am however only able to get this output, where the y-axis indicates that a line is not drawn to 0 for days for which no value is available.
Have you ever tried creating a new dataFrame object indexed by all the dates ranging from startDate to endDate and then filling in the missing values with 0.0?
The output would looks something like:
dates = pd.to_datetime(['2019-08-15','2019-08-19','2019-08-23','2019-08-28','2019-09-04','2019-09-09']).date
byDate = pd.DataFrame({'Value':[2,1,7,4,7,2]},index=dates)
startDate = byDate.index.min()
endDate = byDate.index.max()
newDates = pd.date_range(startDate,endDate, periods=(endDate - startDate).days).date.tolist()
newDatesDf = pd.DataFrame( index=newDates)
newByDate = pd.concat([newDatesDf,byDate],1).fillna(0)
sns.lineplot(x=newByDate.index, y="Value", data=newByDate)
output
Related
I have the data frame below with dates ranging from 2016-01-01 to 2021-03-27
timestamp close circulating_supply issuance_native
0 2016-01-01 0.944695 7.389026e+07 26070.31250
1 2016-01-02 0.931646 7.391764e+07 27383.90625
2 2016-01-03 0.962863 7.394532e+07 27675.78125
3 2016-01-04 0.944515 7.397274e+07 27420.62500
4 2016-01-05 0.950312 7.400058e+07 27839.21875
I'm looking to filter this dataframe by Month & Day to look at the circulating supply on December 31st for each year.
here is an output of the datatypes of the data frame
timestamp datetime64[ns]
close float64
circulating_supply float64
issuance_native float64
dtype: object
I'm able to pull single rows using this:
ts = pd.to_datetime('2016-12-31')
df.loc[df['timestamp'] == td]
but no luck passing in a list of datetimes inside df.loc[]
The result should look like this, showing the rows for December 31st of each year:
timestamp close circulating_supply issuance_native
0 2016-31-12 0.944695 7.389026e+07 26070.31250
1 2017-31-12 0.931646 7.391764e+07 27383.90625
2 2018-31-12 0.962863 7.394532e+07 27675.78125
3 2019-31-12 0.944515 7.397274e+07 27420.62500
4 2020-31-12 0.950312 7.400058e+07 27839.21875
This is the closest Ive gotten but I get this error
#query dataframe for the circulating supply at the end of the year
circulating_supply = df.query("timestamp == '2016-12-31' or timestamp =='2017-12-31' or timestamp =='2018-12-31' or timestamp =='2019-12-31' or timestamp =='2020-12-31' or timestamp =='2021-03-01'")
circulating_supply.drop(columns=['close', 'issuance_native'], inplace=True)
circulating_supply.copy()
circulating_supply.head()
/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
return super().drop(
Try something like this:
end_of_year = [
pd.to_datetime(ts)
for ts in [
"2016-12-31",
"2017-12-31",
"2018-12-31",
"2019-12-31",
"2020-12-31",
"2021-03-01",
]
]
end_of_year_df = df.loc[df["timestamp"].isin(end_of_year), :]
circulating_supply = end_of_year_df.drop(columns=["close", "issuance_native"])
circulating_supply.head()
I was able to solve this by ignoring the error I got when using the .drop() function on my df.query result
#query dataframe for the circulating supply at the end of the year
circulating_supply = df.query("timestamp == '2016-12-31' or timestamp =='2017-12-31' or timestamp =='2018-12-31' or timestamp =='2019-12-31' or timestamp =='2020-12-31' or timestamp =='2021-03-01'")
circulating_supply.drop(columns=['close', 'issuance_native'], inplace=True)
circulating_supply.copy() #not sure if this did anything
circulating_supply.head()
#add the column
yearly_issuance['EOY Supply'] = circulating_supply['circulating_supply'].values
yearly_issuance.head()
I have a hypothetical time series data frame, which is with some missing observations (assumption is that the data frame shall include all dates and corresponding values and for all the dates in the year). As we can see in the head and tail information, there are certain dates and corresponding values are missing (30th Jan & 29th Dec). There would be many more such in the data frame, sometimes missing observations for more than one consecutive date.
Is there a way that missing dates are detected and inserted into the data frame and corresponding values are filled with a rolling average with one week window (this would naturally increase the number of rows of the data frame)? Appreciate inputs.
df.head(3)
date value
0 2020-01-28 25
1 2020-01-29 32
2 2020-01-31 45
df.tail(3)
date value
3 2020-12-28 24
4 2020-12-30 35
5 2020-12-31 37
df.dtypes
date object
value int64
dtype: object
Create DaetimeIndex, then use DataFrame.asfreq with rolling and mean:
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date').asfreq('d').rolling('7D').mean()
If need all values by year use:
df['date'] = pd.to_datetime(df['date'])
idx = pd.date_range('2020-01-01','2020-12-31')
df = df.set_index('date').reindex(idx).rolling('7D').mean()
Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03
I currently have a grouped dataframe of dates and values that I am creating a bar chart of:
date | value
--------|--------
7-9-19 | 250
7-14-19 | 400
7-20-19 | 500
7-20-19 | 300
7-21-19 | 200
7-30-19 | 142
When I plot the df, I get back a bar chart only showing the days that have a value. Is there a way for me to easily plot a bar chart with all the days for the month without inserting dates and 0 values for all the missing days in the dataframe ?
**Edit: I left out that certain dates may have more than one entry, so adding the missing dates by re-indexing throws a duplicate axis error.
*** Solution - I ended up using just the day of the month to simplify having to deal with the datetime objs. ie, 7-9-19 => 9 . After a helpful suggestion by Quang Hoang below, I realized I could do this a little bit easier using just the day #:
ind = range(1,32)
df = df.reindex(ind, fill_value=0)
You could use reindex, remember to set date as index:
# convert to datetime
# skip if already is
df['date'] = pd.to_datetime(df['date'], format='%m-%d-%y')
(df.set_index('date')
.reindex(pd.date_range('2019-07-01','2019-07-31', freq='D'))
.plot.bar()
)
Output:
We have some ready available sales data for certain periods, like 1week, 1month...1year:
time_pillars = pd.Series(['1W', '1M', '3M', '1Y'])
sales = pd.Series([4.75, 5.00, 5.10, 5.75])
data = {'time_pillar': time_pillars, 'sales': sales}
df = pd.DataFrame(data)
I would like to do two operations.
Firstly, create a new column of date type, df['date'], that corresponds to the actual date of 1week, 1month..1year from now.
Then, I'd like to create another column df['days_from_now'], taking how many days are on these pillars (1week would be 7days, 1month would be around 30days..1year around 365days).
The goal of this is then to use any day as input for a a simple linear_interpolation_method() to obtain sales data for any given day (eg, what are sales for 4Octobober2018? ---> We would interpolate between 3months and 1year).
Many thanks.
I'm not exactly sure what you mean regarding your interpolation, but here is a way to make your dataframe in pandas (starting from your original df you provided in your post):
from datetime import datetime
from dateutil.relativedelta import relativedelta
def create_dates(df):
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
df['days_from_now'] = df['date'] - datetime.now().date()
return df
create_dates(df)
sales time_pillar date days_from_now
0 4.75 1W 2018-04-11 7 days
1 5.00 1M 2018-05-04 30 days
2 5.10 3M 2018-07-04 91 days
3 5.75 1Y 2019-04-04 365 days
I wrapped it in a function, so that you can call it on any given day and get your results for 1 week, 3 weeks, etc. from that exact day.
Note: if you want your days_from_now to simply be an integer of the number of days, use df['days_from_now'] = [i.days for i in df['date'] - datetime.now().date()] in the function, instead of df['days_from_now'] = df['date'] - datetime.now().date()
Explanation:
df['date'] = [i.date() for i in
[d+delt for d,delt in zip([datetime.now()] * 4 ,
[relativedelta(weeks=1), relativedelta(months=1),
relativedelta(months=3), relativedelta(years=1)])]]
Takes a list of the date today (datetime.now()) repeated 4 times, and adds a relativedelta (a time difference) of 1 week, 1 month, 3 months, and 1 year, respectively, extracts the date (i.date() for ...), finally creating a new column using the resulting list.
df['days_from_now'] = df['date'] - datetime.now().date()
is much more straightforward, it simply subtracts those new dates that you got above from the date today. The result is a timedelta object, which pandas conveniently formats as "n days".