I have the following dataframe:
Date Time Quantity
20171003 5:00 2
20171003 5:15 5
....
20171005 5:00 1
20171005 5:15 9
I need to create a new column containing the quantity of the same day of the previous week, that is:
Date Time Quantity Quantity-1
20171003 5:00 2 NaN
20171003 5:15 5 NaN
....
20171005 5:00 1 2
20171005 5:15 9 5
I figured out how to get the same day of the last week by using for example:
last_week = today() + relativedelta(weeks=-1, weekday= now.weekday())
How to apply this to my dataframe?
Thank you in advance!
Does your index have a pattern? If yes, you could use pd.shift(). The periods paramater would be the number of periods in your df. For example, assuming your Time column is always whether 5:00 or 5:15, and that you have calendar days, your period would be 7 * 2 = 14
df['Quantity-1'] = df['Quantity'].shift(14)
If the data is collected in the exact same length everyday, using pd.shift as #EricB mentioned should be perfect.
Alternatively, you can create new dataframe and merge where days shift by 14 days and then merge back to original dataframe on column date and time (note assuming that you want the quantity at the same time on the next 14 days).
df = pd.DataFrame([
['20171003', '5:00', '2'],
['20171003', '5:15', '5'],
['20171005', '5:00', '1'],
['20171005', '5:15', '9'],
['20171019', '5:00', '8']],
columns=['date', 'time', 'quantity'])
df.loc[:, 'date'] = pd.to_datetime(df.date)
df2 = df[['date', 'time', 'quantity']]
df2.loc[:, 'date'] = df2.date + datetime.timedelta(weeks=2) # shift by 2 weeks
df_shift = df.merge(df2, on=['time', 'date'], how='left')
Output of df_shift
+-----------+----+----------+----------+
| date|time|quantity_x|quantity_y|
+-----------+----+----------+----------+
|2017-10-03 |5:00| 2| |
|2017-10-03 |5:15| 5| |
|2017-10-05 |5:00| 1| |
|2017-10-05 |5:15| 9| |
|2017-10-19 |5:00| 8| 1|
+-----------+----+----------+----------+
Adding to #titipata solution, there is another way to do it without having to merge.
The approach in a nutshell goes as following
Get the datetime after 1 day/week/month from the first value
starting from that datetime onwards get the value 1 day/week/month before
so for example, if your dataset starts at 1/10/2021 00:00:00 (that's the 1st of October for you Americans)
First:
you will have these values
1 day after: 2/10/2021
1 week after: 8/10/2021
1 month after: 1/11/2021
Second Step
get the following
Previous day values for values starting from 2/10/2021
And so on and so forth
Hope someone finds this helpful
from pandas import DateOffset
def add_past_values(df):
df = df.set_index('datetime')
firstvalue = df.index[0]
#1. get the datetime after 1 day/week/month from the first value
secondday = firstvalue + DateOffset(days = 1)
secondweek = firstvalue + DateOffset(weeks = 1)
secondmonth = firstvalue + DateOffset(months = 1)
#2. starting from that datetime onwards get the value 1 day/week/month before
df.loc[secondday:,'lag_day_1'] = df.loc[df.loc[secondday:].index - DateOffset(days=1),'myvalue'].values
df.loc[secondweek:,'lag_week_1'] = df.loc[df.loc[secondweek:].index - DateOffset(weeks=1),'myvalue'].values
df.loc[secondmonth:,'lag_month_1'] = df.loc[df.loc[secondmonth:].index - DateOffset(months=1),'myvalue'].values
df = df.reset_index()
return df
Related
I have a dataframe with three columns lets say
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
Zara qrs 2021-02-25
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but,
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit
so we will drop that row.
Desired Output:
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
for index, row in dupes_df.iterrows():
for date in uniq_dates_list:
format_date = datetime.strptime(date,'%d/%m/%y')
if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
print("here here")
break
else:
dupes_df.drop(index, inplace=True)
I need a much more omptimal solution for it. Studied about lamba functions, but couldn't get to the depths of it.
IIUC, this should work for you:
import pandas as pd
import itertools
from io import StringIO
data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
Output:
Name Address Date
0 faraz xyz 2022-01-01
1 Abdul abc 2022-06-06
First I think it's be easier if you use 'relativedelta' from 'dateutil'.
Reference: https://pynative.com/python-difference-between-two-dates-in-months/
Second, I think you need to add a column, let's call it score.
At the second loop, if delta <= 6 month :
set score = 1 and 'continue'
This way each row is compared to all rows.
Delete all rows that have score == 0.
Edit: Title changed to reflect map not being more efficient than a for loop.
Original title: Replacing a for loop with map when comparing dates
I have a list of sequential dates date_list and a data frame df which contains, for the purposes of now, contains one column named Event Date which contains the date that an event occured:
Index Event Date
0 02-01-20
1 03-01-20
2 03-01-20
I want to know how many events have happened by a given date in the format:
Date Events
01-01-20 0
02-01-20 1
03-01-20 3
My current method for doing so is as follows:
for date in date_list:
event_rows = df.apply(lambda x: True if x['Event Date'] > date else False , axis=1)
event_count = len(event_rows[event_rows == True].index)
temp = [date,event_count]
pre_df_list.append(temp)
Where the list pre_df_list is later converted to a dataframe.
This method is slow and seems inelegant but I am struggling to find a method that works.
I think it should be something along the lines of:
map(lambda x,y: True if x > y else False, df['Event Date'],date_list)
but that would compare each item in the list in pairs which is not what I'm looking for.
I appreaciate it might be odd asking for help when I have working code but I'm trying to cut down my reliance of loops as they are somewhat of a crutch for me at the moment. Also I have multiple different events to track in the full data and looping through ~1000 dates for each one will be unsatisfyingly slow.
Use groupby() and size() to get counts per date and cumsum() to get a cumulative sum, i.e. include all the dates before a particular row.
from datetime import date, timedelta
import random
import pandas as pd
# example data
dates = [date(2020, 1, 1) + timedelta(days=random.randrange(1, 100, 1)) for _ in range(1000)]
df = pd.DataFrame({'Event Date': dates})
# count events <= t
event_counts = df.groupby('Event Date').size().cumsum().reset_index()
event_counts.columns = ['Date', 'Events']
event_counts
Date Events
0 2020-01-02 13
1 2020-01-03 23
2 2020-01-04 34
3 2020-01-05 42
4 2020-01-06 51
.. ... ...
94 2020-04-05 972
95 2020-04-06 981
96 2020-04-07 989
97 2020-04-08 995
98 2020-04-09 1000
Then if there's dates in your date_list file that don't exist in your dataframe, convert the date_list into a dataframe and merge the previous results. The fillna(method='ffill') will fill gaps in the middle of the data, whille the last fillna(0) incase there's gaps at the start of the column.
date_list = [date(2020, 1, 1) + timedelta(days=x) for x in range(150)]
date_df = pd.DataFrame({'Date': date_list})
merged_df = pd.merge(date_df, event_counts, how='left', on='Date')
merged_df.columns = ['Date', 'Events']
merged_df = merged_df.fillna(method='ffill').fillna(0)
Unless I am mistaken about your objective, it seems to me that you can simply use pandas DataFrames' ability to compare against a single value and slice the dataframe like so:
>>> df = pd.DataFrame({'event_date': [date(2020,9, 1), date(2020, 9, 2), date(2020, 9, 3)]})
>>> df
event_date
0 2020-09-01
1 2020-09-02
2 2020-09-03
>>> df[df.event_date > date(2020, 9, 1)]
event_date
1 2020-09-02
2 2020-09-03
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08
I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.
You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month
Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8
Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.
I have a data frame that contains the following columns:
ID Scheduled Date
241 10/9/2018
423 9/25/2018
126 9/30/2018
123 8/13/2018
132 8/16/2018
143 10/6/2018
I want to count the total number of IDs by week. Specifically, I want the week to always start on Monday and always end on Sunday.
I achieved this in Jupyter Notebook already:
weekly_count_output = df.resample('W-Mon', on='Scheduled Date', label='left', closed='left').sum().query('count_row > 0')
weekly_count_output = weekly_count_output.reset_index()
weekly_count_output = weekly_count_output[['Scheduled Date', 'count_row']]
weekly_count_output = weekly_count_output.rename(columns = {'count_row': 'Total Count'})
But I don't know how to write the above code in Python PySpark syntax. I want my resulting output to look like this:
Scheduled Date Total Count
8/13/2018 2
9/24/2018 2
10/1/2018 1
10/8/2018 1
Please note the Scheduled Date is always a Monday (indicating beginning of week) and the total count goes from Monday to Sunday of that week.
Thanks to Get Last Monday in Spark for defining the funcion previous_day.
Firstly import,
from pyspark.sql.functions import *
from datetime import datetime
Assuming your input data as in my df (DataFrame)
cols = ['id', 'scheduled_date']
vals = [
(241, '10/09/2018'),
(423, '09/25/2018'),
(126, '09/30/2018'),
(123, '08/13/2018'),
(132, '08/16/2018'),
(143, '10/06/2018')
]
df = spark.createDataFrame(vals, cols)
This is the function defined
def previous_day(date, dayOfWeek):
return date_sub(next_day(date, 'monday'), 7)
# Converting the string column to timestamp.
df = df.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', 'MM/dd/yyy') \
.cast('timestamp'), 'yyyy-MM-dd'))
df.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-09|
|423| 2018-09-25|
|126| 2018-09-30|
|123| 2018-08-13|
|132| 2018-08-16|
|143| 2018-10-06|
+---+--------------+
# Returns the first monday of a week
df_mon = df.withColumn("scheduled_date", previous_day('scheduled_date', 'monday'))
df_mon.show()
+---+--------------+
| id|scheduled_date|
+---+--------------+
|241| 2018-10-08|
|423| 2018-09-24|
|126| 2018-09-24|
|123| 2018-08-13|
|132| 2018-08-13|
|143| 2018-10-01|
+---+--------------+
# You can groupBy and do agg count of 'id'.
df_mon_grp = df_mon.groupBy('scheduled_date').agg(count('id')).orderBy('scheduled_date')
# Reformatting to match your resulting output.
df_mon_grp = df_mon_grp.withColumn('scheduled_date', date_format(unix_timestamp('scheduled_date', "yyyy-MM-dd") \
.cast('timestamp'), 'MM/dd/yyyy'))
df_mon_grp.show()
+--------------+---------+
|scheduled_date|count(id)|
+--------------+---------+
| 08/13/2018| 2|
| 09/24/2018| 2|
| 10/01/2018| 1|
| 10/08/2018| 1|
+--------------+---------+