I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}
Related
I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak
I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp
Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)
I have been spending hours trying to write a function to detect trend in a time series by taking the past 4 months months of data prior to today. I organized my monthly data with dt.month but the issue is that I cannot get the previous year's 12th month if today is january. Here is a toy dataset:
data1 = pd.DataFrame({'Id' : ['001','001','001','001','001','001','001','001','001',
'002','002','002','002','002','002','002','002','002',],
'Date': ['2020-01-12', '2019-12-30', '2019-12-01','2019-11-01', '2019-08-04', '2019-08-04', '2019-08-01', '2019-07-20', '2019-06-04',
'2020-01-11', '2019-12-12', '2019-12-01','2019-12-01', '2019-09-10', '2019-08-10', '2019-08-01', '2019-06-20', '2019-06-01'],
'Quantity' :[3,5,6,72,1,5,6,3,9,3,6,7,3,2,5,74,3,4]
})
and my data cleaning to get the format that i need is this:
data1['Date'] =pd.to_datetime(data1['Date'], format='%Y-%m')
data2 = data1.groupby('Id').apply(lambda x: x.set_index('Date').resample('M').sum())['Quantity'].reset_index()
data2['M'] =pd.to_datetime(data2['Date']).dt.month
data2['Y'] =pd.to_datetime(data2['Date']).dt.year
data = pd.DataFrame(data2.groupby(['Id','Date','M','Y'])['Quantity'].sum())
data = data.rename(columns={0 : 'Quantity'})
and my function looks like this:
def check_trend():
today_month = int(time.strftime("%-m"))
data['n3-n4'] = data['Quantity'].loc[data['M']== (today_month - 3)]-data['Quantity'].loc[data['M']== (today_month - 4)]
data['n2-n3'] = data['Quantity'].loc[data['M'] == (today_month - 2)] - data['Quantity'].loc[data['M'] == (today_month - 3)]
data['n2-n1'] = data['Quantity'].loc[data['M'] == (today_month - 2)] - data['Quantity'].loc[data['M'] == (today_month - 1)]
if data['n3-n4'] < 0 and data['n2-n3'] <0 and data['n2-n1'] <0:
elif data['n3-n4'] > 0 and data['n2-n3'] > 0 and dat['n2-n1'] >0:
data['Trend'] = 'Yes'
else:
data['Trend'] = 'No'
print(check_trend)
I have looked at this: Get (year,month) for the last X months but it does not seem to be working for a specific groupby object.
I would really appreciate a hint! At least I would love to know if this method to identify trend in a dataset is a good one. After that I plan on using exponential smoothing if there is no trend and Holt's method if there is trend.
UPDATE: thanks to #Vorsprung durch Technik, I have the function working well but i still struggle to incorporate the result into a new dataframe containing the Ids from data2
forecast = pd.DataFrame()
forecast['Id'] = data1['Id'].unique()
for k,g in data2.groupby(level='Id'):
forecast['trendup'] = g.tail(5)['Quantity'].is_monotonic_increasing
forecast['trendown'] = g.tail(5)['Quantity'].is_monotonic_decreasing
this returns the same value for each row of the dataset, like if it was doing the calculation for only the first one, how can I ensure that it gets calculated for EACH Id value?
I don't think you need check_trend().
There are built-in functions for this:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_monotonic_increasing.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.is_monotonic_decreasing.html
Let me know if this does what you need:
data2 = data1.groupby('Id').apply(lambda x: x.set_index('Date').resample('M').sum())
for k,g in data2.groupby(level='Id'):
print(g.tail(4)['Quantity'].is_monotonic_increasing)
print(g.tail(4)['Quantity'].is_monotonic_decreasing)
This is what is returned by g.tail(4):
Quantity
Id Date
001 2019-10-31 0
2019-11-30 72
2019-12-31 11
2020-01-31 3
Quantity
Id Date
002 2019-10-31 0
2019-11-30 0
2019-12-31 16
2020-01-31 3
I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?
You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!
I have a data frame df which has dates in it:
df['Survey_Date'].head(4)
Out[65]:
0 1990-09-28
1 1991-07-26
2 1991-11-23
3 1992-10-15
I am interested in calculating a metric between two of the dates, using a separate data frame flow_df.
flow_df looks like:
date flow
0 1989-01-01 7480
1 1989-01-02 5070
2 1989-01-03 6410
3 1989-01-04 10900
4 1989-01-05 11700
For instance, I would like to query another data frame based on the current_date and early_date. The first time period of interest would be:
current_date = 1991-07-26
early_date = 1990-09-28
I have written a clunky for loop and it gets the job done, but I am sure there is a more elegant way:
My approach with a counter and for loop:
def find_peak(early_date,current_date,flow_df):
mask = (flow_df['date']>= early_date) & (flow_df['date'] < current_date)
query = flow_df.loc[mask]
peak_flow = np.max(query['flow'])*0.3048**3
return peak_flow
n=0
for thing in df['Survey_Date'][1:]:
early_date = df['Survey_Date'][n]
current_date = thing
peak_flow = find_peak(early_date,current_date,flow_df)
n+=1
df['Avg_Stage'][n] = peak_flow
How can I do this without a counter and for loop?
The desired output looks like:
Survey_Date Avg_Stage
0 1990-09-28
1 1991-07-26 574.831986
2 1991-11-23 526.693347
3 1992-10-15 458.732915
4 1993-04-01 855.168767
5 1993-11-17 470.059653
6 1994-04-07 419.089330
7 1994-10-21 450.237861
8 1995-04-24 498.376500
9 1995-06-23 506.871554
You can define a new variable that identifies survey period and use pandas.DataFrame.groupby to avoid for loop. It should be much faster when flow_df is large.
#convert both to datetime, if they are not
df['Survey_Date'] = pd.to_datetime(df['Survey_Date'])
flow_df['date'] = pd.to_datetime(flow_df['date'])
#Merge Survey_Date to flow_df. Most rows of flow_df['Survey_Date'] should be NaT
flow_df = flow_df.merge(df, left_on='date', right_on='Survey_Date', how='outer')
# In case not all Survey_Date in flow_df['date'] or data not sorted by date.
flow_df['date'].fillna(flow_df['Survey_Date'], inplace=True)
flow_df.sort_values('date', inplace=True)
#Identify survey period. In your example: [1990-09-28, 1991-07-26) is represented by 0; [1991-07-26, 1991-11-23) = 1; etc.
flow_df['survey_period'] = flow_df['Survey_Date'].notnull().cumsum()
#calc Avg_Stage in each survey_period. I did .shift(1) because you want to align period [1990-09-28, 1991-07-26) to 1991-07-26
df['Avg_Stage'] = (flow_df.groupby('survey_period')['flow'].max()*0.3048**3).shift(1)
You can use zip():
for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:]):
#do whatever yo want.
Of course you can put it into a list comprehension:
[some_metric(early_date, current_date) for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:])]