I have a data frame df which has dates in it:
df['Survey_Date'].head(4)
Out[65]:
0 1990-09-28
1 1991-07-26
2 1991-11-23
3 1992-10-15
I am interested in calculating a metric between two of the dates, using a separate data frame flow_df.
flow_df looks like:
date flow
0 1989-01-01 7480
1 1989-01-02 5070
2 1989-01-03 6410
3 1989-01-04 10900
4 1989-01-05 11700
For instance, I would like to query another data frame based on the current_date and early_date. The first time period of interest would be:
current_date = 1991-07-26
early_date = 1990-09-28
I have written a clunky for loop and it gets the job done, but I am sure there is a more elegant way:
My approach with a counter and for loop:
def find_peak(early_date,current_date,flow_df):
mask = (flow_df['date']>= early_date) & (flow_df['date'] < current_date)
query = flow_df.loc[mask]
peak_flow = np.max(query['flow'])*0.3048**3
return peak_flow
n=0
for thing in df['Survey_Date'][1:]:
early_date = df['Survey_Date'][n]
current_date = thing
peak_flow = find_peak(early_date,current_date,flow_df)
n+=1
df['Avg_Stage'][n] = peak_flow
How can I do this without a counter and for loop?
The desired output looks like:
Survey_Date Avg_Stage
0 1990-09-28
1 1991-07-26 574.831986
2 1991-11-23 526.693347
3 1992-10-15 458.732915
4 1993-04-01 855.168767
5 1993-11-17 470.059653
6 1994-04-07 419.089330
7 1994-10-21 450.237861
8 1995-04-24 498.376500
9 1995-06-23 506.871554
You can define a new variable that identifies survey period and use pandas.DataFrame.groupby to avoid for loop. It should be much faster when flow_df is large.
#convert both to datetime, if they are not
df['Survey_Date'] = pd.to_datetime(df['Survey_Date'])
flow_df['date'] = pd.to_datetime(flow_df['date'])
#Merge Survey_Date to flow_df. Most rows of flow_df['Survey_Date'] should be NaT
flow_df = flow_df.merge(df, left_on='date', right_on='Survey_Date', how='outer')
# In case not all Survey_Date in flow_df['date'] or data not sorted by date.
flow_df['date'].fillna(flow_df['Survey_Date'], inplace=True)
flow_df.sort_values('date', inplace=True)
#Identify survey period. In your example: [1990-09-28, 1991-07-26) is represented by 0; [1991-07-26, 1991-11-23) = 1; etc.
flow_df['survey_period'] = flow_df['Survey_Date'].notnull().cumsum()
#calc Avg_Stage in each survey_period. I did .shift(1) because you want to align period [1990-09-28, 1991-07-26) to 1991-07-26
df['Avg_Stage'] = (flow_df.groupby('survey_period')['flow'].max()*0.3048**3).shift(1)
You can use zip():
for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:]):
#do whatever yo want.
Of course you can put it into a list comprehension:
[some_metric(early_date, current_date) for early_date, current_date in zip(df['Survey_Date'], df['Survey_Date'][1:])]
Related
I have a Data Frame which contains a column like this:
pct_change
0 NaN
1 -0.029767
2 0.039884 # period of one
3 -0.026398
4 0.044498 # period of two
5 0.061383 # period of two
6 -0.006618
7 0.028240 # period of one
8 -0.009859
9 -0.012233
10 0.035714 # period of three
11 0.042547 # period of three
12 0.027874 # period of three
13 -0.008823
14 -0.000131
15 0.044907 # period of one
I want to get all the periods where the pct change was positive into a list, so with the example column it will be:
raise_periods = [1,2,1,3,1]
Assuming that the column of your dataframe is a series called y which contains the pct_changes, the following code provides a vectorized solution without loops.
y = df['pct_change']
raise_periods = (y < 0).cumsum()[y > 0]
raise_periods.groupby(raise_periods).count()
eventually, the answer provided by #gioxc88 didn't get me where I wanted, but it did put me in the right direction.
what I ended up doing is this:
def get_rise_avg_period(cls, df):
df[COMPOUND_DIFF] = df[NEWS_COMPOUND].diff()
df[CONSECUTIVE_COMPOUND] = df[COMPOUND_DIFF].apply(lambda x: 1 if x > 0 else 0)
# group together the periods of rise and down changes
unfiltered_periods = [list(group) for key, group in itertools.groupby(df.consecutive_high.values.tolist())]
# filter out only the rise periods
positive_periods = [li for li in unfiltered_periods if 0 not in li]
I wanted to get the average length of this positive periods, so I added this at the end:
period = round(np.mean(positive_periods_lens))
I have a frame like:
id title date
0 1211 jingle bells 2019-01-15
1 1212 jingle bells 2019-01-15
2 1225 tom boat 2019-06-15
3 2112 tom boat 2019-06-15
4 3122 tom boat 2017-03-15
5 1762 tom boat 2017-03-15
An item is defined as the group of id with the same title and with date within 70 days of the first. I need a dictionary of ids grouped by title if date is within 70 days of each other. Expected outcome here is:
d = {0: [1211,1212], 1: [1225,2112], 2: [3122,1762]}
Any given title can have uncapped number of dictionary entries or just one. id are unique to one title. At the moment, I do something like:
itemlist = []
for i in list(df.title):
dates = list(df.loc[df.title==i,'date'])
if (max(dates)-min(dates)).days > 70:
items = []
while len(dates)>0:
extract = [i for i in dates if (i-min(dates)).days<70]
items.append(list(df.loc[(df.title==i)&(df.date.isin(extract)),'id'])
dates = [i for i in dates if i not in extract
else:
items = [list(df.loc[df.title==i,'id'])]
itemlist += items
d = {j:i for i in range(len(itemlist)) for j in itemlist[i]}
It doesn't quite work yet, I'm bugfixing. That said, I feel like this is a lot of iteration - any ideas on how to do this better?
another acceptable output would be a list of dataframes, one per item.
I think sorting your dataframe can help you solve the problem much more efficiently.
df = df.sort_values(['title', 'date'])
itemlist = []
counter = 0 # to get items at constant time
for title in set(df.title):
dates = df.loc[df['title']==title].date.tolist()
item = []
min_date = dates[0]
for date in dates:
if (date-min_date).days>70: # we need a new item
itemlist.append(item) # append original item
item = [df.iloc[counter, 0]] # new item
min_date = date
else:
item.append(df.iloc[counter, 0])
counter += 1
itemlist.append(item)
d = {i:j for i,j in enumerate(itemlist)}
print(d)
Even though the code became a bit long, there are only two loops (except the last one to change the list into dict) and it loops n_rows time in total, which means it only looks at every row once.
The use of counter is to use df.iloc which uses positional index (instead of labels or conditional statements like df.loc), hence computes faster-with O(1).
I have a dataset which has a list of subjects, a start date, and an end date. I'm trying to do a loop so that for each subject I have a list of dates between the start date and end date. I've tried so many ways to do this based on previous posts but still having issues.
an example of the dataframe:
Participant # Start_Date End_Date
1 23-04-19 25-04-19
An example of the output I want:
Participant # Range
1 23-04-19
1 24-04-19
1 25-04-19
Right now my code looks like this:
subjs_490 = tracksheet_490['Participant #']
for subj_490 in subjs_490:
temp_a = tracksheet_490[tracksheet_490['Participant #'].isin([subj_490])]
start = temp_a['Start_Date']
end = temp_a['End_Date'
start_dates = pd.to_datetime(pd.Series(start), format = '%d-%m-%y')
end_dates = pd.to_datetime(pd.Series(end), format = '%d-%m-%y')
date_range = pd.date_range(start_dates, end_dates).tolist()
With this method I'm getting the following error:
Cannot convert input [1 2016-05-03 Name: Start_Date, dtype: datetime64[ns]] of type to Timestamp
Expanding ranges tends to be a slow process. You can create the date_range and then explode it to get what you want. Moving 'Participant #' to the index makes sure it's repeated for all rows that are exploded.
df = (df.set_index('Participant #')
.apply(lambda x: pd.date_range(x.start_date, x.end_date), axis=1) # :( slow
.rename('Range')
.explode()
.reset_index())
Participant # Range
0 1 2019-04-23
1 1 2019-04-24
2 1 2019-04-25
If you can't use explode another option is to create a separate DataFrame for each row and then concat them all together.
pd.concat([pd.DataFrame({'Participant #': par, 'Range': pd.date_range(start, end)})
for par,start,end in zip(df['Participant #'], df['start_date'], df['end_date'])],
ignore_index=True)
I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}
I have the following issue:
I have a dataframe with 3 columns :
The first is userID, the second is invoiceType and the third the time of creation of the invoice.
df = pd.read_csv('invoice.csv')
Output: UserID InvoiceType CreateTime
1 a 2018-01-01 12:31:00
2 b 2018-01-01 12:34:12
3 a 2018-01-01 12:40:13
1 c 2018-01-09 14:12:25
2 a 2018-01-12 14:12:29
1 b 2018-02-08 11:15:00
2 c 2018-02-12 10:12:12
I am trying to plot the invoice cycle for each user. I need to create2 new columns, time_diff, and time_diff_wrt_first_invoice. time_diff will represent the time difference between each invoice for each user and time_diff_wrt_first_invoice will represent the time difference between all the invoices and the first invoice, which will be interesting for ploting purposes. This is my code:
"""
********** Exploding a variable that is a list in each dataframe cell
"""
def explode_list(df,x):
return (df[x].apply(pd.Series)
.stack()
.reset_index(level = 1, drop=True)
.to_frame(x))
"""
****** applying explode_list to all the columns ******
"""
def explode_listDF(df):
exploaded_df = pd.DataFrame()
for x in df.columns.tolist():
exploaded_df = pd.concat([exploaded_df, explode_list(df,x)],
axis = 1)
return exploaded_df
"""
******** Getting the time difference column in pivot table format
"""
def pivoted_diffTime(df1, _freq=60):
# _ freq is 1 for minutes frequency
# _freq is 60 for hour frequency
# _ freq is 60*24 for daily frequency
# _freq is 60*24*30 for monthly frequency
df = df.sort_values(['UserID', 'CreateTime'])
df_pivot = df.pivot_table(index = 'UserID',
aggfunc= lambda x : list(v for v in x)
)
df_pivot['time_diff'] = [[0]]*len(df_pivot)
for user in df_pivot.index:
try:
_list = [0]+[math.floor((x - y).total_seconds()/(60*_freq))
for x,y in zip(df_pivot.loc[user, 'CreateTime'][1:],
df_pivot.loc[user, 'CreateTime'][:-1])]
df_pivot.loc[user, 'time_diff'] = _list
except:
print('There is a prob here :', user)
return df_pivot
"""
***** Pipelining the two functions to obtain an exploaded dataframe
with time difference ******
"""
def get_timeDiff(df, _frequency):
df = explode_listDF(pivoted_diffTime(df, _freq=_frequency))
return df
And once I have time_diff, I am creating time_diff_wrt_first_variable this way:
# We initialize this variable
df_with_timeDiff['time_diff_wrt_first_invoice'] =
[[0]]*len(df_with_timeDiff)
# Then we loop over users and we apply a cumulative sum over time_diff
for user in df_with_timeDiff.UserID.unique():
df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff_wrt_first_i nvoice'] = np.cumsum(df_with_timeDiff.loc[df_with_timeDiff.UserID==user,'time_diff'])
The problem is that I have a dataframe with hundreds of thousands of users and it's so time consuming. I am wondering if there is a solution that fits better my need.
Check out .loc[] for pandas.
df_1 = pd.DataFrame(some_stuff)
df_2 = df_1.loc[tickers['column'] >= some-condition, 'specific-column']
you can access specific columns, run a loop to check for certain types of conditions, and if you add a comma after the condition and put in a specific column name it'll only return that column.
I'm not 100% sure if that answers whatever question you're asking, cause I didn't actually see one, but it seemed like you were running a lot of for loops and stuff to isolate columns, which is what .loc[] is for.
I have found a better solution. Here's my code :
def next_diff(x):
return ([0]+[(b-a).total_seconds()/3600 for b,a in zip(x[1:], x[:-1])])
def create_timediff(df):
df.sort_values(['UserID', 'CreateTime'], inplace=True)
a = df.groupby('UserID').agg({'CreateTime' :lambda x : list(v for v in x)}).CreateTime.apply(next_diff)
b = a.apply(np.cumsum)
a = a.reset_index()
b = b.reset_index()
# Here I explode the lists inside the cell
rows1= []
_ = a.apply(lambda row: [rows1.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
rows2 = []
__ = b.apply(lambda row: [rows2.append([row['UserID'], nn])
for nn in row.CreateTime], axis=1)
df1_new = pd.DataFrame(rows1, columns=a.columns).set_index(['UserID'])
df2_new = pd.DataFrame(rows2, columns=b.columns).set_index(['UserID'])
df = df.set_index('UserID')
df['time_diff']= df1_new['CreateTime']
df['time_diff_wrt_first_invoice'] = df2_new['CreateTime']
df.reset_index(inplace=True)
return df