Working on a problem, I have the following dataframe in python
week hour week_hr store_code baskets
0 201616 106 201616106 505 0
1 201616 107 201616107 505 0
2 201616 108 201616108 505 0
3 201616 109 201616109 505 18
4 201616 110 201616110 505 0
5 201616 106 201616108 910 0
6 201616 107 201616106 910 0
7 201616 108 201616107 910 2
8 201616 109 201616108 910 3
9 201616 110 201616109 910 10
Here "hour" variable is a concat of "weekday" and "hour of shop", example weekday is monday=1 and hour of shop is 6am then hour variable = 106, similarly cal_hr is a concat of week and hour. I want to get those rows where i see a trend of no baskets , i.e 0 baskets for rolling 3 weeks. in the above case i will only get the first 3 rows. i.e. for store 505 there is a continuous cycle of 1 baskets from 106 to 108. But i do not want the rows (4,5,6) because even though there are 0 baskets for 3 continuous hours but the hours are actually NOT continuous. 110 -> 106 -> 107 . For the hours to be continuous they should lie in the range of 106 - 110.. Essentially i want all stores and the respective rows if it has 0 baskets for continuous 3 hours on any given day. Dummy output
week hour week_hr store_code baskets
0 201616 106 201616106 505 0
1 201616 107 201616107 505 0
2 201616 108 201616108 505 0
Can i do this in python using pandas and loops? The dataset requires sorting by store and hour. Completely new to python (
Do the following:
Sort by store_code, week_hr
Filter by 0
Store the subtraction between df['week_hr'][1:].values-df['week_hr'][:-1].values so you will get to know if they are continuos.
Now you can give groups to continuous and filter as you want.
import numpy as np
import pandas as pd
# 1
t1 = df.sort_values(['store_code', 'week_hr'])
# 2
t2 = t1[t1['baskets'] == 0]
# 3
continuous = t2['week_hr'][1:].values-t2['week_hr'][:-1].values == 1
groups = np.cumsum(np.hstack([False, continuous==False]))
t2['groups'] = groups
# 4
t3 = t2.groupby(['store_code', 'groups'], as_index=False)['week_hr'].count()
t4 = t3[t3.week_hr > 2]
print pd.merge(t2, t4[['store_code', 'groups']])
There's no need for looping!
You can solve:
Sort by store_code, week_hr
Filter by 0
Group by store_code
Find continuous
Code:
t1 = df.sort_values(['store_code', 'week_hr'])
t2 = t1[t1['baskets'] == 0]
grouped = t2.groupby('store_code')['week_hr'].apply(lambda x: x.tolist())
for store_code, week_hrs in grouped.iteritems():
print(store_code, week_hrs)
# do something
Related
When I group 'time_interval_code' the values of 'vehicle_real' in a result file are correct only for the first group, but not for the others. When 'time_interval_code' was used in previous group it seems that it is not in the sum of the new group. How to make sure 'vehicle _real' values are available to sum in every group?
The idea of 'time_interval_code' was to get rid of time format. I have 8 time intervals in the morning (07:00 - 07:15 is 1, 07:15 - 07:30 - 2, etc. up to 8).
I want to check maximum flow rate in an hour by adding 15 minutes each time for every junction and every direction from which cars were entering a junction. The measurements are given in 15 minutes interval, so I need to check 4 intervals every time. Results to be 'junction_id', 'source_direction' and sum of the 'vehicle_real' for that junction, direction and group of 'time_interval_code'.
To solve this I created groups that contains 4 time intervals. The problem I have is when I group 'time_interval_code' the values of 'vehicle_real' in a result file are correct only for the first group (1,2,3,4), but not for the others.
import pandas as pd
data = pd.read_excel("traffic.xlsx")
# Create a DataFrame from the list of data
df = pd.DataFrame(data)
# Define a function to get the morning groups for each time interval code
def get_morning_group(time_interval_code):
morning_groups = [(1, 2, 3, 4), (2, 3, 4, 5), (3, 4, 5, 6), (4, 5, 6, 7), (5, 6, 7, 8)]
for group in morning_groups:
if time_interval_code in group:
return group
# Add a new column to the DataFrame that contains the morning groups for each time interval code
df['morning_groups'] = df['time_interval_code'].apply(get_morning_group)
# Group data by values
grouped_data = df.groupby(['junction_id', 'source_direction', 'morning_groups'])
# Calculate the sum of the vehicles_real values for each group
grouped_data = grouped_data['vehicles_real'].sum()
# Convert the grouped data back into a DataFrame
df = grouped_data.reset_index()
# Create the pivot table
pivot_table = df.pivot_table(index=['junction_id', 'source_direction'], columns=['morning_groups'], values='vehicles_real')
# Save the pivot table to a new Excel file
pivot_table.to_excel('max_flow_rate.xlsx')
The traffic.xlsx file has ca. 140k records. Every junction has at least 2 'source_direction' values. Junction with 'source_direction' has 'vehicles_real' values for every 'time_interval_code'. The file looks like this:
id
time_interval_code
junction_id
source_direction
vehicles_real
1
3
1001
N
140
2
1
2002
E
10
18
2
2011
W
41
21
5
2030
S
2
33
8
2030
N
140
35
7
2150
E
10
41
6
2150
W
41
52
5
2150
S
2
The output I get is fine, but the values are correct only for (1,2,3,4).
junction_id
source_direction
(1,2,3,4)
(2,3,4,5)
(3,4,5,6)
(4,5,6,7)
(5,6,7,8)
1001
N
257
95
69
61
59
1001
S
456
120
136
153
111
1002
N
2597
676
670
619
645
1002
S
2571
552
641
656
595
1003
N
586
181
148
127
142
1003
S
711
174
147
157
141
I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.
My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.
I have a DataFrame (df) like shown below where each column is sorted from largest to smallest for frequency analysis. That leaves some values either zeros or NaN values as each column has a different length.
08FB006 08FC001 08FC003 08FC005 08GD004
----------------------------------------------
0 253 872 256 11.80 2660
1 250 850 255 10.60 2510
2 246 850 241 10.30 2130
3 241 827 235 9.32 1970
4 241 821 229 9.17 1900
5 232 0 228 8.93 1840
6 231 0 225 8.05 1710
7 0 0 225 0 1610
8 0 0 224 0 1590
9 0 0 0 0 1590
10 0 0 0 0 1550
I need to perform the following calculation as if each column has different lengths or number of records (ignoring zero values). I have tried using NaN but for some reason operations on Nan values are not possible.
Here is what I am trying to do with my df columns :
shape_list1=[]
location_list1=[]
scale_list1=[]
for column in df.columns:
shape1, location1, scale1=stats.genpareto.fit(df[column])
shape_list1.append(shape1)
location_list1.append(location1)
scale_list1.append(scale1)
Assuming all values are positive (as seems from your example and description), try:
stats.genpareto.fit(df[df[column] > 0][column])
This filters every column to operate just on the positive values.
Or, if negative values are allowed,
stats.genpareto.fit(df[df[column] != 0][column])
The syntax is messy, but change
shape1, location1, scale1=stats.genpareto.fit(df[column])
to
shape1, location1, scale1=stats.genpareto.fit(df[column][df[column].nonzero()[0]])
Explanation: df[column].nonzero() returns a tuple of size (1,) whose only element, element [0], is a numpy array that holds the index labels where df is nonzero. To index df[column] by these nonzero labels, you can use df[column][df[column].nonzero()[0]].
I have a pandas data frame that looks like this:
duration distance speed hincome fi_cost type
0 359 1601 4 3 40.00 cycling
1 625 3440 6 3 86.00 cycling
2 827 4096 5 3 102.00 cycling
3 1144 5704 5 2 143.00 cycling
If I use the following I export a new csv that pulls only those records with a distance less than 5000.
distance_1 = all_results[all_results.distance < 5000]
distance_1.to_csv('./distance_1.csv',",")
Now, I wish to export a csv with values from 5001 to 10000. I can't seem to get the syntax right...
distance_2 = all_results[10000 > all_results.distance < 5001]
distance_2.to_csv('./distance_2.csv',",")
Unfortunately because of how Python chained comparisons work, we can't use the 50 < x < 100 syntax when x is some vectorlike quantity. You have several options.
You could create two boolean Series and use & to combine them:
>>> all_results[(all_results.distance > 3000) & (all_results.distance < 5000)]
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Use between to create a boolean Series and then use that to index (note that it's inclusive by default, though):
>>> all_results[all_results.distance.between(3000, 5000)] # inclusive by default
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Or finally you could use .query:
>>> all_results.query("3000 < distance < 5000")
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
5001 < all_results.distance < 10000