Iterating pandas dataframe and changing values - python

I'm looking to predict the number of customers in a restaurant at a certain time. My data preprocessing is almost finished - I have acquired the arrival dates of each customer. Those are presented in acchour. weekday means the day of the week, 0 being Monday and 6 Sunday. Now, I'd like to calculate the number of customers at that certain time in the restaurant. I figured I have to loop through the dataframe in reverse and keep adding the arrived customers to the customer count at a certain time, while simultaneously keeping track of when previous customers are leaving. As there is no data on this, we will simply assume every guest stays for an hour.
My sketch looks like this:
exp = [] #Keep track of the expiring customers
for row in reversed(df['customers']): #start from the earliest time
if row != 1: #skip the 1st row
res = len(exp) + 1 #amount of customers
for i in range(len(exp) - 1, -1, -1): #loop exp sensibly while deleting
if df['acchour'] > exp[i]+1: #if the current time is more than an hour more than the customer's arrival time
res -= 1
del exp[i]
exp.append(df['acchour'])
row = res
However, I can see that df['acchour'] is not a sensible expression and was wondering how to reference the different column on the same row properly. Altogether, if someone can come up with a more convenient way to solve the problem, I'd hugely appreciate it!

So you can get the total customers visiting at a specific time like so:
df.groupby(['weekday','time', 'hour'], as_index=False).sum()
Then maybe you can calculate the difference between each time window you want?

Related

Is there a way in pandas (python) to calculate the days of inventory grouped by material number

With the given data frame the calculated measure would be the DOI (aka how many days into the future will the inventory last; based on the demand. Note: The figures populated in the DOI need to be programmatically calculated and grouped on the material.
Calculation of DOI: Let us take the first row belonging to material A1. The dates are on weekly basis.
Inventory = 1000
Days into the future till when the inventory would last: 300 + 400 + part of 500. This means the DOI is 7 + 7 + (1000-300-400)/500 = 14.6 [aka 26.01.2023 - 19-01-2023]; [09.02.2023 - 02.02.2023]
An important point to note is that the demand figure of the concerned row is NOT taken into account while calculating DOI.
I have tried to calculate the cumulative demand without taking the first row for each material (here A1 and B1).
inv_network['cum_demand'] = 0
for i in range(inv_network.shape[0]-1):
if inv_network.loc[i+1,'Period'] > inv_network.loc[0,'Period']:
inv_network.loc[i+1,'cum_demand']= inv_network.loc[i,'cum_demand'] + inv_network.loc[i+1,'Primary Demand']
print(inv_network)
However, this piece of code is taking a lot of time, with the increase in the number of records.
As part of the next step when I am trying to calculate DOI, I am running into issues to get the right value.

Loop Through Days of the Week in Pandas Dataframe

I have a Pandas DataFrame with a start column of dtype of datetime64[ns, UTC] and the DataFrame is sorted in ascending order based on the start column. From this DataFrame I used the following to create a new (updated) DataFrame indicating the day of the week for the start column
format_datetime_df['day_of_week'] = format_datetime_df['start'].dt.dayofweek
I want to pass the DataFrame into a function. The function needs to loop through the days of the week, so from 0 to 6, and keep a running total of the distance (kept in column 'distance') covered. If the distance covered is greater than 15, then a counter is incremented. It needs to do this for all rows of the DataFrame. The return of the function will be the total number of weeks over 15.
I am getting stuck on how to implement this as my 'day_of_week' column starts as follows
3
3
5
1
5
So, week 1 would be comprised of 3, 3, 5 and week 2 would be comprised of 1, 5, ...
I want to do something like
number_of_weeks_over_10km = format_datetime_df.groupby().apply(weeks_over_10km)
but am not really sure what should go in the groupby() function. I also feel like I am overcomplicating this.
It was complicated, but I figured it out. Here is the basic flow of what I did
# Create a helper index that allows iteration by week while also considering the year
# Function to return the total distance for each week
# Create a NumPy array to store the total distance for each week
# Append the total distance for each week to the array
# Count the number of times the total distance for each week was > x (in km)
The helper index that allowed for iteration by week while also considering the year came from another post here on Stack Overflow (Iterate over pd df with date column by week python). This had a consequence though, in that I had to create and append the NumPy array outside of the function in order to get everything to work.
I guess you can solve that using Pandas without functions. Just determine year and week using
df["isoweek"] = (df["start"].dt.isocalendar()["year"].astype(str)
+ " "
+ df["start"].dt.isocalendar()["week"].astype(str)
)
Then you determine the distance using a groupby and count the entries above 15:
weeks_above_15 = (df.groupby("isoweek")["distance"].sum() > 15).sum()

remove/isolate days when there is no change (in pandas)

I have annual hourly energy data for two AC systems for two hotel rooms. I want to figure out when the rooms were occupied or not by isolating/removing the days when the ac was not used for 24 hours.
I did df[df.Meter2334Diff > 0.1] for one room which gives me all the hours when AC was turned on, however it removes the hours of the days when the room was most likely occupied and the AC was turned off. This is where my knowledge stops. I therefore enquire the assistance from the oracles of the internet.
my dataframe above
results after df[df.Meter2334Diff > 0.1]
If I've interpreted your question correctly, you want to extract all the days from the dataframe where the Meter2334Diff value was zero?
As your data is currently has a frequency of every hour, we can resample it in pandas using the resample() function. To resample() we can pass the freq parameter which tells pandas at what time interval to aggregate the data. There are lots of options (see the docs) but in your case we can set freq='D' to group by day.
Then we can calculate the sum of that day for the Meter2334Diff column. If we then filter out the days that have a value == 0 (obviously without knowledge of your dataset etc I don't know whether 0 is the correct value).
total_daily_meter_diff = df.resample('D')['Meter2334Diff'].sum()
days_less_than_cutoff = total_daily_meter_diff.query('MeterDiff2334 == 0')
We can then use these days to filter in the original dataset:
df.loc[df.index.floor('D').isin(days_less_than_cutoff) , :]

Python/Pandas: sort by date and compute two week (rolling?) average

So far I've read in 2 CSV's and merged them based on a common element. I take the output of the merged CSV and iterate through the unique element they've been merged on. While I have them separated I want to generate a daily count line and a two week rolling average from the current date going backward. I cannot index based of the 'Date Opened' field but I still need my outputs organized by this with the most recent first. Once these are sorted by date my daily count plotting issue will be rectified. My remaining task would be to compute a two week rolling average for count within the week. I've looked into the Pandas documentation and I think the rolling_mean will work but the parameters of this function don't really make sense to me. I've tried biwk_avg = pd.rolling_mean(open_dt, 28) but that doesnt seem to work. I know there is an easier way to do this but I think I've hit a roadblock with the documentation available. The end result should look something like this graph. Right now my daily count graph isnt sorted(even though I think I've instructed it to) and is unusable in line form.
def data_sort():
data_merge = data_extract()
domains = data_merge.groupby('PWx Domain')
for domain in domains.groups.items():
dsort = (data_merge.loc[domain[1]])
print (dsort.head())
open_dt = pd.to_datetime(dsort['Date Opened']).dt.date
#open_dt.to_csv('output\''+str(domain)+'_out.csv', sep = ',')
open_ct = open_dt.value_counts(sort= False)
biwk_avg = pd.rolling_mean(open_ct, 28)
plt.plot(open_ct,'bo')
plt.show()
data_sort()
Rolling mean alone is not enough in your case; you need a combination of resampling (to group data by days) followed by a 14-day rolling mean (why do you use 28 in your code?). Something like thins:
for _,domain in data_merge.groupby('PWx Domain'):
# Convert date to the index
domain.index = pd.to_datetime(domain['Date Opened'])
# Sort dy dates
domain.sort_index(inplace=True)
# Do the averaging
rolling = pd.rolling_mean(domain.resample('1D').mean(), 14)
plt.plot(rolling,'bo')
plt.show()

Looping over three days

In my table in the postgres I have the users and the time they've made a picture. I would like to define different tracks of the users. To start with I say that within 3 days for the same user is the same track, and then starts the next one. I would like to add a column to my table, where I will save the unique number of the track. I have figured out the how to define the start and end day of every user. But I am not sure how to divide that now to every three days.
cur.execute("""SELECT users From table;""")
dbrows = cur.fetchall()
for k in set(dbrows):
cur.execute("""SELECT time From table where users ="""+k+""";""")
dbrows_time = cur.fetchall()
time_period = []
for i in dbrows_time:
new_date_time = datetime.fromtimestamp(int(i)).strftime('%Y-%m-%d %H:%M:%S')
new_date = new_date_time[0:10]
time_period.append(new_date)
period = len(set(time_period))
beginning = time_period[0]
end = time_period[-1]
beginning = datetime.strptime(beginning, "%Y-%m-%d")
end = datetime.strptime(end, "%Y-%m-%d")
delta=timedelta(days=3)
Here should come some loop "for every three days create a new unique number". Or may be it should be done in SQL when I insert in the table?
INSERT INTO table_tracks the number of each track
conn.commit()
I can insert some number every three entries like this, but I am not sure how to do the three days thing.
cur.execute("INSERT INTO table_track (id, users, link, tags, time, track, geom) select id, users, link, tags, time, ((row_number() over (ORDER BY users DESC) - 1) / 3) + 1, geom from table;""")
The data is basically in the form:
user time geometry
1 1.03.2015 geometry
1 2.03.2015 geometry
......
2 01.03.2015 geometry
......
Every user has a unique id and the time I convert to the datetime format. Desired results are:
user time geometry track_number
1 1.03.2015 geometry 1
1 2.03.2015 geometry 1
......
2 01.03.2015 geometry 15
......
The track_number is unique per user and per three days. Let's say the user makes a picture on 1st, 2nd and 3rd of March and then on 4th, 5th and 6th. These would be to different tracks (which is may be not ideal in the real word). But it is especially interesting if the pictures are on 1st, 2nd of March and 4th, 5th, 8th. Then I would say these should be three tracks: 1st and 2nd, 4th and 5th and 8th. So I mean a timedelta by three days. If there is only one picture within three days then it is only one track. Something like that. Thank you!!
Any ideas or directions are highly appreciated!

Categories

Resources