Complete pandas dataframe with zero values for large datasets - python

I have a dataframe that looks like this:
>> df
index week day hour count
5 10 2 10 70
5 10 3 11 80
7 10 2 18 15
7 10 2 19 12
where week is the week of the year, day is day of the week (0-6), and hour is hour of the day (0-23). However, since I plan to convert this to a 3D array (week x day x hour) later, I have to include hours where there are no items in the count column. Example:
>> target_df
index week day hour count
5 10 0 0 0
5 10 0 1 0
...
5 10 2 10 70
5 10 2 11 0
...
7 10 0 0 0
...
...
and so on. What I do is to generate a dummy dataframe containing all index-week-day-hour combinations possible (basically target_df without the count column):
>> dummy_df
index week day hour
5 10 0 0
5 10 0 1
...
5 10 2 10
5 10 2 11
...
7 10 0 0
...
...
and then using
target_df = pd.merge(df, dummy_df, on=['index','week','day','hour'], how='outer').fillna(0)
This works fine for small datasets, but I'm working with a lot of rows. With the case I'm working on now, I get 82M rows for dummy_df and target_df, and it's painfully slow.
EDIT: The slowest part is actually constructing dummy_df!!! I can generate the individual lists but combining them into a pandas dataframe is the slowest part.
num_weeks = len(week_list)
num_idxs = len(df['index'].unique())
print('creating dummies')
_dummy_idxs = list(itertools.chain.from_iterable(
itertools.repeat(x, 24*7*num_weeks) for x in df['index'].unique()))
print('\t_dummy_idxs')
_dummy_weeks = list(itertools.chain.from_iterable(
itertools.repeat(x, 24*7) for x in week_list)) * num_idxs
print('\t_dummy_weeks')
_dummy_days = list(itertools.chain.from_iterable(
itertools.repeat(x, 24) for x in range(0,7))) * num_weeks * num_idxs
print('\t_dummy_days')
_dummy_hours = list(range(0,24)) * 7 * num_weeks * num_idxs
print('\t_dummy_hours')
print('Creating dummy_hour_df with {0} rows...'.format(len(_dummy_hours)))
# the part below takes the longest time
dummy_hour_df = pd.DataFrame({'index': _dummy_idxs, 'week': _dummy_weeks, 'day': _dummy_days, 'hour': _dummy_hours})
print('dummy_hour_df completed')
Is there a faster way to do this?

As an alternative, you can use itertools.product for the creation of dummy_df as a product of lists:
import itertools
index = range(100)
weeks = range(53)
days = range(7)
hours = range(24)
dummy_df = pd.DataFrame(list(itertools.product(index, weeks, days, hours)), columns=['index','week','day','hour'])
dummy_df.head()
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 0 2
3 0 0 0 3
4 0 0 0 4

Related

cumsum based on taking first value from another column the creating a new calculation

I was hoping to get some help with a calculation I'm struggling a bit with. I'm working with some data (copied below) and I need create a calculation that takes the first value > 0 from another column and computes a new series based on that value, and then aggregates the numbers giving a cumulative sum. My raw data looks like this:
d = {'Final Account': ['A', 'A', 'A' ,'A' , 'A', 'A', 'A','A' ,'A' ,'A', 'A', 'A', 'A'],
'Date': ['Jun-21','Jul-21','Aug-21','Sep-21','Oct-21','Nov-21','Dec-21','Jan-22','Feb-22','Mar-22','Apr-22','May-22','Jun-22'],
'Units':[0, 0, 0, 0, 10, 0, 20, 0, 0, 7, 12, 35, 0]}
df = pd.DataFrame(data=d)
Account Date Units
A Jun-21 0
A Jul-21 0
A Aug-21 0
A Sep-21 0
A Oct-21 10
A Nov-21 0
A Dec-21 20
A Jan-22 0
A Feb-22 0
A Mar-22 7
A Apr-22 12
A May-22 35
A Jun-22 0
To the table I do an initial conversion for my data which is:
df['Conv'] = df['Units'].apply(x/5)
This adds a new column to my table like this:
Account Date Units Conv
A Jun-21 0 0
A Jul-21 0 0
A Aug-21 0 0
A Sep-21 0 0
A Oct-21 10 2
A Nov-21 0 0
A Dec-21 20 4
A Jan-22 0 0
A Feb-22 0 0
A Mar-22 7 1
A Apr-22 12 2
A May-22 35 7
A Jun-22 0 0
The steps after this I begin to run into issues. I need to calculate new field which takes the first value of the conv field > 0, at the same index position and begins a new calculation based on the previous rows cumsum and then adds it back into the cumsum following the calculation. Outside of python this is done by creating two columns. One to calculate new units by:
(Units - (previous row cumsum of existing units * 2))/5
Then existing units which is just the cumsum of the values that have been figured out to be new units. The desired output should look something like this:
Account Date Units Conv New Units Existing Units (cumsum of new units)
A Jun-21 0 0 0 0
A Jul-21 0 0 0 0
A Aug-21 0 0 0 0
A Sep-21 0 0 0 0
A Oct-21 10 2 2 2
A Nov-21 0 0 0 2
A Dec-21 20 4 3 5
A Jan-22 0 0 0 5
A Feb-22 0 0 0 5
A Mar-22 7 1 0 5
A Apr-22 12 2 0 5
A May-22 35 7 5 10
A Jun-22 0 0 0 10
The main issue I'm struggling with is grabbing the first value >0 from the "Conv" column and being able to create a new cumsum based on that initial value that can be applied to the "New Units" calculation. Any guidance is much appreciated, and despite reading a lot around I've hit a bit of a brick wall! If you need me to explain better please do ask! :)
Much appreciated in advance!
I'm not sure that I completely understand what you are trying to achieve. Nevertheless, here's an attempt to reproduce your expected results. For your example frame this
groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
df['New Units'] = 0
last = 0
for _, group in df['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = df.at[i, 'New Units'] = new_unit
does result in
Final Account Date Units New Units
0 A Jun-21 0 0
1 A Jul-21 0 0
2 A Aug-21 0 0
3 A Sep-21 0 0
4 A Oct-21 10 2
5 A Nov-21 0 0
6 A Dec-21 20 3
7 A Jan-22 0 0
8 A Feb-22 0 0
9 A Mar-22 7 0
10 A Apr-22 12 0
11 A May-22 35 5
12 A Jun-22 0 0
The first step identifies the blocks in column Units whose last item is relevant for building the new units: Successive zeros, followed by non-zeros, until the first zero. This
groups = (df['Units'].eq(0) & df['Units'].shift().ne(0)).cumsum()
results in
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 3
9 3
10 3
11 3
12 4
Then group column Units along these blocks, grab the last item if each block if it is non-zero (zero can only happen in the last block), build the new unit (according to the given formula) and store it in the new column New Units.
(If you actually need the column Existing Units then just use .cumsum() on the column New Units.)
If there are multiple accounts (indicated in the comments), then one way to apply the procedure to each account separately would be to pack it into a function (here new_units), .groupby() over the Final Account column, and .apply() the function to the groups:
def new_units(sdf):
groups = (sdf['Units'].eq(0) & sdf['Units'].shift().ne(0)).cumsum()
last = 0
for _, group in sdf['Units'].groupby(groups):
i, unit = group.index[-1], group.iloc[-1]
if unit != 0:
new_unit = (unit - last * 2) // 5
last = sdf.at[i, 'New Units'] = new_unit
return sdf
df['New Units'] = 0
df = df.groupby('Final Account').apply(new_units)
Try using a for loop that performs the sample calculation you provided
# initialize new rows to zero
df['new u'] = 0
df['ext u'] = 0
# set first row cumsum
df['ext u'][0] = df['units'][0]//5
# loop through the data frame to perform the calculations
for i in range(1, len(df)):
# calculate new units
df['new u'][i] = (df['units'][i]-2*df['ext u'][i-1])//5
# calculate existing units
df['ext u'][i] = df['ext u'][i-1] + df['new u']
I'm not certain that those are the exact expressions you are looking for, but hopefully this gets you on your way to a solution. Worth noting that this does not take care of the whole "first value > 0" thing because (feel free to correct me but) it seems like before that you will just be adding up zeros, which won't affect anything. Hope this helps!

Check certain conditions looking back x hours (pandas)

I have some data like this:
import pandas as pd
dates = ["12/25/2021 07:47:01", "12/25/2021 08:02:32", "12/25/2021 13:57:40", "12/25/2021 14:17:11", "12/25/2021 17:23:01", "12/25/2021 23:48:55", "12/26/2021 08:22:32", "12/26/2021 11:11:11", "12/26/2021 14:53:40", "12/26/2021 16:07:07", "12/26/2021 23:56:07"]
is_manual = [0,0,0,0,1,1,0,0,0,0,1]
is_problem = [0,0,0,0,1,1,0,0,0,1,1]
df = pd.DataFrame({'dates':dates,
'manual_entry': is_manual,
'problem_entry': is_problem})
dates manual_entry problem_entry
0 12/25/2021 07:47:01 0 0
1 12/25/2021 08:02:32 0 0
2 12/25/2021 13:57:40 0 0
3 12/25/2021 14:17:11 0 0
4 12/25/2021 17:23:01 1 1
5 12/25/2021 23:48:55 1 1
6 12/26/2021 08:22:32 0 0
7 12/26/2021 11:11:11 0 0
8 12/26/2021 14:53:40 0 0
9 12/26/2021 16:07:07 0 1
10 12/26/2021 23:56:07 1 1
What I would like to do is to take every row where problem_entry == 1 and examine if every row in the 24 hours prior to that row is manual_entry == 0
While I know you can create a rolling lookback window of a certain number of rows, each row is not spaced a normal time period apart, so wondering how to look back 24 hours and determine whether the criteria above is met.
Thanks in advance
EDIT: Expected output:
dates manual_entry problem_entry
4 12/25/2021 17:23:01 1 1
10 12/26/2021 23:56:07 1 1
Try the following.Extracted 'manual_entry' into a separate variable and collected the amounts in a sliding window of the day. If the current 'manual_entry' is equal to 1, then there were no other values during the day. Next, the dataframe is filtered where 'problem_entry', 'manual_entry' where are equal to 1.
df['dates'] = pd.to_datetime(df['dates'])
a = (df.rolling("86400s", on='dates', min_periods=1).sum()).loc[:, 'manual_entry']
print(df.loc[(df['problem_entry'] == 1) & (a == 1)])
Output:
dates manual_entry problem_entry
4 2021-12-25 17:23:01 1 1
10 2021-12-26 23:56:07 1 1

Extract a window of rows following a set of one's in a pandas dataframe

I have a pandas dataframe that looks like the following:
Day val
Day1 0
Day2 0
Day3 0
Day4 0
Day5 1
Day6 1
Day7 1
Day8 1
Day9 0
Day10 0
Day11 0
Day12 1
Day13 1
Day14 1
Day15 1
Day16 0
Day17 0
Day18 0
Day19 0
Day20 0
Day21 1
Day22 0
Day23 1
Day24 1
Day25 1
I am looking to extract at-most 2 rows where val = 0 but only those where the proceeding rows are a set of 1's.
For example:
There is a set of ones from Day5 to Day8 (an event). I would need to look into at-most two rows after the end of the event. So here it's Day9 and Day10.
Similarly, Day21 is a single-day event, and I need to look into only Day22 since it is the single zero that follows the event.
For the table data above, the output would be the following:
Day val
day9 0
Day10 0
Day16 0
Day17 0
Day22 0
We can simplify the condition for each row to:
The val value should be 0
The previous day or the day before that should have a val of 1
In code:
cond = (df['val'].shift(1) == 1) | (df['val'].shift(2) == 1)
df.loc[(df['val'] == 0) & cond]
Result:
Day val
8 Day9 0
9 Day10 0
15 Day16 0
16 Day17 0
21 Day22 0
Note: If more than 2 days should be considered this can easily be added to the condition cond. In this case, cond can be constructed with a list comprehension and np.any(), for example:
n = 2
cond = np.any([df['val'].shift(s) == 1 for s in range(1, n+1)], axis=0)
df.loc[(df['val'] == 0) & cond]
You can compute a mask on the rolling max per group where the groups start for each 1->0 transition and combine it with a second mask where the values are 0:
N = 2
o2z = df['val'].diff().eq(-1)
m1 = o2z.groupby(o2z.cumsum()).rolling(N, min_periods=1).max().astype(bool).values
m2 = df['val'].eq(0)
df[m1&m2]
Output:
Day val
8 Day9 0
9 Day10 0
15 Day16 0
16 Day17 0
21 Day22 0

How do I create a while loop for this df that has moving average in every stage? [duplicate]

This question already has an answer here:
For loop that adds and deducts from pandas columns
(1 answer)
Closed 1 year ago.
So I want to spread the shipments per ID in the group one by one by looking at avg sales to determine who to give it to.
Here's my dataframe:
ID STOREID BAL SALES SHIP
1 STR1 50 5 18
1 STR2 6 7 18
1 STR3 74 4 18
2 STR1 35 3 500
2 STR2 5 4 500
2 STR3 54 7 500
While SHIP (grouped by ID) is greater than 0, calculate AVG (BAL/SALES) and the lowest AVG per group give +1 to its column BAL and +1 to its column final. And then repeat the process until SHIP is 0. The AVG would be different every stage which is why I wanted it to be a while loop.
Sample output of first round is below. So do this until SHIP is 0 and SUM of Final per ID is = to SHIP:
ID STOREID BAL SALES SHIP AVG Final
1 STR1 50 5 18 10 0
1 STR2 6 4 18 1.5 1
1 STR3 8 4 18 2 0
2 STR1 35 3 500 11.67 0
2 STR2 5 4 500 1.25 1
2 STR3 54 7 500 7.71 0
I've tried a couple of ways in SQL, I thought it would be better to do it in python but I haven't been doing a great job with my loop. Here's what I tried so far:
df['AVG'] = 0
df['FINAL'] = 0
for i in df.groupby(["ID"])['SHIP']:
if i > 0:
df['AVG'] = df['BAL'] / df['SALES']
df['SHIP'] = df.groupby(["ID"])['SHIP']-1
total = df.groupby(["ID"])["FINAL"].transform("cumsum")
df['FINAL'] = + 1
df['A'] = + 1
else:
df['FINAL'] = 0
This was challenging because more than one row in the group can have the same average calculation. then it throws off the allocation.
This works on the example dataframe, if I understood you correctly.
d = {'ID': [1, 1, 1, 2,2,2], 'STOREID': ['str1', 'str2', 'str3','str1', 'str2', 'str3'],'BAL':[50, 6, 74, 35,5,54], 'SALES': [5, 7, 4, 3,4,7], 'SHIP': [18, 18, 18, 500,500,500]}
df = pd.DataFrame(data=d)
df['AVG'] = 0
df['FINAL'] = 0
def calc_something(x):
# print(x.iloc[0]['SHIP'])
for i in range(x.iloc[0]['SHIP'])[0:500]:
x['AVG'] = x['BAL'] / x['SALES']
x['SHIP'] = x['SHIP']-1
x = x.sort_values('AVG').reset_index(drop=True)
# print(x.iloc[0, 2])
x.iloc[0, 2] = x['BAL'][0] + 1
x.iloc[0, 6] = x['FINAL'][0] + 1
return x
df_final = df.groupby('ID').apply(calc_something).reset_index(drop=True).sort_values(['ID', 'STOREID'])
df_final
ID STOREID BAL SALES SHIP AVG FINAL
1 1 STR1 50 5 0 10.000 0
0 1 STR2 24 7 0 3.286 18
2 1 STR3 74 4 0 18.500 0
4 2 STR1 127 3 0 42.333 92
5 2 STR2 170 4 0 42.500 165
3 2 STR3 297 7 0 42.286 243

Sample from dataframe with conditions

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks
Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1
Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)

Categories

Resources