Reduce output time - count with groupby (python, pandas) - python

I have a loop that calculates the "total_count" of a group of elements from multiple periods. Is there a way to optimize the script to have a shorter output time? The dataframe is 33MD and running a loop takes over 300++ms. Actual script runs over 50k loops; which takes over 2 days to complete.
#sample df with similar output time
df = pd.DataFrame(np.random.randint(3, size=(400000,1)), columns=['type'])
df['class'] = np.random.randint(1, 7, df.shape[0])
df['country'] = np.random.randint(1, 12, df.shape[0])
df['period'] = np.random.randint(2010, 2018, df.shape[0])
df['season'] = np.random.randint(1, 4, df.shape[0])
%%time
#period
tr1_sta = 2011
tr1_end = 2016
h0 = 'type'
h1 = 'class'
h2 = 'country'
holder = [h0,h1,h2]
df = (df.set_index(holder).assign(tr1_tc = df[(df['period'].between(tr1_sta, tr1_end))].groupby(holder)['season'].count()).reset_index())
Kindly advise
Thank you

Your code took 75.1ms on my computer. The code below run in 14.6ms:
df['tr1_tc'] = (df.assign(tr1_tc=df['period'].between(tr1_sta, tr1_end))
.groupby(holder)['tr1_tc'].transform('sum'))
print(df)
# Output
type class country period season tr1_tc
0 1 5 1 2016 2 1343
1 1 5 9 2014 1 1302
2 2 3 4 2013 2 1299
3 0 6 1 2014 1 1326
4 2 4 5 2012 3 1367
... ... ... ... ... ... ...
349995 2 1 3 2010 3 1332
349996 0 1 8 2015 1 1362
349997 1 3 1 2013 3 1283
349998 1 6 7 2015 3 1305
349999 0 6 9 2017 2 1250
[350000 rows x 6 columns]

Related

Multidimensional array restructuring like in pandas.stack

Consider the following code to create a dummy dataset
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(10)
n=3
space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)
values = np.kron(space, time).reshape(n,n) + norm(1,1).rvs([n,n])
### Output
array([[267.39784458, 300.81493866, 229.19163206],
[236.1940266 , 266.49469945, 204.01294305],
[122.55912977, 140.00957047, 106.28339745]])
I can put these data in a pandas dataframe using
space_names = ['A','B','C']
time_names = [2000,2001,2002]
df = pd.DataFrame(values, index=space_names, columns=time_names)
df
### Output
2000 2001 2002
A 267.397845 300.814939 229.191632
B 236.194027 266.494699 204.012943
C 122.559130 140.009570 106.283397
This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe
df.columns.name = 'time'
df.index.name = 'space'
df.stack().rename('value').reset_index()
### Output
space time value
0 A 2000 267.397845
1 A 2001 300.814939
2 A 2002 229.191632
3 B 2000 236.194027
4 B 2001 266.494699
5 B 2002 204.012943
6 C 2000 122.559130
7 C 2001 140.009570
8 C 2002 106.283397
My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple
s = 3
t = 4
r = 2
space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)
values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) + norm(0,1).rvs([s,t,r])
values
### Output
array([[[286.50322099, 288.51266345],
[176.64303485, 175.38175877],
[136.01675917, 134.44328617]],
[[187.07608546, 185.4068411 ],
[112.86398438, 111.983463 ],
[ 85.99035255, 86.67236986]],
[[267.66833894, 269.45295404],
[162.30044715, 162.50564386],
[124.6374401 , 126.2315447 ]]])
How can I obtain the same structure for the dataframe as above?
Ugly solution
Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.
labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition
def flatten3d(k):
return [i for l in k for s in l for i in s]
value_series = pd.Series(flatten3d(values)).rename('y')
split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])
pd.concat([df, value_series], axis=1)
### Output
s t r y
0 0 0 0 266.2408815208753
1 0 0 1 266.13662442609433
2 0 1 0 299.53178992512954
3 0 1 1 300.13941632567605
4 0 2 0 229.39037800681405
5 0 2 1 227.22227496248507
6 0 3 0 281.76357915411995
7 0 3 1 280.9639352062619
8 1 0 0 235.8137644198259
9 1 0 1 234.23202459516452
10 1 1 0 265.19681013560034
11 1 1 1 266.5462102589883
12 1 2 0 200.730100791878
13 1 2 1 199.83217739700535
14 1 3 0 246.54018839875374
15 1 3 1 248.5496308586532
16 2 0 0 124.90916276929234
17 2 0 1 123.64788669199066
18 2 1 0 139.65391860786775
19 2 1 1 138.08044561039517
20 2 2 0 106.45276370157518
21 2 2 1 104.78351933651582
22 2 3 0 129.86043618610572
23 2 3 1 128.97991481257253
This does not use stack, but maybe it is acceptable for your problem:
import numpy as np
import pandas as pd
values = np.arange(18).reshape(3, 3, 2) # Your values here
index = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])
df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()
# df:
# space1 space2 time value
# 0 A A 2000 0
# 1 A A 2001 1
# 2 A B 2000 2
# 3 A B 2001 3
# 4 A C 2000 4
# 5 A C 2001 5
# 6 B A 2000 6
# 7 B A 2001 7
# 8 B B 2000 8
# 9 B B 2001 9
# 10 B C 2000 10
# 11 B C 2001 11
# 12 C A 2000 12
# 13 C A 2001 13
# 14 C B 2000 14
# 15 C B 2001 15
# 16 C C 2000 16
# 17 C C 2001 17

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

Pandas dataframe column wise calculation

I have below dataframe columns:
Index(['Location' 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x', 'May-2022_x', 'Jun-2022_x', 'Jul-2022_x', 'Aug-2022_x',
'Sep-2022_x', 'Oct-2022_x', 'Nov-2022_x', 'Dec-2022_x', 'Jan-2023_x',
'Feb-2023_x', 'Mar-2023_x', 'Apr-2023_x', 'May-2023_x', 'Jun-2023_x',
'Jul-2023_x', 'Aug-2023_x', 'Sep-2023_x', 'Oct-2023_x', 'Nov-2023_x',
'Dec-2023_x', 'Jan-2024_x', 'Feb-2024_x', 'Mar-2024_x', 'Apr-2024_x',
'May-2024_x', 'Jun-2024_x', 'Jul-2024_x', 'Aug-2024_x', 'Sep-2024_x',
'Oct-2024_x', 'Nov-2024_x', 'Dec-2024_x',
'sum_val',
'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y', 'May-2022_y', 'Jun-2022_y', 'Jul-2022_y',
'Aug-2022_y', 'Sep-2022_y', 'Oct-2022_y', 'Nov-2022_y', 'Dec-2022_y',
'Jan-2023_y', 'Feb-2023_y', 'Mar-2023_y', 'Apr-2023_y', 'May-2023_y',
'Jun-2023_y', 'Jul-2023_y', 'Aug-2023_y', 'Sep-2023_y', 'Oct-2023_y',
'Nov-2023_y', 'Dec-2023_y', 'Jan-2024_y', 'Feb-2024_y', 'Mar-2024_y',
'Apr-2024_y', 'May-2024_y', 'Jun-2024_y', 'Jul-2024_y', 'Aug-2024_y',
'Sep-2024_y', 'Oct-2024_y', 'Nov-2024_y', 'Dec-2024_y'],
dtype='object')
Sample dataframe with reduced columns looks like this:
df:
Location Dec-2021_x Jan-2022_x sum_val Dec-2021_y Jan-2022_y
A 212 315 1000 12 13
B 312 612 1100 13 17
C 242 712 1010 15 15
D 215 382 1001 16 17
E 252 319 1110 17 18
I have to create a resultant dataframe which will be in the below format:
Index(['Location' 'Dec-2021', 'Jan-2022', 'Feb-2022', 'Mar-2022',
'Apr-2022', 'May-2022', 'Jun-2022', 'Jul-2022', 'Aug-2022',
'Sep-2022', 'Oct-2022', 'Nov-2022', 'Dec-2022', 'Jan-2023',
'Feb-2023', 'Mar-2023', 'Apr-2023', 'May-2023', 'Jun-2023',
'Jul-2023', 'Aug-2023', 'Sep-2023', 'Oct-2023', 'Nov-2023',
'Dec-2023', 'Jan-2024', 'Feb-2024', 'Mar-2024', 'Apr-2024',
'May-2024', 'Jun-2024', 'Jul-2024', 'Aug-2024', 'Sep-2024',
'Oct-2024', 'Nov-2024', 'Dec-2024'
dtype='object')
The way we do this is using the formula:
'Dec-2021' = 'Dec-2021_x' * sum_val * 'Dec-2021_y' (these are all numeric columns)
and a similar way for all the months. There are 36 months to be precise. Is there any way to do this in a loop manner for each column in the month-year combination? There are around 65000+ rows here so do not want to overwhelm the system.
Use:
#sample data
np.random.seed(2022)
c = ['Location', 'Dec-2021_x', 'Jan-2022_x', 'Feb-2022_x', 'Mar-2022_x',
'Apr-2022_x','sum_val', 'Dec-2021_y', 'Jan-2022_y', 'Feb-2022_y',
'Mar-2022_y', 'Apr-2022_y']
df = (pd.DataFrame(np.random.randint(10, size=(5, len(c))), columns=c)
.assign(Location=list('abcde')))
print (df)
Location Dec-2021_x Jan-2022_x Feb-2022_x Mar-2022_x Apr-2022_x \
0 a 1 1 0 7 8
1 b 8 0 3 6 8
2 c 1 7 5 5 4
3 d 0 7 5 5 8
4 e 8 0 3 9 5
sum_val Dec-2021_y Jan-2022_y Feb-2022_y Mar-2022_y Apr-2022_y
0 2 8 0 5 9 1
1 0 1 2 0 5 7
2 8 2 3 1 0 4
3 2 4 0 9 4 9
4 2 1 7 2 1 7
#remove unnecessary columns
df1 = df.drop(['sum_val'], axis=1)
#add columns names for not necessary remove - if need in ouput
df1 = df1.set_index('Location')
#split columns names by last _
df1.columns = df1.columns.str.rsplit('_', n=1, expand=True)
#seelct x and y Dataframes by second level and multiple
df2 = (df1.xs('x', axis=1, level=1).mul(df['sum_val'].to_numpy(), axis= 0) *
df1.xs('y', axis=1, level=1))
print (df2)
Dec-2021 Jan-2022 Feb-2022 Mar-2022 Apr-2022
Location
a 16 0 0 126 16
b 0 0 0 0 0
c 16 168 40 0 128
d 0 0 90 40 144
e 16 0 12 18 70

Pandas sum for the rest of month

I have a dataframe that looks this:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017','...']
sales = [1,2,3,4,1,2,'...']
days_left_in_m = [3,2,1,0,29,28,'...']
df_test = pd.DataFrame({'date': date,'days_left_in_m':days_left_in_m,'sales':sales})
df_test
I am trying to find sales for the rest of the month.
So, for 28th of Jan 2017 it will calculate sum of the next 3 days,
for 29th of Jan - sum of the next 2 days and so on...
The outcome should look like the "required" column below.
date days_left_in_m sales required
0 28-01-2017 3 1 10
1 29-01-2017 2 2 9
2 30-01-2017 1 3 7
3 31-01-2017 0 4 4
4 01-02-2017 29 1 3
5 02-02-2017 28 2 2
6 ... ... ... ...
My current solution is really ugly - I use a non-pythonic looping:
for i in range(lenght_of_t_series):
days_left = data_in.loc[i].days_left_in_m
if days_left == 0:
sales_temp_list.append(0)
else:
if (i+days_left) <= lenght_of_t_series:
sales_temp_list.append(sum(data_in.loc[(i+1):(i+days_left)].sales))
else:
sales_temp_list.append(np.nan)
I guess a much better way of doing this would be to use df['sales'].rolling(n).sum()
However, each row has a different window.
Please advise on the best way of doing this...
I think you need DataFrame.sort_values with GroupBy.cumsum.
If you do not want to take into account the current day you can
use groupby.shift (see commented code).
First you could convert date column to datetime in order to use Series.dt.month
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
Then we can use:
months = df_test['date'].dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
print(df_test)
Output
date days_left_in_m sales required
0 2017-01-28 3 1 10
1 2017-01-29 2 2 9
2 2017-01-30 1 3 7
3 2017-01-31 0 4 4
4 2017-02-01 29 1 3
5 2017-02-02 28 2 2
If you don't want convert date column to datetime use:
months = pd.to_datetime(df_test['date'],format = '%d-%m-%Y').dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)

How to duplicate entries in a dataframe

I have a dataframe of the form:
df2 = pd.DataFrame({'Date': np.array([2018,2017,2016,2015]),
'Rev': np.array([4000,5000,6000,7000]),
'Other': np.array([0,0,0,0]),
'High':np.array([75.11,70.93,48.63,43.59]),
'Low':np.array([60.42,45.74,34.15,33.12]),
'Mean':np.array([67.765,58.335,41.390,39.355]) #mean of high/low columns
})
This looks like:
I want to convert this dataframe to something that looks like:
Basically you are copying each row two more times. Then you are taking the high, low, and mean values and column-wise under the 'price' column. Then you add a new 'category' that keeps a track of which is from high/low/medium (0 meaning high, 1 meaning low, and 2 meaning mean).
This is a simple melt (wide to long) problem:
# convert df2 from wide to long, melting the High, Low and Mean cols
df3 = df2.melt(df2.columns.difference(['High', 'Low', 'Mean']).tolist(),
var_name='category',
value_name='price')
# remap "category" to integer
df3['category'] = pd.factorize(df['category'])[0]
# sort and display
df3.sort_values('Date', ascending=False))
Date Other Rev category price
0 2018 0 4000 0 75.110
4 2018 0 4000 1 60.420
8 2018 0 4000 2 67.765
1 2017 0 5000 0 70.930
5 2017 0 5000 1 45.740
9 2017 0 5000 2 58.335
2 2016 0 6000 0 48.630
6 2016 0 6000 1 34.150
10 2016 0 6000 2 41.390
3 2015 0 7000 0 43.590
7 2015 0 7000 1 33.120
11 2015 0 7000 2 39.355
instead of melt, you can use stack, which saves you the sort_values:
new_df = (df2.set_index(['Date','Rev', 'Other'])
.stack()
.to_frame(name='price')
.reset_index()
)
output:
Date Rev Other level_3 price
0 2018 4000 0 High 75.110
1 2018 4000 0 Low 60.420
2 2018 4000 0 Mean 67.765
3 2017 5000 0 High 70.930
4 2017 5000 0 Low 45.740
5 2017 5000 0 Mean 58.335
6 2016 6000 0 High 48.630
7 2016 6000 0 Low 34.150
8 2016 6000 0 Mean 41.390
9 2015 7000 0 High 43.590
10 2015 7000 0 Low 33.120
11 2015 7000 0 Mean 39.355
and if you want the category column:
new_df['category'] = new_df['level_3'].map({'High':0, 'Low':1, 'Mean':2'})
Here's another version:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'Date': np.array([2018,2017,2016,2015]),
'Rev': np.array([4000,5000,6000,7000]),
'Other': np.array([0,0,0,0]),
'High':np.array([75.11,70.93,48.63,43.59]),
'Low':np.array([60.42,45.74,34.15,33.12]),
'Mean':np.array([67.765,58.335,41.390,39.355]) #mean of high/low columns
})
#create one dataframe per category
df_high = df2[['Date', 'Other', 'Rev', 'High']]
df_mean = df2[['Date', 'Other', 'Rev', 'Mean']]
df_low = df2[['Date', 'Other', 'Rev', 'Low']]
#rename the category column to price
df_high = df_high.rename(index = str, columns = {'High': 'price'})
df_mean = df_mean.rename(index = str, columns = {'Mean': 'price'})
df_low = df_low.rename(index = str, columns = {'Low': 'price'})
#create new category column
df_high['category'] = 0
df_mean['category'] = 2
df_low['category'] = 1
#concatenate the dataframes together
frames = [df_high, df_mean, df_low]
df_concat = pd.concat(frames)
#sort values per example
df_concat = df_concat.sort_values(by = ['Date', 'category'], ascending = [False, True])
#print result
print(df_concat)
Result:
Date Other Rev price category
0 2018 0 4000 75.110 0
0 2018 0 4000 60.420 1
0 2018 0 4000 67.765 2
1 2017 0 5000 70.930 0
1 2017 0 5000 45.740 1
1 2017 0 5000 58.335 2
2 2016 0 6000 48.630 0
2 2016 0 6000 34.150 1
2 2016 0 6000 41.390 2
3 2015 0 7000 43.590 0
3 2015 0 7000 33.120 1
3 2015 0 7000 39.355 2

Categories

Resources