looping through groupby in Python dataframe - python

I am new to python. I am trying to write the code on the python dataframe to loop through the data. Below is my initial data:
A B C Start Date End Date
1 2 5 01/01/15 1/31/15
1 2 4 02/01/15 2/28/15
1 2 7 02/25/15 3/15/15
1 2 9 03/11/15 3/30/15
1 2 8 03/14/15 4/5/15
1 2 3 03/31/15 4/10/15
1 2 4 04/05/15 4/27/15
1 2 11 04/15/15 4/20/15
4 5 23 5/6/16 6/6/16
4 5 12 6/10/16 7/10/16
I want to create a new column as forward_c. Forward_C is the data of that row which satisfies the conditions:
Column A and B should be equal.
Start_Date of the row should be greater than Start Date and End Date of the current Row.
The expected output is :
A B C Start Date End Date Forward_C
1 2 5 01/01/15 1/31/15 4
1 2 4 02/01/15 2/28/15 9
1 2 7 02/25/15 3/15/15 3
1 2 9 03/11/15 3/30/15 3
1 2 8 03/14/15 4/5/15 11
1 2 3 03/31/15 4/10/15 11
1 2 4 04/05/15 4/27/15 0
1 2 11 04/15/15 4/20/15 0
4 5 23 5/6/16 6/6/16 12
4 5 12 6/10/16 7/10/16 0
I wrote below code to achieve the same:
df = data.groupby(['A','B'], as_index = False).apply(lambda x:
x.sort_values(['Start Date','End Date'],ascending = True))
for i,j in df.iterrows():
for index,row in df.iterrows():
if (j['A'] == row['A']) and (j['B'] == row['B']) and (row['Start Date'] > j['End Date']) and (j['Start Date'] < row['Start Date']):
j['Forward_C'] = row['C']
df.loc[i,'Forward_C'] = row['C']
break
I was wondering if there is any more efficient way to do the same in python.
Because now my code will iterate through all the rows for each record. This will slow down the performance, since it will be dealing with more than 10 million records.
Your input is appreciated.
Thanks in advance.
Regards,
RD

I was not exactly clear with the question. based on my understanding, this is what i could come up with. Iam using Cross Join instead of a loop.
import pandas
data = #Actual Data Frame
data['Join'] = "CrossJoinColumn"
df1 = pandas.merge(data,data,how = "left",on = "Join",suffixes = ["","_2"])
df1 = df1[(df1['A'] == df1['A_2']) & (df1['B'] == df1['B_2']) & (df1['Start Date'] < df1['Start Date_2']) & (df1['End Date'] < df1['Start Date_2'])].groupby(by =['A','B','C','Start Date','End Date']).first().reset_index()[['A','B','C','Start Date','End Date','C_2']]
df1 = pandas.merge(data,df1,how = "left",on = ['A','B','C','Start Date','End Date'])

Related

Aggregating the counts on a certain day of the week in python

I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00

Python: Calculate mathematical values in new row in dataframe based on few specific previous rows

I have the below pandas dataframe:
Input:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
My Expected Output is:
A B C
Expense 2 3
Sales 5 6
Travel 8 9
Total Exp 10 12
The last tow is basically total of row 1 and row 3. This is a very simplified example, i actually have to perform complex calculation on a huge dataframe.
Is there a way in python to perform such calculation?
You can select rows by positions with DataFrame.iloc and sum, then assign to new row:
df.loc[len(df.index)] = df.iloc[0] + df.iloc[2]
Or:
df.loc[len(df.index)] = df.iloc[[0,2]].sum()
print (df)
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 8 10 12
EDIT: First idea is create index by A column, so you can use loc with new value of A, but last step is convert index to column by reset_index:
df = df.set_index('A')
df.loc['Total Exp'] = df.iloc[[0,2]].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Similar is possible selecting by loc by labels - here Expense and Travel:
df = df.set_index('A')
df.loc['Total Exp'] = df.loc[['Expense', 'Travel']].sum()
df = df.reset_index()
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or is possible filter out first column with 1: and add value back by Series.reindex:
df.loc[len(df.index)] = df.iloc[[0,2], 1:].sum().reindex(df.columns, fill_value='Total Exp')
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12
Or you can set value of A separately:
s = df.iloc[[0,2]].sum()
s.loc['A'] = 'Total Exp'
df.loc[len(df.index)] = s
print (df)
A B C
0 Expense 2 3
1 Sales 5 6
2 Travel 8 9
3 Total Exp 10 12

python pandas - remove duplicates in a column and keep rows according to a complex criteria

Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

Python - Pandas dataframe - iteration through a column

I have a pandas dataframe and I would like to add an empty column (named nb_trades). Then I would like to fill this new column with a 5 by 5 increment. So I should get a column with values 5 10 15 20 ...
Doing the below code assign the same value (last value of i) in the whole column and that's not what I wanted:
big_df["nb_trade"]='0'
for i in range(big_df.shape[0]):
big_df['nb_trade']=5*(i+1)
Can someone help me?
Use range or np.arrange:
df = pd.DataFrame({'a':[1,2,3]})
print (df)
a
0 1
1 2
2 3
df['new'] = range(5, len(df.index) * 5 + 5, 5)
print (df)
a new
0 1 5
1 2 10
2 3 15
df['new'] = np.arange(5, df.shape[0] * 5 + 5, 5)
print (df)
a new
0 1 5
1 2 10
2 3 15
Solution of John Galt from comment:
df['new'] = np.arange(5, 5*(df.shape[0]+1), 5)
print (df)
a new
0 1 5
1 2 10
2 3 15

Pandas Calculate Sum of Multiple Columns Given Multiple Conditions

I have a wide table in a format as follows (for up to 10 people):
person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type
0 | 1 | 0 | 7 | 4 | 6
Where status can be a 0 or a 1 (first 3 cols).
Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...
Type | Value
4 | 10
5 | 20
6 | 30
7 | 40
I need to calculate two columns, 'A', and 'B', where:
A is the sum of values of each person's type (in that row) where
status = 0.
B is the sum of values of each person's type (in that row) where
status = 1.
For example, the resulting columns 'A', and 'B' would be as follows:
A | B
70 | 10
An explanation of this:
'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).
Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).
This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...
I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.
For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...
hope that made sense
Use the filter method which will filter the column names for those where a string appears in them.
Make a dataframe for the lookup values other_table and set the index as the type column.
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Full example below:
Create fake data
df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) ,
'person_2_status':np.random.randint(0, 2,1000),
'person_3_status':np.random.randint(0, 2,1000),
'person_1_type':np.random.randint(4, 8,1000),
'person_2_type':np.random.randint(4, 8,1000),
'person_3_type':np.random.randint(4, 8,1000)},
columns= ['person_1_status', 'person_2_status', 'person_3_status',
'person_1_type', 'person_2_type', 'person_3_type'])
person_1_status person_2_status person_3_status person_1_type \
0 1 0 0 7
1 0 1 0 6
2 1 0 1 7
3 0 0 0 7
4 0 0 1 4
person_3_type person_3_type
0 5 5
1 7 7
2 7 7
3 7 7
4 7 7
Make other_table
other_table = pd.Series({4:10, 5:20, 6:30, 7:40})
4 10
5 20
6 30
7 40
dtype: int64
Filter out status and type columns to their own dataframes
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
Make lookup table
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
Apply matrix multiplication and sum across rows.
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Output
person_1_status person_2_status person_3_status person_1_type \
0 0 0 1 7
1 0 1 0 4
2 0 1 1 7
3 0 1 0 6
4 0 0 1 5
person_2_type person_3_type A B
0 7 5 80 20
1 6 4 20 30
2 5 5 40 40
3 6 4 40 30
4 7 5 60 20
consider the dataframe df
mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df
The way this is structured we can do this for type == 1
df.status.mul(df.type).sum(1)
0 0.935290
1 1.252478
2 1.354461
3 1.399357
4 2.102277
5 1.589710
6 0.434147
7 2.553792
8 1.205599
9 1.022305
dtype: float64
and for type == 0
df.status.rsub(1).mul(df.type).sum(1)
0 1.867986
1 1.068045
2 0.653943
3 2.239459
4 0.214523
5 0.734449
6 1.291228
7 0.614539
8 0.849644
9 1.109086
dtype: float64
You can get your columns in this format using the following code
df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)

Categories

Resources