I have monthly performance of students for several years for all subjects. DataFrame has following columns: [Name, Subject, Month, Year, Marks] as given in following image 1:
Name Month Year Subject Marks
0 A 1 2022 Math 80
1 A 2 2022 Math 80
2 A 3 2022 Math 80
3 A 4 2022 Math 70
4 A 5 2022 Math 80
5 A 6 2022 Math 80
6 A 7 2022 Math 80
Now I want combine consecutive rows having same performance for given student and subject. As given in following image:
Name Subject Marks Time_Period
0 A Math 80 1.2022-3.2022
1 A Math 70 4.2022-4.2022
2 A Math 80 5.2022-7.2022
I have tried to group dataframe and extract Min/Max(Month) and Min/Max(Year). But it will give wrong result if student has different performance in month in between.
You can use a custom groupby.agg:
# identify consecutive marks
group = df['Marks'].ne(df['Marks'].shift()).cumsum()
out = (df.assign(Time_Period=lambda d: d['Month'].astype(str)
+'.'+d['Year'].astype(str))
.groupby(['Name', 'Subject', 'Marks', group],
sort=False, as_index=False)
['Time_Period']
.agg(lambda x: '-'.join([x.iloc[0], x.iloc[-1]]))
)
If you want to start a new group when a month is missing:
# identify consecutive marks
group1 = df['Marks'].ne(df['Marks'].shift()).cumsum()
# group by successive months
group2 = df.groupby('Year')['Month'].diff().ne(1)
out = (df.assign(Time_Period=lambda d: d['Month'].astype(str)
+'.'+d['Year'].astype(str))
.groupby(['Name', 'Subject', 'Marks', group1, group2],
sort=False, as_index=False)
['Time_Period']
.agg(lambda x: '-'.join([x.iloc[0], x.iloc[-1]]))
)
Output:
Name Subject Marks Time_Period
0 A Math 80 1.2022-3.2022
1 A Math 70 4.2022-4.2022
2 A Math 80 5.2022-7.2022
Related
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I have a dataframe containing two columns of dates: start date and end date. I need to set up a dataframe where all months of the year are set up in separate columns based on the start and end date intervals so I can sum values from another column for each of the months per name.
To illustrate:
Original df:
Start Date End Date Name Value
10/22/20 01/25/21 John 100
10/12/20 04/30/21 John 50
02/25/21 None John 20
Desired df:
Name Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 Jul_21 Aug_21 ...
John 150 150 150 150 70 70 70 20 20 20 20 ...
Any suggestions or pointers on how I could achieve that result would be greatly appreciated!
First convert values to datetimes with replace non datetimes to missing values and replace them to some date, then in list comprehension get all months to Series, which is used for pivoting by DataFrame.pivot_table:
end = '2021-12-31'
df['Start'] = pd.to_datetime(df['Start Date'])
df['End'] = pd.to_datetime(df['End Date'], errors='coerce').fillna(end)
s = pd.concat([pd.Series(r.Index,pd.date_range(r.Start, r.End, freq='M'))
for r in df.itertuples()])
df1 = pd.DataFrame({'Date': s.index}, s).join(df)
df2 = df1.pivot_table(index='Name',
columns='Date',
values='Value',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.strftime('%b_%y')
print (df2)
Date Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 \
Name
John 150 150 150 50 70 70 70 20 20
Date Jul_21 Aug_21 Sep_21 Oct_21 Nov_21 Dec_21
Name
John 20 20 20 20 20 20
Here is a sample dataset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'VipNo':np.repeat( range(3), 2 ),
'Quantity': np.random.randint(200,size=6),
'OrderDate': np.random.choice( pd.date_range('1/1/2020', periods=365, freq='D'), 6, replace=False)})
print(df)
So I have a couple of steps to do. I want to create a new column named qtywithin1mon/totalqty. First I want to group the VipNo (each number represents an individual) because a person may have made multiple purchases. Then I want to see if the orderdate is within a certain range (let's say 2020/03/01 - 2020/03/31). If so, I want to use the respective quantity on that day divided by the total quantity this customer purchased. My dataset is big so a customer may have ordered twice within the time range and I would want the sum of the two orders divided by the total quantity in this case. How can I achieve this goal? I really have no idea where to start..
Thank you so much!
You can create a new column masking quantity within the given date range, then groupby:
start, end = pd.to_datetime(['2020/03/01','2020/03/31'])
(df.assign(QuantitySub=df['OrderDate'].between(start,end)*df.Quantity)
.groupby('VipNo')[['Quantity','QuantitySub']]
.sum()
.assign(output=lambda x: x['QuantitySub']/x['Quantity'])
.drop('QuantitySub', axis=1)
)
With a data frame:
VipNo Quantity OrderDate
0 0 105 2020-01-07
1 0 56 2020-03-04
2 1 167 2020-09-05
3 1 18 2020-05-08
4 2 151 2020-11-01
5 2 14 2020-03-17
The output is:
Quantity output
VipNo
0 161 0.347826
1 185 0.000000
2 165 0.084848
I'm creating the charts in periscopedata and doing pandas to derive our results. I'm facing difficulties when removing duplicates from the results.
This is our data look like in final dataframe after calculating.
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 20 10 15
B2345 01/01/2015 15 50 20 45
B2345 02/01/2015 45 4 30 19
I want to remove the duplicate entry based on vendor_ID and date but keep the starting opening and keep the last entry closing
i.e) Expected result I want
vendor_ID date opening purchase paid closing
B2345 01/01/2015 5 70 30 45
B2345 02/01/2015 45 4 30 19
I've tried below code to remove the duplicates but that gave us different error.
df.drop_duplicates(subset=["vendor_ID", "date"], keep="last", inplace=True)
How do I code such way to remove the duplicates and keep the first and last as mentioned in above example.
Use GroupBy.agg with GroupBy.first, GroupBy.last and GroupBy.sum specified for each column for output:
Notice: Thanks #Erfan - if need use minimal and maximal column instead first and last change dict to {'opening':'min','purchase':'sum','paid':'sum', 'closing':'max'}
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 01/01/2015 5 70 30 45
1 B2345 02/01/2015 45 4 30 19
Also if not sure if datetimes are sorted:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
df1 = (df.groupby(["vendor_ID", "date"], as_index=False)
.agg({'opening':'first','purchase':'sum','paid':'sum', 'closing':'last'}))
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
You can also create dictionary dynamic for sum all columns without first 2 and used for first and last:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.sort_values(["vendor_ID", "date"])
d = {'opening':'first', 'closing':'last'}
sum_cols = df.columns.difference(list(d.keys()) + ['vendor_ID','date'])
final_d = {**dict.fromkeys(sum_cols,'sum'), **d}
df1 = df.groupby(["vendor_ID", "date"], as_index=False).agg(final_d).reindex(df.columns,axis=1)
print (df1)
vendor_ID date opening purchase paid closing
0 B2345 2015-01-01 5 70 30 45
1 B2345 2015-01-02 45 4 30 19
I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90