Pandas dataframe join, the calculation process through groupby(Cumulative weighted average) - python

Thankfully, the contents of the last question were solved well, so I was making a dataset without any problems.(Thank you Ecker!!)
But, there was a new problem.
In the process of calculating the cumulative weighted average through the same process,
When the dataset and type were changed, there were cases where it was not possible to join.
For example, in the dataset below
firm
date
reviewer
compound
A
2021-01-01
a
0.6531
A
2021-01-01
b
-0.7213
A
2021-01-01
c
-0.3168
A
2021-01-02
d
0.3548
A
2021-01-02
e
0.5783
A
2021-01-03
f
0.4298
A
2021-01-04
g
0.8769
B
2021-01-01
h
0.7895
B
2021-01-01
i
-0.4924
B
2021-01-02
j
0.0245
B
2021-01-02
k
0.4982
B
2021-01-03
a
0.1597
B
2021-01-04
b
0.6254
‘The compound value’ is a real number (float64) including a decimal point between -1 and 1.
‘The count number’ is the number of reviewers on a specific date.(int64)
I would like to add a column that calculates the cumulative weighted average as shown in the table below.
firm
date
reviewer
rate
cum_avg_compound
A
2021-01-01
a
0.6531
-0.12833
A
2021-01-01
b
-0.7213
-0.12833
A
2021-01-01
c
-0.3168
-0.12833
A
2021-01-02
d
0.3548
0.10962
A
2021-01-02
e
0.5783
0.10962
A
2021-01-03
f
0.4298
0.162983
A
2021-01-04
g
0.8769
0.264971
B
2021-01-01
h
0.7895
0.14855
B
2021-01-01
i
-0.4924
0.14855
B
2021-01-02
j
0.0245
0.20495
B
2021-01-02
k
0.4982
0.20495
B
2021-01-03
a
0.1597
0.1959
B
2021-01-04
b
0.6254
0.26748
Even if this is converted to the same float64 format, it cannot be combined using join.
The code I tried is as follows.
g = (
df.groupby(['firm', 'date'])['compound']
.agg(['sum', 'count'])
.groupby(level='sid').cumsum()
)
df = df.join(
g ['sum'].div(g_text_emo['count']).rename('cum_avg_compound'),
on=['firm', 'date']
)
Is there any way to solve this problem?
Thank you in advance.

Related

Select Rows Based on Time Difference [Before or After] In Columns

I have the following dataset of students taking 2 different exams:
df = pd.DataFrame({'student': 'A B C D E'.split(),
'sat_date': [datetime.datetime(2013,4,1),datetime.datetime(2013,5,1),
datetime.datetime(2013,5,2),datetime.datetime(2013,7,15),
datetime.datetime(2013,8,1)],
'act_date': [datetime.datetime(2013,4,12),datetime.datetime(2013,5,2),
datetime.datetime(2013,4,12), datetime.datetime(2013,7,1),
datetime.datetime(2013,8,2)]})
print(df)
student sat_date act_date
0 A 2013-04-01 2013-04-12
1 B 2013-05-01 2013-05-02
2 C 2013-05-02 2013-04-12
3 D 2013-07-15 2013-07-01
4 E 2013-08-01 2013-08-02
I want to select those students whose two exams are 10 days apart from each other in either direction.
I am trying Timedelta, but I'm not sure if it's optimal.
df[(df['sat_date'] >= df['act_date'] + pd.Timedelta(days=10)) | (df['sat_date'] <= df['act_date'] - pd.Timedelta(days=10))]
Desired Output:
student sat_date act_date
0 A 2013-04-01 2013-04-12
2 C 2013-05-02 2013-04-12
3 D 2013-07-15 2013-07-01
Is there any better way of getting the desired output? Any suggestions would be appreciated. Thanks!
I would probably look at the absolute value of the difference between the two dates is greater to or equal than 10.
df.loc[abs((df['sat_date']-df['act_date']).dt.days).ge(10)]
Try as follows:
result = df.loc[abs(df.sat_date - df.act_date).dt.days>=10]
print(result)
student sat_date act_date
0 A 2013-04-01 2013-04-12
2 C 2013-05-02 2013-04-12
3 D 2013-07-15 2013-07-01
Or maybe nicer:
df.loc[abs(df.sat_date - df.act_date).ge(pd.Timedelta(days=10))]

Slicing across a timeseries range in a multiindex DataFrame

I have a DataFrame that tracks the 'Adj Closing' price for several global markets causing there to be repeating dates. To clean this up I use .set_index(['Index Ticker', 'Date']).
DataFrame sample
My issue is that the Closing Prices run as far back as 1997-07-02 but I only need 2020-01-01 and forward. I tried using idx = pd.IndexSlice followed by df.loc[idx[ :, '2020-01-01':], :] as well as df.loc[(slice(None), '2020-01-01':), :], but both methods return a syntax error on the : that I'm using to slice across a range of dates. Any tips on getting the data I need past a specific date? Thank you in advance!
Try:
# create dataframe to approximate your data
df = pd.DataFrame({'ticker' : ['A']*5 + ['M']*5,
'Date' : pd.date_range(start='2021-01-01', periods=5).tolist() + pd.date_range(start='2021-01-01', periods=5).tolist(),
'high' : range(10)}
).groupby(['ticker', 'Date']).sum()
high
ticker Date
A 2021-01-01 0
2021-01-02 1
2021-01-03 2
2021-01-04 3
2021-01-05 4
M 2021-01-01 5
2021-01-02 6
2021-01-03 7
2021-01-04 8
2021-01-05 9
# evaluate conditions against level 1 (Date) of your multiIndex; level 0 is ticker
df[df.index.get_level_values(1) > '2021-01-03']
high
ticker Date
A 2021-01-04 3
2021-01-05 4
M 2021-01-04 8
2021-01-05 9
Alternatively, if possible, remove the unwanted dates prior to setting your multiIndex.

Groupby, and convert colum to list for nth recent dates

I have:
Name Purchase_Date Item
0 Peter 2021-01-01 Car
1 Peter 2021-02-01 Keys
2 Peter 2021-03-01 Chocolate
3 Erika 2021-01-02 Horse
4 Erika 2021-02-02 Water
5 Erika 2021-02-02 Laptop
I want to get in a column the list of the most recent two(for the sake of the example) purchases (not if Purchase_Date repeats, take both).
So the output would look like:
Name. Purchase_Date. Items_List
Peter 2021-01-01    [Car]
Peter 2021-02-01 [Keys]
Erika. 2021-01-02 [Horse]
Erika 2021-02-02 [Water, Laptop]
As you can see Peter's purchase (chocolate) is not there, its the 3rd and Erika has two items because the date 2021-02-02 repeats.
Tried some group_by and flatten list and all the stuff but not sorting it out.
Code for df:
data = pd.DataFrame({'Name':['Peter','Peter','Peter','Erika','Erika','Erika'],
'Purchase_Date':['01/01/2021','01/02/2021','01/03/2021','02/01/2021','02/02/2021','02/02/2021'],
'Item':['Car','Keys','Chocolate','Horse','Water','Laptop']})
Try:
# Convert to datetime
# df['Purchase_Date'] = pd.to_datetime(df['Purchase_Date']
>>> df.groupby(['Name', 'Purchase_Date'], sort=True) \
.agg({'Item': list}).groupby('Name').head(2)
Item
Name Purchase_Date
Erika 2021-01-02 [Horse]
2021-02-02 [Water, Laptop]
Peter 2021-01-01 [Car]
2021-02-01 [Keys]
You can use groupby twice to achieve this. I'm assuming your dataframe is already sorted as desired prior to running the below line of code.
(data.groupby(['Name', 'Purchase_Date'], as_index=False, sort=False).agg(list)
.groupby('Name').head(2))
Out[1]:
Name Purchase_Date Item
0 Peter 01/01/2021 [Car]
1 Peter 01/02/2021 [Keys]
3 Erika 02/01/2021 [Horse]
4 Erika 02/02/2021 [Water, Laptop]
Convert your dates to_datetime then you can use groupby + rank to keep up to the first two 'Purchase_Dates' for each person. Then groupby + agg(list)
import pandas as pd
# If not `datetime64[ns]`
df['Purchase_Date'] = pd.to_datetime(df['Purchase_Date'])
df = df.sort_values('Purchase_Date')
df1 = (df[df.groupby('Name')['Purchase_Date'].rank(method='dense').le(2)]
.groupby(['Name', 'Purchase_Date']).agg(list))
print(df1)
Item
Name Purchase_Date
Erika 2021-02-01 [Horse]
2021-02-02 [Water, Laptop]
Peter 2021-01-01 [Car]
2021-01-02 [Keys]

Rolling windows without NaN at the beginning

Given the following code with rolling windows of 3:
import pandas as pd
df=pd.DataFrame({"date":['1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021','1/1/2021','1/2/2021','1/3/2021','1/4/2021','1/5/2021'],"item_name":["bracelet","bracelet","bracelet","bracelet","bracelet","earring","earring","earring","earring","earring"],"quantity_sold":[1,2,3,4,5,100,200,300,400,500]})
df['date']=pd.to_datetime(df['date'])
display(df)
#sort on the right fields before the calculation
df=df.sort_values(['date','item_name'])
#sum of quantity for last 3 days (curr_day-2,curr_day-1,curr_day)
display(df.set_index("date").groupby("item_name").rolling(3).agg('sum'))
the result is:
Is it possible to have the first two value calculated without NaN - e.g., on bracelet, 2021-01-01, since we have 1 element, we use rolling windows of 1 and get value=1; on bracelet, 2021-01-02, since we have 2 element, we use rolling windows of 2 and get value=3?
(So similarly we have bracelet, 2021-01-01 with value =100 and bracelet, 2021-01-02 with value =300)
You can use the min_periods keyword from the documentation of rolling():
df.set_index("date").groupby("item_name").rolling(3,min_periods=1).agg('sum')
min_periods: int, default None
Minimum number of observations in window
required to have a value (otherwise result is NA). For a window that
is specified by an offset, min_periods will default to 1. Otherwise,
min_periods will default to the size of the window.
This will give you:
quantity_sold
item_name date
bracelet 2021-01-01 1.0
2021-01-02 3.0
2021-01-03 6.0
2021-01-04 9.0
2021-01-05 12.0
earring 2021-01-01 100.0
2021-01-02 300.0
2021-01-03 600.0
2021-01-04 900.0
2021-01-05 1200.0

Python Pandas : how to combine trip segments into a journey with Transport smart card data

Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.
The current table:
CustomerID SegmentID Origin Dest StartTime EndTime Fare Type
0 A001 101 A B 7:30am 7:45am 1.5 Bus
1 A001 102 B C 7:50am 8:30am 3.5 Train
2 A001 103 C B 17:10pm 18:00pm 3.5 Train
3 A001 104 B A 18:10pm 18:30pm 1.5 Bus
4 A002 105 K Y 11:30am 12:30pm 3.0 Train
5 A003 106 P O 10:23am 11:13am 4.0 Ferrie
and covert into something like:
CustomerID JourneyID Origin Dest Start Time End Time Fare Type NumTrips
0 A001 1 A C 7:30am 8:30am 5 Intermodal 2
1 A001 2 C A 17:10pm 18:30pm 5 Intermodal 2
2 A002 6 K Y 11:30am 12:30pm 3 Train 1
3 A003 8 P O 10:23am 11:13am 4 Ferrie 1
I'm new to Python and Pandas and have no idea how to start, so any guidance would be appreciated.
Here's a fairly complete answer. You didn't fully specify the concept of a single journey so I took a guess. You could adjust mask below to better suit your own definition.
# get rid of am/pm and convert to proper datetime
# converts to year 1900 b/c it's not specified, doesn't matter here
df['StTime'] = pd.to_datetime( df.StartTime.str[:-2], format='%H:%M' )
df['EndTime'] = pd.to_datetime( df.EndTime.str[:-2], format='%H:%M' )
# some of the later processing is easier if you use duration
# instead of arrival time
df['Duration'] = df.EndTime-df.StTime
# get rid of some nuisance variables for clarity
df = df[['CustomerID','Origin','Dest','StTime','Duration','Fare','Type']]
First, we need to figure out a way group the rows. As this is not well specified in the question, I'll group by Customer ID where Start Times are within 1 hr. Note that for tri-modal trips this actually implies that start times of the first and third trips could differ by more than one hour as long as first+second and second+third are each individaully under 1 hour. This seems like a natural way to do it, but for your actual use case you'd have to adjust this for your desired definition. There are quite a few ways you could proceed here.
mask1 = df.StTime - df.StTime.shift(1) <= pd.Timedelta('01:00:00')
mask2 = (df.CustomerID == df.CustomerID.shift(1))
mask = ( mask1 & mask2 )
Now we can use the mask with cumsum to generate a tripID:
df['JourneyID'] = 1
df.ix[mask,'JourneyID'] = 0
df['JourneyID'] = df['JourneyID'].cumsum()
df['NumTrips'] = 1
df[['CustomerID','StTime','Fare','JourneyID']]
CustomerID StTime Fare JourneyID
0 A001 1900-01-01 07:30:00 1.5 1
1 A001 1900-01-01 07:50:00 3.5 1
2 A001 1900-01-01 17:10:00 3.5 2
3 A001 1900-01-01 18:10:00 1.5 2
4 A002 1900-01-01 11:30:00 3.0 3
5 A003 1900-01-01 10:23:00 4.0 4
Now, for each column just aggregate appropriately:
df2 = df.groupby('JourneyID').agg({ 'Origin' : sum, 'CustomerID' : min,
'Dest' : sum, 'StTime' : min,
'Fare' : sum, 'Duration' : sum,
'Type' : sum, 'NumTrips' : sum })
StTime Dest Origin Fare Duration Type CustomerID NumTrips
JourneyID
1 1900-01-01 07:30:00 BC AB 5 00:55:00 BusTrain A001 2
2 1900-01-01 17:10:00 BA CB 5 01:10:00 TrainBus A001 2
3 1900-01-01 11:30:00 Y K 3 01:00:00 Train A002 1
4 1900-01-01 10:23:00 O P 4 00:50:00 Ferrie A003 1
Note that Duration includes only travel time and not the time in-between trips (e.g. if start time of second trip is later than the end time of first trip).

Categories

Resources