Group by dataframe on values from other dataframe - python

I have a df1 that I want to group with folowing columns
retailer object
store_id int64
visit_date datetime64[ns]
categories_euclidean float64
probes_euclidean float64
I want to group it by following df2
pk start_date end_date
name
Cycle 01 1 2016-02-24 2016-03-13
Cycle 02 2 2016-03-14 2016-03-27
Cycle 03 3 2016-03-28 2016-04-10
Cycle 04 4 2016-04-11 2016-04-24
Cycle 05 5 2016-04-25 2016-05-08
Cycle 06 6 2016-05-09 2016-05-22
The filtering condition is: df2.start_date <= df1.visit_date <= df2.end_date,
and the groups names to be df2.names
Any idea how to do so ?

Try:
def filt(x):
cond1 = df2.start_date <= x
cond2 = df2.end_date >= x
return df2[cond1 & cond2].index[0]
df1['name'] = df1.visit_date.apply(filt)
df1.groupby('name')
I couldn't play around with this. But it might work.

Related

Check if there is a particular value in column in given time window in Pandas, Python

enter code hereI have the following Pandas data frame:
id date event
01 2000-01-01 start
01 2000-01-02 a
01 2000-01-03 a
01 2000-01-04 b
02 2000-02-01 start
02 2000-02-10 a
02 2000-02-11 a
03 2000-01-05 start
03 2000-01-08 b
03 2000-02-05 a
03 2000-02-15 a
04 2000-04-07 start
I'd like to know which are the ids where there is a b event in the 5-day time window starting from the start event (start is always the first in the event sequences belonging to a particular user).
What is the proper query for the desired result?
You can use a merge of the "start" and "b" per "id", then filtering the rows where difference is below or equal to 5 days using loc and getting the unique matching ids:
threshold = '5days'
m1 = df['event'].eq('start')
m2 = df['event'].eq('b')
ids = (df[m1].merge(df[m2], on='id')
.loc[lambda d: d['date_y'].sub(d['date_x']).le(threshold)]
['id'].unique().tolist()
)
Output: ['01', '03']
NB. if there a multiple "b", the id will match is any of them is within the 5 days threshold.
Intermediate merge:
df[m1].merge(df[m2], on='id')
id date_x event_x date_y event_y
0 01 2000-01-01 start 2000-01-04 b
1 03 2000-01-05 start 2000-01-08 b

Python - Extract year and month from a single column of different year and month arrangements

I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct

How to get a DataFrame from the DataFrame with one column as a sum of values of other rows?

I have a DataFrame in this way:
shop_id item_price item_cnt_day day month year
59 9.00 1.0 02 01 2013
59 8.00 2.0 02 01 2013
25 10.00 4.0 05 02 2013
25 17.0 1.0 06 01 2013
25 10.00 1.0 15 01 2013
And I try to get the result like following DataFrame:
shop_id all_revenue month year
59 25.00 01 2013
25 27.00 01 2013
I mean I want to get each shop's revenue in January 2013.
BUT, I don't know how to code in Pandas. Any help would be appreciated.
eval + groupby + sum
You can assign a series via eval, then use groupby:
res = df.eval('revenue=item_price * item_cnt_day')\
.groupby(['shop_id', 'month', 'year'], as_index=False)['revenue'].sum()
You can, if you wish, query for January 2013 (before or after the above operations):
res = res.query('month == 1 & year == 2013')
print(res)
shop_id month year revenue
0 25 1 2013 27.0
2 59 1 2013 25.0
I like filtering the dataframe first, to reduce number of unnecessary calculations:
df.query('month == 1 and year == 2013')\
.assign(all_revenue = df.item_price * df.item_cnt_day)\
.groupby(['shop_id','month','year'], as_index=False)['all_revenue'].sum()
Output:
shop_id month year all_revenue
0 25 1 2013 27.0
1 59 1 2013 25.0
Note: Because your column names are "friendly", no spaces nor special characters, you can use query method. If that doesn't work for your column naming then you need to use boolean indexing.
df[(df['month'] == 1) & (df['year'] == 2013)]\
.assign(all_revenue = df.item_price * df.item_cnt_day)\
.groupby(['shop_id','month','year'], as_index=False)['all_revenue'].sum()

python - Fill in missing dates with respect to a specific attribute in pandas

My data looks like below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-25,44
2,2016-10-27,12
I want to fill in missing dates among id.
For example, the date range of id=1 is 2016-10-24 ~ 2016-10-28, and 2016-10-26 is missing. Moreover, the date range of id=2 is 2016-10-21 ~ 2016-10-27, and 2016-10-23, 2016-10-24 and 2016-10-26 are missing.
I want to fill in the missing dates and fill in the target value as 0.
Therefore, I want my data to be as below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-26,0
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-23,0
2,2016-10-24,0
2,2016-10-25,44
2,2016-10-26,0
2,2016-10-27,12
Can somebody help me?
Thanks in advance.
You can use groupby with resample - then is problem fillna - so need asfreq first:
#if necessary convert to datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
df = df.groupby('id').resample('d')['target'].asfreq().fillna(0).astype(int).reset_index()
print (df)
id date target
0 1 2016-10-24 22
1 1 2016-10-25 31
2 1 2016-10-26 0
3 1 2016-10-27 44
4 1 2016-10-28 12
5 2 2016-10-21 22
6 2 2016-10-22 31
7 2 2016-10-23 0
8 2 2016-10-24 0
9 2 2016-10-25 44
10 2 2016-10-26 0
11 2 2016-10-27 12

pandas between statement for two dates from another dataset

What formula can I use to determine based off of a certain date the pay period number that falls between a start date and end date? In other words, from df1 I have a date that I need to compare against df2 start dates and end dates and then produce the pay period that those dates fall within in a new data set.
The formula I've used so far is the following. Keep in mind I'm not a seasoned programmer:
1st try:
def calc(row):
if pf2(row['actn_dt']) >= pp_calendar([0], and pf2(row['actn_dt']) <= df2([1]):
return df2([2,3])
2nd try:
pf2['pay'] = np.where (pf2['actn_dt'] >= df2[0]) | (pf2['actn_dt'] <= pf2[1]), pp_calendar[2]
3rd try:
def calc(row):
if pf2(row['actn_dt']) >= df2(row[1]) | pf2(row['actn_dt']) <= df2(row[2]):
return df2(row[3])
pf2['pay'] = pf2.apply (lambda row: calc(row), axis=1)
print df:
actn_dt
16 2008-09-28 00:00:00
17 2008-03-16 00:00:00
18 2009-08-30 00:00:00
43 2008-06-22 00:00:00
89 2009-08-16 00:00:00
106 2009-03-29 00:00:00
244 2009-08-30 00:00:00
371 2009-09-13 00:00:00
400 2009-07-19 00:00:00
439 2007-12-23 00:00:00
print df2:
START_DATE END_DATE PAY_PERIOD CALENDAR_YEAR
0 2008-09-28 2008-10-11 10 2008
1 2008-03-16 2008-03-16 06 2008
2 2009-08-30 2009-09-12 18 2009
3 2008-06-22 2008-06-22 13 2008
4 2009-03-29 2009-04-11 07 2009
Expected Result:
actn_dt START_DATE END_DATE PAY_PERIOD CALENDAR_YEAR
16 2008-09-28 2008-09-28 2008-10-11 10 2008
17 2008-03-16 2008-03-16 2008-03-29 06 2008
18 2009-08-30 2009-08-30 2009-09-12 18 2009
43 2008-06-22 2008-06-22 2008-07-05 13 2008
89 2009-08-16 2009-08-16 2008-08-29 17 2009
106 2009-03-29 2009-03-29 2009-04-11 07 2009
244 2009-08-30 2009-08-30 2009-09-12 18 2009
Thank you for your knowledge and time!
Head Line Solution
def find_pay_period(date, df):
df = df[(df.START_DATE <= date) & (date <= df.END_DATE)].iloc[0, :]
df['actn_dt'] = date
return df
df1.actn_dt.apply(lambda x: find_pay_period(x, df2))
Explanation
start with
# apply() will take date in df1 and find the first row in df2
# such that the date is between START_DATE and END_DATE and
# then return the row.
df1.actn_dt.apply(lambda x: find_pay_period(x, df2))
now this
def find_pay_period(date, df):
# df[] use boolean mask to filter
# .iloc[0, :] grabs first row of filtered DataFrame.
# Keep in mind this is a Series.
df = df[(df.START_DATE <= date) & (date <= df.END_DATE)].iloc[0, :]
# add back your date
df['actn_dt'] = date
return df

Categories

Resources