I’ve got a dataframe like this one:
df = pd.DataFrame({"ID": [123214, 123214, 321455, 321455, 234325, 234325, 234325, 234325, 132134, 132134, 132134],
"DATETIME": ["2020-05-28", "2020-06-12", "2020-01-06", "2020-01-10", "2020-01-11", "2020-02-06", "2020-07-24", "2020-10-14", "2020-03-04", "2020-09-11", "2020-10-17"],
"CATEGORY": ["computer technology", "early childhood", "early childhood", "shoes and bags", "early childhood", "garden and gardening", "musical instruments", "handmade products", "musical instruments", "early childhood", "beauty"]})
I’d like to:
Group by ID
Where CATEGORY == “early childhood” (input), select the next item bought (next row)
The result should be:
321455 "2020-01-10" "shoes and bags"
234325 "2020-02-06" "garden and gardening"
132134 "2020-10-17" "beauty"
The shift function for Pandas is what I need but I can’t make it work while grouping.
Thanks!
You can create mask with test CATEGORY by Series.eq with DataFrameGroupBy.shift, replace first missing values to False and pass to boolean indexing:
#if necessary convert to datetimes and sorting
#df['DATETIME'] = pd.to_datetime(df['DATETIME'])
#df = df.sort_values(['ID','DATETIME'])
mask = df['CATEGORY'].eq('early childhood').groupby(df['ID']).shift(fill_value=False)
df = df[mask]
print (df)
ID DATETIME CATEGORY
3 321455 2020-01-10 shoes and bags
5 234325 2020-02-06 garden and gardening
10 132134 2020-10-17 beauty
Related
I have a dataframe that I would like to split into multiple dataframes using the value in my Date column. Ideally, I would like to split my dataframe by decades. Do I need to use np.array_split method or is there a method that does not require numpy?
My Dataframe looks like a larger version of this:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
3 1752-01-26 Maupertuis (#p4)
4 1755-06-02 Jordan (#p31)
And so I would ideally want in this scenario two data frames like these:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
Date Name
0 1752-01-26 Maupertuis (#p4)
1 1755-06-02 Jordan (#p31)
Building up on mozways answer for getting the decades.
d = {
"Date": [
"1746-06-02",
"1746-09-02",
"1747-06-02",
"1752-01-26",
"1755-06-02",
],
"Name": [
"Borcke (#p1)",
"Jordan (#p31)",
"Sa Majesté (#p32)",
"Maupertuis (#p4)",
"Jord (#p31)",
],
}
import pandas as pd
import math
df = pd.DataFrame(d)
df["years"] = df['Date'].str.extract(r'(^\d{4})', expand=False).astype(int)
df["decades"] = (df["years"] / 10).apply(math.floor) *10
dfs = [g for _,g in df.groupby(df['decades'])]
Use groupby, you can generate a list of DataFrames:
dfs = [g for _,g in df.groupby(df['Date'].str.extract(r'(^\d{3})', expand=False)]
Or, validating the dates:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.year//10)]
If you prefer a dictionary for indexing by decade:
dfs = dict(list(df.groupby(pd.to_datetime(df['Date']).dt.year//10*10)))
NB. I initially missed that you wanted decades, not years. I updated the answer. The logic remains unchanged.
why this for loop does not work...?
I want to get a new column with Delivery Year, it consists of these columns, however, there are a lot of Nans so the logic is that the for loop goes through columns and returns the first non-Na value. The best-case scenario is Delivery Date, when this is not there then Build Year if even this is not there then at least In-Service Date when the machine was set into work.
df = pd.DataFrame({'Platform ID' : [1,2,3,4], "Delivery Date" : [str(2009), float("nan"), float("nan"), float("nan")],
"Build Year" : [float("nan"),str(2009),float("nan"), float("nan")],
"In Service Date" : [float("nan"),str("14-11-2010"), str("14-11-2009"), float("nan")]})
df.dtypes
df
def delivery_year(delivery_year, build_year, service_year):
out = []
for i in range(0,len(delivery_year)):
if delivery_year.notna():
out[i].append(delivery_year)
if (delivery_year[i].isna() and build_year[i].notna()):
out[i].append(build_year)
elif build_year[i].isna():
out[i].append(service_year.str.strip().str[-4:])
else:
out[i].append(float("nan"))
return out
df["Delivery Year"] = delivery_year(df["Delivery Date"], df["Build Year"], df["In Service Date"])
When I run this function I get this error and I do not know why...
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The expected output (column Delivery Year):
Update 3
I rewrote your function in the same manner of your, so without change the logic and the type of your columns. I let you compare the two versions:
def delivery_year(delivery_date, build_year, service_year):
out = []
for i in range(len(delivery_date)):
if pd.notna(delivery_date[i]):
out.append(delivery_date[i])
elif pd.isna(delivery_date[i]) and pd.notna(build_year[i]):
out.append(build_year[i])
elif pd.isna(build_year[i]) and pd.notna(service_year[i]):
out.append(service_year[i].strip()[-4:])
else:
out.append(float("nan"))
return out
df["Delivery Year"] = delivery_year(df["Delivery Date"],
df["Build Year"],
df["In Service Date"])
Notes:
I changed the name of your first parameter because delivery_year is also the name of your function, so it can be confusing.
I also replaced the .isna() and .notna() methods by their equivalent functions: pd.isna(...) and pd.notna(...).
The second if became elif
Update 2
Use combine_first to replace your function. combine_first updates first series ('Delivery Date') with the second series where values are NaN. You can chain them to fill your 'Delivery Year'.
df['Delivery Year'] = df['Delivery Date'] \
.combine_first(df['Build Year']) \
.combine_first(df['In Service Date'].str[-4:])
Output:
>>> df
Platform ID Delivery Date Build Year In Service Date Delivery Year
0 1 2009 NaN NaN 2009
1 2 NaN 2009 14-11-2010 2009
2 3 NaN NaN 14-11-2009 2009
3 4 NaN NaN NaN NaN
Update
You forgot the [i]:
if delivery_year[i].notna():
The truth value of a Series is ambiguous:
>>> delivery_year.notna()
0 True # <- 2009
1 False # <- NaN
2 False
3 False
Name: Delivery Date, dtype: bool
Pandas should consider the series is True (2009) or False (NaN)?
You have to aggregate the result with .any() or .all()
>>> delivery_year.notna().any()
True # because there is at least one non nan-value.
>>> delivery_year.notna().all()
False # because all values are not nan.
One of the reasons for the error is that although your columns Delivery Date, Build Year and In Service Date are of type object, but the NaN values in them are of type float (see screenshot below).
One of the ways to solve this would be to convert the three columns into str type:
df["Delivery Date"] = df["Delivery Date"].astype(str)
df["Build Year"] = df["Build Year"].astype(str)
df["In Service Date"] = df["In Service Date"].astype(str)
And then I have modified your function as follows:
def delivery_year(delivery_year, build_year, service_year):
out = []
for i in range(0,len(delivery_year)):
if len(delivery_year[i])>=4:
out.append(delivery_year[i])
elif (len(delivery_year[i])<4) & (len(build_year[i])>=4):
out.append(build_year[i])
elif (len(build_year[i])<4 and len(service_year[i])>=4):
out.append(service_year[i].split("-")[-1])
else:
out.append(float("nan"))
return out
df["Delivery Year"] = delivery_year(df["Delivery Date"], df["Build Year"], df["In Service Date"])
I am checking the length greater than 4 because the length of "NaN" as a string would be checked in the function above which is 3. This would return you the desired additional column as shown in the screenshot attached
your error has the solution as far as I can see.
If you want to trigger the IF-Statements on the first occurence of Non-nan values,
use .any() like this.
if delivery_year.notna().any():
out[i].append(delivery_year)
You have to specify that whether you want to filter out 'Any' values from specific columns or 'All' values.
:)
I have the following DataFrame, and I'm trying to find for each customer, when the first date (in ascending order) does the flag column = Y
df = {
"customer_key": ["1","1","1","2","2","2"],
"date": ["2020-09-30", "2020-01-31", "2020-06-30","2020-01-31", "2020-02-29", "2020-03-31"],
"flag": ["Y","N","Y","N","N","Y"]
}
Results expected:
For customer 1 it would be 2020-06-30.
For customer 2 it would be 2020-03-31.
So first I'm sorting by the date.
df.sort_values('date', inplace=True)
Here is where I get stuck, I know I need to group by customer key and then find the first occurrence where the flag = y, I'm just now sure how to do this pythonicly.
df['first_occurence_date'] = df.groupby(by='customer_key') ## i dunno...
Try with
out = df.loc[df['flag'].eq('Y')].groupby('customer_key').date.min()
customer_key
1 2020-06-30
2 2020-03-31
Name: date, dtype: object
I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can
I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()