I'm wondering how to optimize a part of code to remove a loop which takes forever since I have around 350 000 IDs.
Here is the current code, which is not optimal and takes quite a while.
I'm trying to get it working better and if possible removing a loop.
The dataset is made of 4 columns with IDs, start_dates, end_dates and amount. We can have multi rows with same IDs but not the same amount. The main thing is in some rows the dates are not saved in the dataset. In that case we have to find the earlier start_date of the ID and the later end_date and add them to the row where it's not put in the dataframe
ID start_date end_date value
ABC 12/10/2010 12/12/2020 8
ABC 01/01/2020 01/04/2021 9
ABC 43
BCD 14/02/2020 14/03/2020 8
So we should have on the third row the start_date as 12/10/2010 and end date 01/04/2021. In the picture you cant see it but don't forget that BCD start_date could be earlier than ABC but you still use the 12/10/2010 because it is linked to the ID
for x in df['ID'].unique():
tmp = df.loc[df['ID'] == x].reset_index()
df.loc[(df['ID'] == x) & (df['start_date'].isna()), 'start_date'] = tmp['start_date'].min()
df.loc[(df['ID'] == x) & (df['end_date'].isna()), 'end_date'] = tmp['end_date'].max()
I suppose the code is quite clear about what I am trying to do.
But if you have any questions don't hesitate do post them I'll do my best to answer.
set up the job
import pandas as pd
data = { 'ID': ['ABC','ABC','ABC','BCD'], 'start_date' : ['12/10/2010', '01/01/2020',None ,'14/02/2020'], 'end_date': ['12/12/2020', '01/01/2021',None ,'14/03/2020'], 'value': [8,9,43,8]}
df = pd.DataFrame(data)
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC NaT NaT 43
3 BCD 2020-02-14 2020-03-14 8
do the work
df.start_date = df.groupby('ID')['start_date'].apply(lambda x: x.fillna(x.min()))
df.end_date = df.groupby('ID')['end_date'].apply(lambda x: x.fillna(x.max()))
we get this result
ID start_date end_date value
0 ABC 2010-12-10 2020-12-12 8
1 ABC 2020-01-01 2021-01-01 9
2 ABC 2010-12-10 2021-01-01 43
3 BCD 2020-02-14 2020-03-14 8
Related
Given Data is
id
date
1
10/20/2019
2
11/02/2019
3
12/12/2019
1
02/06/2019
1
05/14/2018
3
5/13/2019
2
07/20/2018
3
08/23/2019
2
06/25/2018
I want in This format
id
date1
date2
date3
1
05/14/2018
02/06/2019
10/20/2019
2
06/25/2018
07/20/2018
11/02/2019
3
05/13/2019
08/23/2019
12/12/2019
I am using For Loop to implement this on 4,00,000+ Unique Ids and its time-consuming. Is there any easy method?
I am using this code:
Each Policy number has Multiple DATEs, I want them arranged in min to max in a row in different columns like mentioned in 2nd table.
f= pd.DataFrame()
for i in range(0,len(uni_pol)):
d=ct.loc[ct["Policy_no"]== uni_pol[I]]
t=d.sort values ('DATE", ascending=True).T
df=pd.DataFrame(t)
a=df. loc['Policy_no' ]
col=df.columns
df['Policy_no']= a.loc[ col[0] ]
for j in range(0, len(col)):
nn= str(j+1)
name="Paydt"+nn
df[name] = df[col[j]]
CC= col[j]
df=df.drop([cc], axi5-1)
j=j+1
f = f.append(df. loc[' DATE'])
Here's one approach:
sort_values by "date"; then groupby "id" and create a list from dates; this builds a Series. Then create a DataFrame from the lists in the Series:
df['date'] = pd.to_datetime(df['date'])
s = df.sort_values(by='date').groupby('id')['date'].agg(list)
out = pd.DataFrame(s.tolist(), index=s.index, columns=[f'date{i}' for i in range(1,len(s.iat[0])+1)]).reset_index()
Output:
id date1 date2 date3
0 1 2018-05-14 2019-02-06 2019-10-20
1 2 2018-06-25 2018-07-20 2019-11-02
2 3 2019-05-13 2019-08-23 2019-12-12
I would like to calculate how many customers there were at each time of month in the past year. My dataframe contains customer ID, start-date (where customer started being customer) and end-date (where customer ended being customer):
Customer_ID StartDate EndDate
1 01/01/2019 NAT
2 25/10/2017 01/06/2020
2 13/06/2012 15/07/2015
2 20/12/2015 03/01/2016
2 25/03/2016 14/06/2017
3 05/06/2018 05/06/2019
3 12/12/2019 NAT
The result I would like; is counting the number of customers that were "active" per month-year combination:
MONTH YEAR NUMB_CUSTOMERS
01 2013 1
02 2013 1
03 2013 1
04 2013 1
...
01 2019 2
...
09 2020 2
I would like to avoid for-loops as that takes too much long (I have a table of over 100 000 rows).
Has anyone an idea to do this neat and quickly?
Thanks!
First, read data and make it digestible for program
import pandas as pd
import datetime
df = pd.read_csv("table.csv")
func = lambda x: x.split('/', maxsplit=1)[1]
df["StartDate"] = df["StartDate"].apply(func)
mask = df["EndDate"] != "NAT"
df.loc[mask, "EndDate"] = df.loc[mask, "EndDate"].apply(func)
Then, count changes in amount of clients (you basically get a derivative of your data)
customers_gained = df[["Customer_ID", "StartDate"]].groupby("StartDate").agg("count")
customers_lost = df[["Customer_ID", "EndDate"]].groupby("EndDate").agg("count")
customers_lost.drop("NAT",inplace=True)
make a grouper for all changes in amount of clients
def make_time_table(start, end):
start_date = datetime.datetime.strptime(start, "%d/%m/%Y")
end_date = datetime.datetime.strptime(end, "%d/%m/%Y")
data_range = pd.date_range(start_date, end_date, freq="M")
string_range = [el.strftime("%m/%Y") for el in data_range]
ser = pd.Series([0]*data_range.size, index=string_range)
return ser
Next introduce change into time_table and "integrate" by accumulation
time_table = make_time_table("01/01/2012", "01/12/2020")
time_table[customers_gained.index] = customers_gained["Customer_ID"]
time_table[customers_lost.index] -= customers_lost["Customer_ID"]
result = time_table.cumsum()
print(result)
Outputs:
01/2012 0
02/2012 0
03/2012 0
04/2012 0
05/2012 0
06/2012 1
07/2012 1
...
10/2019 2
11/2019 2
12/2019 3
01/2020 3
02/2020 3
03/2020 3
04/2020 3
05/2020 3
06/2020 2
07/2020 2
08/2020 2
09/2020 2
10/2020 2
11/2020 2
dtype: int64
table.csv
Customer_ID,StartDate,EndDate
1,01/01/2019,NAT
2,25/10/2017,01/06/2020
2,13/06/2012,15/07/2015
2,20/12/2015,03/01/2016
2,25/03/2016,14/06/2017
3,25/03/2016,05/06/2019
3,12/12/2019,NAT
I want to perform rolling median on price column over 4 days back, data will be groupped by date. So basically I want to take prices for a given day and all prices for 4 days back and calculate median out of these values.
Here are the sample data:
id date price
1637027 2020-01-21 7045204.0
280955 2020-01-11 3590000.0
782078 2020-01-28 2600000.0
1921717 2020-02-17 5500000.0
1280579 2020-01-23 869000.0
2113506 2020-01-23 628869.0
580638 2020-01-25 650000.0
1843598 2020-02-29 969000.0
2300960 2020-01-24 5401530.0
1921380 2020-02-19 1220000.0
853202 2020-02-02 2990000.0
1024595 2020-01-27 3300000.0
565202 2020-01-25 3540000.0
703824 2020-01-18 3990000.0
426016 2020-01-26 830000.0
I got close with combining rolling and groupby:
df.groupby('date').rolling(window = 4, on = 'date')['price'].median()
But this seems to add one row per each index value and by median definition, I am not able to somehow merge these rows to produce one result per row.
Result now looks like this:
date date
2020-01-10 2020-01-10 NaN
2020-01-10 NaN
2020-01-10 NaN
2020-01-10 3070000.0
2020-01-10 4890000.0
...
2020-03-11 2020-03-11 4290000.0
2020-03-11 3745000.0
2020-03-11 3149500.0
2020-03-11 3149500.0
2020-03-11 3149500.0
Name: price, Length: 389716, dtype: float64
It seems it just deleted 3 first values and then just printed price value.
Is it possible to get one lagged / moving median value per one date?
You can use rolling with a frequency window of 5 days to get today and last 4 days, then drop_duplicates to keep the last row per day. First create a copy (if you want to keep the original one), sort_values per date and ensure the date column is datetime
#sort and change to datetime
df_f = df[['date','price']].copy().sort_values('date')
df_f['date'] = pd.to_datetime(df_f['date'])
#create the column rolling
df_f['price'] = df_f.rolling('5D', on='date')['price'].median()
#drop_duplicates and keep the last row per day
df_f = df_f.drop_duplicates(['date'], keep='last').reset_index(drop=True)
print (df_f)
date price
0 2020-01-11 3590000.0
1 2020-01-18 3990000.0
2 2020-01-21 5517602.0
3 2020-01-23 869000.0
4 2020-01-24 3135265.0
5 2020-01-25 2204500.0
6 2020-01-26 849500.0
7 2020-01-27 869000.0
8 2020-01-28 2950000.0
9 2020-02-02 2990000.0
10 2020-02-17 5500000.0
11 2020-02-19 3360000.0
12 2020-02-29 969000.0
This is a step by step process. There are probably more efficient methods of getting what you want. Note, if you have time information for your dates, you would need to drop that information before grouping by date.
import pandas as pd
import statistics as stat
import numpy as np
# Replace with you data import
df = pd.read_csv('random_dates_prices.csv')
# Convert your date to a datetime
df['date'] = pd.to_datetime(df['date'])
# Sort your data by date
df = df.sort_values(by = ['date'])
# Create group by object
dates = df.groupby('date')
# Reformat dataframe for one row per day, with prices in a nested list
df = pd.DataFrame(dates['price'].apply(lambda s: s.tolist()))
# Extract price lists to a separate list
prices = df['price'].tolist()
# Initialize list to store past four days of prices for current day
four_days = []
# Loop over the prices list to combine the last four days to a single list
for i in range(3, len(prices), 1):
x = i - 1
y = i - 2
z = i - 3
four_days.append(prices[i] + prices[x] + prices[y] + prices[z])
# Initialize a list to store median values
medians = []
# Loop through four_days list and calculate the median of the last for days for the current date
for i in range(len(four_days)):
medians.append(stat.median(four_days[i]))
# Create dummy zero values to add lists create to dataframe
four_days.insert(0, 0)
four_days.insert(0, 0)
four_days.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
medians.insert(0, 0)
# Add both new lists to data frames
df['last_four_day_prices'] = four_days
df['last_four_days_median'] = medians
# Replace dummy zeros with np.nan
df[['last_four_day_prices', 'last_four_days_median']] = df[['last_four_day_prices', 'last_four_days_median']].replace(0, np.nan)
# Clean data frame so you only have a single date a median value for past four days
df_clean = df.drop(['price', 'last_four_day_prices'], axis=1)
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08
I'm trying to use Python Pandas to count daily returning visitors to my website over a time period.
Example data:
df1 = pd.DataFrame({'user_id':[1,2,3,1,3], 'date':['2012-09-29','2012-09-30','2012-09-30','2012-10-01','2012-10-01']})
print df1
date user_id
0 2012-09-29 1
1 2012-09-30 2
2 2012-09-30 3
3 2012-10-01 1
4 2012-10-01 3
What I'd like to have as final result:
df1_result = pd.DataFrame({'count_new':[1,2,0], 'date':['2012-09-29','2012-09-30','2012-10-01']})
print df1_result
count_new date
0 1 2012-09-29
1 2 2012-09-30
2 0 2012-10-01
In the first day there is 1 new user because user 1 appears for the first time.
In the second day there are 2 new users: user 2 and user 3 both appear for the first time.
Finally in the third day there are 0 new users: user 1 and user 3 have both already appeared in previous periods.
So far I have been looking into merging two copies of same dataframe and shifting one by a date, but without success:
pd.merge(df1, df1.user_id.shift(-date), on = 'date').groupby('date')['user_id_y'].nunique()
Any help would be much appreciated,
Thanks
>>> (df1
.groupby(['user_id'], as_index=False)['date'] # Group by `user_id` and get first date.
.first()
.groupby(['date']) # Group result on `date` and take counts.
.count()
.reindex(df1['date'].unique()) # Reindex on original dates.
.fillna(0)) # Fill null values with zero.
user_id
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
It is better to add a new column Isreturning (in case you need to analysis on Returning customer in the future)
df['Isreturning']=df.groupby('user_id').cumcount()
Only show new customer
df.loc[df.Isreturning==0,:].groupby('date')['user_id'].count()
Out[840]:
date
2012-09-29 1
2012-09-30 2
Name: user_id, dtype: int64
Or you can :
df.groupby('date')['Isreturning'].apply(lambda x : len(x[x==0]))
Out[843]:
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
Name: Isreturning, dtype: int64