Filter df within same date range - python

I have a weekly report, where I need to find duplicates of IDs in one column (Customer ID) happening within the same date range (Submit Date).
import pandas as pd
from datetime import timedelta
df = pd.read_excel('report.xlsx', engine='openpyxl', parse_dates=['Submit Date'])
customer = report['Customer ID']
submit_date = report['Submit Date']
submit_date = submit_date.dt.date
submit_date.sort_index(inplace=True)
mask1 = customer.duplicated(keep=False) == True
report = report[mask1]
And this part comes easy - I have the result of all duplicated IDs. However, I'm not sure how should I approach the problem of limiting it to the same day, ex 2021-04-12, so I can have only duplicates in this particular date. I tried creating for loop, where there is if statement checking
if day < day + timedelta(days=1)
but that didn't seem to bring any results. I could hard code the dates so create a different masks for every separate date of week but I'd like to keep the report automated.
Thanks in advance for all ideas!

I actually managed to sort it out:
while start_date <= end_date:
same_date_values = report.loc[start_date == submit_date]
print(f" ################# DATE CHECKED: {start_date} #################")
mask1 = same_date_values['Customer ID'].duplicated(keep=False) == True
same_date_values = same_date_values[mask1]
df = df.append(same_date_values)
start_date += timedelta(days=1)
It's a one while loop and I'm missing a after declaration of loop but this basically allowed me to iterarte through every day (12-04, 13-4, etc.) and break it into smaller dataframes with same date.

Related

how to filter date based on last circulated data

I would like to filter the date based on last report received date(working data clear daily basic from path). I am saving the data on SQLite and I filter dataframe with the date (Day-1) and append it on database. But in some cases, data do not receive daily basic, on those day I cannot filter day-1, may be two or more-day data need to filter. For the automation purpose how can I filter the date which is not yet uploaded on database. Now what I am doing is apply date number on date and subtract from last working date number. It will not work very beginning of the month.
dt_yesday = datetime.strftime(datetime.now() - timedelta(1), '%Y-%m-%d') wk_dt = int(datetime.strftime(datetime.now() - timedelta(1), '%e')) day_1 = df[df['date'] == dt_yesday]
latest_report_data = str(df['date_number'].iloc[-1])
date_subtract = wk_dt - int(last_working_date.iloc[-1])
if date_subtract == 0:
day_1 = df[df['date'] == dt_yesday]
elif date_subtract == 1:
day_1 = df[df['date'] > dt_Last2Day]

Iterating a groupby datetime over several weeks

I'm trying to group my data by a week that I predefined using to_datetime and timedelta. However, after copying my script a few times, I was hoping there was a way to iterate this process over multiple weeks. Is this something that can be done?
The data set that I'm working with lists sales out sales revenue and spending out by the day for each data source and its corresponding id.
Below is what I have so far but my knowledge of loops is pretty limited due to being self-taught.
Let me know if what I'm asking is feasible or if I have to continue to copy my code every week.
Code
import pandas as pd
from datetime import datetime, timedelta,date
startdate = '2021-09-26'
enddate = pd.to_datetime(startdate) + timedelta(days=6)
last7 = (df.date >= startdate) & (df.date <= enddate)
df = df.loc[last7,['datasource','id','revenue','spend']]
df.groupby(by=['datasource_name','id'],as_index=False).sum()
df['start_date'] = startdate
df['end_date'] = enddate
df
If I have understood your issue correctly, you are basically trying to aggregate daily data into weekly. You can try following code
import datetime as dt
import pandas as pd
#Get weekend date for each date
df['week_end_date']=df['date'].apply(lambda x: pd.Period(x,freq='W').end_time.date().strftime('%Y-%m-%d'))
#Aggregate sales and revenue at weekly level
df_agg = df.groupby(['datasource_name','id','week_end_date']).agg({'revenue':'sum','spend':'sum'}).reset_index()
df_agg will have all your sales and revenue numbers aggregated by the weekend date for corresponding date.

Python: TypeError: Invalid comparison between dtype=datetime64[ns] and date

For a current project, I am planning to filter a JSON file by timeranges by running several loops, each time with a slightly shifted range. The code below however yields the error TypeError: Invalid comparison between dtype=datetime64[ns] and date for line after_start_date = df["Date"] >= start_date.
I have already tried to modify the formatting of the dates both within the Python code as well as the corresponding JSON file. Is there any smart tweak to align the date types/formats?
The JSON file has the following format:
[
{"No":"121","Stock Symbol":"A","Date":"05/11/2017","Text Main":"Sample text"}
]
And the corresponding code looks like this:
import string
import json
import pandas as pd
import datetime
from dateutil.relativedelta import *
# Loading and reading dataset
file = open("Glassdoor_A.json", "r")
data = json.load(file)
df = pd.json_normalize(data)
df['Date'] = pd.to_datetime(df['Date'])
# Create an empty dictionary
d = dict()
# Filtering by date
start_date = datetime.date.fromisoformat('2017-01-01')
end_date = datetime.date.fromisoformat('2017-01-31')
for i in df.iterrows():
start_date += relativedelta(months=+3)
end_date += relativedelta(months=+3)
print(start_date)
print(end_date)
after_start_date = df["Date"] >= start_date
before_end_date = df["Date"] <= end_date
between_two_dates = after_start_date & before_end_date
filtered_dates = df.loc[between_two_dates]
print(filtered_dates)
You can use pd.to_datetime('2017-01-31') instead of datetime.date.fromisoformat('2017-01-31').
I hope this helps!
My general advice is not to use datetime module.
Use rather built-in pandasonic methods / classes like pd.to_datetime
and pd.DateOffset.
You should also close the input file as early as it is not needed, e.g.:
with open('Glassdoor_A.json', 'r') as file:
data = json.load(file)
Other weird points in your code are that:
you wrote a loop iterating over rows for i in df.iterrows():,
but never use i (control variable of this loop).
your loop works rather in a time step (not "row by row") mode,
so your loop should be rather something like "while end_date <=
last_end_date:",
the difference between start_date and end_date is just
1 month (actually they are dates of start and end of some month),
but in the loop you increase both dates by 3 months.
Below you have an example of code to look for rows in consecutive months,
up to some final date and print rows from the current month if any:
start_date = pd.to_datetime('2017-01-01')
end_date = pd.to_datetime('2017-03-31')
last_end_date = pd.to_datetime('2017-12-31')
mnthBeg = pd.offsets.MonthBegin(3)
mnthEnd = pd.offsets.MonthEnd(3)
while end_date <= last_end_date:
filtered_rows = df[df.Date.between(start_date, end_date)]
n = len(filtered_rows.index)
print(f'Date range: {start_date.strftime("%Y-%m-%d")} - {end_date.strftime("%Y-%m-%d")}, {n} rows.')
if n > 0:
print(filtered_rows)
start_date += mnthBeg
end_date += mnthEnd
You can compare your dates using the following method
from datetime import datetime
df_subset = df.loc[(df['Start_Date'] > datetime.strptime('2018-12-31', '%Y-%m-%d'))]

Python Dataframe Date plus months variable which comes from the other column

I have a dataframe with the date and month_diff variable. I would like to get a new date (name it as Target_Date) based on the following logic:
For example, the date is 2/13/2019, month_diff is 3, then the target date should be the month-end of the original date plus 3 months, which is 5/31/2019
I tried the following method to get the traget date first:
df["Target_Date"] = df["Date"] + pd.DateOffset(months = df["month_diff"])
But it failed, as I know, the parameter in the dateoffset should be a varaible or a fixed number.
I also tried:
df["Target_Date"] = df["Date"] + relativedelta(months = df["month_diff"])
It failes too.
Anyone can help? thank you.
edit:
this is a large dataset with millions rows.
You could try this
import pandas as pd
from dateutil.relativedelta import relativedelta
df = pd.DataFrame({'Date': [pd.datetime(2019,1,1), pd.datetime(2019,2,1)], 'month_diff': [1,2]})
df.apply(lambda row: row.Date + relativedelta(months=row.month_diff), axis=1)
Or list comprehension
[date + relativedelta(months=month_diff) for date, month_diff in df[['Date', 'month_diff']].values]
I would approach in the following method to compute your "target_date".
Apply the target month offset (in your case +3months), using your pd.DateOffset.
Get the last day of that target month (using for example calendar.monthrange, see also "Get last day of the month"). This will provide you with the "flexible" part of that date" offset.
Apply the flexible day offset, when comparing the result of step 1. and step 2. This could be a new pd.DateOffset.
A solution could look something like this:
import calendar
from dateutil.relativedelta import relativedelta
for ii in df.index:
new_ = df.at[ii, 'start_date'] + relativedelta(months=df.at[ii, 'month_diff'])
max_date = calendar.monthrange(new_.year, new_.month)[1]
end_ = new_ + relativedelta(days=max_date - new_.day)
print(end_)
Further "cleaning" into a function and / or list comprehension will probably make it much faster
import pandas as pd
from datetime import datetime
from datetime import timedelta
This is my approach in solving your issue.
However for some reason I am getting a semantic error in my output even though I am sure it is the correct way. Please everyone correct me if you notice something wrong.
today = datetime.now()
today = today.strftime("%d/%m/%Y")
month_diff =[30,5,7]
n = 30
for i in month_diff:
b = {'Date': today, 'month_diff':month_diff,"Target_Date": datetime.now()+timedelta(days=i*n)}
df = pd.DataFrame(data=b)
Output:
For some reason the i is not getting updated.
I was looking for a solution I can write in one line only and apply does the job. However, by default apply function performs action on each column, so you have to remember to specify correct axis: axis=1.
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Create a new column with date adjusted by number of months from 'month_diff' column and later adjust to the last day of month
df['Target_Date'] = df.apply(lambda row: row.Date # to current date
+ relativedelta(months=row.month_diff) # add month_diff
+ relativedelta(day=+31) # and adjust to the last day of month
, axis=1) # 1 or ‘columns’: apply function to each row.

PYTHON Numpy where time condition

I have the following target: I need to compare two date columns in the same table and create a 3rd column based on the result of the comparison. I do not know how to compare dates in a np.where statement.
This is my current code:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
And here is the np.where statement:
DB['s_date'] = np.where((DB['Start Date']<=time_delta | DB['Start Date'] = (None,"")),DB['Start Date'],RW['date'])
There is an OR condition to take into account the possibility that Start Date column might be empty
Would lambda apply work for you Filippo? It looks at a series row-wise, then applies a function of your choice to every value of the row. Whatever is returned in the function will fill up the series with the values it returns.
def compare(date):
if date <= time_delta or date == None:
#return something
else:
#return something else
DB['s_date'] = DB.apply(lambda x: compare(x))
EDIT: This will work as well (thanks EyuelDK)
DB['s_date'] = DB.apply(compare)
Thank you for the insights. I updated (and adjusted for my purposes) the code as following and it works:
now = datetime.datetime.now() #set the date to compare
delta = datetime.timedelta(days=7) #set delta
time_delta = now+delta #now+7 days
DB['Start'] = np.where(((DB['Start Date']<=time_delta) | (DB['Start Date'].isnull()) | (DB['Start Date'] == "")),DB['Start'],DB['Start Date'])
They key was to add () in each condition separated by |. Otherwise was giving an error by comparing two different data types.

Categories

Resources