Problem:
I have a bunch of files that were downloaded from an org. Halfway through their data directory the org changed the naming convention (reasons unknown). I am looking to create a script that will take the files in a directory and rename the file the same way, but simply "go back one day".
Here is a sample of how one file is named: org2015365_res_version.asc
What I need is logic to only change the year day (2015365) in this case to 2015364. This logic needs to span a few years so 2015001 would be 2014365.
I guess I'm not sure this is possible since its not working with the current date so using a module like datetime does not seem applicable.
Partial logic I came up with. I know it is rudimentary at best, but wanted to take a stab at it.
# open all files
all_data = glob.glob('/somedir/org*.asc')
# empty array to be appended to
day = []
year = []
# loop through all files
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[0]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:3], '')
# get only day - renders 365
days = year_day.replace(year_day[0:4], '')
# get only year - renders 2015
day.append(days)
years = year_day.replace(year_day[4:], '')
year.append(years)
# convert to int for easier processing
day = [int(i) for i in day]
year = [int(i) for i in year]
if day == 001 & year == 2016:
day = 365
year = 2015
elif day == 001 & year == 2015:
day = 365
year = 2014
else:
day = day - 1
Apart from the logic above I also came across the function below from this post, I am not sure what would be the best way to combine that with the partial logic above. Thoughts?
import glob
import os
def rename(dir, pattern, titlePattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
os.rename(pathAndFilename,
os.path.join(dir, titlePattern % title + ext))
rename(r'c:\temp\xx', r'*.doc', r'new(%s)')
Help me, stackoverflow. You're my only hope.
You can use datetime module:
#First argument - string like 2015365, second argument - format
dt = datetime.datetime.strptime(year_day,'%Y%j')
#Time shift
dt = dt + datetime.timedelta(days=-1)
#Year with shift
nyear = dt.year
#Day in year with shift
nday = dt.timetuple().tm_yday
Based on feedback from the community I was able to get the logic needed to fix the files downloaded from the org! The logic was the biggest hurdle. It turns out that the datetime module can be used, I need to read up more on that.
I combined the logic with the batch renaming using the os module, I put the code below to help future users who may have a similar question!
# open all files
all_data = glob.glob('/some_dir/org*.asc')
# loop through
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[1]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:10], '')
# first argument - string 2015365, second argument - format the string to datetime
dt = datetime.datetime.strptime(year_day, '%Y%j')
# create a threshold where version changes its naming convention
# only rename files greater than threshold
threshold = '2014336'
th = datetime.datetime.strptime(threshold, '%Y%j')
if dt > th:
# Time shift - go back one day
dt = dt + datetime.timedelta(days=-1)
# Year with shift
nyear = dt.year
# Day in year with shift
nday = dt.timetuple().tm_yday
# rename files correctly
f_output = 'org' + str(nyear) + str(nday).zfill(3) + '_res_version.asc'
os.rename(f, '/some_dir/' + f_output)
else:
pass
Related
I have many folders (in Microsoft Azure data lake), each folder is named with a date as the form "ddmmyyyy". Generally, I used the regex to extract all files of all folders of an exact month of a year in the way
path_data="/mnt/data/[0-9]*032022/data_[0-9]*.json" # all folders of all days of month 03 of 2022
result=spark.read.json(path_data)
My problem now is to extract all folders that match exactly one year before a given date
For example: for the date 14-03-2022; I need a regex to automatically read all files of all folders between 14-03-2021 and 14-03-2022.
I tried to extract the month and year in vars using strings, then using those two strings in a regex respecting the conditions ( for the showed example month should be greater than 03 when year equal to 2021 and less than 03 when the year is equal to 2022). I tried something similar to (while replacing the vars with 03, 2021 and 2022).
date_regex="([0-9]{2}[03-12]2021)|([0-9]{2}[01-03]2022)"
Is there any hint how I can perform such a task!
Thanks in advance
If I understand your question correctly.
To find our date between ??-03-2021 and ??-03-2022 from the file name field, you can use the following Regex
date_regex="([0-9]{2}-03-2021)|([0-9]{2}-03-2022)"
Also, if you want to be more customized, it is better to apply the changes from the link below and take advantage of it
https://regex101.com/r/AgqFfH/1
update : extract any folder named with a date between 14032021 and 14032022
solution : First we extract the date in ddmmyyyy format with ridge, then we give the files assuming that our format is correct and such a phrase is found in it.
date_regex="((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))"
if re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") > 14032021 and re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") < 14032022
..do any operation..
The above code is just overnight code for your overview of the solution method.
First we extract the date in ddmmyyyy format with regex, then we give the files assuming that our format is correct and such a phrase is found in it.
I hope this solution helps.
It certainly isn't pretty, but here you go:
#input
day = "14"; month = "03"; startYear = "2021";
#day construction
sameTensAfter = '(' + day[0] + '[' + day[1] + '-9])';
theDaysAfter = '([' + chr(ord(day[0])+1) + '-9][0-9])';
sameTensBefore = '(' + day[0] + '[0-' + day[1] + '])';
theDaysBefore = '';
if day[0] != '0':
theDaysBefore = '([0-' + chr(ord(day[0])-1) + '][0-9])';
#build the part for the dates with the same month as query
afterDayPart = '%s|%s' %(sameTensAfter, theDaysAfter);
beforeDayPart = '%s|%s' %(sameTensBefore, theDaysBefore);
theMonthAfter = str(int(month) + 1).zfill(2);
afterMonthPart = theMonthAfter[0] + '([' + theMonthAfter[1] + '-9])';
if theMonthAfter[0] == '0':
afterMonthPart += '|(1[0-2])';
theMonthBefore = str(int(month) - 1).zfill(2);
beforeMonthPart = theMonthBefore[0] + '([0-' + theMonthBefore[1] + '])';
if theMonthBefore[0] == '1':
beforeMonthPart = '(0[0-9])|' + beforeMonthPart;
#4 kinds of matches:
startDateRange = '((%s)(%s)(%s))' %(afterDayPart, month, startYear);
anyDayAfterMonth = '((%s)(%s)(%s))' %('[0-9]{2}', afterMonthPart, startYear);
endDateRange = '((%s)(%s)(%s))' %(beforeDayPart, month, int(startYear)+1);
anyDayBeforeMonth = '((%s)(%s)(%s))' %('[0-9]{2}', beforeMonthPart, int(startYear)+1);
#print regex
date_regex = startDateRange + '|' + anyDayAfterMonth + '|' + endDateRange + '|' + anyDayBeforeMonth;
print date_regex;
#this prints:
#(((1[4-9])|([2-9][0-9]))(03)(2021))|(([0-9]{2})(0([4-9])|(1[0-2]))(2021))|(((1[0-4])|([0-0][0-9]))(03)(2022))|(([0-9]{2})(0([0-2]))(2022))
startDateRange: the month is the same and it's the starting year, this will take all the days including and after.
anyDayAfterMonth: the month is greater and it's the starting year, this will take any day.
endDateRange: the month is the same and it's the ending year, this will take all the days including and before.
anyDayBeforeMonth: the month is less than and it's the ending year, this will take any day.
Here's an example: https://regex101.com/r/i76s58/1
to compare the date, use datetime module, example below.
Then you can only extract folders within your condition
# importing datetime module
import datetime
# date in yyyy/mm/dd format
d1 = datetime.datetime(2018, 5, 3)
d2 = datetime.datetime(2018, 6, 1)
# Comparing the dates will return
# either True or False
print("d1 is greater than d2 : ", d1 > d2)
print("d1 is less than d2 : ", d1 < d2)
print("d1 is not equal to d2 : ", d1 != d2)
Basically, I'm trying to check whether a date, e.g. 2021-07-08, is in the next week, or the week after that, or neither.
#I can call the start and end dates of the current week
start = tday - timedelta(days=tday.weekday())
end = start + timedelta(days=6)
print("Today: " + str(tday))
print("Start: " + str(start))
print("End: " + str(end))
# and I can get the current week number.
curr_week = datetime.date.today().strftime("%V")
print(curr_week)
Is there a better way than getting a list of dates in curr_week + 1 and then checking whether date is in in that list?
Thanks so much
GENERAL ANSWER
It is best to stick to datetime and timedelta, since this handles all edge cases like year changes, years with 53 weeks etc.
So find the number of the next week, and compare the weeknumber of the week you want to check against that.
import datetime
# Date to check in date format:
check_date = datetime.datetime.strptime("2021-09-08", "%Y-%d-%m").date()
# Current week number:
curr_week = datetime.date.today().strftime("%V")
# number of next week
next_week = (datetime.date.today()+datetime.timedelta(weeks=1)).strftime("%V")
# number of the week after that
week_after_next_week = (datetime.date.today()+datetime.timedelta(weeks=2)).strftime("%V")
# Compare week numbers of next weeks to the week number of the date to check:
if next_week == check_date.strftime("%V"):
# Date is within next week, put code here
pass
elif week_after_next_week == check_date.strftime("%V"):
# Date is the week after next week, put code here
pass
OLD ANSWER
This messes up around year changes, and modulo doesn't fix it because there are years with 53 weeks.
You can compare the week numbers by converting them to integers. You don't need to create a list of all dates within the next week.
import datetime
# Date to check in date format:
check_date = datetime.datetime.strptime("2021-07-08", "%Y-%d-%m").date()
# Current week number, make it modulo so that the last week is week 0:
curr_week = int(datetime.date.today().strftime("%V"))
# Compare week numbers:
if curr_week == (int(check_date.strftime("%V"))-1):
# Date is within next week, put code here
pass
elif curr_week == (int(check_date.strftime("%V"))-2):
# Date is the week after next week, put code here
pass
You can cast the date you want to check in datetime, and then compare the week numbers.
# date you want to check
date = datetime.datetime.strptime("2021-07-08","%Y-%m-%d")
# current date
tday = datetime.date.today()
# compare the weeks
print(date.strftime("%V"))
print(tday.strftime("%V"))
27
32
[see Alfred's answer]
You can get the week number directly as an integer integer from the IsoCalendarDate representation of each date.
from datetime import datetime
date_format = '%Y-%m-%d'
t_now = datetime.strptime('2021-08-11', date_format)
target_date = datetime.strptime('2021-08-18', date_format)
Just using datetime comparing:
from datetime import datetime, timedelta
def in_next_week(date):
""" -1: before; 0: in; 1: after next week;"""
today = datetime.today()
this_monday = today.date() - timedelta(today.weekday())
start = this_monday + timedelta(weeks=1)
end = this_monday + timedelta(weeks=2)
return -1 if date < start else 0 if date < end else 1
Test cases:
for i in range(14):
dt = datetime.today().date() + timedelta(days=i)
print(dt, in_next_week(dt))
I've written this function to get the last Thursday of the month
def last_thurs_date(date):
month=date.dt.month
year=date.dt.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
But its not working with the lambda function.
datelist['Date'].map(lambda x: last_thurs_date(x))
Where datelist is
datelist = pd.DataFrame(pd.date_range(start = pd.to_datetime('01-01-2014',format='%d-%m-%Y')
, end = pd.to_datetime('06-03-2019',format='%d-%m-%Y'),freq='D').tolist()).rename(columns={0:'Date'})
datelist['Date']=pd.to_datetime(datelist['Date'])
Jpp already added the solution, but just to add a slightly more readable formatted string - see this awesome website.
import calendar
def last_thurs_date(date):
year, month = date.year, date.month
cal = calendar.monthcalendar(year, month)
# the last (4th week -> row) thursday (4th day -> column) of the calendar
# except when 0, then take the 3rd week (February exception)
last_thurs_date = cal[4][4] if cal[4][4] > 0 else cal[3][4]
return f'{year}-{month:02d}-{last_thurs_date}'
Also added a bit of logic - e.g. you got 2019-02-0 as February doesn't have 4 full weeks.
Scalar datetime objects don't have a dt accessor, series do: see pd.Series.dt. If you remove this, your function works fine. The key is understanding that pd.Series.apply passes scalars to your custom function via a loop, not an entire series.
def last_thurs_date(date):
month = date.month
year = date.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
You can rewrite your logic more succinctly via f-strings (Python 3.6+) and a ternary statement:
def last_thurs_date(date):
month = date.month
year = date.year
last_thurs_date = calendar.monthcalendar(year, month)[4][4]
return f'{year}{"-0" if month < 10 else "-"}{month}-{last_thurs_date}'
I know that a lot of time has passed since the date of this post, but I think it would be worth adding another option if someone came across this thread
Even though I use pandas every day at work, in that case my suggestion would be to just use the datetutil library. The solution is a simple one-liner, without unnecessary combinations.
from dateutil.rrule import rrule, MONTHLY, FR, SA
from datetime import datetime as dt
import pandas as pd
# monthly options expiration dates calculated for 2022
monthly_options = list(rrule(MONTHLY, count=12, byweekday=FR, bysetpos=3, dtstart=dt(2022,1,1)))
# last satruday of the month
last_saturday = list(rrule(MONTHLY, count=12, byweekday=SA, bysetpos=-1, dtstart=dt(2022,1,1)))
and then of course:
pd.DataFrame({'LAST_ST':last_saturdays}) #or whatever you need
This question answer Calculate Last Friday of Month in Pandas
This can be modified by selecting the appropriate day of the week, here freq='W-FRI'
I think the easiest way is to create a pandas.DataFrame using pandas.date_range and specifying freq='W-FRI.
W-FRI is Weekly Fridays
pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')
Creates all the Fridays in the date range between the min and max of the dates in df
Use a .groupby on year and month, and select .last(), to get the last Friday of every month for every year in the date range.
Because this method finds all the Fridays for every month in the range and then chooses .last() for each month, there's not an issue with trying to figure out which week of the month has the last Friday.
With this, use pandas: Boolean Indexing to find values in the Date column of the dataframe that are in last_fridays_in_daterange.
Use the .isin method to determine containment.
pandas: DateOffset objects
import pandas as pd
# test data: given a dataframe with a datetime column
df = pd.DataFrame({'Date': pd.date_range(start=pd.to_datetime('2014-01-01'), end=pd.to_datetime('2020-08-31'), freq='D')})
# create a dateframe with all Fridays in the daterange for min and max of df.Date
fridays = pd.DataFrame({'datetime': pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')})
# use groubpy and last, to get the last Friday of each month into a list
last_fridays_in_daterange = fridays.groupby([fridays.datetime.dt.year, fridays.datetime.dt.month]).last()['datetime'].tolist()
# find the data for the last Friday of the month
df[df.Date.isin(last_fridays_in_daterange)]
Currently I am trying to trim the current date into day, month and year with the following code.
#Code from my local machine
from datetime import datetime
from datetime import timedelta
five_days_ago = datetime.now()-timedelta(days=5)
# result: 2017-07-14 19:52:15.847476
get_date = str(five_days_ago).rpartition(' ')[0]
#result: 2017-07-14
#Extract the day
day = get_date.rpartition('-')[2]
# result: 14
#Extract the year
year = get_date.rpartition('-')[0])
# result: 2017-07
I am not a Python professional because I grasp this language for a couple of months ago but I want to understand a few things here:
Why did I receive this 2017-07 if str.rpartition() is supposed to separate a string once you have declared some sort separator (-, /, " ")? I was expecting to receive 2017...
Is there an efficient way to separate day, month and year? I do not want to repeat the same mistakes with my insecure code.
I tried my code in the following tech. setups:
local machine with Python 3.5.2 (x64), Python 3.6.1 (x64) and repl.it with Python 3.6.1
Try the code online, copy and paste the line codes
Try the following:
from datetime import date, timedelta
five_days_ago = date.today() - timedelta(days=5)
day = five_days_ago.day
year = five_days_ago.year
If what you want is a date (not a date and time), use date instead of datetime. Then, the day and year are simply properties on the date object.
As to your question regarding rpartition, it works by splitting on the rightmost separator (in your case, the hyphen between the month and the day) - that's what the r in rpartition means. So get_date.rpartition('-') returns ['2017-07', '-', '14'].
If you want to persist with your approach, your year code would be made to work if you replace rpartition with partition, e.g.:
year = get_date.partition('-')[0]
# result: 2017
However, there's also a related (better) approach - use split:
parts = get_date.split('-')
year = parts[0]
month = parts[1]
day = parts[2]
Please tell me how I can list next 24 months' start dates with python,
such as:
01May2014
01June2014
.
.
.
01Aug2015
and so on
I tried:
import datetime
this_month_start = datetime.datetime.now().replace(day=1)
for i in xrange(24):
print (this_month_start + i*datetime.timedelta(40)).replace(day=1)
But it skips some months.
Just increment the month value; I used datetime.date() types here as that's more than enough:
current = datetime.date.today().replace(day=1)
for i in xrange(24):
new_month = current.month % 12 + 1
new_year = current.year + current.month // 12
current = current.replace(month=new_month, year=new_year)
print current
The new month calculation picks the next month based on the last calculated month, and the year is incremented every time the previous month reached December.
By manipulating a current object, you simplify the calculations; you can do it with i as an offset as well, but the calculation gets a little more complicated.
It'll work with datetime.datetime() too.
To simplify arithmetics, try/except could be used:
from datetime import date
current = date.today().replace(day=1)
for _ in range(24):
try:
current = current.replace(month=current.month + 1)
except ValueError: # new year
current = current.replace(month=1, year=current.year + 1)
print(current.strftime('%d%b%Y'))