Regex with date as String in Azure path - python

I have many folders (in Microsoft Azure data lake), each folder is named with a date as the form "ddmmyyyy". Generally, I used the regex to extract all files of all folders of an exact month of a year in the way
path_data="/mnt/data/[0-9]*032022/data_[0-9]*.json" # all folders of all days of month 03 of 2022
result=spark.read.json(path_data)
My problem now is to extract all folders that match exactly one year before a given date
For example: for the date 14-03-2022; I need a regex to automatically read all files of all folders between 14-03-2021 and 14-03-2022.
I tried to extract the month and year in vars using strings, then using those two strings in a regex respecting the conditions ( for the showed example month should be greater than 03 when year equal to 2021 and less than 03 when the year is equal to 2022). I tried something similar to (while replacing the vars with 03, 2021 and 2022).
date_regex="([0-9]{2}[03-12]2021)|([0-9]{2}[01-03]2022)"
Is there any hint how I can perform such a task!
Thanks in advance

If I understand your question correctly.
To find our date between ??-03-2021 and ??-03-2022 from the file name field, you can use the following Regex
date_regex="([0-9]{2}-03-2021)|([0-9]{2}-03-2022)"
Also, if you want to be more customized, it is better to apply the changes from the link below and take advantage of it
https://regex101.com/r/AgqFfH/1
update : extract any folder named with a date between 14032021 and 14032022
solution : First we extract the date in ddmmyyyy format with ridge, then we give the files assuming that our format is correct and such a phrase is found in it.
date_regex="((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))"
if re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") > 14032021 and re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") < 14032022
..do any operation..
The above code is just overnight code for your overview of the solution method.
First we extract the date in ddmmyyyy format with regex, then we give the files assuming that our format is correct and such a phrase is found in it.
I hope this solution helps.

It certainly isn't pretty, but here you go:
#input
day = "14"; month = "03"; startYear = "2021";
#day construction
sameTensAfter = '(' + day[0] + '[' + day[1] + '-9])';
theDaysAfter = '([' + chr(ord(day[0])+1) + '-9][0-9])';
sameTensBefore = '(' + day[0] + '[0-' + day[1] + '])';
theDaysBefore = '';
if day[0] != '0':
theDaysBefore = '([0-' + chr(ord(day[0])-1) + '][0-9])';
#build the part for the dates with the same month as query
afterDayPart = '%s|%s' %(sameTensAfter, theDaysAfter);
beforeDayPart = '%s|%s' %(sameTensBefore, theDaysBefore);
theMonthAfter = str(int(month) + 1).zfill(2);
afterMonthPart = theMonthAfter[0] + '([' + theMonthAfter[1] + '-9])';
if theMonthAfter[0] == '0':
afterMonthPart += '|(1[0-2])';
theMonthBefore = str(int(month) - 1).zfill(2);
beforeMonthPart = theMonthBefore[0] + '([0-' + theMonthBefore[1] + '])';
if theMonthBefore[0] == '1':
beforeMonthPart = '(0[0-9])|' + beforeMonthPart;
#4 kinds of matches:
startDateRange = '((%s)(%s)(%s))' %(afterDayPart, month, startYear);
anyDayAfterMonth = '((%s)(%s)(%s))' %('[0-9]{2}', afterMonthPart, startYear);
endDateRange = '((%s)(%s)(%s))' %(beforeDayPart, month, int(startYear)+1);
anyDayBeforeMonth = '((%s)(%s)(%s))' %('[0-9]{2}', beforeMonthPart, int(startYear)+1);
#print regex
date_regex = startDateRange + '|' + anyDayAfterMonth + '|' + endDateRange + '|' + anyDayBeforeMonth;
print date_regex;
#this prints:
#(((1[4-9])|([2-9][0-9]))(03)(2021))|(([0-9]{2})(0([4-9])|(1[0-2]))(2021))|(((1[0-4])|([0-0][0-9]))(03)(2022))|(([0-9]{2})(0([0-2]))(2022))
startDateRange: the month is the same and it's the starting year, this will take all the days including and after.
anyDayAfterMonth: the month is greater and it's the starting year, this will take any day.
endDateRange: the month is the same and it's the ending year, this will take all the days including and before.
anyDayBeforeMonth: the month is less than and it's the ending year, this will take any day.
Here's an example: https://regex101.com/r/i76s58/1

to compare the date, use datetime module, example below.
Then you can only extract folders within your condition
# importing datetime module
import datetime
# date in yyyy/mm/dd format
d1 = datetime.datetime(2018, 5, 3)
d2 = datetime.datetime(2018, 6, 1)
# Comparing the dates will return
# either True or False
print("d1 is greater than d2 : ", d1 > d2)
print("d1 is less than d2 : ", d1 < d2)
print("d1 is not equal to d2 : ", d1 != d2)

Related

How to automate the generation of dates in this special situation?

I have a result file generated from software and it looks like this:
,0,1,2,3,4,5,6,7,8,9
0,Month,Decade,Stage,Kc,ETc,ETc,Eff,rain,Irr.,Req.
1,coeff,mm/day,mm/dec,mm/dec,mm/dec,,,,,
2,Sep,1,Init,0.50,1.85,18.5,21.8,0.0,,
3,Sep,2,Init,0.50,1.77,17.7,30.3,0.0,,
4,Sep,3,Init,0.50,1.72,17.2,37.1,0.0,,
5,Oct,1,Deve,0.61,2.05,20.5,49.5,0.0,,
6,Oct,2,Deve,0.82,2.66,26.6,59.3,0.0,,
7,Oct,3,Deve,1.03,3.24,35.6,43.0,0.0,,
8,Nov,1,Mid,1.20,3.63,36.3,20.9,15.4,,
9,Nov,2,Mid,1.21,3.53,35.3,6.0,29.2,,
10,Nov,3,Mid,1.21,3.70,37.0,4.0,33.0,,
11,Dec,1,Mid,1.21,3.87,38.7,0.1,38.6,,
12,Dec,2,Late,1.18,3.92,39.2,0.0,39.2,,
13,Dec,3,Late,1.00,3.58,39.4,0.0,39.4,,
14,Jan,1,Late,0.88,3.36,10.1,0.0,10.1,,
15,,,,,,,,,,
16,372.1,272.2,204.9,,,,,,,
As one can observe, the months vary from September to January. Each month is divided into three divisions or decades. To be exact, the months vary from September 2017 to 1st decade of January 2018. Now, I have to generate dates with the starting date of each decade in a month in this format: 01-Sep-2017. So I will have 01-Sep-2017, 11-Sep-2017, 21-Sep-2017, ..., 01-Jan-2018. How to generate these dates? I will share the code that I have written until now.
years = [2017, 2018, 2019]
temp = pd.read_csv(folder_link) # Reading the particular result file
Month = temp['0'][2:] # First column = Month (Jul, Aug, ..)
Decade = temp['1'][2:]
for year in years:
for j in range(2,len(Decade)): # First two lines are headers, so removed them
if(int(Decade[j]) == 1): # First decade = 1-10 days of month
Date = "1" + "-" + Month[j] + "-" + str(year) # Writing the date as 1-Jan-2017
Dates.append(Date)
if(int(Decade[j]) == 2): # Second decade = 11-20 days of month
Date = "11" + "-" + Month[j] + "-" + str(year)
Dates.append(Date)
if(int(Decade[j]) == 3): # Third decade = 21-28 or 21-30 or 21-31 days of month
Date = "21" + "-" + Month[j] + "-" + str(year)
Dates.append(Date)
The problem with this code is I will get 01-Sep-2017, 11-Sep-2017, 21-Sep-2017, ..., 01-Jan-2017 (instead of 2018). I need a generalized solution that could work for all months, not just for January. I have some results ranging from Sep 2017 - Aug 2018. Any help?
First you could start by setting your columns and index right while reading the csv file. Then you can use a formula to deduce the day from decade.
Increment year when switching from december to january only (you can extend your condition here if there are cases where january and/or december are missing).
The code becomes much easier to read and understand once you apply these:
temp = pd.read_csv(folder_link, header=1, index_col=0)
Dates = []
year = 2017
for index, row in temp.iloc[1:].iterrows():
month = row["Month"]
if month == "Jan" and temp.at[index-1, "Month"] == "Dec":
year += 1 # incrementing year if row is january while preceding row is december
day = (int(row["Decade"]) - 1) * 10 + 1
Dates.append(f"{day}-{month}-{year}")
print(Dates)
Output:
['1-Sep-2017', '11-Sep-2017', '21-Sep-2017', '1-Oct-2017', '11-Oct-2017', '21-Oct-2017', '1-Nov-2017', '11-Nov-2017', '21-Nov-2017', '1-Dec-2017', '11-Dec-2017', '21-Dec-2017', '1-Jan-2018']
If you want to stay with the iteration approach (there may be more efficient one using pandas functions), here is a simple way to do :
dates = []
year = 2017
month_list = ['Jan', 'Sep', 'Oct', 'Nov', 'Dec']
temp = pd.read_csv("data.csv") # Reading the particular result file
for index, row in temp.iterrows():
# First two lines are headers, so skip them. Same for last two lines.
if index > 1 and row[1] in month_list:
if row[1] == 'Jan':
year += 1
if(int(row[2]) == 1): # First decade = 1-10 days of month
date = "1" + "-" + row[1] + "-" + str(year) # Writing the date as 1-Jan-2017
dates.append(date)
elif(int(row[2]) == 2): # Second decade = 11-20 days of month
date = "11" + "-" + row[1] + "-" + str(year)
dates.append(date)
elif(int(row[2]) == 3): # Third decade = 21-28 or 21-30 or 21-31 days of month
date = "21" + "-" + row[1] + "-" + str(year)
dates.append(date)
else:
print("Unrecognized value for month {}".format(row[2]))
pass
print(dates)
Explanation :
use iterrows to iterate over your dataframe rows
then, skip headers and check you are parsing actual data by looking at month value (using a predefined list)
finally, just increment year when your month value is Jan
*Note : this solution assumes that your data is a time series with rows ordered in time.
P.S: only use capital letters for classes in Python, not variables.

Find next Quarter/Month/Year/Bi-annual Date from Pandas Timestamp

I want to find a way that could give me next month/quarter/year/bi-annual date given a Pandas timestamp.
If the timestamp is already an end of month/quarter/year/bi-annual date than I can get next quarter date as follows:
pd.Timestamp('1999-12-31') + pd.tseries.offsets.DateOffset(months=3)
What if the time stamp was pd.Timestamp('1999-12-30'), the above won't work.
Expected output
input = pd.Timestamp('1999-12-30')
next_quarter_end = '2000-03-31'
next_month_end = '2000-01-31'
next_year_end = '2000-12-31'
next_biannual_end = '2000-06-30'
This works. I used pandas.tseries.offsets.QuarterEnd, .MonthEnd, and .YearEnd, multiplied by specific factors that change based on the input, to achieve the four values you're looking for.
date = pd.Timestamp('1999-12-31')
month_factor = 1 if date.day == date.days_in_month else 2
year_factor = 1 if date.day == date.days_in_month and date.month == 12 else 2
next_month_end = date + pd.tseries.offsets.MonthEnd() * month_factor
next_quarter_end = date + (pd.tseries.offsets.QuarterEnd() * month_factor)
next_year_end = date + pd.tseries.offsets.YearEnd() * year_factor
next_biannual_end = date + pd.tseries.offsets.DateOffset(months=6)
Technically, the next quarter end after Timestamp('1999-12-30') is Timestamp('1999-12-31 00:00:00')
You can use pandas.tseries.offsets.QuarterEnd
>>> pd.Timestamp('1999-12-30') + pd.tseries.offsets.QuarterEnd()
Timestamp('1999-12-31 00:00:00')
>>> pd.Timestamp('1999-12-30') + pd.tseries.offsets.QuarterEnd()*2
Timestamp('2000-03-31 00:00:00')
Similarly, use pandas.tseries.offsets.MonthEnd() and pandas.tseries.offsets.YearEnd()
For biannual, I guess you can take 2*QuarterEnd().

SQLite query not functioning properly

I have made every attempt that I know of to make this work, but at this point I think I am just running in circles.
I am taking user input and using that to query a database. The caveat is that there are dates within the database that need to have days added to them, and to make sure that the user is seeing all the UPDATED information between the dates they chose, I changed the user's start date so that it includes two months beforehand.
At this point, the information is passed into a dataframe where it is then filtered to contain only relevant information as well as adjusting the dates that need to be adjusted. After that, it's passed through a mask on the dataframe to make sure that the user is seeing the updated information only, and not dates that are outside of their chosen range that originally weren't.
There were a few points throughout this process that my code was running properly, but I kept realizing there were changes that needed to be made. As to be expected, those changes caused my code to break and I've not been able to figure out how to fix it.
One issue is that the SQL queries are not returning the correct information. It seems that the chosen start date will allow any entries past that date, but the chosen end date will only include database entries if the end date is very near to the highest date in the database. The problem with that is that the user may not always know what the highest value in the database is, therefore they need to be able to choose an arbitrary value to query by.
There is an also an issue where it seems the query only wants to work some of the time. On two separate instances I ran the same exact queries and it only worked one time and not the other.
Here is my code:
self.StartDate = (self.StartMonth.get() + " " + self.StartDay.get() + "," + " " + self.StartYear.get())
self.StartDate = datetime.strptime(self.StartDate, '%b %d, %Y').date()
self.StartDate = self.StartDate - timedelta(days = 60)
self.StartDate = self.StartDate.strftime('%b %d, %Y')
self.EndDate = (self.EndMonth.get() + " " + self.EndDay.get() + "," + " " + self.EndYear.get())
self.EndDate = datetime.strptime(self.EndDate, '%b %d, %Y').date()
self.EndDate = self.EndDate.strftime('%b %d, %Y')
JobType = self.JobType.get()
if JobType == 'All':
self.cursor.execute('''
SELECT
*
FROM
MainTable
WHERE
ETADate >= ? and
ETADate <= ?
''',
(self.StartDate, self.EndDate,)
)
self.data = self.cursor.fetchall()
else:
self.cursor.execute('''
SELECT
*
FROM
MainTable
WHERE
ETADate BETWEEN
? AND ?
AND EndUse = ?
''',
(self.StartDate, self.EndDate, JobType,)
)
self.data = self.cursor.fetchall()
self.Data_Cleanup()
def Data_Cleanup(self):
self.df = pd.DataFrame (
self.data,
columns = [
'id',
'JobNumber',
'ETADate',
'Balance',
'EndUse',
'PayType'
]
)
remove = ['id', 'JobNumber']
self.df = self.df.drop(columns = remove)
self.df['ETADate'] = pd.to_datetime(self.df['ETADate'])
self.df.loc[self.df['PayType'] == '14 Days', 'ETADate'] = self.df['ETADate'] + timedelta(days = 14)
self.df.loc[self.df['PayType'] == '30 Days', 'ETADate'] = self.df['ETADate'] + timedelta(days = 30)
self.df['ETADate'] = self.df['ETADate'].astype('category')
self.df['EndUse'] = self.df['EndUse'].astype('category')
self.df['PayType'] = self.df['PayType'].astype('category')
mask = (self.df['ETADate'] >= self.StartDate) & (self.df['ETADate'] <= self.EndDate)
print(self.df.loc[mask])
Ideally, the data would be updated before it is added to the database, but unfortunately the source of this data isn't capable of updating it correctly.
I appreciate any help.
You are storing dates as a string, formatted like Jan 02, 2021. That means you'll compare the month first, alphabetically, then the day numerically, then the year. Or, to take a few random dates, the sort order looks like this:
Dec 23, 2021
Jan 01, 2021
Nov 07, 2026
Nov 16, 2025
If you want a query that makes sense, you'll either need quite a bit of SQL logic to parse these dates on the SQLite side, or preferably, just store the dates using a format that sorts correctly as a string. If you use .strftime("%Y-%m-%d") those same dates will sort in order:
2021-01-01
2021-12-23
2025-11-16
2026-11-07
This will require changing the format of the columns in your database, of course.

Metacharacters python extracting dates

I want to extract dates in the format Month Date Year.
For example: 14 January, 2005 or Feb 29 1982
the code im using:
date = re.findall(r'\d{1,3} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December \d{1,3}[, ]\d{4}',line)
python inteprets this as 1-2 digits and Jan or each of the months. So it would match with only "Feb" or "12 Jan", but not the rest of it
So how do I group ONLY the Months in a way where i can use the | only for the months but not the rest of the expression
Answering your question directly, you can make two regexps for your "Day Month Year" and "Month Day Year" formats, then check them separately.
import datetime
# Make months using list comp
months_shrt = [datetime.date(1,m,1).strftime('%b') for m in range(1,13)]
months_long = [datetime.date(1,m,1).strftime('%B') for m in range(1,13)]
# Join together
months = months_shrt + months_long
months_or = f'({"|".join(months)})'
expr_dmy = '\d{1,3},? ' + months_or + ',? \d{4}'
expr_mdy = months_or + ',? \d{1,3},? \d{4}'
You can try both out and see which one matches. However, you'll still need to inspect it and convert it to your favourite flavour of date format.
Instead, I would advise not using regexp at all, and simply try different date formats.
str_a = ' ,'
str_b = ' ,'
base_fmts = [('%d', '%b', '%Y'),
('%d', '%B', '%Y'),
('%b', '%d', '%Y'),
('%B', '%d', '%Y')]
def my_formatter(s):
for o in base_fmts:
for i in range(2):
for j in range(2):
# Concatenate
fmt = f'{o[0]}{str_a[i]} '
fmt += f'{o[1]}{str_b[j]} '
fmt += f'{o[2]}'
try:
d = datetime.datetime.strptime(s, fmt)
except ValueError:
continue
else:
return d
The function above will take a string and return a datetime.datetime object. You can use standard datetime.datetime methods to get your day, month and year back.
>>> d = my_formatter('Jan 15, 2009')
>>> (d.month, d.day, d.year)
(1, 15, 2009)

Python - rename files incrementally based on julian day

Problem:
I have a bunch of files that were downloaded from an org. Halfway through their data directory the org changed the naming convention (reasons unknown). I am looking to create a script that will take the files in a directory and rename the file the same way, but simply "go back one day".
Here is a sample of how one file is named: org2015365_res_version.asc
What I need is logic to only change the year day (2015365) in this case to 2015364. This logic needs to span a few years so 2015001 would be 2014365.
I guess I'm not sure this is possible since its not working with the current date so using a module like datetime does not seem applicable.
Partial logic I came up with. I know it is rudimentary at best, but wanted to take a stab at it.
# open all files
all_data = glob.glob('/somedir/org*.asc')
# empty array to be appended to
day = []
year = []
# loop through all files
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[0]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:3], '')
# get only day - renders 365
days = year_day.replace(year_day[0:4], '')
# get only year - renders 2015
day.append(days)
years = year_day.replace(year_day[4:], '')
year.append(years)
# convert to int for easier processing
day = [int(i) for i in day]
year = [int(i) for i in year]
if day == 001 & year == 2016:
day = 365
year = 2015
elif day == 001 & year == 2015:
day = 365
year = 2014
else:
day = day - 1
Apart from the logic above I also came across the function below from this post, I am not sure what would be the best way to combine that with the partial logic above. Thoughts?
import glob
import os
def rename(dir, pattern, titlePattern):
for pathAndFilename in glob.iglob(os.path.join(dir, pattern)):
title, ext = os.path.splitext(os.path.basename(pathAndFilename))
os.rename(pathAndFilename,
os.path.join(dir, titlePattern % title + ext))
rename(r'c:\temp\xx', r'*.doc', r'new(%s)')
Help me, stackoverflow. You're my only hope.
You can use datetime module:
#First argument - string like 2015365, second argument - format
dt = datetime.datetime.strptime(year_day,'%Y%j')
#Time shift
dt = dt + datetime.timedelta(days=-1)
#Year with shift
nyear = dt.year
#Day in year with shift
nday = dt.timetuple().tm_yday
Based on feedback from the community I was able to get the logic needed to fix the files downloaded from the org! The logic was the biggest hurdle. It turns out that the datetime module can be used, I need to read up more on that.
I combined the logic with the batch renaming using the os module, I put the code below to help future users who may have a similar question!
# open all files
all_data = glob.glob('/some_dir/org*.asc')
# loop through
for f in all_data:
# get first part of string, renders org2015365
f_split = f.split('_')[1]
# get only year day - renders 2015365
year_day = f_split.replace(f_split[:10], '')
# first argument - string 2015365, second argument - format the string to datetime
dt = datetime.datetime.strptime(year_day, '%Y%j')
# create a threshold where version changes its naming convention
# only rename files greater than threshold
threshold = '2014336'
th = datetime.datetime.strptime(threshold, '%Y%j')
if dt > th:
# Time shift - go back one day
dt = dt + datetime.timedelta(days=-1)
# Year with shift
nyear = dt.year
# Day in year with shift
nday = dt.timetuple().tm_yday
# rename files correctly
f_output = 'org' + str(nyear) + str(nday).zfill(3) + '_res_version.asc'
os.rename(f, '/some_dir/' + f_output)
else:
pass

Categories

Resources