Metacharacters python extracting dates - python

I want to extract dates in the format Month Date Year.
For example: 14 January, 2005 or Feb 29 1982
the code im using:
date = re.findall(r'\d{1,3} Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December \d{1,3}[, ]\d{4}',line)
python inteprets this as 1-2 digits and Jan or each of the months. So it would match with only "Feb" or "12 Jan", but not the rest of it
So how do I group ONLY the Months in a way where i can use the | only for the months but not the rest of the expression

Answering your question directly, you can make two regexps for your "Day Month Year" and "Month Day Year" formats, then check them separately.
import datetime
# Make months using list comp
months_shrt = [datetime.date(1,m,1).strftime('%b') for m in range(1,13)]
months_long = [datetime.date(1,m,1).strftime('%B') for m in range(1,13)]
# Join together
months = months_shrt + months_long
months_or = f'({"|".join(months)})'
expr_dmy = '\d{1,3},? ' + months_or + ',? \d{4}'
expr_mdy = months_or + ',? \d{1,3},? \d{4}'
You can try both out and see which one matches. However, you'll still need to inspect it and convert it to your favourite flavour of date format.
Instead, I would advise not using regexp at all, and simply try different date formats.
str_a = ' ,'
str_b = ' ,'
base_fmts = [('%d', '%b', '%Y'),
('%d', '%B', '%Y'),
('%b', '%d', '%Y'),
('%B', '%d', '%Y')]
def my_formatter(s):
for o in base_fmts:
for i in range(2):
for j in range(2):
# Concatenate
fmt = f'{o[0]}{str_a[i]} '
fmt += f'{o[1]}{str_b[j]} '
fmt += f'{o[2]}'
try:
d = datetime.datetime.strptime(s, fmt)
except ValueError:
continue
else:
return d
The function above will take a string and return a datetime.datetime object. You can use standard datetime.datetime methods to get your day, month and year back.
>>> d = my_formatter('Jan 15, 2009')
>>> (d.month, d.day, d.year)
(1, 15, 2009)

Related

Regex with date as String in Azure path

I have many folders (in Microsoft Azure data lake), each folder is named with a date as the form "ddmmyyyy". Generally, I used the regex to extract all files of all folders of an exact month of a year in the way
path_data="/mnt/data/[0-9]*032022/data_[0-9]*.json" # all folders of all days of month 03 of 2022
result=spark.read.json(path_data)
My problem now is to extract all folders that match exactly one year before a given date
For example: for the date 14-03-2022; I need a regex to automatically read all files of all folders between 14-03-2021 and 14-03-2022.
I tried to extract the month and year in vars using strings, then using those two strings in a regex respecting the conditions ( for the showed example month should be greater than 03 when year equal to 2021 and less than 03 when the year is equal to 2022). I tried something similar to (while replacing the vars with 03, 2021 and 2022).
date_regex="([0-9]{2}[03-12]2021)|([0-9]{2}[01-03]2022)"
Is there any hint how I can perform such a task!
Thanks in advance
If I understand your question correctly.
To find our date between ??-03-2021 and ??-03-2022 from the file name field, you can use the following Regex
date_regex="([0-9]{2}-03-2021)|([0-9]{2}-03-2022)"
Also, if you want to be more customized, it is better to apply the changes from the link below and take advantage of it
https://regex101.com/r/AgqFfH/1
update : extract any folder named with a date between 14032021 and 14032022
solution : First we extract the date in ddmmyyyy format with ridge, then we give the files assuming that our format is correct and such a phrase is found in it.
date_regex="((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))"
if re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") > 14032021 and re.find(r"((0[1-9]|1[0-9]|2[0-8])|(0[1-9]|1[012]))") < 14032022
..do any operation..
The above code is just overnight code for your overview of the solution method.
First we extract the date in ddmmyyyy format with regex, then we give the files assuming that our format is correct and such a phrase is found in it.
I hope this solution helps.
It certainly isn't pretty, but here you go:
#input
day = "14"; month = "03"; startYear = "2021";
#day construction
sameTensAfter = '(' + day[0] + '[' + day[1] + '-9])';
theDaysAfter = '([' + chr(ord(day[0])+1) + '-9][0-9])';
sameTensBefore = '(' + day[0] + '[0-' + day[1] + '])';
theDaysBefore = '';
if day[0] != '0':
theDaysBefore = '([0-' + chr(ord(day[0])-1) + '][0-9])';
#build the part for the dates with the same month as query
afterDayPart = '%s|%s' %(sameTensAfter, theDaysAfter);
beforeDayPart = '%s|%s' %(sameTensBefore, theDaysBefore);
theMonthAfter = str(int(month) + 1).zfill(2);
afterMonthPart = theMonthAfter[0] + '([' + theMonthAfter[1] + '-9])';
if theMonthAfter[0] == '0':
afterMonthPart += '|(1[0-2])';
theMonthBefore = str(int(month) - 1).zfill(2);
beforeMonthPart = theMonthBefore[0] + '([0-' + theMonthBefore[1] + '])';
if theMonthBefore[0] == '1':
beforeMonthPart = '(0[0-9])|' + beforeMonthPart;
#4 kinds of matches:
startDateRange = '((%s)(%s)(%s))' %(afterDayPart, month, startYear);
anyDayAfterMonth = '((%s)(%s)(%s))' %('[0-9]{2}', afterMonthPart, startYear);
endDateRange = '((%s)(%s)(%s))' %(beforeDayPart, month, int(startYear)+1);
anyDayBeforeMonth = '((%s)(%s)(%s))' %('[0-9]{2}', beforeMonthPart, int(startYear)+1);
#print regex
date_regex = startDateRange + '|' + anyDayAfterMonth + '|' + endDateRange + '|' + anyDayBeforeMonth;
print date_regex;
#this prints:
#(((1[4-9])|([2-9][0-9]))(03)(2021))|(([0-9]{2})(0([4-9])|(1[0-2]))(2021))|(((1[0-4])|([0-0][0-9]))(03)(2022))|(([0-9]{2})(0([0-2]))(2022))
startDateRange: the month is the same and it's the starting year, this will take all the days including and after.
anyDayAfterMonth: the month is greater and it's the starting year, this will take any day.
endDateRange: the month is the same and it's the ending year, this will take all the days including and before.
anyDayBeforeMonth: the month is less than and it's the ending year, this will take any day.
Here's an example: https://regex101.com/r/i76s58/1
to compare the date, use datetime module, example below.
Then you can only extract folders within your condition
# importing datetime module
import datetime
# date in yyyy/mm/dd format
d1 = datetime.datetime(2018, 5, 3)
d2 = datetime.datetime(2018, 6, 1)
# Comparing the dates will return
# either True or False
print("d1 is greater than d2 : ", d1 > d2)
print("d1 is less than d2 : ", d1 < d2)
print("d1 is not equal to d2 : ", d1 != d2)

For-loop for days over years, python script for targeting data

If I want to add a loop to constrain days as well, what is the easiest way to do it, considering different length of month, leap years etc.
This is the script with years and months:
yearStart = 2010
yearEnd = 2017
monthStart = 1
monthEnd = 12
for year in list(range(yearStart, yearEnd + 1)):
for month in list(range(monthStart, monthEnd + 1)):
startDate = '%04d%02d%02d' % (year, month, 1)
numberOfDays = calendar.monthrange(year, month)[1]
lastDate = '%04d%02d%02d' % (year, month, numberOfDays)
If you want only the days then this code, using the pendulum library, is probably the easiest.
>>> import pendulum
>>> first_date = pendulum.Pendulum(2010, 1, 1)
>>> end_date = pendulum.Pendulum(2018, 1, 1)
>>> for day in pendulum.period(first_date, end_date).range('days'):
... print (day)
... break
...
2010-01-01T00:00:00+00:00
pendulum has many other nice features. For one thing, it's a drop-in replacement for datetime. Therefore, many of the properties and methods that you are familiar with using for that class will also be available to you.
You may want to use datetime in addition to calendar library. I am exactly not sure on requirements. But it appears you want the first date and last date of a given month and year. And, then loop through those dates. The following function will give you the first day and last day of each month. Then, you can loop between those two dates in whichever way you want.
import datetime
import calendar
def get_first_last_day(month, year):
date = datetime.datetime(year=year, month=month, day=1)
first_day = date.replace(day = 1)
last_day = date.replace(day = calendar.monthrange(date.year, date.month)[1])
return first_day, last_day
Adding the logic for looping through 2 dates as well.
d = first_day
delta = datetime.timedelta(days=1)
while d <= last_day:
print d.strftime("%Y-%m-%d")
d += delta

Parsing a string and converting a date using Python

I am trying to parse this "For The Year Ending December 31, 2015" and convert it to 2015-12-31 using the datetime lib. How would I go about partitioning and then converting the date? My program is looking through an excel file with multiple sheets and combining them into one; however, there is need now to add a date column, but I can only get it write the full value to the cell. So my data column currently has "For The Year Ending December 31, 2015" in all the rows.
Thanks in advance!
Here is the code block that is working now. Thanks all! edited to account for text that could vary.
if rx > (options.startrow-1):
ws.write(rowcount, 0, sheet.name)
date_value = sheet.cell_value(4,0)
s = date_value.split(" ")
del s[-1]
del s[-1]
del s[-1]
string = ' '.join(s)
d = datetime.strptime(date_value, string + " %B %d, %Y")
result = datetime.strftime(d, '%Y-%m-%d')
ws.write(rowcount, 9, result)
for cx in range(sheet.ncols):
Simply include the hard-coded portion and then use the proper identifiers:
>>> import datetime
>>> s = "For The Year Ending December 31, 2015"
>>> d = datetime.datetime.strptime(s, 'For The Year Ending %B %d, %Y')
>>> result = datetime.datetime.strftime(d, '%Y-%m-%d')
>>> print(result)
2015-12-31
from datetime import datetime
date_string = 'For The Year Ending December 31, 2015'
date_string_format = 'For The Year Ending %B %d, %Y'
date_print_class = datetime.strptime(date_string, date_string_format)
wanted_date = datetime.strftime(date_print_class, '%Y-%m-%d')
print(wanted_date)

How to parse string dates with 2-digit year?

I need to parse strings representing 6-digit dates in the format yymmdd where yy ranges from 59 to 05 (1959 to 2005). According to the time module docs, Python's default pivot year is 1969 which won't work for me.
Is there an easy way to override the pivot year, or can you suggest some other solution? I am using Python 2.7. Thanks!
I'd use datetime and parse it out normally. Then I'd use datetime.datetime.replace on the object if it is past your ceiling date -- Adjusting it back 100 yrs.:
import datetime
dd = datetime.datetime.strptime(date,'%y%m%d')
if dd.year > 2005:
dd = dd.replace(year=dd.year-100)
Prepend the century to your date using your own pivot:
year = int(date[0:2])
if 59 <= year <= 99:
date = '19' + date
else
date = '20' + date
and then use strptime with the %Y directive instead of %y.
import datetime
date = '20-Apr-53'
dt = datetime.datetime.strptime( date, '%d-%b-%y' )
if dt.year > 2000:
dt = dt.replace( year=dt.year-100 )
^2053 ^1953
print dt.strftime( '%Y-%m-%d' )
You can also perform the following:
today=datetime.datetime.today().strftime("%m/%d/%Y")
today=today[:-4]+today[-2:]
Recently had a similar case, ended up with this basic calculation and logic:
pivotyear = 1969
century = int(str(pivotyear)[:2]) * 100
def year_2to4_digit(year):
return century + year if century + year > pivotyear else (century + 100) + year
If you are dealing with very recent dates as well as very old dates and want to use the current date as a pivot (not just the current year), try this code:
import datetime
def parse_date(date_str):
parsed = datetime.datetime.strptime(date_str,'%y%m%d')
current_date = datetime.datetime.now()
if parsed > current_date:
parsed = parsed.replace(year=parsed.year - 100)
return parsed

Convert date Python

I have MMDDYY dates, i.e. today is 111609
How do I convert this to 11/16/2009, in Python?
I suggest the following:
import datetime
date = datetime.datetime.strptime("111609", "%m%d%y")
print date.strftime("%m/%d/%Y")
This will convert 010199 to 01/01/1999 and 010109 to 01/01/2009.
date = '111609'
new_date = date[0:2] + '/' + date[2:4] + '/' + date[4:6]
>>> s = '111609'
>>> d = datetime.date(int('20' + s[4:6]), int(s[0:2]), int(s[2:4]))
>>> # or, alternatively (and better!), as initially shown here, by Tim
>>> # d = datetime.datetime.strptime(s, "%m%d%y")
>>> d.strftime('%m/%d/%Y')
'11/16/2009'
one of the reasons why the strptime() approach is better is that it deals with dates close to contemporary times, with a window, i.e. properly handling dates in the latter part of the 20th century and dates in the early part of the 21st century properly.
Incomplete but here's an idea
11/16/2009
date_var = ("11/16/2009")
datetime.date.replace(year=2009, month=11, day=16)

Categories

Resources