How to convert date as 'Monday 1st' to'Monday 1' in python? - python

I have tried a lot. Banning words doesn't help, removing certain characters doesn't help.
The datetime module doesn't have a directive for this. It has things like %d which will give you today's day, for example 24.
I have a date in the format of 'Tuesday 24th January' but I need it to be 'Tuesday 24 January'.
Is there a way to remove st,nd,rd,th. Or is there an even better way?
EDIT: even removing rd would remove it from Saturday. So that doesn't work either.

You can use a regex:
import re
d = 'Tuesday 24th January'
d = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', d) # \1 to restore the captured day
print(d)
# Output
Tuesday 24 January
For Saturday 21st January:
d = 'Saturday 21st January'
d = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', d)
print(d)
# Output
Saturday 21 January

Related

How to extract non-standard dates from text in Python?

I have a dataframe similar to the following one:
df = pd.DataFrame({'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']})
I would like to extract only dates from the text. The problem is that it is hard to find patterns. The only rule I can find there is: keep 2/3 objects before a four-digit number (i.e. the year).
I tried many convoluted solutions but I am not able to get what I need.
The result should look like this:
["12-13 December 2018"
"11-14 October 2019"
"10 January 2011"]
Can anyone help me?
Thanks!
If "keep 2/3 object before a four-digit number (i.e. the year)" is a reliable rule then you could use the following:
import re
data = {'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']}
date_strings = []
for string in data['Text']: # loop through each string
words = string.split() # split string by ' ' characters
for index in range(len(words)):
if re.search(r'(\d){4}', words[index]): # if the 'word' is 4 digits
date_strings.append( ' '.join(words[index-2:index+1]) ) # extract that word & the preceeding 2
break
print(date_strings)
To get:
['12-13 December 2018', '11-14 October 2019,', '10 January 2011.']
Some assumptions:
the dates are always 3 'words' long
the years are always at the end of the dates
as pointed out in the comments, the only 4-digit number in the text is the year
Here is a potential solution using a regex:
from calendar import month_name
months = '|'.join(list(month_name)[1:])
df['Text'].str.extract(r'([0-9-]+ (?:%s) \d{4})' % months)[0]
alternative regex: r'((?:\d+-)?\d+ (?:%s) \d{4})' % months
output:
0 12-13 December 2018
1 11-14 October 2019
2 10 January 2011

Slice does not work after using .split() function

I'm trying to create a function that takes the "day" part of a Y-M-D date string.
For example:
Input: ["2022 November 23,2023 April 9"]
Output: 23
I have tried to do this by using the .split() function to split the string up at the comma, then slicing the last 2 indexes out to get the day. However, while I can get the last term of the new split string easily, I cannot get the 2nd-to-last term.
Ex:
y_m_d="2022 November 20,2023 April 9"
split_ymd=y_m_d.split(",")
first_value=split_ymd[0]
print(split_ymd[-1]) #This prints "0"
However, adding the 2nd argument to the slice command breaks it
y_m_d="2022 November 20,2023 April 9"
split_ymd=y_m_d.split(",")
first_value=split_ymd[0]
print(split_ymd[-1:-2]) #This prints "[]"
I understand that some of the terminologies above might not be correct as I a new to learn python, and programming in general, and that the code above is very messy but I just need help knowing why the slice command above does not work. I am open to suggestions on improving the code itself, but I really just want to know why the slice does not work in his situation.
Thanks!
The slice doesn't work in this situation because the slice doesn't work in any situation. It's unrelated to .split() or anything else you're doing.
Consider this simpler test case:
>>> [1,2,3,4][-1]
4
>>> [1,2,3,4][-1:-2]
[]
This happens because -1 refers to index 3 and -2 refers to index 2, and the span [3,2) is backwards so it's treated as empty.
You can swap them if you actually want a range:
>>> [1,2,3,4][-2:-1]
[3]
Or you can just use -2 if you want the second-from-last element:
>>> [1,2,3,4][-2]
3
You can split your initial list on the comma to create a list containing
multiple strings, each representing a date.
Then iterate over those dates, splitting them by spaces. The last value in the subsequent list is the day value you are looking for.
Let me know if this isn't clear.
It looks something like this:
list_of_dates = ["2022 November 23, 2023 April 9"]
# This separates all dates by splitting on the comma
dates = "".join(list_of_dates).split(",")
days = []
for d in dates:
# This splits each date on the space
temp = d.split(" ")
days.append(temp[-1])
print(days)
# Output: ["23", "9"]
So, I have two recommendations here:
Try learning how python slicing works (both negative and positive numbers)
For your actual solution, I see a list of dates. Comma separating it and then parsing the date into a datetime object might make things easier here.
# For example,
date_str_list = "2022 November 20,2023 April 9"
for date_str in date_str_list.split(","):
date = datetime.datetime.strptime(date_str, "%Y %B %d")
day = date.day
See https://docs.python.org/3/library/datetime.html for more details on how you can control the string format of a datetime object.
The Amazing datetime Module!
First lesson: Let the built-ins do the hard work for you!
Here is an example of your function which parses a date string and returns the day. Additionally, here is an example implementation of how you can use the function.
Hope this helps!
The Function:
from datetime import datetime as dt
def extract_day(date_string, mask='%Y %B %d'):
"""Extract and return the day.
Args:
date_string (str): Date text as a string.
mask (str): Format of the provided date string,
used for parsing.
"""
day = dt.strptime(date_string, mask).day
return day
Implementation:
dates1 = "2022 November 20,2023 April 9"
dates2 = "20 Nov 2022,9 Apr 2023"
date_strings1 = dates1.split(',')
date_strings2 = dates2.split(',')
for date in date_strings1:
day = extract_day(date)
print('\nOriginal string: {}'.format(date))
print('Extracted day: {}'.format(day))
for date in date_strings2:
day = extract_day(date, mask='%d %b %Y')
print('\nOriginal string: {}'.format(date))
print('Extracted day: {}'.format(day))
Output:
Original string: 2022 November 20
Extracted day: 20
Original string: 2023 April 9
Extracted day: 9
Original string: 20 Nov 2022
Extracted day: 20
Original string: 9 Apr 2023
Extracted day: 9
How about:
ymd="2022 November 20,2023 April 9"
lchar = ymd.find(',')
fchar = lchar-2
d_int = int(y_m_d[fchar:lchar])
print(d_int)
If there is only one comma after the date this should give you what you want.
You can tackle this in a few different ways:
Using the len function:
date = "2013 November 20,2023 April 10"
splitted = date.split(',')
splitted[0][len(splitted[0])-2:]
Use negative indexing:
date = "2013 November 20,2023 April 10"
splitted = date.split(',')
splitted[0][-2:]

Hot to cut out text in string that varies?

I was just curious if there was a way to delete text from a string or only capture specific text when the string varies in info.
Exmaples of the strings I'm working:
3/5/2019 12:38 PM
10/30/2019 6:32 AM
9/12/2019 9:53 AM
I want to be able to extract the date and hour of the day separately and append them to a list. However obviously those vary and even the index of the hour can change as the day, month or hour can become > 10 which can push it back up to three spaces.
import re
s = "3/5/2019 12:38 PM"
result = re.compile(r"[\s\/:]").split(s)
result:
['3', '5', '2019', '12', '38', 'PM']
This should solve your problem assuming the delimiter when the string come in are the same.
you can use regular expressions
something likes this:
import re
m = re.match("(\d+/\d+/\d+) (\d+:\d+) (\wM)", "3/5/2019 12:38 PM")
print(m.groups())
this will print a tuple with the first item being the date and second item being the time and the third item being the PM or AM: ('3/5/2019', '12:38', 'PM') which you can easily parse yourself
Edit
you can also use the datetime module to parse the date string:
import datetime
dt = datetime.datetime.strptime("3/5/2019 12:38 PM","%d/%m/%Y %I:%M %p")
print(dt.date(), dt.hour)
which will give you a datetime object which you can get all the information from

How to write a Regex to validate date format of type DAY, MONTH dd, yyyy?

I have a date string like Thursday, December 13, 2018 i.e., DAY, MONTH dd, yyyy and I need to validate it with a regular expression.
The regex should not validate incorrect day or month. For example, Muesday, December 13, 2018 and Thursday, December 32, 2018 should be marked invalid.
What I could do so far is write expressions for the ", ", "dd", and "yyyy". I don't understand how will I customize the regex in such a way that it would accept only correct day's and month's name.
My attempt:
^([something would come over here for day name]day)([\,]|[\, ])(something would come over here for month name)(0?[1-9]|[12][0-9]|3[01])([\,]|[\, ])([12][0-9]\d\d)$
Thanks.
EDIT: I have only included years starting from year 1000 - year 2999. Validating leap years does not matter.
You can try a library that implements regex for "complex" case like yours. This is called datefinder.
This guy made the work for you to find any kind of date into texts:
https://github.com/akoumjian/datefinder
To install : pip install datefinder
import datefinder
string_with_dates = "entries are due by January 4th, 2017 at 8:00pm
created 01/15/2005 by ACME Inc. and associates."
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
# Output
2017-01-04 20:00:00
2005-01-15 00:00:00
To detect wrong words like "Muesday" you you filter your text with an spellchecker like PyEnchant
import enchant
>>> d = enchant.Dict("en_US")
>>> print(d.check("Monday"))
True
>>> print(d.check("Muesday"))
False
>>> print(d.suggest("Muesday"))
['Tuesday', 'Domesday', 'Muesli', 'Wednesday', 'Mesdames']
regex is not the way to go to solve your problem!
But here is some example code where you could see how something would come over here for day name-section in your pattern could be written. I also added example of how to use strptime() that is a much better solution in your case:
import re
from datetime import datetime
s = """
Thursday, December 13, 2018
Muesday, December 13, 2018
Monday, January 13, 2018
Thursday, December 32, 2018
"""
pat = r"""
^
(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\
([\,]|[\, ])\
(January|February|March|April|May|June|July|August|September|October|November|December)\
(0?[1-9]|[12][0-9]|3[01])
([\,]|[\, ])\
([12][0-9]\d\d)
$
"""
for match in re.finditer(pat, s, re.VERBOSE+re.MULTILINE):
print match
for row in s.split('\n'):
try:
match = datetime.strptime(row, '%A, %B %d, %Y')
print match
except:
print "'%s' is not valid"%row

Extracting dates from a string in python

I have a string as
fmt_string2 = I want to apply for leaves from 12/12/2017 to 12/18/2017
Here I want to extract the following dates. But my code needs to be robust as this can be in any format it can be 12 January 2017 or 12 Jan 17. and its position can also change.
For the above code I have tried doing:
''.join(fmt_string2.split()[-1].split('.')[::-10])
But here I am giving position of my date. Which I dont want.
Can anyone help in making a robust code for extracting dates.
If 12/12/2017, 12 January 2017, and 12 Jan 17 are the only possible patterns then the following code that uses regex should be enough.
import re
string = 'I want to apply for leaves from 12/12/2017 to 12/18/2017 I want to apply for leaves from 12 January 2017 to ' \
'12/18/2017 I want to apply for leaves from 12/12/2017 to 12 Jan 17 '
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', string)
for match in matches:
print(match[0])
Output:
12/12/2017
12/18/2017
12 January 2017
12/18/2017
12/12/2017
12 Jan 17
To understand the regex play with it hare in regex101.
Using Regular Expressions
Rather than going through regex completely, I suggest the following approach:
import re
from dateutil.parser import parse
Sample Text
text = """
I want to apply for leaves from 12/12/2017 to 12/18/2017
then later from 12 January 2018 to 18 January 2018
then lastly from 12 Feb 2018 to 18 Feb 2018
"""
Regular expression to find anything that is of form "from A to B". Advantage here will be that I don't have to take care of each and every case and keep building my regex. Rather this is dynamic.
pattern = re.compile(r'from (.*) to (.*)')
matches = re.findall(pattern, text)
Pattern from above regex for the text is
[('12/12/2017', '12/18/2017'), ('12 January 2018', '18 January 2018'), ('12 Feb 2018', '18 Feb 2018')]
For each match I parse the date. Exception is thrown for value that isn't date, hence in except block we pass.
for val in matches:
try:
dt_from = parse(val[0])
dt_to = parse(val[1])
print("Leave applied from", dt_from.strftime('%d/%b/%Y'), "to", dt_to.strftime('%d/%b/%Y'))
except ValueError:
print("skipping", val)
Output:
Leave applied from 12/Dec/2017 to 18/Dec/2017
Leave applied from 12/Jan/2018 to 18/Jan/2018
Leave applied from 12/Feb/2018 to 18/Feb/2018
Using pyparsing
Using regular expressions has the limitation that it might end up being very complex in order to make it more dynamic for handling not so straightforward input for e.g.
text = """
I want to apply for leaves from start 12/12/2017 to end date 12/18/2017 some random text
then later from 12 January 2018 to 18 January 2018 some random text
then lastly from 12 Feb 2018 to 18 Feb 2018 some random text
"""
So, Pyton's pyparsing module is the best fit here.
import pyparsing as pp
Here approach is to create a dictionary that can parse the entire text.
Create keywords for month names that can be used as pyparsing keyword
months_list= []
for month_idx in range(1, 13):
months_list.append(calendar.month_name[month_idx])
months_list.append(calendar.month_abbr[month_idx])
# join the list to use it as pyparsing keyword
month_keywords = " ".join(months_list)
Dictionary for parsing:
# date separator - can be one of '/', '.', or ' '
separator = pp.Word("/. ")
# Dictionary for numeric date e.g. 12/12/2018
numeric_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=2) + separator + pp.Word(pp.nums, max=4))
# Dictionary for text date e.g. 12/Jan/2018
text_date = pp.Combine(pp.Word(pp.nums, max=2) + separator + pp.oneOf(month_keywords) + separator + pp.Word(pp.nums, max=4))
# Either numeric or text date
date_pattern = numeric_date | text_date
# Final dictionary - from x to y
pattern = pp.Suppress(pp.SkipTo("from") + pp.Word("from") + pp.Optional("start") + pp.Optional("date")) + date_pattern
pattern += pp.Suppress(pp.Word("to") + pp.Optional("end") + pp.Optional("date")) + date_pattern
# Group the pattern, also it can be multiple
pattern = pp.OneOrMore(pp.Group(pattern))
Parse the input text:
result = pattern.parseString(text)
# Print result
for match in result:
print("from", match[0], "to", match[1])
Output:
from 12/12/2017 to 12/18/2017
from 12 January 2018 to 18 January 2018
from 12 Feb 2018 to 18 Feb 2018

Categories

Resources