How to extract non-standard dates from text in Python? - python

I have a dataframe similar to the following one:
df = pd.DataFrame({'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']})
I would like to extract only dates from the text. The problem is that it is hard to find patterns. The only rule I can find there is: keep 2/3 objects before a four-digit number (i.e. the year).
I tried many convoluted solutions but I am not able to get what I need.
The result should look like this:
["12-13 December 2018"
"11-14 October 2019"
"10 January 2011"]
Can anyone help me?
Thanks!

If "keep 2/3 object before a four-digit number (i.e. the year)" is a reliable rule then you could use the following:
import re
data = {'Text': ['Hello I would like to get only the date which is 12-13 December 2018 amid this text.', 'Ciao, what I would like to do is to keep dates, e.g. 11-14 October 2019, and remove all the rest.','Hi, SO can you help me delete everything but 10 January 2011. I found it hard doing it myself.']}
date_strings = []
for string in data['Text']: # loop through each string
words = string.split() # split string by ' ' characters
for index in range(len(words)):
if re.search(r'(\d){4}', words[index]): # if the 'word' is 4 digits
date_strings.append( ' '.join(words[index-2:index+1]) ) # extract that word & the preceeding 2
break
print(date_strings)
To get:
['12-13 December 2018', '11-14 October 2019,', '10 January 2011.']
Some assumptions:
the dates are always 3 'words' long
the years are always at the end of the dates
as pointed out in the comments, the only 4-digit number in the text is the year

Here is a potential solution using a regex:
from calendar import month_name
months = '|'.join(list(month_name)[1:])
df['Text'].str.extract(r'([0-9-]+ (?:%s) \d{4})' % months)[0]
alternative regex: r'((?:\d+-)?\d+ (?:%s) \d{4})' % months
output:
0 12-13 December 2018
1 11-14 October 2019
2 10 January 2011

Related

How to convert date as 'Monday 1st' to'Monday 1' in python?

I have tried a lot. Banning words doesn't help, removing certain characters doesn't help.
The datetime module doesn't have a directive for this. It has things like %d which will give you today's day, for example 24.
I have a date in the format of 'Tuesday 24th January' but I need it to be 'Tuesday 24 January'.
Is there a way to remove st,nd,rd,th. Or is there an even better way?
EDIT: even removing rd would remove it from Saturday. So that doesn't work either.
You can use a regex:
import re
d = 'Tuesday 24th January'
d = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', d) # \1 to restore the captured day
print(d)
# Output
Tuesday 24 January
For Saturday 21st January:
d = 'Saturday 21st January'
d = re.sub(r'(\d+)(st|nd|rd|th)', r'\1', d)
print(d)
# Output
Saturday 21 January

python datefinder.find_dates does not work if only years are present e.g '2014 - 2018'

I am trying to extract dates from a string if only years are present, e.g the following string:
'2014 - 2018'
should return the following dates:
2014/01/01
2018/01/01
I am using the python library datefinder and it's brilliant when other element like a month is specified but fails when only years are present in a date.
I need to recognise all sort of incomplete and complete dates:
2014
May 2014
08/2014
03/10/2018
01 March 2013
Any idea how to recognise date in a string when only the year is present?
Thank you

Slice does not work after using .split() function

I'm trying to create a function that takes the "day" part of a Y-M-D date string.
For example:
Input: ["2022 November 23,2023 April 9"]
Output: 23
I have tried to do this by using the .split() function to split the string up at the comma, then slicing the last 2 indexes out to get the day. However, while I can get the last term of the new split string easily, I cannot get the 2nd-to-last term.
Ex:
y_m_d="2022 November 20,2023 April 9"
split_ymd=y_m_d.split(",")
first_value=split_ymd[0]
print(split_ymd[-1]) #This prints "0"
However, adding the 2nd argument to the slice command breaks it
y_m_d="2022 November 20,2023 April 9"
split_ymd=y_m_d.split(",")
first_value=split_ymd[0]
print(split_ymd[-1:-2]) #This prints "[]"
I understand that some of the terminologies above might not be correct as I a new to learn python, and programming in general, and that the code above is very messy but I just need help knowing why the slice command above does not work. I am open to suggestions on improving the code itself, but I really just want to know why the slice does not work in his situation.
Thanks!
The slice doesn't work in this situation because the slice doesn't work in any situation. It's unrelated to .split() or anything else you're doing.
Consider this simpler test case:
>>> [1,2,3,4][-1]
4
>>> [1,2,3,4][-1:-2]
[]
This happens because -1 refers to index 3 and -2 refers to index 2, and the span [3,2) is backwards so it's treated as empty.
You can swap them if you actually want a range:
>>> [1,2,3,4][-2:-1]
[3]
Or you can just use -2 if you want the second-from-last element:
>>> [1,2,3,4][-2]
3
You can split your initial list on the comma to create a list containing
multiple strings, each representing a date.
Then iterate over those dates, splitting them by spaces. The last value in the subsequent list is the day value you are looking for.
Let me know if this isn't clear.
It looks something like this:
list_of_dates = ["2022 November 23, 2023 April 9"]
# This separates all dates by splitting on the comma
dates = "".join(list_of_dates).split(",")
days = []
for d in dates:
# This splits each date on the space
temp = d.split(" ")
days.append(temp[-1])
print(days)
# Output: ["23", "9"]
So, I have two recommendations here:
Try learning how python slicing works (both negative and positive numbers)
For your actual solution, I see a list of dates. Comma separating it and then parsing the date into a datetime object might make things easier here.
# For example,
date_str_list = "2022 November 20,2023 April 9"
for date_str in date_str_list.split(","):
date = datetime.datetime.strptime(date_str, "%Y %B %d")
day = date.day
See https://docs.python.org/3/library/datetime.html for more details on how you can control the string format of a datetime object.
The Amazing datetime Module!
First lesson: Let the built-ins do the hard work for you!
Here is an example of your function which parses a date string and returns the day. Additionally, here is an example implementation of how you can use the function.
Hope this helps!
The Function:
from datetime import datetime as dt
def extract_day(date_string, mask='%Y %B %d'):
"""Extract and return the day.
Args:
date_string (str): Date text as a string.
mask (str): Format of the provided date string,
used for parsing.
"""
day = dt.strptime(date_string, mask).day
return day
Implementation:
dates1 = "2022 November 20,2023 April 9"
dates2 = "20 Nov 2022,9 Apr 2023"
date_strings1 = dates1.split(',')
date_strings2 = dates2.split(',')
for date in date_strings1:
day = extract_day(date)
print('\nOriginal string: {}'.format(date))
print('Extracted day: {}'.format(day))
for date in date_strings2:
day = extract_day(date, mask='%d %b %Y')
print('\nOriginal string: {}'.format(date))
print('Extracted day: {}'.format(day))
Output:
Original string: 2022 November 20
Extracted day: 20
Original string: 2023 April 9
Extracted day: 9
Original string: 20 Nov 2022
Extracted day: 20
Original string: 9 Apr 2023
Extracted day: 9
How about:
ymd="2022 November 20,2023 April 9"
lchar = ymd.find(',')
fchar = lchar-2
d_int = int(y_m_d[fchar:lchar])
print(d_int)
If there is only one comma after the date this should give you what you want.
You can tackle this in a few different ways:
Using the len function:
date = "2013 November 20,2023 April 10"
splitted = date.split(',')
splitted[0][len(splitted[0])-2:]
Use negative indexing:
date = "2013 November 20,2023 April 10"
splitted = date.split(',')
splitted[0][-2:]

How to write a Regex to validate date format of type DAY, MONTH dd, yyyy?

I have a date string like Thursday, December 13, 2018 i.e., DAY, MONTH dd, yyyy and I need to validate it with a regular expression.
The regex should not validate incorrect day or month. For example, Muesday, December 13, 2018 and Thursday, December 32, 2018 should be marked invalid.
What I could do so far is write expressions for the ", ", "dd", and "yyyy". I don't understand how will I customize the regex in such a way that it would accept only correct day's and month's name.
My attempt:
^([something would come over here for day name]day)([\,]|[\, ])(something would come over here for month name)(0?[1-9]|[12][0-9]|3[01])([\,]|[\, ])([12][0-9]\d\d)$
Thanks.
EDIT: I have only included years starting from year 1000 - year 2999. Validating leap years does not matter.
You can try a library that implements regex for "complex" case like yours. This is called datefinder.
This guy made the work for you to find any kind of date into texts:
https://github.com/akoumjian/datefinder
To install : pip install datefinder
import datefinder
string_with_dates = "entries are due by January 4th, 2017 at 8:00pm
created 01/15/2005 by ACME Inc. and associates."
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
# Output
2017-01-04 20:00:00
2005-01-15 00:00:00
To detect wrong words like "Muesday" you you filter your text with an spellchecker like PyEnchant
import enchant
>>> d = enchant.Dict("en_US")
>>> print(d.check("Monday"))
True
>>> print(d.check("Muesday"))
False
>>> print(d.suggest("Muesday"))
['Tuesday', 'Domesday', 'Muesli', 'Wednesday', 'Mesdames']
regex is not the way to go to solve your problem!
But here is some example code where you could see how something would come over here for day name-section in your pattern could be written. I also added example of how to use strptime() that is a much better solution in your case:
import re
from datetime import datetime
s = """
Thursday, December 13, 2018
Muesday, December 13, 2018
Monday, January 13, 2018
Thursday, December 32, 2018
"""
pat = r"""
^
(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday)\
([\,]|[\, ])\
(January|February|March|April|May|June|July|August|September|October|November|December)\
(0?[1-9]|[12][0-9]|3[01])
([\,]|[\, ])\
([12][0-9]\d\d)
$
"""
for match in re.finditer(pat, s, re.VERBOSE+re.MULTILINE):
print match
for row in s.split('\n'):
try:
match = datetime.strptime(row, '%A, %B %d, %Y')
print match
except:
print "'%s' is not valid"%row

Calendar in python/django

I am working on a website with weekly votations, but only the first three weeks of the month, so in each instance I have a start_date and end_date field.
I'd like to know if there's a way to automitically create these instances based on the current date, for instance:
Today it is 6 of March, and votations end tomorrow, so a function should be run (tmrw) that, taking into account this month calendar, would fill in the appropiate dates for the next votations. What calendar do you recommend me, and how shoul I do it?
(Never mind the automatically run part, I'll go with celery).
Thanks!
I am not sure what your problem is and I don't know what votations are. But as a general direction of thinking: there is timeboard library that can generate rule-based schedules (calendars) and do calculations over them (DISCLAIMER: I am the author).
The code below designates, for every month of 2018, the days of the first three weeks of the month as 'on-duty' (i.e. 'active', 'usable') and the rest as 'off-duty':
>>> import timeboard as tb
>>> weekly = tb.Organizer(marker='W', structure=[[1],[1],[1],[0],[0],[0]])
>>> monthly = tb.Organizer(marker='M', structure=[weekly])
>>> clnd = tb.Timeboard(base_unit_freq='D',
... start='01 Jan 2018', end='31 Dec 2018',
... layout=monthly)
For example, in March 2018, the days from Thursday, 1st, through Sunday, 18th, are marked 'on-duty', and the days 19-31 are marked 'off-duty'.
Now you can move along the calendar picking only on-duty days. For example, adding 1 to March, 17 gives you March 18:
>>> (clnd('17 Mar 2018') + 1).to_timestamp()
Timestamp('2018-03-18 00:00:00')
However, adding 2 carries you over to April 1, as March 19 is NOT within the first 3 weeks of March:
>>> (clnd('17 Mar 2018') + 2).to_timestamp()
Timestamp('2018-04-01 00:00:00')

Categories

Resources