find matching string in a file using Python - python

Using Python I would like to find strings in a file that matches this format YYYY-MM-DD
Here is how my sample file looks like
I want to find date 2016-01-01 ,2016-01-05
then I want to find 2016-01-17
then I want to find this date 2016-01-04
Output should be
2016-01-01
2016-01-05
2016-01-17
2016-01-04
below is the code I am using currently using but I can't find matching records , any help on this will be appreciated ?
#!/usr/bin/python
import sys
import csv
import re
pattern = re.compile("^([0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9])$")
for i, line in enumerate(open('C:\\Work\\scripts\\logs\\CSI.txt')):
for match in re.finditer(pattern, line):
print 'Found on line' % (i+1, match.groups())

I would remove ^( and $, because your dates don't seem separated :
re.compile("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]")

You can use regex and datetime to get valid dates from string
import re
from datetime import datetime
string = "I want to find date 2016-01-01 ,2016-01-05"
pattern = re.complie("[\d]{4}-\d{2}-\d{2}")
raw_dates = pattern.findall(string)
parsed_dates = []
for date in raw_dates:
try:
d = datetime.strptime(date, "%Y-%m-%d")
parsed_dates.append(d)
except:
pass
print(parsed_dates)
output:
['2016-01-01', '2016-01-05']

Related

Getting exception : Unknown string format:

I am trying to extract dates from text using dateutil library (Python 3.7)
I want to extract all dates from text using code below.
import dateutil.parser as dparser
text = 'First date is 10 JANUARY 2000 and second date is 31/3/2000'
dt = dparser.parse(text, fuzzy=True, dayfirst=True, default=datetime.datetime(1800, 1, 1))
But getting following exception
Unknown string format: First date is 10 JANUARY 2000 and second date
is 31/1/2000
Please let me know any way to extract multiple dates in the text.
How about using datefinder?
import datefinder
string_with_dates = '''
First date is 10 JANUARY 2000 and second date is 31/3/2000
'''
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
Output:
2000-01-10 00:00:00
2000-03-31 00:00:00

Python regex for date and numbers, find the date format

How to extract dates alone from text file using regex in Python 3?
Below is my current code:
import datetime
from datetime import date
import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on
09/07/1897"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
print (date)
Expected Output is
20/12/2018
04/01/1997
09/07/1897
Building on DirtyBit's answer. I found that if you make a minor change, it will pick up multiple date formats. Change the forward slash to a dot.
import re
s = "birthday on 20.12.2018 and wedding anniversary on 04-01-1997 and dob
is on 09/07/1897"
pattern = r'\d{2}.\d{2}.\d{4}'
print("\n".join(re.findall(pattern,s)))
Output
20.12.2018
04-01-1997
09/07/1897
You have an invalid date format near '%Y-%m-%d' since it should have been '%d/%m/%Y' looking at your provided date: birthday on 20/12/2018 (dd/mm/yyyy)
Change this:
date = datetime.datetime.strptime(match.group(), '%Y-%m-%d').date()
With this:
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
Your Fix:
import datetime
from datetime import date
import re
s = "birthday on 20/12/2018"
match = re.search(r'\d{2}/\d{2}/\d{4}', s)
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
print (date)
But:
Why get into all the trouble? When they're easier and elegant ways out there.
Using dparser:
import dateutil.parser as dparser
dt_1 = "birthday on 20/12/2018"
print("Date: {}".format(dparser.parse(dt_1,fuzzy=True).date()))
OUTPUT:
Date: 2018-12-20
EDIT:
With your edited question which now has multiple dates, you could extract them using regex:
import re
s = "birthday on 20/12/2018 and wedding aniversry on 04/01/1997 and dob is on 09/07/1897"
pattern = r'\d{2}/\d{2}/\d{4}'
print("\n".join(re.findall(pattern,s)))
OUTPUT:
20/12/2018
04/01/1997
09/07/1897
OR
Using dateutil:
from dateutil.parser import parse
for s in s.split():
try:
print(parse(s))
except ValueError:
pass
OUTPUT:
2018-12-20 00:00:00
1997-04-01 00:00:00
1897-09-07 00:00:00
You are doing everything right expect this line,
date = datetime.datetime.strptime(match.group(), '%d/%m/%Y').date()
You have to give the same format as your input has in datetime.strptime.
'%Y-%m-%d' >> 2018-12-20
'%d/%m/%Y' >> 20/12/2018
Edit
If you are not looking for datetime object. You can do like this
results = re.findall(r'\d{2}/\d{2}/\d{4}', s)
print('\n'.join(results))
Output
In [20]: results = re.findall(r'\d{2}/\d{2}/\d{4}', s)
In [21]: print('\n'.join(results))
20/12/2018
04/01/1997
09/07/1897

Python: Extract two dates from string

I have a string s which contains two dates in it and I am trying to extract these two dates in order to subtract them from each other to count the number of days in between. In the end I am aiming to get a string like this: s = "o4_24d_20170708_20170801"
At the company I work we can't install additional packages so I am looking for a solution using native python. Below is what I have so far by using the datetime package which only extracts one date: How can I get both dates out of the string?
import re, datetime
s = "o4_20170708_20170801"
match = re.search('\d{4}\d{2}\d{2}', s)
date = datetime.datetime.strptime(match.group(), '%Y%m%d').date()
print date
from datetime import datetime
import re
s = "o4_20170708_20170801"
pattern = re.compile(r'(\d{8})_(\d{8})')
dates = pattern.search(s)
# dates[0] is full match, dates[1] and dates[2] are captured groups
start = datetime.strptime(dates[1], '%Y%m%d')
end = datetime.strptime(dates[2], '%Y%m%d')
difference = end - start
print(difference.days)
will print
24
then, you could do something like:
days = 'd{}_'.format(difference.days)
match_index = dates.start()
new_name = s[:match_index] + days + s[match_index:]
print(new_name)
to get
o4_d24_20170708_20170801
import re, datetime
s = "o4_20170708_20170801"
match = re.findall('\d{4}\d{2}\d{2}', s)
for a_date in match:
date = datetime.datetime.strptime(a_date, '%Y%m%d').date()
print date
This will print:
2017-07-08
2017-08-01
Your regex was working correctly at regexpal

How to format date in python

I made a crawler using python.
But my crawler get date in this format:
s = page_ad.findAll('script')[25].text.replace('\'', '"')
s = re.search(r'\{.+\}', s, re.DOTALL).group() # get json data
s = re.sub(r'//.+\n', '', s) # replace comment
s = re.sub(r'\s+', '', s) # strip whitspace
s = re.sub(r',}', '}', s) # get rid of last , in the dict
dataLayer = json.loads(s)
print dataLayer["page"]["adDetail"]["adDate"]
2017-01-1412:28:07
I want only date without hours (2017-01-14), how get only date if not have white spaces?
use string subset:
>>> date ="2017-01-1412:28:07"
>>> datestr= date[:-8]
>>> datestr
'2017-01-14'
>>>
As this is not a standard date format, just slice the end.
st = "2017-01-1412:28:07"
res = st[:10]
print res
>>>2017-01-14
try this code:
In [2]: from datetime import datetime
In [3]: now = datetime.now()
In [4]: now.strftime('%Y-%m-%d')
Out[4]: '2017-01-24'
Update
I suggest you parse the date first into datetime object and then show the relevant information out of it.
for this a better approach would be using a library for this.
I use dateparser for this tasks, example usage:
import dateparser
date = dateparser.parse('12/12/12')
date.strftime('%Y-%m-%d')
Use datetime as follows to first convert it into a datetime object, and then format the output as required using the stftime() function:
from datetime import datetime
ad_date = dataLayer["page"]["adDetail"]["adDate"]
print datetime.strptime(ad_date, "%Y-%m-%d%H:%M:%S").strftime("%Y-%m-%d")
This will print:
2017-01-14
By using this method, it would give you the flexibility to display other items, for example adding %A to the end would give you the day of the week:
print datetime.strptime(ad_date, "%Y-%m-%d%H:%M:%S").strftime("%Y-%m-%d %A")
e.g.
2017-01-14 Saturday

Python filtering and selecting from list

I need to make a python function that opens a file, reads in the text and then outputs on the Python GUI any entries that contain dates. Examples for valid dates include "1/30/10", "1/30/2010", "1-30-2010", "01-30-2010", "30.1.2010", "30. 1. 2010", and "2010-01-30." It should have few false positives such as "13010", "01302010", or "30-30-10" as dates.
What I have so far is this
import sys
def main():
infile = open('testdate.txt', 'r')
for line in infile:
words = line.split()
for date in words:
if ____ in date:
print date
infile.close()
main()
I know that the line.split() function is able to separate all entries in the text file. What I'm unsure about is how to loop through this new list and ONLY take in dates. How would I go about filtering only dates out?
Find out all possible formats and try to parse those. This may help:
>>> from datetime import datetime
>>> possible_fmts = ["%m/%d/%y","%m/%d/%Y","%m-%d-%y","%m-%d-%Y","%d.%m.%Y","%d. %m. %Y","%Y-%m-%d"]
>>> test_text = "1/30/10,1/30/2010,1-30-2010,01-30-2010,30.1.2010,30. 1. 2010,2010-01-30"
>>> for date_token in test_text.split(','):
for fmt in possible_fmts:
try:
print datetime.strptime(date_token, fmt)
break
except ValueError, e:
pass
2010-01-30 00:00:00
2010-01-30 00:00:00
2010-01-30 00:00:00
2010-01-30 00:00:00
2010-01-30 00:00:00
2010-01-30 00:00:00
2010-01-30 00:00:00

Categories

Resources