I am trying to extract dates from text using dateutil library (Python 3.7)
I want to extract all dates from text using code below.
import dateutil.parser as dparser
text = 'First date is 10 JANUARY 2000 and second date is 31/3/2000'
dt = dparser.parse(text, fuzzy=True, dayfirst=True, default=datetime.datetime(1800, 1, 1))
But getting following exception
Unknown string format: First date is 10 JANUARY 2000 and second date
is 31/1/2000
Please let me know any way to extract multiple dates in the text.
How about using datefinder?
import datefinder
string_with_dates = '''
First date is 10 JANUARY 2000 and second date is 31/3/2000
'''
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)
Output:
2000-01-10 00:00:00
2000-03-31 00:00:00
Related
I have a column in a Pandas dataframe that contains the year and the week number (1 up to 52) in one string in this format: '2017_03' (meaning 3d week of year 2017).
I want to convert the column to datetime and I am using the pd.to_datetime() function. However I get an exception:
pd.to_datetime('2017_01',format = '%Y_%W')
ValueError: Cannot use '%W' or '%U' without day and year
On the other hand the strftime documentation mentions that:
I am not sure what I am doing wrong.
You need also define start day:
a = pd.to_datetime('2017_01_0',format = '%Y_%W_%w')
print (a)
2017-01-08 00:00:00
a = pd.to_datetime('2017_01_1',format = '%Y_%W_%w')
print (a)
2017-01-02 00:00:00
a = pd.to_datetime('2017_01_2',format = '%Y_%W_%w')
print (a)
2017-01-03 00:00:00
I'm doing a project that involves analyzing WhatsApp log data.
After preprocessing the log file I have a table that looks like this:
DD/MM/YY | hh:mm | name | text |
I could build a graph where, using a chat with a friend of mine, I plotted a graph of the number of text per month and the mean number of words per month but I have some problems:
If in a month we didn't exchange text the algorithm doesn't count that month, therefore in the graph I want to see that month with 0 messages
there is a better way to utilize dates and time in python? Using them as strings isn't so intuitive but online I didn't found anything useful.
this is the GitLab page of my project.
def wapp_split(line):
splitted = line.split(',')
Data['date'].append(splitted[0])
splitted = splitted[1].split(' - ')
Data['time'].append(splitted[0])
splitted = splitted[1].split(':')
Data['name'].append(splitted[0])
Data['msg'].append(splitted[1][0:-1])
def wapp_parsing(file):
with open(file) as f:
data = f.readlines()
for line in data:
if (line[17:].find(':')!= -1):
if (line[0] in numbers) and (line[1]in numbers):
prev = line[0:35]
wapp_split(line)
else:
line = prev + line
wapp_split(line)
Those are the main function of the script. The WhatsApp log is formatted like so:
DD/MM/YY, hh:mm - Name Surname: This is a text sent using WhatsApp
The parsing function just take the file and send each line to the split function. Those if in the parsing function just avoid that mssages from WhatsApp and not from the people in the chat being parsed.
Suppose that the table you have is a .csv file that looks like this (call it msgs.csv):
date;time;name;text
22/10/2018;11:30;Maria;Hello how are you
23/10/2018;11:30;Justin;Check this
23/10/2018;11:31;Justin;link
22/11/2018;11:30;Maria;Hello how are you
23/11/2018;11:30;Justin;Check this
23/12/2018;11:31;Justin;link
22/12/2018;11:30;Maria;Hello how are you
23/12/2018;11:30;Justin;Check this
23/01/2019;11:31;Justin;link
23/04/2019;11:30;Justin;Check this
23/07/2019;11:31;Justin;link
Now you can use pandas to import this csv in a table format that will recognise both date and time as a timestamp object and then for your calculations you can group the data by month.
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
df = pd.read_csv('msgs.csv', delimiter=';', parse_dates=[['date', 'time']], date_parser=dateparse)
per = df.date_time.dt.to_period("M")
g = df.groupby(per)
for i in g:
print('#######')
print('year: {year} ; month: {month} ; number of messages: {n_msgs}'
.format(year=i[0].year, month=i[0].month, n_msgs=len(i[1])))
EDIT - no information about specific month = 0 messages:
In order to get the 0 for months in which no messages were sent you can do like this (looks better than above too):
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%d/%m/%Y %H:%M')
df = pd.read_csv('msgs.csv', delimiter=';', parse_dates=[['date', 'time']], date_parser=dateparse)
# create date range from oldest message to newest message
dates = pd.date_range(*(pd.to_datetime([df.date_time.min(), df.date_time.max()]) + pd.offsets.MonthEnd()), freq='M')
for i in dates:
df_aux = df[(df.date_time.dt.month == i.month) & (df.date_time.dt.year == i.year)]
print('year: {year} ; month: {month} ; number of messages: {n_msgs}'
.format(year=i.year, month=i.month, n_msgs=len(df_aux)))
EDIT 2: parse logs into a pandas dataframe:
df = pd.DataFrame({'logs':['DD/MM/YY, hh:mm - Name Surname: This is a text sent using WhatsApp',
'DD/MM/YY, hh:mm - Name Surname: This is a text sent using WhatsApp']})
pat = re.compile("(?P<date>.*?), (?P<time>.*?) - (?P<name>.*?): (?P<message>.*)")
df_parsed = df.logs.str.extractall(pat)
It's best to convert the strings into datetime objects
from datetime import datetime
datetime_object = datetime.strptime('22/10/18', '%d/%m/%y')
When converting from a string, remember to use the correct seperators, ie "-" or "/" to match the string, and the letters in the format template on the right hand side of the function to parse with the date string too. Full details on the meaning of the letters can be found at Python strptime() Method
A simple solution for adding missing dates and plotting the mean value of msg_len is to create a date range your interested in then reindex:
df.set_index('date', inplace=True)
df1 = df[['msg_len','year']]
df1.index = df1.index.to_period('m')
msg_len year
date
2016-08 11 2016
2016-08 4 2016
2016-08 3 2016
2016-08 4 2016
2016-08 15 2016
2016-10 10 2016
# look for date range between 7/2016 and 11/2016
idx = pd.date_range('7-01-2016','12-01-2016',freq='M').to_period('m')
new_df = pd.DataFrame(df1.groupby(df1.index)['msg_len'].mean()).reindex(idx, fill_value=0)
new_df.plot()
msg_len
2016-07 0.0
2016-08 7.4
2016-09 0.0
2016-10 10.0
2016-11 0.0
you can change mean to anything if you want the count of messages for a given month etc.
I have a date column in a pandas.DataFrame in various date time formats and stored as list object, like the following:
date
1 [May 23rd, 2011]
2 [January 1st, 2010]
...
99 [Apr. 15, 2008]
100 [07-11-2013]
...
256 [9/01/1995]
257 [04/15/2000]
258 [11/22/68]
...
360 [12/1997]
361 [08/2002]
...
463 [2014]
464 [2016]
For the sake of convenience, I want to convert them all to MM/DD/YYYY format. It doesn't seem possible to use regex replace() function to do this, since one cannot execute this operation over list objects. Also, to use strptime() for each cell will be too time-consuming.
What will be the easier way to convert them all to the desired MM/DD/YYYY format? I found it very hard to do this on list objects within a dataframe.
Note: for cell values of the form [YYYY] (e.g., [2014] and [2016]), I will assume they are the first day of that year (i.e., January 1, 1968) and for cell values such as [08/2002] (or [8/2002]), I will assume they the first day of the month of that year (i.e., August 1, 2002).
Given your sample data, with the addition of a NaT, this works:
Code:
df.date.apply(lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])
Test Code:
import pandas as pd
df = pd.DataFrame([
[['']],
[['May 23rd, 2011']],
[['January 1st, 2010']],
[['Apr. 15, 2008']],
[['07-11-2013']],
[['9/01/1995']],
[['04/15/2000']],
[['11/22/68']],
[['12/1997']],
[['08/2002']],
[['2014']],
[['2016']],
], columns=['date'])
df['clean_date'] = df.date.apply(
lambda x: pd.to_datetime(x).strftime('%m/%d/%Y')[0])
print(df)
Results:
date clean_date
0 [] NaT
1 [May 23rd, 2011] 05/23/2011
2 [January 1st, 2010] 01/01/2010
3 [Apr. 15, 2008] 04/15/2008
4 [07-11-2013] 07/11/2013
5 [9/01/1995] 09/01/1995
6 [04/15/2000] 04/15/2000
7 [11/22/68] 11/22/1968
8 [12/1997] 12/01/1997
9 [08/2002] 08/01/2002
10 [2014] 01/01/2014
11 [2016] 01/01/2016
It would be better if you use this it'll give you the date format in MM-DD-YYYY the you can apply strftime:
df['Date_ColumnName'] = pd.to_datetime(df['Date_ColumnName'], dayfirst = False, yearfirst = False)
Provided code will work for following scenarios.
Change date format from M/D/YY to MM/DD/YY (5/2/2009 to 05/02/2009)
change form ANY FORMAT to MM/DD/YY
import pandas as pd
'''
* checking provided input file date format correct or not
* if format is correct change date format from M/D/YY to MM/DD/YY
* else date format is not correct in input file
Date format change form ANY FORMAT to MM/DD/YY
'''
input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/predictions.csv'
dest_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/Enrich.csv'
#input_file_name = 'C:/Users/Admin/Desktop/SarenderReddy/enrichment.csv'
read_data = pd.read_csv(input_file_name)
print(pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all())
if pd.to_datetime(read_data['Date'], format='%m/%d/%Y', errors='coerce').notnull().all():
print("Provided correct input date format in input file....!")
read_data['Date'] = pd.to_datetime(read_data['Date'],format='%m/%d/%Y')
read_data['Date'] = read_data['Date'].dt.strftime('%m/%d/%Y')
read_data.to_csv(dest_file_name,index=False)
print(read_data['Date'])
else:
print("NOT... Provided correct input date format in input file....!")
data_format = pd.read_csv(input_file_name,parse_dates=['Date'], dayfirst=True)
#print(df['Date'])
data_format['Date'] = pd.to_datetime(data_format['Date'],format='%m/%d/%Y')
data_format['Date'] = data_format['Date'].dt.strftime('%m/%d/%Y')
data_format.to_csv(dest_file_name,index=False)
print(data_format['Date'])
I require to find out the phone bill due date from SMS using Python 3.4 I have used dateutil.parser and datefinder but with no success as per my use-case.
Example: sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
Code 1:
import datefinder
due_dates = datefinder.find_dates(sms_text)
for match in due_dates:
print(match)
Result: 2017-07-17 00:00:00
Code 2:
import dateutil.parser as dparser
due_date = dparser.parse(sms_text,fuzzy=True)
print(due_date)
Result: ValueError probably because of multiple dates in the text
How can I pick the due date from such texts? The date format is not fixed but there would be 2 dates in the text: one is the month for which bill is generated and other the is the due date, in the same order. Even if I get a regular expression to parse the text, it would be great.
More sample texts:
Hello! Your phone billed outstanding is 293.72 due date is 03rd Jul.
Bill dated 06-JUN-17 for Rs 219 is due today for your phone No. 1234567890
Bill dated 06-JUN-17 for Rs 219 is due on Jul 5 for your phone No. 1234567890
Bill dated 27-Jun-17 for your operator fixedline/broadband ID 1234567890 has been sent at abc#xyz.com from xyz#abc.com. Due amount: Rs 3,764.53, due date: 16-Jul-17.
Details of bill dated 21-JUN-2017 for phone no. 1234567890: Total Due: Rs 374.12, Due Date: 09-JUL-2017, Bill Delivery Date: 25-Jun-2017,
Greetings! Bill for your mobile 1234567890, dtd 18-Jun-17, payment due date 06-Jul-17 has been sent on abc#xyz.com
Dear customer, your phone bill of Rs.191.24 was due on 25-Jun-2017
Hi! Your phone bill for Rs. 560.41 is due on 03-07-2017.
An idea for using dateutil.parser:
from dateutil.parser import parse
for s in sms_text.split():
try:
print(parse(s))
except ValueError:
pass
There are two things that prevent datefinder to parse correctly your samples:
the bill amount: numbers are interpreted as years, so if they have 3 or 4 digits it creates a date
characters defined as delimiters by datefinder might prevent to find a suitable date format (in this case ':')
The idea is to first sanitize the text by removing the parts of the text that prevent datefinder to identify all the dates. Unfortunately, this is a bit of try and error as the regex used by this package is too big for me to analyze thoroughly.
def extract_duedate(text):
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Rs[\d,\. ]+ will remove the bill amount so it is not mistaken as part of a date. It will match strings of the form 'Rs[.][ ][12,]345[.67]' (actually more variations but this is just to illustrate).
Obviously, this is a raw example function.
Here are the results I get:
1 : 2017-07-03 00:00:00
2 : 2017-06-06 00:00:00 # Wrong result: first date instead of today
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
There is one problem on the sample 2: 'today' is not recognized alone by datefinder
Example:
>>> list(datefinder.find_dates('Rs 219 is due today'))
[datetime.datetime(219, 7, 13, 0, 0)]
>>> list(datefinder.find_dates('is due today'))
[]
So, to handle this case, we could simply replace the token 'today' by the current date as a first step. This would give the following function:
def extract_duedate(text):
if 'today' in text:
text = text.replace('today', datetime.date.today().isoformat())
# Sanitize the text for datefinder by replacing the tricky parts
# with a non delimiter character
text = re.sub(':|Rs[\d,\. ]+', '|', text, flags=re.IGNORECASE)
return list(datefinder.find_dates(text))[-1]
Now the results are good for all samples:
1 : 2017-07-03 00:00:00
2 : 2017-07-18 00:00:00 # Well, this is the date of my test
3 : 2017-07-05 00:00:00
4 : 2017-07-16 00:00:00
5 : 2017-06-25 00:00:00
6 : 2017-07-06 00:00:00
7 : 2017-06-25 00:00:00
8 : 2017-03-07 00:00:00
If you need, you can let the function return all dates and they should all be correct.
Why not just using regex? If your input strings always contain this substrings due on ... has been you can just do something like that:
import re
from datetime import datetime
string = """Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been
sent to your regd email ID abc#xyz.com. Pls check Inbox"""
match_obj = re.search(r'due on (.*) has been', string)
if match_obj:
date_str = match_obj.group(1)
else:
print "No match!!"
try:
# DD-MM-YYYY
print datetime.strptime(date_str, "%d-%m-%Y")
except ValueError:
# try another format
try:
print datetime.strptime(date_str, "%Y-%m-%d")
except ValueError:
try:
print datetime.strptime(date_str, "%m-%d")
except ValueError:
...
Having a text message as the example you have provided:
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
It could be possible to use pythons build in regex module to match on the 'due on' and 'has been' parts of the string.
import re
sms_text = "Your phone bill for Jun'17 of Rs.72.23 due on 15-07-2017 has been sent to your regd email ID abc#xyz.com. Pls check Inbox"
due_date = re.split('due on', re.split('has been', sms_text)[0])[1]
print(due_date)
Resulting: 15-07-2017
With this example the date format does not matter, but it is important that the words you are spliting the string on are consistent.
I'm starting with python and pandas and matplotlib. I'm working with data with over million entries. I'm trying to change the date format. In CSV file date format is 23-JUN-11. I will like to use dates in future to plot amount of donation for each candidate. How to convert the date format to a readable format for pandas?
Here is the link to cut file 149 entries
My code:
%matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
First candidate
reader_bachmann = pd.read_csv('P00000001-ALL.csv' ,converters={'cand_id': lambda x: str(x)[1:]},parse_dates=True, squeeze=True, low_memory=False, nrows=411 )
date_frame = pd.DataFrame(reader_bachmann, columns = ['contb_receipt_dt'])
Data slice
s = date_frame.iloc[:,0]
date_slice = pd.Series([s])
date_strip = date_slice.str.replace('JUN','6')
Trying to convert to new date format
date = pd.to_datetime(s, format='%d%b%Y')
print(date_slice)
Here is the error message
ValueError: could not convert string to float: '05-JUL-11'
You need to use a different date format string:
format='%d-%b-%y'
Why?
The error message gives a clue as to what is wrong:
ValueError: could not convert string to float: '05-JUL-11'
The format string controls the conversion, and is currently:
format='%d%b%Y'
And the fields needed are:
%y - year without a century (range 00 to 99)
%b - abbreviated month name
%d - day of the month (01 to 31)
What is missing is the - that are separating the field in your data string, and the y for a two digit year instead of the current Y for a four digit year.
As an alternative you can use dateutil.parser to parse dates containing string directly, I have created a random dataframe for demo.
l = []
for i in range(100):
l.append('23-JUN-11')
B = pd.DataFrame({'Date':l})
Now, Let's import dateutil.parser and apply it on our date column
import dateutil.parser
B['Date2'] = B['Date'].apply(lambda x : dateutil.parser.parse(x))
B.head()
Out[106]:
Date Date2
0 23-JUN-11 2011-06-23
1 23-JUN-11 2011-06-23
2 23-JUN-11 2011-06-23
3 23-JUN-11 2011-06-23
4 23-JUN-11 2011-06-23