I have such string
Sale: \t\t\t5 Jan \u2013 10 Jan
I want to extract the start and the end of the sale. Very straightforward approach would be to make several spilts, but I want to that using regular expressions.
As the result I want to get
start = "5 Jan"
end = "10 Jan"
Is it possible to do that using regex?
This should help.
import re
s = "Sale: \t\t\t5 Jan \u2013 10 Jan"
f = re.findall(r"\d+ \w{3}", s)
print f
Output:
['5 Jan', '10 Jan']
This may not be an optimised one but works assuming the string pattern remains the same.
import re
s = 'Sale: \t\t\t5 Jan \u2013 10 Jan'
start, end = re.search(r'Sale:(.*)', s).group(1).strip().replace('\u2013', ',').split(', ')
# start <- 5 Jan
# end <- 10 Jan
Related
I have scraped some data and there are some hours that have time in 12 hours format. The string is like this: Mon - Fri:,10:00 am - 7:00 pm. So i need to extract the times 10:00 am and 7:00 pm and then convert them to 24 hour format. Then the final string I want to make is like this:
Mon - Fri:,10:00 - 19:00
Any help would be appreciated in this regard. I have tried the following:
import re
txt = 'Mon - Fri:,10:00 am - 7:00 pm'
data = re.findall(r'\s(\d{2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
print(data)
But this regex and any other that I tried to use didn't do the task.
Your regex enforces a whitespace before the leading digit which prevents ,10:00 am from matching and requires two digits before the colon which fails to match 7:00 pm. r"(?i)(\d?\d:\d\d (?:a|p)m)" seems like the most precise option.
After that, parse the match using datetime.strptime and convert it to military using the "%H:%M" format string. Any invalid times like 10:67 will raise a nice error (if you anticipate strings that should be ignored, adjust the regex to strictly match 24-hour times).
import re
from datetime import datetime
def to_military_time(x):
return datetime.strptime(x.group(), "%I:%M %p").strftime("%H:%M")
txt = "Mon - Fri:,10:00 am - 7:00 pm"
data = re.sub(r"(?i)(\d?\d:\d\d (?:a|p)m)", to_military_time, txt)
print(data) # => Mon - Fri:,10:00 - 19:00
Your regex looks only for two digit hours (\d{2}) with white space before them (\s). The following captures also one digit hours, with a possible comma instead of the space.
data = re.findall(r'[\s,](\d{1,2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
However, you might want to consider all punctuation as valid:
data = re.findall(r'[\s!"#$%&\'\(\)*+,-./:;\<=\>?#\[\\\]^_`\{|\}~](\d{1,2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
Regex need to change like here.
import re
text = 'Mon - Fri:,10:00 am - 7:00 pm'
result = re.match(r'\D* - \D*:,([\d\s\w:]+) - ([\d\s\w:]+)', text)
print(result.group(1))
# it will print 10:00 am
print(result.group(2))
# it will print 7:00 pm
You need some thing like '+' and '*' to tell regex to get multiple word, if you only use \s it will only match one character.
You can learn more regex here.
https://regexr.com/
And here you can try regex online.
https://regex101.com/
Why not use the time module?
import time
data = "Mon - Fri:,10:00 am - 7:00 pm"
parts = data.split(",")
days = parts[0]
hours = parts[1]
parts = hours.split("-")
t1 = time.strptime(parts[0].strip(), "%I:%M %p")
t2 = time.strptime(parts[1].strip(), "%I:%M %p")
result = days + "," + time.strftime("%H:%M", t1) + " - " + time.strftime("%H:%M", t2)
Output:
Mon - Fri:,10:00 - 19:00
So I've been trying to extract the strings that follow the "dot" character in the text file, but only for lines that follow the pattern as below, that is, after the date and time:
09 May 2018 10:37AM • 6PR, Perth (Mornings)
The problem is for each of those lines, the date and time would change so the only common pattern is that there would be AM or PM right before the "dot".
However, if I search for "AM" or "PM" it wouldn't recognize the lines because the "AM" and "PM" are attached to the time.
This is my current code:
for i,s in enumerate(open(file)):
for words in ['PM','AM']:
if re.findall(r'\b' + words + r'\b', s):
source=s.split('•')[0]
Any idea how to get around this problem? Thank you.
I guess your regex is the problem here.
for i, s in enumerate(open(file)):
if re.findall(r'\d{2}[AP]M', s):
source = s.split('•')[0]
# 09 May 2018 10:37AM
If you are trying to extract the datetime try using regex.
Ex:
import re
s = "09 May 2018 10:37AM • 6PR, Perth (Mornings)"
m = re.search("(?P<datetime>\d{2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\s+\d{2}\:\d{2}(AM|PM))", s)
if m:
print m.group("datetime")
Output:
09 May 2018 10:37AM
I'm new to python and am having difficulties to remove words in a string
9 - Saturday, 19 May 2012
above is my string I would like to remove all string to
19 May 2012
so I could easily convert it to sql date
here is the could that I tried
new_s = re.sub(',', '', '9 - Saturday, 19 May 2012')
But it only remove the "," in the String. Any help?
You can use string.split(',')
and you will get
['9 - Saturday', '19 May 2012']
You are missing the .* (matching any number of chars) before the , (and a space after it which you probably also want to remove:
>>> new_s = re.sub('.*, ', '', '9 - Saturday, 19 May 2012')
>>> new_s
'19 May 2012'
Your regex is matching a single comma only hence that is the only thing it removes.
You may use a negated character class i.e. [^,]* to match everything until you match a comma and then match comma and trailing whitespace to remove it like this:
>>> print re.sub('[^,]*, *', '', '9 - Saturday, 19 May 2012')
19 May 2012
Regex is great, but for this you could also use .split()
test_string = "9 - Saturday, 19 May 2012"
splt_string = test_string.split(",")
out_String = splt_string[1]
print(out_String)
Outputs:
19 May 2012
If the leading ' ' is a propblem, you can remedy this with out_String.lstrip()
try this
a = "9 - Saturday, 19 May 2012"
f = a.find("19 May 2012")
b = a[f:]
print(b)
I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??
You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'
You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)
I would like to extract the dates from the following text:
Some text some more text October 12 - 2010
The result would be:
yyyy-mm-dd: 2010-10-12
How can I tell regex the month is words and can be "january", "february" etc then a single space, [a group of 1-2 characters] a space and the final [group of four digits \d{4}]
Writing out the actual names of the months in the regex makes for a very readable and maintainable expression, which I feel is important when it comes to regexes. Like so:
(january|february|march|april|may|june|july|august|september|october|november|december)\s\d{1-2}\s\d{4}
Using the above regular expression and calendar library to find calendar names, you can proceed as follows.
import calendar
import re
month_num = {v: str(k) for k,v in enumerate(calendar.month_name)}
apattern = r'(january|february|march|april|may|june|july|august|september|october|november|december)\s\d{1,2}\s\-\s\d{4}'
re.sub(apattern, lambda x: 'yyyy-mm-dd:' + x.group().split(" ")[-1]+"-"+x.group().split(" ")[-3] + "-" + month_num[x.group().capitalize().split(" ")[0]], 'october 12 - 2010')