I have this string:
Sun 10:00am - 10:00pm<br>Mon 10:00am - 10:00pm<br>Tue 10:00am - 10:00pm<br>Wed 10:00am - 10:00pm<br>Thu 10:00am - 10:00pm<br>Fri 10:00am - 10:00pm<br>Sat 10:00am - 10:00pm
And I want to extract only the 2 first hours appearing (which would be 10:00am and 10:00pm)
I am trying with slicing and with spliting, but without success.
Regex:
(?<=\s)\d{2}:\d{2}[ap]m
will get all the HH:MM matches and you need to get the first two using e.g. list slicing [:2] when using re.findall.
Without Regex:
Split on <br> tag, then again by whitespace, get the second and last elements:
str_.split('<br>')[0].split()
[out[1], out[-1]]
Example:
In [56]: str_ = 'Sun 10:00am - 10:00pm<br>Mon 10:00am - 10:00pm<br>Tue 10:00am - 10:00pm<br>Wed 10:00am - 10:00pm<br>Thu 10:00am - 10:00pm<br>Fri 10:00am - 10:00pm<br>Sat 10:00am - 10:00pm'
In [57]: re.findall(r'(?<=\s)\d{2}:\d{2}[ap]m', str_)[:2]
Out[57]: ['10:00am', '10:00pm']
In [58]: out = str_.split('<br>')[0].split()
In [59]: [out[1], out[-1]]
Out[59]: ['10:00am', '10:00pm']
I thought this regex would do:
import re
s= 'Sun 10:00am - 10:00pm<br>Mon 10:00am - 10:00pm<br>Tue 10:00am - 10:00pm<br>Wed 10:00am - 10:00pm<br>Thu 10:00am - 10:00pm<br>Fri 10:00am - 10:00pm<br>Sat 10:00am - 10:00pm'
pattern = r'\d{2}:\d{2}[AaPp][Mm]'
timestamps = re.findall(pattern, s)[:2]
print(timestamps)
No need for regex:
s = "Sun 10:00am - 10:00pm<br>Mon 10:00am - 10:00pm<br>Tue 10:00am - 10:00pm<br>Wed 10:00am - 10:00pm<br>Thu 10:00am - 10:00pm<br>Fri 10:00am - 10:00pm<br>Sat 10:00am - 10:00pm"
spl = s.split("<br>") # split at <br> into the days
d={} # empty dict
for s in spl: # for each day
d.setdefault(s.split(" ")[0],[]).extend([x for x in s.split(" ")
if x!= '-'][1:])
print(d)
Output:
{'Wed': ['10:00am', '10:00pm'],
'Sun': ['10:00am', '10:00pm'],
'Fri': ['10:00am', '10:00pm'],
'Tue': ['10:00am', '10:00pm'],
'Mon': ['10:00am', '10:00pm'],
'Thu': ['10:00am', '10:00pm'],
'Sat': ['10:00am', '10:00pm']}
It splits your data into days (at <br>) and splits each day into its weekday (as key) and the both times into a list, omitting the day which we already took as key for the dict and the - that is between the times.
You get to the list of times of Tue by tueTime = d['Tue'] and can access it by [0] or [1] or by decomposing open,close = tueTime.
If you only need the first one, do use: spl = s.split("<br>")[0] - the dict is unordered and you wont know which one was first in your data string.
Related
I'm generating a bot that should receive telegram messages and forward them to another group after applying regex to the received message.
Regex tested in PHP and sites around the web, in theory it should work.
regex = r"⏰ Timeframe M5((\s+(.*)\s+){1,})⏰ Timeframe M15"
text_example = '''⏰ Timeframe M5
04:05 - EURUSD - PUT
05:15 - EURJPY - PUT
06:35 - EURGBP - PUT
07:10 - EURUSD - PUT
08:15 - EURJPY - PUT
⏰ Timeframe M15
06:35 - EURGBP - PUT
07:10 - EURUSD - PUT
08:15 - EURJPY - PUT '''
reg = re.findall(regex, example_text)
print(reg)
return me
[ ]
I've run out of attempts..
I've used regex in other situations and had no problems, in this one I don't know why it works
The pattern does not work for the example data because you are repeating 1 or more leading AND trailing whitespace characters after .*. This will not match if you have 2 consecutive lines without for example extra spaces at the end that you do seem to have in this regex example https://regex101.com/r/gvulvK/3
Note that the name of the initial variable is text_example
What you might do is match the start and the end of the pattern, and in between match all lines that do not start with for example the ⏰
⏰ Timeframe M5\s*((?:\n(?!⏰).*)+)\n⏰ Timeframe M15\b
See this regex101 demo and this regex101 demo
import re
regex = r"⏰ Timeframe M5\s*((?:\n(?!⏰).*)+)\n⏰ Timeframe M15\b"
text_example = '''⏰ Timeframe M5
04:05 - EURUSD - PUT
05:15 - EURJPY - PUT
06:35 - EURGBP - PUT
07:10 - EURUSD - PUT
08:15 - EURJPY - PUT
⏰ Timeframe M15
06:35 - EURGBP - PUT
07:10 - EURUSD - PUT
08:15 - EURJPY - PUT '''
reg = re.findall(regex, text_example)
print(reg)
Output
['\n04:05 - EURUSD - PUT\n05:15 - EURJPY - PUT\n06:35 - EURGBP - PUT\n07:10 - EURUSD - PUT\n08:15 - EURJPY - PUT\n']
If you slightly change your regex and also run it in dot all mode, it should work:
regex = r'⏰ Timeframe M5.*?⏰ Timeframe M15'
matches = re.findall(regex, text_example, flags=re.S)
print(matches)
This prints:
⏰ Timeframe M5
04:05 - EURUSD - PUT
05:15 - EURJPY - PUT
06:35 - EURGBP - PUT
07:10 - EURUSD - PUT
08:15 - EURJPY - PUT
⏰ Timeframe M15
I recently acquired data for my local gym and I'm attempting to normalize the data so that a "gym signup" object can be created, which contains all the people that signed up for that session.
The text file looks like this:
https://pastebin.com/YcnSJiA7
Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
JD John Doe
AW Alice Wonderland
IM Iron Man
Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
JD John Doe
AW Alice Wonderland
IM Iron Man
I've been able to use pandas to separate the signs ups by column[initials of name, name] but I have no idea how to detect when a line corresponds to the time slot and not to a person signing up.
So after the program runs, every line should consist of the columns [initials of name, name, timeslot]
the easiest way for me to work with this data would be in this format,
JD John Doe Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
AW Alice Wonderland Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
IM Iron Man Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
JD John Doe Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
AW Alice Wonderland Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
IM Iron Man Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
I attempted to iterate through every line and once a time slot line comes up, then I append that line to the next ones, until a new time slot appears.
def testSort():
with open("1-weak-gym.txt") as fp:
id= []
totalSheet=[]
timeSlot = []
lastLine=[]
for ln in fp:
if ln.startswith("Sep"): ##this is a time slot
timeSlot.clear()
timeSlot.append(ln[0:]) ##save that time slot as the lastDate variable
else:
if (timeSlot):
totalSheet.append(timeSlot) ##append the time slot
totalSheet.append(ln[0:]) ##append the name line
else:
print('Hello eror')
print(totalSheet, file=open("newOuput.txt","a"))
You can try this approach ( if you have a strong pattern with time at the end of the headers rows):
import re
def is_time_format(s):
time_re = re.compile(r'\b((1[0-2]|0?[1-9]):([0-5][0-9])([AaPp][Mm]))')
return bool(time_re.match(s))
with open("1-weak-gym.txt") as fp:
new_lines = []
extra_info = ''
for line in fp:
last_bit = line.split(' ')[-1]
if is_time_format(last_bit):
extra_info = line
continue
else:
new_lines.append(line.rstrip() + '\t' + extra_info)
open("newOutput", 'w').writelines(new_lines)
Then you will get a file in the proper format.
I have scraped some data from the yellow pages using scrapy.
The hours of the business provided from scraping are in a 12-hour format and I need to convert it into 24 hours.
The format for the business hours I scraped are:
Mon - Fri:,10:00 am - 7:00 pm.
I need to extract the two values for opening and closing time, convert them both into 24-hour format and then concatenate the string back together again.
As a result, I need to devise a regex that will extract the time and then change it into a 24 hour format.
The final string should (as per previous example) should be:
Mon - Fri:,10:00 - 19:00
I have tried different regex. I tried the following:
import re
txt = 'Mon - Fri:,10:00 am - 7:00 pm'
data = re.findall(r'\s(\d{2}\:\d{2}\s?(?:AM|PM|am|pm))', txt)
print(data)
i am not python developer but we can in this way in javascript. you can convert logic into python
in this way you can convert this time to miltary time (24 hour)
https://jsfiddle.net/1hxojLdf/2/
let text='Mon - Fri:,10:00 am - 7:00 pm';
const regex=/(\w+\s-\s\w+:.)(\d{1,2}:\d{1,2}\s(am|pm))\s-\s(\d{1,2}:\d{1,2}\s(am|pm))/;
const result=text.match(regex);
let timeone=result[2];
let timetwo=result[4];
timeone= moment(timeone,"h:mm A").format('HH:mm');
timetwo= moment(timetwo,"h:mm A").format('HH:mm');
text=result[1]+timeone+"-"+timetwo;
alert(text)
I am trying to print out the operating hours of a stall. I'm trying to check if all the values of operating hours are the same, I should be printing
Monday to Sunday: 0800 - 2200
Else it should break into the different operating hours.
Monday to Friday: 0800 - 2200
Saturday to Sunday: 1100 - 2000
The values of the list are created depending on the stall. As an example, one of the stalls has operating hours of such.
operating_hours_list = [['MONDAY', '0800 - 2200'], ['TUESDAY', '0800 - 2200'], ['WEDNESDAY', '0800 - 2200'], ['THURSDAY', '0800 - 2200'], ['FRIDAY', '0800 - 2200'], ['SATURDAY', '1100 - 2000'], ['SUNDAY', '1100 - 2000']]
Thank you!
You can use itertools.groupby() to group the list by consecutive elements which have the same hours (index 1 of the nested list):
import itertools
operating_hours_list = [['MONDAY', '0800 - 2200'], ['TUESDAY', '0800 - 2200'], ['WEDNESDAY', '0800 - 2200'], ['THURSDAY', '0800 - 2200'], ['FRIDAY', '0800 - 2200'], ['SATURDAY', '1100 - 2000'], ['SUNDAY', '1100 - 2000']]
groups = itertools.groupby(operating_hours_list, lambda x: x[1])
# groups looks like
# [('0800 - 2200', <itertools._grouper object at 0x1174c8470>), ('1100 - 2000', <itertools._grouper object at 0x1176f3fd0>)]
# where each <itertools._grouper> object contains elements in the original list
for hours, days in groups:
day_list = list(days)
# if there's only one unique day with these hours, then just print that day
# e.g. day_list = [['MONDAY', '0800 - 2200']], so we need to take the first element of the first element
# we additionally call .title() on it to turn it to 'Title Case' instead of all-caps
if len(day_list) == 1:
print("{}: {}".format(day_list[0][0].title(), hours))
# otherwise, we get both the first day with these hours (index 0),
# and the last day with these hours (index -1).
else:
print("{} to {}: {}".format(day_list[0][0].title(), day_list[-1][0].title(), hours))
This outputs:
Monday to Friday: 0800 - 2200
Saturday to Sunday: 1100 - 2000
Since you're looking for guidance I'll help you with some pseudo-code (as giving you the answer won't be helpful for learning)
First you'll need something to store the set of open times. This could be one or multiple sets, and each set will have a range of days and a time. Therefore a 2D list is an option:
openings = [[]]
We know that we can start with Monday and its time and go until we find a day with a different time or reach the end of the week. So we can start it as such:
openings = [[operating_hours_list[0][0], operating_hours_list[0][1]]
This will start us with
[['Monday', '0800 - 2200']]
Now loop from the next day until you find a different time (this is now pseudo-code - try translating to Python)
for d from 1 to 6: #Tuesday to Sunday indices
if operating_hours_list[d][1] == openings[end][1]:
keep going
else: # found a day with a different time
append (' to ' + operating_hours_list[d-1][0]) to openings[end][0]
append a new opening set to openings with day and times from opening_hours_list
Now this will get you started, and then think about how you'll handle when you get to the end of the week. Keep in mind you can index the last item in a list with [-1], and do it in steps and run and test so you can fix problems as they come up. Hope this helps and you learn a bit from this problem!
I want to do like this. Do you know a good way?
import re
if __name__ == '__main__':
sample = "eventA 12:30 - 14:00 5200yen / eventB 15:30 - 17:00 10200yen enjoy!"
i_want_to_generate = "eventA 12:30 - 14:00 5,200yen / eventB 15:30 - 17:00 10,200yen enjoy!"
replaced = re.sub("(\d{1,3})(\d{3})", "$1,$2", sample) # Wrong.
print(replaced) # eventA 12:30 - 14:00 $1,$2yen / eventB 15:30 - 17:00 $1,$2yen enjoy!
You're not using the correct notation for your back-reference(s). You could also add a positive lookahead assertion containing the currency to ensure only those after the 'yen' are changed:
replaced = re.sub(r"(\d{1,3})(\d{3})(?=yen)", r"\1,\2", sample) # Wrong.
print(replaced)
# eventA 12:30 - 14:00 5,200yen / eventB 15:30 - 17:00 10,200yen enjoy!
Use \1 instead of $1 for substitution
Check: https://regex101.com/r/T2sbD2/1