RegEx: How can I match timecodes above a certain time? - python

I'm writing a script to scour the metadata of YouTube videos and grab timecodes out of them, if any.
with urllib.request.urlopen("https://www.googleapis.com/youtube/v3/videos?id=m65QTeKRWNg&key=AIzaSyDls3PGTAKqbr5CqSmxt71fzZTNHZCQzO8&part=snippet") as url:
data = json.loads(url.read().decode())
description = json.dumps(data['items'][0]['snippet']['description'], indent=4, sort_keys=True)
print(description)
This works fine, so I go ahead and find the timecodes.
# finds timecodes like 00:00
timeBeforeHour = re.findall(r'[\d\.-]+:[\d.-]+', description)
>>[''0:00', '6:00', '9:30', '14:55', '19:00', '23:23', '28:18', '33:33', '37:44', '40:04', '44:15', '48:00', '54:00', '58:18', '1:02', '1:06', '1:08', '1:12', '1:17', '1:20']
It goes beyond and grabs times after 59:00, but not correctly as it misses the final ":", so I grab the remaining set:
# finds timecodes like 00:00:00
timePastHour = re.findall(r'[\d\.-]+:[\d.-]+:[\d\.-]+', description)
>>['1:02:40', '1:06:10', '1:08:15', '1:12:25', '1:17:08', '1:20:34']
I want to concatenate them, but still have the issue of the incorrect times in the first regex.
How can I stop the range of the first regex going above an hour i.e 59:59?
I look at regex and my head explodes a bit, any clarifacation would be super!
edit:
I've tried this:
description = re.findall(r'?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d', description)
and this:
description = re.findall(r'^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$', description)
but I'm entering them wrong. What is it position of the regex doing?
Also for context, this is part of the sample I'm trying to strip:
Naked\n1:02:40 Marvel 83' - Genesis\n1:06:10 Ward-Iz - The Chase\n1:08:15 Epoch - Formula\n1:12:25 Perturbator - Night Business\n1:17:08 Murkula - Death Code\n1:20:34 LAZERPUNK - Revenge\n\nPhotography by Jezael Melgoza"

Use
results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)
See the regex demo.
It will match a time string when not inside a loner colon-separated digit string (like 11:22:22:33).
Explanation:
(?<!\d:) - a negative lookbehind that matches a location that is not immediately preceded with a digit and :
(?<!\d) - a negative lookbehind that matches a location that is not immediately preceded with a digit (a separate lookbehind is necessary because Python re lookbehind only accepts a fixed-width pattern)
[0-5]?\d - an optional digit from 0 to 5 and then any 1 digit
: - a colon
[0-5]\d - a digit from 0 to 5 and then any 1 digit
(?!:?\d) - a negative lookahead that matches a location that is not immediately followed with an optional : and a digit.
Python online demo:
import re
description = "Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo"
results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)
print(results)
# => ['0:00', '6:00', '9:30', '14:55']

I think this is what you are looking for:
(^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$)
https://regex101.com/r/yERoPi/1

Related

How to use regex in string partition using python?

I have a string like as shown below from a pandas data frame column
string = "insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous) - Hypoglycaemia Protocol if Blood Glucose Level (mmol) < 4 - Call Doctor if Blood Glucose Level (mmol) > 22"
I am trying to get an output like as shown below (you can see everything before 2nd hyphen is returned)
insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous)
So, I tried the below code
string.partition(' -')[0] # though this produces the output, not reliable
Meaning, I always want everything before the 2nd Hyphen (-).
Instead of me manually assigning the spaces, I would like to write something like below. Not sure whether the below is right as well. can you help me get everything before the 2nd hyphen?
string.partition(r'\s{2,6}-')[0]
Can help me get the expected output using partition method and regex?
You could use re.sub here for a one-liner solution:
string = "insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous) - Hypoglycaemia Protocol if Blood Glucose Level (mmol) < 4 - Call Doctor if Blood Glucose Level (mmol) > 22"
output = re.sub(r'^([^-]+?-[^-]+?)(?=\s*-).*$', '\\1', string)
print(output)
This prints:
insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous)
Explanation of regex:
^ from the start of the input
( capture
[^-]+? all content up to
- the first hyphen
[^-]+? all content up, but not including
) end capture
(?=\s*-) zero or more whitespace characters followed by the second hyphen
.* then match the remainder of the input
$ end of the input
Try using re.split instead of string.partition:
re.split(r'\s{2,6}-', string)[0]
Simple solution with split and join:
"-".join(string.split("-")[0:2])

Finding Standardized Text Pattern In String

We are looking through a very large set of strings for standard number patterns in order to locate drawing sheet numbers. For example valid sheet numbers are: A-101, A101, C-101, C102, E-101, A1, C1, A-100-A, ect.
They may be contained in a string such as "The sheet number is A-101 first floor plan"
The sheet number patterns are always comprised of similar patterns of character type (numbers, characters and separators (-, space, _)) and if we convert all valid numbers to a pattern indicating the character type (A-101=ASNNN, A101=ANNN, A1 - AN, etc) that there are only ~100 valid patterns.
Our plan is to convert each character in the string to it's character type and then search for a valid pattern. So the question is what is the best way to search through "AAASAAAAASAAAAAASAASASNNNSAAAAASAAAAASAAAA" to find one of 100 valid character type patterns. We considered doing 100 text searches for each pattern, but there seems like there could be a better way to find a candidate pattern and then search to see if it is one of the 100 valid patterns.
Solution
Is it what you want?
import re
pattern_dict = {
'S': r'[ _-]',
'A': r'[A-Z]',
'N': r'[0-9]',
}
patterns = [
'ASNNN',
'ANNN',
'AN',
]
text = "A-1 A2 B-345 C678 D900 E80"
for pattern in patterns:
converted = ''.join(pattern_dict[c] for c in pattern)
print(pattern, re.findall(rf'\b{converted}\b', text))
output:
ASNNN ['B-345']
ANNN ['C678', 'D900']
AN ['A2']
Exmplanation
rf'some\b {string}': Combination of r-string and f-string.
r'some\b': Raw string. It prevents python string escaping. So it is same with 'some\\b'
f'{string}': Literal format string. Python 3.6+ supports this syntax. It is similar to '{}'.format(string).
So you can alter rf'\b{converted}\b' to '\\b' + converted + '\\b'.
\b in regex: It matches word boundary.
bookmark_strings = []
bookmark_strings.append("I-111 - INTERIOR FINISH PLAN & FINISH SCHEDULE")
bookmark_strings.append("M0.01 SCHEDULES & CALCULATIONS")
bookmark_strings.append("M-1 HVAC PLAN - OH Maple Heights PERMIT")
bookmark_strings.append("P-2 - PLUMBING DEMOLITION")
pattern_dict = {
'S': r'[. _-]',
'A': r'[A-Z]',
'N': r'[0-9]',
}
patterns = [
'ASNNN',
'ANSNN',
'ASN',
'ANNN'
]
for bookmark in bookmark_strings:
for pattern in patterns:
converted = ''.join(pattern_dict[c] for c in pattern)
if len(re.findall(rf'\b{converted}\b', bookmark)) > 0:
print ("We found a match for pattern - {}, value = {} in bookmark {}".format(pattern, re.findall(rf'\b{converted}\b', bookmark) , bookmark))
Output:
We found a match for pattern - ASNNN, value = ['I-111'] in bookmark I-111 - INTERIOR FINISH PLAN & FINISH SCHEDULE
We found a match for pattern - ANSNN, value = ['M0.01'] in bookmark M0.01 SCHEDULES & CALCULATIONS
We found a match for pattern - ASN, value = ['M-1'] in bookmark M-1 HVAC PLAN - OH Maple Heights PERMIT
We found a match for pattern - ASN, value = ['P-2'] in bookmark P-2 - PLUMBING DEMOLITION
use regex
import re
re.findall("[A-Z][-_ ]?[0-9]+",text)

Python Regex - Extract text between (multiple) expressions in a textfile

I am a Python beginner and would be very thankful if you could help me with my text extraction problem.
I want to extract all text, which lies between two expressions in a textfile (the beginning and end of a letter). For both, the beginning and the end of the letter there are multiple possible expressions (defined in the lists "letter_begin" and "letter_end", e.g. "Dear", "to our", etc.). I want to analyze this for a bunch of files, find below an example of how such a textfile looks like -> I want to extract all text starting from "Dear" till "Douglas". In cases where the "letter_end" has no match, i.e. no letter_end expression is found, the output should start from the letter_beginning and end at the very end of the text file to be analyzed.
Edit: the end of "the recorded text" has to be after the match of "letter_end" and before the first line with 20 characters or more (as is the case for "Random text here as well" -> len=24.
"""Some random text here
 
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
This is my code so far - but it is not able to flexible catch the text between the expressions (there can be anything (lines, text, numbers, signs, etc.) before the "letter_begin" and after the "letter_end")
import re
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:" + openings + r")\s+.*?" + r"(?:" + closings + r"),\n\S+"
with open(filename, 'r', encoding="utf-8") as infile:
text = infile.read()
text = str(text)
output = re.findall(regex, text, re.MULTILINE|re.DOTALL|re.IGNORECASE) # record all text between Regex (Beginning and End Expressions)
print (output)
I am very thankful for every help!
You may use
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
This pattern will result in a regex like
(?:dear|to our|estimated)[\s\S]*?(?:sincerely|yours|best regards).*(?:\n.*){0,2}
See the regex demo. Note you should not use re.DOTALL with this pattern, and the re.MULTILINE option is also redundant.
Details
(?:dear|to our|estimated) - any of the three values
[\s\S]*? - any 0+ chars, as few as possible
(?:sincerely|yours|best regards) - any of the three values
.* - any 0+ chars other than newline
(?:\n.*){0,2} - zero, one or two repetitions of a newline followed with any 0+ chars other than newline.
Python demo code:
import re
text="""Some random text here
Dear Shareholders We
are pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.
Best regards
Douglas
Random text here as well"""
letter_begin = ["dear", "to our", "estimated"] # All expressions for "beginning" of letter
openings = "|".join(letter_begin)
letter_end = ["sincerely", "yours", "best regards"] # All expressions for "ending" of Letter
closings = "|".join(letter_end)
regex = r"(?:{})[\s\S]*?(?:{}).*(?:\n.*){{0,2}}".format(openings, closings)
print(regex)
print(re.findall(regex, text, re.IGNORECASE))
Output:
['Dear Shareholders We\nare pleased to provide you with this semiannual report for Fund for the six-month period ended April 30, 2018. For additional information about the Fund, please visit our website a, where you can access quarterly commentaries. We value the trust that you place in us and look forward to serving your investment needs in the years to come.\nBest regards \nDouglas\n']

Python RegEx for Australian Phone Numbers - False negative - 2 matches in the same substring

I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}

regex to find image sequences from list of filenames

I need some help with a regex string to pull any filename that looks like it might be part of a frame sequence out of a previously generated list of filenames.
Frames in a sequence will generally have a minimum padding of 3 and will be preceeded by either a '.' or a '_' An exception is: if the filename is only made up of a number and the .jpg extension (e.g 0001.jpg, 0002.jpg, etc.). I'd like to capture all these in one line of regex, if possible.
Here's what I have so far:
(.*?)(.|_)(\d{3,})(.*)\.jpg
Now I know this doesn't do the "preceeded by . or _" bit and instead just finds a . or _ anywhere in the string to return a positive. I've tried a bit of negative lookbehind testing, but can't get the syntax to work.
A sample of data is:
test_canon_shot02.jpg
test_shot01-04.jpg
test_shot02-03.jpg
test_shot02-02.jpg
test_shot01-03.jpg
test_canon_shot03.jpg
test_shot01-02.jpg
test_shot02.jpg
test_canon_shot02.jpg
test_shot01.jpg
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
OrangeXmas2015_Print_A ct2.jpg
sh120_HF_V01-01.jpg
sh120_HF_V01-02.jpg
sh200_DMP_v04.jpg
sh120_HF_V04.jpg
sh120_HF_V03.jpg
sh120_HF_V02.jpg
blah_v02.jpg
blah_v01.jpg
blah_Capture0 4.jpg
blah_Capture03 .jpg
blah_Capture01. jpg
blah_Capture02.jpg
Wall_GraniteBlock_G rey_TC041813.jpg
Renders10_wire.jpg
Renders10.jpg
Renders09_wire.jpg
Renders09.jpg
Renders08_wire.jpg
Renders08.jpg
Renders07_wire.jpg
Renders07.jpg
Renders06_wire.jpg
Renders06.jpg
Renders05_wire.jpg
Renders05.jpg
Renders04_wire.jpg
Renders04.jpg
Renders03_wire.jpg
Renders03.jpg
Renders02_wire.jpg
Renders02.jpg
Renders01_wire.jpg
Renders01.jpg
archmodels58_057_carpinusbetulus_leaf_diffuse.jpg
archmodels58_042_bark_bump.jpg
archmodels58_023_leaf_diffuse.jpg
WINDY TECHNICZNE-reflect00.jpg
archmodels58_057_leaf_opacity.jpg
archmodels58_057_bark_reflect.jpg
archmodels58_057_bark_bump.jpg
blahC-00-oknaka.jpg
bed
debt
cab
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
The result I'm after is 2 sequences identified:
GameAssets_.00000.jpg to GameAssets_.00024.jpg
00000.jpg to 00018.jpg
Based on the rules you specified in your question, this pattern should accomplish what you need:
(^|\r?\n|.*_|.*\.)\d{3,}.*\.jpg
for item in re.findall(r'.*?[._]?0{3,}.*',data):
print(item)
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
Try
(.*?)(\.|_?)(000\d{0,})(.*)\.jpg
Notice that I had to escape the '.' in the second group. Also, I had to make the search for '.' and '_' optional in the second group. Finally, I had to add the minimum padding to the third group.
I used regex101.com to test and refine the regex: regex101

Categories

Resources