How to use regex in string partition using python? - python

I have a string like as shown below from a pandas data frame column
string = "insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous) - Hypoglycaemia Protocol if Blood Glucose Level (mmol) < 4 - Call Doctor if Blood Glucose Level (mmol) > 22"
I am trying to get an output like as shown below (you can see everything before 2nd hyphen is returned)
insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous)
So, I tried the below code
string.partition(' -')[0] # though this produces the output, not reliable
Meaning, I always want everything before the 2nd Hyphen (-).
Instead of me manually assigning the spaces, I would like to write something like below. Not sure whether the below is right as well. can you help me get everything before the 2nd hyphen?
string.partition(r'\s{2,6}-')[0]
Can help me get the expected output using partition method and regex?

You could use re.sub here for a one-liner solution:
string = "insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous) - Hypoglycaemia Protocol if Blood Glucose Level (mmol) < 4 - Call Doctor if Blood Glucose Level (mmol) > 22"
output = re.sub(r'^([^-]+?-[^-]+?)(?=\s*-).*$', '\\1', string)
print(output)
This prints:
insulin MixTARD 30/70 - inJECTable 20 unit(s) SC (SubCutaneous)
Explanation of regex:
^ from the start of the input
( capture
[^-]+? all content up to
- the first hyphen
[^-]+? all content up, but not including
) end capture
(?=\s*-) zero or more whitespace characters followed by the second hyphen
.* then match the remainder of the input
$ end of the input

Try using re.split instead of string.partition:
re.split(r'\s{2,6}-', string)[0]

Simple solution with split and join:
"-".join(string.split("-")[0:2])

Related

Regex that matches all German and Austrian mobile phone numbers

I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+
You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.

Regular expression in Python, 2-3 numbers then 2 letters

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grĂ¥", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!
You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

RegEx: How can I match timecodes above a certain time?

I'm writing a script to scour the metadata of YouTube videos and grab timecodes out of them, if any.
with urllib.request.urlopen("https://www.googleapis.com/youtube/v3/videos?id=m65QTeKRWNg&key=AIzaSyDls3PGTAKqbr5CqSmxt71fzZTNHZCQzO8&part=snippet") as url:
data = json.loads(url.read().decode())
description = json.dumps(data['items'][0]['snippet']['description'], indent=4, sort_keys=True)
print(description)
This works fine, so I go ahead and find the timecodes.
# finds timecodes like 00:00
timeBeforeHour = re.findall(r'[\d\.-]+:[\d.-]+', description)
>>[''0:00', '6:00', '9:30', '14:55', '19:00', '23:23', '28:18', '33:33', '37:44', '40:04', '44:15', '48:00', '54:00', '58:18', '1:02', '1:06', '1:08', '1:12', '1:17', '1:20']
It goes beyond and grabs times after 59:00, but not correctly as it misses the final ":", so I grab the remaining set:
# finds timecodes like 00:00:00
timePastHour = re.findall(r'[\d\.-]+:[\d.-]+:[\d\.-]+', description)
>>['1:02:40', '1:06:10', '1:08:15', '1:12:25', '1:17:08', '1:20:34']
I want to concatenate them, but still have the issue of the incorrect times in the first regex.
How can I stop the range of the first regex going above an hour i.e 59:59?
I look at regex and my head explodes a bit, any clarifacation would be super!
edit:
I've tried this:
description = re.findall(r'?<!\d:)(?<!\d)[0-5]\d:[0-5]\d(?!:?\d', description)
and this:
description = re.findall(r'^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$', description)
but I'm entering them wrong. What is it position of the regex doing?
Also for context, this is part of the sample I'm trying to strip:
Naked\n1:02:40 Marvel 83' - Genesis\n1:06:10 Ward-Iz - The Chase\n1:08:15 Epoch - Formula\n1:12:25 Perturbator - Night Business\n1:17:08 Murkula - Death Code\n1:20:34 LAZERPUNK - Revenge\n\nPhotography by Jezael Melgoza"
Use
results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)
See the regex demo.
It will match a time string when not inside a loner colon-separated digit string (like 11:22:22:33).
Explanation:
(?<!\d:) - a negative lookbehind that matches a location that is not immediately preceded with a digit and :
(?<!\d) - a negative lookbehind that matches a location that is not immediately preceded with a digit (a separate lookbehind is necessary because Python re lookbehind only accepts a fixed-width pattern)
[0-5]?\d - an optional digit from 0 to 5 and then any 1 digit
: - a colon
[0-5]\d - a digit from 0 to 5 and then any 1 digit
(?!:?\d) - a negative lookahead that matches a location that is not immediately followed with an optional : and a digit.
Python online demo:
import re
description = "Tracks\n======\n0:00 Tonebox - Frozen Code\n6:00 SHIKIMO & DOOMROAR - Getaway\n9:30 d.notive - Streets of Passion\n14:55 Perturbator - Neo Tokyo"
results = re.findall(r'(?<!\d:)(?<!\d)[0-5]?\d:[0-5]\d(?!:?\d)', description)
print(results)
# => ['0:00', '6:00', '9:30', '14:55']
I think this is what you are looking for:
(^|[^\d:])([0-5]?[0-9]:[0-5][0-9])([^\d:]|$)
https://regex101.com/r/yERoPi/1

Python RegEx for Australian Phone Numbers - False negative - 2 matches in the same substring

I am trying to extract phone numbers from a web page using Python & RegEx
Australian number format
+61 (international code - shown below as 'i')
02, 03, 07 or 08 (state codes - shown below as 's')
1234-5678 (8 digit local number - shown below as 'x')
Common variations of format (in order of commonality):
Format 1: ss xxxx xxxx (e.g. 02 1234 5678)
Format 2: +ii s xxxx xxxx (e.g. +61 2 1234 5678) (note the first 's' digit is removed here)
Format 3: (seen rarely) +ii (s)s xxxx-xxxx (e.g. +61 (0)2 1234 5678
My RegEx:
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', sample_text))
works well on a simple sample_text:
sample_text =
"610212345678ABC##610312345678ABC##610712345678ABC##610812345678ABC##0212345678ABC##0312345678ABC##0712345678ABC##0812345678ABC##61212345678ABC##61312345678ABC##61712345678ABC##61812345678ABC##0412345678ABC##61412345678ABC##130012345678ABC##180012345678ABC##"
Result:
['0212345678', '0312345678', '0712345678', '0812345678',
'0212345678', '0312345678', '0712345678', '0812345678',
'61212345678', '61312345678', '61712345678', '61812345678',
'0412345678', '61412345678', '1300123456', '1800123456']
The Goal
Using http://www.outware.com.au/contact as an example ...
The 2 actual numbers on the page are:
+61 (0)3 8684 9912 and +61 (0)2 8064 7043 (both numbers appear twice - once in the main section of the page and once in the footer)
The Problem
#take HTML markup from body tags
b = driver.find_element_by_css_selector('body').text
#remove all non-alpha + white space.
b = re.sub(r'\W+', '', b)
Result:
"PORTFOLIOINNOVATIONSERVICESCAREERSINSIGHTSNEWSABOUTCONTACTCONTACTOUTWAREMelbourneLe......AFRFast100Nov92017EXPLOREOUTWAREPortfolioInnovationWorkingatOutwareAboutSitemapCONNECTMELBOURNELevel3469LaTrobeStMelbourneVIC3000610386849912SYDNEYLevel41SmailStUltimoNSW2007610280647043"
Now if I apply my regex to this string
re.findall(r'[0][2]\d{8}|[0][3]\d{8}|[0][7]\d{8}|[0][8]\d{8}|[6][1][2]\d{8}|[6][1][3]\d{8}|[6][1][7]\d{8}|[6][1][8]\d{8}|[0][4]\d{8}|[6][1][4]\d{8}|[1][3][0][0]\d{6}|[1][8][0][0]\d{6}', re.sub(r'\W+', '', b))
Result:
[u'0386849912', u'0761028064', u'0386849912', u'0761028064']
I am getting a false positive because I have concatenated a postcode "NSW2007" onto the start of the phone number.
I presume because the regex has parsed the first part of "NSW2007610280647043" matching "0761028064" it doesn't then match "0280647043" which is also part of the same substring
I actually don't mind the false positive (i.e. getting "0761028064") but I do need to solve the false negative (i.e. not getting "0280647043")
I know there's some RegEx gurus here who can help on this. :-)
Please help!!
Don't search/replace any text prior to using the regex. That will make your input unusable. Try this:
(?:(?:\+?61 )?(?:0|\(0\))?)?[2378] \d{4}[ -]?\d{4}
https://regex101.com/r/1Q4HuD/3
It might help if you use a negative look ahead to check to see make sure the following character is not a number. For example: (?!\d).
This could create a problem though if some data following a phone number starts with a number.
The look behind looks like this when implemented in your regex:
(02\d{8}|03\d{8}|07\d{8}|08\d{8}|612\d{8}|613\d{8}|617\d{8}|618\d{8}|04\d{8}|614\d{8}|1300\d{6}|1800\d{6})(?!\d)
(I removed the square brackets as you do not need them when trying to match a single character)
This answer should be a comment, it isn't because of my low reputation!
I've seen you're updating the regex and I think this variation can help you. It should match very uncommon formats!
(\+61 )?(?:0|\(0\))?[2378] (?:[\s-]*\d){8}

joining multiple regular expression for readability

I have following requirements in date which can be any of the following format.
mm/dd/yyyy or dd Mon YYYY
Few examples are shown below
04/20/2009 and 24 Jan 2001
To handle this I have written regular expression as below
Few text scenarios are metnioned below
txt1 = 'Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox
+ fluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical
Review of Systems Constitutional:'
txt2 = "s The patient is a 44 year old married Caucasian woman,
unemployed Decorator, living with husband and caring for two young
children, who is referred by Capitol Hill Hospital PCP, Dr. Heather
Zubia, for urgent evaluation/treatment till first visit with Dr. Toney
Winkler IN EIGHT WEEKS on 24 Jan 2001."
date = re.findall(r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}', txtData)
I am not getting 24 Jan 2001 where as if I run individually (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}' I am able to get output.
Question 1: What is bug in above expression?
Question 2: I want to combine both to make more readable as I have to parse any other formats so I used join as shown below
RE1 = '(?:\b(?<!\.)[\d{0,2}]+) (?:[/-]\d{0,}[/-]\d{2,4})'
RE2 = '(?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]* (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
regex_all = '|'.join([RE1, RE2])
regex_all = re.compile(regex_all)
date = regex_all.findall(txtData) // notice here txtData can be any one of the above string.
I am getting output as NaN in case of above for date.
Please suggest what is the mistake if I join.
Thanks for your help.
Note that it is a very bad idea to join such long patterns that also match at the same location within the string. That would cause the regex engine to backtrack too much, and possibly lead to crashes and slowdown. If there is a way to re-write the alternations so that they could only match at different locations, or even get rid of them completely, do it.
Besides, you should use grouping constructs (...) to groups sequences of patterns, and only use [...] character classes when you need to matches specific chars.
Also, your alternatives are overlapping, you may combine them easily. See the fixed regex:
\b(?<!\.)\d{1,2}(?:[/-]\d+[/-]|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{2})\b
See the regex demo.
Details
\b - a word boundary
(?<!\.) - no . immediately to the left of the current location
\d{1,2} - 1 or 2 digits
(?: - start of a non-capturing alternation group:
[/-]\d+[/-] - / or -, 1+ digits, - or /
| - or
(?:th|st|[nr]d)?\s*(?:
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*)) - th, st, nd or rd (optionally), followed with 0+ whitespaces, and then month names
\s* - 0+ whitespaces
(?:\d{4}|\d{2}) - 2 or 4 digits
\b - trailing word boundary.
Another note: if you want to match the date-like strings with two matching delimiters, you will need to capture the first one, and use a backreference to match the second one, see this regex demo. In Python, you would need a re.finditer to get those matches.
See this Python demo:
import re
rx = r"\b(?<!\.)\d{1,2}(?:([/-])\d+\1|(?:th|st|[nr]d)?\s*(?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z]*))\s*(?:\d{4}|\d{4})\b"
s = "Lithium 0.25 (7/11/77). LFTS wnl. Urine tox neg. Serum tox\nfluoxetine 500; otherwise neg. TSH 3.28. BUN/Cr: 16/0.83. Lipids unremarkable. B12 363, Folate >20. CBC: 4.9/36/308 Pertinent Medical\nReview of Systems Constitutional:\n\nThe patient is a 44 year old married Caucasian woman, unemployed Decorator, living with husband and caring for two young children, who is referred by Capitol Hill Hospital PCP, Dr. Heather Zubia, for urgent evaluation/treatment till first visit with Dr. Toney Winkler IN EIGHT WEEKS on 24 Jan 2001"
print([x.group(0) for x in re.finditer(rx, s, re.I)])
# => ['7/11/77', '24 Jan 2001']
I think your approach is too complicated. I suggest using a combination of a simple regex and strptime().
import re
from datetime import datetime
date_formats = ['%m/%d/%Y', '%d %b %Y']
pattern = re.compile(r'\b(\d\d?/\d\d?/\d{4}|\d\d? \w{3} \d{4})\b')
data = "... your string ..."
for match in re.findall(pattern, data):
print("Trying to parse '%s'" % match)
for fmt in date_formats:
try:
date = datetime.strptime(match, fmt)
print(" OK:", date)
break
except:
pass
The advantage of this approach is, besides a much more manageable regex, that it won't pick dates that look plausible but do not exist, like 2/29/2000 (whereas 2/29/2004 works).
r'(?:\b(?<!\.)[\d{0,2}]+)'
'(?:[/-]\d{0,}[/-]\d{2,4}) | (?:\b(?<!\.)[\d{1,2}]+)[th|st|nd]*'
' (?:[Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec][a-z]*) \d{2,4}'
you should use raw strings (r'foo') for each string, not only the first one. This way backslashes (\) will be considered as normal character and usable by the re library.
[abc|def] matches any character between the [], while (one|two|three) matches any expression (one, two, or three)

Categories

Resources