Regular expression in Python, 2-3 numbers then 2 letters - python

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grĂ¥", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!

You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

Related

Regex that matches all German and Austrian mobile phone numbers

I need a Python regex which matches to mobile phone numbers from Germany and Austria.
In order to do so, we first have to understand the structure of a phone number:
a mobile number can be written with a country calling code in the beginning. However, this code is optional!
if we use the country calling code the trunk prefix is redundant!
The prefix is composed out of the trunk prefix and the company code
The prefix is followed by an individual and unique number with 7 or 8 digits, respectivley.
List of German prefixes:
0151, 0160, 0170, 0171, 0175, 0152, 0162, 0172, 0173, 0174, 0155, 0157, 0159, 0163, 0176, 0177, 0178, 0179, 0164, 0168, 0169
List of Austrian prefixes:
0664, 0680, 0688, 0681, 0699, 0664, 0667, 0650, 0678, 0650, 0677, 0676, 0660, 0699, 0690, 0665, 0686, 0670
Now that we know all rules to build a regex, we have to consider, that humans sometimes write numbers in a very strange ways with multiple whitespaces, / or (). For example:
0176 98 600 18 9
+49 17698600189
+(49) 17698600189
0176/98600189
0176 / 98600189
many more ways to write the same number
I am looking for a Python regex which can match all Austian and German mobile numbers.
What I have so far is this:
^(?:\+4[39]|004[39]|0|\+\(49\)|\(\+49\))\s?(?=(?:[^\d\n]*\d){10,11}(?!\d))(\()?[19][1567]\d{1,2}(?(1)\))\s?\d(?:[ /-]?\d)+
You can use
(?x)^ # Free spacing mode on and start of string
(?: # A container group:
(\+49|0049|\+\(49\)|\(\+49\))? [ ()\/-]* # German: country code
(?(1)|0)1(?:5[12579]|6[023489]|7[0-9]) # trunk prefix and company code
| # or
(\+43|0043|\+\(43\)|\(\+43\))? [ ()\/-]* # Austrian: country code
(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])) # trunk prefix and company code
)
[ ()\/-]* # zero or more spaces, parens, / and -
\d(?:[ \/-]*\d){6,7} # a digit and then six or seven occurrences of space, / or - and a digit
\s* # zero or more whites
$ # end of string
See the regex demo.
A one-line version of the pattern is
^(?:(\+49|0049|\+\(49\)|\(\+49\))?[ ()\/-]*(?(1)|0)1(?:5[12579]|6[023489]|7[0-9])|(\+43|0043|\+\(43\)|\(\+43\))?[ ()\/-]*(?(2)|0)6(?:64|(?:50|6[0457]|7[0678]|8[0168]|9[09])))[ ()\/-]*\d(?:[ \/-]*\d){6,7}\s*$
See this demo.
How to create company code regex
Go to the Optimize long lists of fixed string alternatives in regex
Click the Run code snippet button at the bottom of the answer to run the last code snippet
Re-size the input box if you wish
Get the list of your supported numbers, either comma or linebreak separated and paste it into the field
Click Generate button, and grab the pattern that will appear below.

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

Pattern of regular expressions while using Look Behind or Look Ahead Functions to find a match

I am trying to split a sentence correctly bases on normal grammatical rules in python.
The sentence I want to split is
s = """Mr. Smith bought cheapsite.com for 1.5 million dollars,
i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a
probability of .9 it isn't."""
The expected output is
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
To achieve this I am using regular , after a lot of searching I came upon the following regex which does the trick.The new_str was jut to remove some \n from 's'
m = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
So the way I understand the reg ex is that we are first selecting
1) All the characters like i.e
2) From the filtered spaces from the first selection ,we select those characters
which dont have words like Mr. Mrs. etc
3) From the filtered 2nd step we select only those subjects where we have either dot or question and are preceded by a space.
So I tried to change the order as below
1) Filter out all the titles first.
2) From the filtered step select those that are preceded by space
3) remove all phrases like i.e
but when I do that the blank after is also split
m = re.split(r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)',new_str)
for i in m:
print (i)
Mr. Smith bought cheapsite.com for 1.5 million dollars,i.e.
he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with aprobability of .9 it isn't.
Shouldn't the last step in the modified procedure be capable in identifying phrases like i.e ,why is it failing to detect it ?
First, the last . in (?<!\w\.\w.) looks suspicious, if you need to match a literal dot with it, escape it ((?<!\w\.\w\.)).
Coming back to the question, when you use r'(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.)' regex, the last negative lookbehind checks if the position after a whitespace is not preceded with a word char, dot, word char, any char (since the . is unescaped). This condition is true, because there are a dot, e, another . and a space before that position.
To make the lookbehind work that same way as when it was before \s, put the \s into the lookbehind pattern, too:
(?<![A-Z][a-z]\.)(?<=\.|\?)\s(?<!\w\.\w.\s)
See the regex demo
Another enhancement can be using a character class in the second lookbehind: (?<=\.|\?) -> (?<=[.?]).

regex to find image sequences from list of filenames

I need some help with a regex string to pull any filename that looks like it might be part of a frame sequence out of a previously generated list of filenames.
Frames in a sequence will generally have a minimum padding of 3 and will be preceeded by either a '.' or a '_' An exception is: if the filename is only made up of a number and the .jpg extension (e.g 0001.jpg, 0002.jpg, etc.). I'd like to capture all these in one line of regex, if possible.
Here's what I have so far:
(.*?)(.|_)(\d{3,})(.*)\.jpg
Now I know this doesn't do the "preceeded by . or _" bit and instead just finds a . or _ anywhere in the string to return a positive. I've tried a bit of negative lookbehind testing, but can't get the syntax to work.
A sample of data is:
test_canon_shot02.jpg
test_shot01-04.jpg
test_shot02-03.jpg
test_shot02-02.jpg
test_shot01-03.jpg
test_canon_shot03.jpg
test_shot01-02.jpg
test_shot02.jpg
test_canon_shot02.jpg
test_shot01.jpg
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
OrangeXmas2015_Print_A ct2.jpg
sh120_HF_V01-01.jpg
sh120_HF_V01-02.jpg
sh200_DMP_v04.jpg
sh120_HF_V04.jpg
sh120_HF_V03.jpg
sh120_HF_V02.jpg
blah_v02.jpg
blah_v01.jpg
blah_Capture0 4.jpg
blah_Capture03 .jpg
blah_Capture01. jpg
blah_Capture02.jpg
Wall_GraniteBlock_G rey_TC041813.jpg
Renders10_wire.jpg
Renders10.jpg
Renders09_wire.jpg
Renders09.jpg
Renders08_wire.jpg
Renders08.jpg
Renders07_wire.jpg
Renders07.jpg
Renders06_wire.jpg
Renders06.jpg
Renders05_wire.jpg
Renders05.jpg
Renders04_wire.jpg
Renders04.jpg
Renders03_wire.jpg
Renders03.jpg
Renders02_wire.jpg
Renders02.jpg
Renders01_wire.jpg
Renders01.jpg
archmodels58_057_carpinusbetulus_leaf_diffuse.jpg
archmodels58_042_bark_bump.jpg
archmodels58_023_leaf_diffuse.jpg
WINDY TECHNICZNE-reflect00.jpg
archmodels58_057_leaf_opacity.jpg
archmodels58_057_bark_reflect.jpg
archmodels58_057_bark_bump.jpg
blahC-00-oknaka.jpg
bed
debt
cab
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
The result I'm after is 2 sequences identified:
GameAssets_.00000.jpg to GameAssets_.00024.jpg
00000.jpg to 00018.jpg
Based on the rules you specified in your question, this pattern should accomplish what you need:
(^|\r?\n|.*_|.*\.)\d{3,}.*\.jpg
for item in re.findall(r'.*?[._]?0{3,}.*',data):
print(item)
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
Try
(.*?)(\.|_?)(000\d{0,})(.*)\.jpg
Notice that I had to escape the '.' in the second group. Also, I had to make the search for '.' and '_' optional in the second group. Finally, I had to add the minimum padding to the third group.
I used regex101.com to test and refine the regex: regex101

Python regex string groups capture

I have a number of medical reports from each which i am trying to capture 6 groups (groups 5 and 6 are optional):
(clinical details | clinical indication) + (text1) + (result|report) + (text2) + (interpretation|conclusion) + (text3).
The regex I am using is:
reportPat=re.compile(r'(Clinical details|indication)(.*?)(result|description|report)(.*?)(Interpretation|conclusion)(.*)',re.IGNORECASE|re.DOTALL)
works except on strings missing the optional groups on whom it fails.i have tried putting a question mark after group5 like so: (Interpretation|conclusion)?(.*) but then this group gets merged into group4. I am pasting two conflicting strings (one containing group 5/6 and the other without it) for people to test their regex. thanks for helping
text 1 (all groups present)
Technical Report:\nAdministrations:\n1.04 ml of Fluorine 18, fluorodeoxyglucose with aco - Bronchus and lung\nJA - Staging\n\nClinical Details:\nSquamous cell lung cancer, histology confirmed ?stage\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion. \n\nThere is a large mass noted in the left upper lobe proximally, with lower grade uptake within a collapsed left upper lobe. This lesi\n\nInterpretation: \nThe scan findings are in keeping with the known lung primary in the left upper lobe and involvement of the lymph nodes as dThere is no evidence of distant metastatic disease.
text 2 (without group 5 and 6)
Technical Report:\nAdministrations:\n0.81 ml of Fluorine 18, fluorodeoxyglucose with activity 312.79\nScanner: 3D Static\nPatient Position: Supine, Head First. Arms up\n\n\nDiagnosis Codes:\n- Bronchus and lung\nJA - Staging\n\nClinical Indication:\nNewly diagnosed primary lung cancer with cranial metastasis. PET scan to assess any further metastatic disease.\n\nScanner DST 3D\n\nSession 1 - \n\n.\n\nResult:\nAn FDG scan was acquired from skull base to upper thighs together with a low dose CT scan for attenuation correction and image fusion.\n\nThere is increased FDG uptake in the right lower lobe mass abutting the medial and posterior pleura with central necrosis (maximum SUV 18.2). small nodule at the right paracolic gutte
It seems like that what you were missing is basically an end of pattern match to fool the greedy matches when combining with the optional presence of the group 5 & 6. This regexp should do the trick, maintaining your current group numbering:
reportPat=re.compile(
r'(Clinical details|indication)(.*)'
r'(result|description|report)(.*?)'
r'(?:(Interpretation|conclusion)(.*))?$',
re.IGNORECASE|re.DOTALL)
Changes done are adding the $ to the end, and enclosing the two last groups in a optional non-capturing group, (?: ... )?. Also note how you easily can make the entire regexp more readable by splitting the lines (which the interpreter will autoconnect when compiling).
Added: When reviewing the result of the matches I saw some :\n or : \n, which can easily be cleaned up by adding (?:[:\s]*)? inbetween the header and text groups. This is an optional non-capturing group of colons and whitespace. Your regexp does then look like this:
reportPat=re.compile(
r'(Clinical details|indication)(?:[:\s]*)?(.*)'
r'(result|description|report)(?:[:\s]*)?(.*?)'
r'(?:(Interpretation|conclusion)(?:[:\s]*)?(.*))?$',
re.IGNORECASE|re.DOTALL)
Added 2: At this link: https://regex101.com/r/gU9eV7/3, you can see the regex in action. I've also added some unit test cases to verify that it works against both texts, and that in for text1 it has a match for text1, and that for text2 it has nothing. I used this parallell to direct editing in a python script to verify my answer.
The following pattern works for both your test cases though given the format of the data you're having to parse I wouldn't be confident that the pattern will work for all cases (for example I've added : after each of the keyword matches to try to prevent inadvertent matches against more common words like result or description):
re.compile(
r'(Clinical details|indication):(.+?)(result|description|report):(.+?)((Interpretation|conclusion):(.+?)){0,1}\Z',
re.IGNORECASE|re.DOTALL
)
I grouped the last 2 groups and marked them as optional using {0,1}. This means the output groups will vary a little from your original pattern (you'll have an extra group, the 4th group will now contain the output of both the last 2 groups and the data for the last 2 groups will be in groups 5 and 6).

Categories

Resources