Python regex capture issue - python

I have a regex expression that captures the text from a text file. The regex can ve viewed at the following URL:
https://regex101.com/r/wwEjTF/1
In my python code I would like extract the text that is matched by the regex only from all the other text in the text file.
I have following python code for matching th regex and storing it in a variable.
match = re.findall(r'test\s.+\n\sdescription\s\"(.+)\"', text, re.S)
I am expecting all the matches to be in the match variable and returned a list.
But when I do print (match) I get empty list. I do not understand why it is coming up an empty list. How do i capture the matched part of the regex into the variable.
Thanks for your help.
Just in case if there is an issue with the above url, is the regex and the sample text string:
test\s.+\n\sdescription\s\"(.+)\"
some random text
test 111.333.555.666
description "text10"
some random text
some random text
test 22.44.55.66
description "text12"
some random text
some random text
test 77.77.88.99
description "text13"
some random text
some random text
test 14.22.55.99
description "text16"
some random text
some random text
test 13.33.55.66
description "text17"
some random text`

Like I said in my comment try omitting re.S because It makes the '.' special character match any character at all, including a newline
Also \n\s is not appropriate. As \s includes newline. You will have to use \s+
Your regex will be:
match = re.findall(r'test\s.+\s+description\s\"(.+)\"', text)
Ideone Demo

Example in the regex101 uses tab as separator for description, so single \s works for it. Replace it with repetition:
match = re.findall(r'test\s.+\n\s+description\s\"(.+)\"', text, re.S)

Related

How to use re.sub to match and substitute any numbers after - extension?

I want to substitute from 'datasets/4/image-3.jpg' to 'datasets/4/image-1.jpg'. Are there any ways to do it by using re.sub? Or should I try something else like .split("/")[-1]? I had tried below but end up getting 'datasets/1/image-1.jpg', but I want to keep the /4/ instead of /1/.
My Code
import re
employee_name = 'datasets/4/image-3.jpg'
label = re.sub('[0-9]', "1", employee_name)
print(label)
Output
datasets/1/image-1.jpg
Expected Input
datasets/4/image-3.jpg
Expected Output
datasets/4/image-1.jpg
You can use
re.sub(r'-\d+\.jpg', '-1.jpg', text)
Note: If the match is always at the end of string, append the $ anchor at the end of the regex. If there can be other text after and you need to make sure there are no word chars after jpg, add a word boundary, \b. If you want to match other extensions, use a group, e.g. (?:jpe?g|png).
Regex details
-\d+ - one or more digits
\.jpg - .jpg string.
See the regex demo (code generator link).

python re: Need to replace numbers with links to websites

I have a strings like this:
this is SBN: 1245897 some text some text SBN4589364 some text some
text SBN-7859157, some text some text Part Num: 2615, some text SBN:7859148 asasas
YYY-485 some text some text...
I need to get all SBN related numbers only and replace it with links to some site, i'm trying this code, but that's does not work
p = re.compile('SBN[:- ] ( \d+ )')
p.sub(r'\1', string)
Please help..
All whitespace counts in a regular expression! Only include whitespace exactly when you would want whitespace in your match as well. You can limit it to just spaces ( ), but generally 'all whitespace' is more useful (\s).
I would also suggest always writing a regular expression as a raw string.
This regex catches all examples in your question:
p = re.compile(r'SBN[:-]\s*(\d+)')

What would be the easiest way to extract specific information from ()'s?

I have lines I am iterating through that look like this:
random text and A08524SDD here (00-04) more random text
lame text (junk data) more text (08-12) more text 4000 5553
random text and numbers 44553349 (2008)
random text (2005) junk text (junk)
nothing important (13-15) not important (not important)
I am trying to figure out how to pull ONLY the dates (range or single year) from the parenthesis without pulling the other random junk from the parenthesis.
Currently using this, but it is returning the random text as well:
date = re.findall('\(([^)]+)', line)
Edit: Each line in the string I am iterating over 1 line at a time. It is not one single string. I have a for loop that is searching each line and trying to extract the date range from each line. Also, there is random numbers included in the random text, so I cannot just search for ##-## or #### in the entire string. It will have to be encased in ()'s.
Edit2: #CarySwoveland has answered my origional question. As a bonus, I do have a few lines that look like this, that if they could also be included would be nice.
random text and numbers 44553349 (2008 important text)
random text (2005 important text) junk text (junk) 55555555 (08-09 important text)
nothing important (13-15) not important (not important)(2008 important text)
In the lines with more than 1 () that both start with a ##-## or a #### I need to grab both of them WITH the text. Out of about 35,000 lines of text, only about ~50 or so have these random issues, and I do not mind doing them by hand. But if a solution exists, it would be nice to implement.
THANK YOU TO ALL WHO HAVE POSTED! THIS HAS HELPED ME OUT GREATLY!!!!
As per both your question and your added comments I would suggest the following pattern:
(?<=\()\d\d-?\d\d.*?(?=\))
This would cater for all patterns of interest, like: (####), (##-##) and (##-## text) and possibly (#### text). Here is an online demo
From left to right:
(?<=\() - Positive lookbehind for an opening paranthesis
\d\d-?\d\d - Two digits followed by an optional hyphen and again two more digits
.*? - Match any character except newlines but non-greedy
(?=\)) - A positive lookahead to check for a closing paranthesis.
If you want to be very explicit about a closing paranthesis behind the 4th digit and the possibility for text you could extend. For example (?<=\()\d\d-?\d\d(?:\s\w+)?(?=\)), where we have a non-capturing group (?:...) where we check for a space \s followed by one-or-more word-characters \w+. The non-capturing group is optional ...)? and then the same positive lookahead as above.
Don't forget, using these patterns in Python, you'd need to make sure to use them as raw strings.
Note: I escaped both the opening and closing paranthesis in the lookarounds with a backslash, e.g: \( and \) to use them as literals. Not doing so will prematurely open/close another (non-)capturing group!
A Python example:
import re
lines = ['random text and A08524SDD here (00-04) more random text',
'lame text (junk data) more text (08-12) more text 4000 5553',
'random text and numbers 44553349 (2008)',
'random text (2005) junk text (junk)',
'nothing important (13-15) not important (not important)',
'random text and numbers 44553349 (2008 important text)',
'random text (2005 important text) junk text (junk) 55555555 (08-09 important text)',
'nothing important (13-15) not important (not important)(2008 important text)']
for line in lines:
print(re.findall(r'(?<=\()\d\d-?\d\d.*?(?=\))', line))
Returns:
['00-04']
['08-12']
['2008']
['2005']
['13-15']
['2008 important text']
['2005 important text', '08-09 important text']
['13-15', '2008 important text']
You can use the following regular expression.
(?m)(?<=\()(?:\d{4}|\d{2}-\d{2})(?=\))
Regex Demo <¯\_(ツ)_/¯> Python demo
Python's regex engine performs the following operations.
(?m) multiline mode
(?<=\() match is preceded by '(' (positive lookbehind)
(?: begin non-capture group
\d{4} match 4 digits
| or
\d{2}-\d{2} match 2 digits, a hyphen, 2 digits
) end non-capture group
(?=\)) match is followed by ')' (positive lookahead)
does something like this work for you?
this is assuming strings is a list of your lines
def getter(string):
return re.search(r'(\(\d{4}\)|\(\d{2}-\d{2}\))', string).group()
list(map(getter, strings))
output:
['(00-04)', '(08-12)', '(2008)', '(2005)', '(13-15)']
as per your edit...if you are looping, just apply the function in the loop on each line

How to write a Regex for centrally aligned upper case words?

THE COMPANY
ABCD is a new company in the field of Marketing. bla bla bla.
DESCRIPTION
xyz....
SUMMARY
asdf.......
For text like above, I want to pick up headings like THE COMPANY and SUMMARY. I would like to write a code in Python. I have been trying to use RegEx but have not found a way to write a pattern that matches centrally aligned words. I am open to any new method as long as it can be implemented in Python.
If I understand you correctly, you want to match lines composed of some sort of indentation, followed by some number of upper case words.
If so, the following regex should do the trick:
(?m)^(?: +)[A-Z\s]+$
Let's take that piece by piece.
(?m) tells the regex matcher to treat ^ and $ as the beginning and end of the line instead of the beginning and end of the string.
^ matches the beginning of the line.
(?: +) is a non-capturing group of one or more spaces. In other words, this part of the pattern finds as many spaces as possible, and then ignores them. After all, we're not looking for the spaces, we're looking for the text. If you want the spaces too, just remove the (?: and ), leaving you with \s+. If you prefer tabs, replace \s with \t.
[A-Z\s]+ matches one or more uppercase letters or whitespace characters.
$ matches the end of the line.
Putting it all together (and into Python) we get:
import re
headers = re.findall('(?m)^(?:\s+)[A-Z\s]+$', your_string)
I prefer to use re.match:
import re
example = ''' THE COMPANY
ABCD is a new company in the field of Marketing. bla bla bla.
DESCRIPTION
xyz....
SUMMARY
asdf.......'''
headlines = []
for line in example.split('\n'):
m = re.match(r'^\s{4,}([A-Z0-9 \t\._-]+)', line)
if m:
headlines.append(m.group(1))
print(headlines)
An other way is re.findall:
headlines = [x.lstrip(' \n') for x in re.findall(r'^\s{4,}[A-Z0-9 \t\._-]+', example, re.M)]
print(headlines)

Python reqular expressions non-greedy match

I have this code:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()
and I am trying to match a minimal block between <b> and </b> which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:
<b>1234</b><b>56text78</b>
while I need:
<b>56text78</b>
instead of .* use this
print re.search(r'<b>[^<]*text[^<]*</b>', a).group()
Here you say that ignore "<" character.
Why you're getting the output as <b>1234</b><b>56text78</b> when using <b>.*?text.*?</b> regex?
Basically regex engine scans the input from left to right. So first it takes the pattern <b> from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag <b>, it matches that tag. Now the engine takes the second pattern along with the following string text that is .*?text. Now it matches any character upto the first text string. Why i call it as first text means , if there are more than one text strings after <b>, .*?text matches upto the first text string. So <b>1234</b><b>56text will be matched. Now the engine takes the last pattern .*?</b> and macthes upto the first </b>, so <b>1234</b><b>56text78</b> got matched.
When using this <b>[^<]*text[^<]*</b> regex, it asserts that the characters before the string (text, </b>) and after the string (<b>, text) are any but not of < character. So it prevents the engine from matching also the tags.
Why doesn't <b>.*?text produce the desired output?
This is what regexp engine does:
Takes the first character from the search pattern, which is <, and
finds it in the string, then takes the second, then the third, until
it matches <b>.
The next step takes the whole .*?text pattern and tries to find it
in the string. That's because .*? without the text part would
have no sense, as it would match 0 characters. It matches
1234</b><b>56text part and adds it to <b> found in the step 1.
It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:
`<b>1234</b><b>56text78text</b><b>9012</b>`
then the greedy '<b>.*text' match would be:
<b>1234</b><b>56text78text
and the non-greedy one '<b>.*?text' would produce the one I was getting:
<b>1234</b><b>56text
So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()

Categories

Resources