I need to build a program that can read multiple lines of code, and extract the right information from each line.
Example text:
no matches
one match <'found'>
<'one'> match <found>
<'three'><'matches'><'found'>
For this case, the program should detect <'found'>, <'one'>, <'three'>, <'matches'> and <'found'> as matches because they all have "<" and "'".
However, I cannot work out a system using regex to account for multiple matches on the same line. I was using something like:
re.search('^<.*>$')
But if there are multiple matches on one line, the extra "'<" and ">'" are taken as part of the .*, without counting them as separate matches. How do I fix this?
This works -
>>> r = re.compile(r"\<\'.*?\'\>")
>>> r.findall(s)
["<'found'>", "<'one'>", "<'three'>", "<'matches'>", "<'found'>"]
Use findall instead of search:
re.findall( r"<'.*?'>", str )
You can use re.findall and match on non > characters inside of the angle brackets:
>>> re.findall('<[^>]*>', "<'three'><'matches'><'found'>")
["<'three'>", "<'matches'>", "<'found'>"]
Non-greedy quantifier '?' as suggested by anubhava is also an option.
Related
I have a string "F(foo)", and I'd like to replace that string with "F('foo')". I know we can also use regular expression in the second parameter and do this replacement using re.sub(r"F\(foo\)", r"F\('foo'\)",str). But the problem here is, foo is a dynamic string variable. It is different every time we want to do this replacement. Is it possible by some sort of regex, to do such replacement in a cleaner way?
I remember one way to extract foo using () and then .group(1). But this would require me to define one more temporary variable just to store foo. I'm curious if there is a way by which we can replace "F(foo)" with "F('foo')" in a single line or in other words in a more cleaner way.
Examples :
F(name) should be replaced with F('name').
F(id) should be replaced with F('id').
G(name) should not be replaced.
So, the regex would be r"F\((\w)+\)" to find such strings.
Using re.sub
Ex:
import re
s = "F(foo)"
print(re.sub(r"\((.*)\)", r"('\1')", s))
Output:
F('foo')
The following regex encloses valid [Python|C|Java] identifiers after F and in parentheses in single quotation marks:
re.sub(r"F\(([_a-z][_a-z0-9]+)\)", r"F('\1')", s, flags=re.I)
#"F('foo')"
There are several ways, depending on what foo actually is.
If it can't contain ( or ), you can just replace ( with (' and ) with '). Otherwise, try using
re.sub(r"F\((.*)\)", r"F('\1')", yourstring)
where the \1 in the replacement part will reference the (.*) capture group in the search regex
In your pattern F\((\w)+\) you are almost there, you just need to put the quantifier + after the \w to repeat matching 1+ word characters.
If you put it after the capturing group, you repeat the group which will give you the value of the last iteration in the capturing group which would be the second o in foo.
You could update your expression to:
F\((\w+)\)
And in the replacement refer to the capturing group using \1
F('\1')
For example:
import re
str = "F(foo)"
print(re.sub(r"F\((\w+)\)", r"F('\1')", str)) # F('foo')
Python demo
I have a bunch of quotes scraped from Goodreads stored in a bs4.element.ResultSet, with each element of type bs4.element.Tag. I'm trying to use regex with the re module in python 3.6.3 to clean the quotes and get just the text. When I iterate and print using [print(q.text) for q in quotes] some quotes look like this
“Don't cry because it's over, smile because it happened.”
―
while others look like this:
“If you want to know what a man's like, take a good look at how he
treats his inferiors, not his equals.”
―
,
Each also has some extra blank lines at the end. My thought was I could iterate through quotes and call re.match on each quote as follows:
cleaned_quotes = []
for q in quote:
match = re.match(r'“[A-Z].+$”', str(q))
cleaned_quotes.append(match.group())
I'm guessing my regex pattern didn't match anything because I'm getting the following error:
AttributeError: 'NoneType' object has no attribute 'group'
Not surprisingly, printing the list gives me a list of None objects. Any ideas on what I might be doing wrong?
As you requested this for learning purpose, here's the regex answer:
(?<=“)[\s\s]+?(?=”)
Explanation:
We use a positive lookbehind to and lookahead to mark the beginning and end of the pattern and remove the quotes from result at the same time.
Inside of the quotes we lazy match anything with the .+?
Online Demo
Sample Code:
import re
regex = r"(?<=“)[\s\S]+?(?=”)"
cleaned_quotes = []
for q in quote:
m = re.search(regex, str(q))
if m:
cleaned_quotes.append(m.group())
Arguably, we do not need any regex flags. Add the g|gloabal flag for multiple matches. And m|multiline to process matches line by line (in such a scenario could be required to use [\s\S] instead of the dot to get line spanning results.)
This will also change the behavior of the positional anchors ^ and $, to match the end of the line instead of the string. Therefore, adding these positional anchors in-between is just wrong.
One more thing, I use re.search() since re.match() matches only from the beginning of the string. A common gotcha. See the documentation.
First of all, in your expression r'“[A-Z].+$”' end of line $ is defined before ", which is logically not possible.
To use $ in regexi for multiline strings, you should also specify re.MULTILINE flag.
Second - re.match expects to match the whole value, not find part of string that matches regular expression.
Meaning re.search should do what you initially expected to accomplish.
So the resulting regex could be:
re.search(r'"[A-Z].+"$', str(q), re.MULTILINE)
I'm using python 2.7 re library to find all numbers written in scientific form in a string. I'm using the following code:
import re
y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","{8.25e+07|8.26206e+07}")
print y
However, the output is only ['8.25e+07'] while I'm expecting something like [('8.25e+07'),(8.26206e+07)]. I've been trying around but couldn't find where the problem is. If I input y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","|8.26206e+07}") then it gives ['8.26206e+07'] so the pattern is matching the second number but I don't get it why it doesn't match both at the same time.
You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.
With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.
{8.25e+07|8.26206e+07}
[--------]
After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].
To illustrate, change your input string by duplicating your separator |:
>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']
To satisfy your two .s you need two separators between any two numbers.
Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:
p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"
Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.
[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall
Because findall is documented to
... Return all non-overlapping matches of pattern in string, as a list of strings.
But your patterns overlap: the leading . of the second match would have to be the | character, but that was already consumed by the trailing . of the first match.
Just remove those non-captured .s at the start and end of your regex.
i think you have extra dots.
try this below
import re
y = re.findall("([0-9]+\.[0-9]+[eE][-+]?[0-9]+)","{8.25e+07|8.26206e+07}")
print (y)
When you use regular expressions to match. The default mode will be to find all non-overlapping matches. Using the dots at both the end and the beginning, you make them overlap.
"([0-9]+\.[0-9]+[eE][-+]?[0-9]+)"
should work
I'm struggling to do multiline regex with multiple matches.
I have data separated by newline/linebreaks like below. My pattern matches each of these lines if i test it separately. How can i match all the occurrences (specifically numbers?
I've read that i could/should use DOTALL somehow (possibly with MULTILINE). This seems to match any character (newlines also) but not sure of any eventual side effects. Don't want to have it match an integer or something and give me malformed data in the end.
Any info on this would be great.
What i really need though, is some assistance in making this example code work. I only need to fetch the numbers from the data.
I used re.fullmatch when i only needed one specific match in a previous case and not entirely sure which function i should use now by the way (finditer, findall, search etc.).
Thank you for any and all help :)
data = """http://store.steampowered.com/app/254060/
http://www.store.steampowered.com/app/254061/
https://www.store.steampowered.com/app/254062
store.steampowered.com/app/254063
254064"""
regPattern = '^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$'
evaluateData = re.search(regPattern, data, re.DOTALL | re.MULTILINE)
if evaluateString2 is not None:
print('do stuff')
else:
print('found no match')
import re
p = re.compile(ur'^\s*(?:https?:\/\/)?(?:www\.)?(?:store\.steampowered\.com\/app\/)?([0-9]+)\/?\s*$', re.MULTILINE)
test_str = u"http://store.steampowered.com/app/254060/\nhttp://www.store.steampowered.com/app/254061/\nhttps://www.store.steampowered.com/app/254062\nstore.steampowered.com/app/254063\n254064"
re.findall(p, test_str)
https://regex101.com/r/rC9rI0/1
this gives [u'254060', u'254061', u'254062', u'254063', u'254064'].
Are you trying to return those specific integers?
re.search stop at the first occurrence
You should use this intead
re.findall(regPattern, data, re.MULTILINE)
['254060', '254061', '254062', '254063', '254064']
Note: Search was not working for me (python 2.7.9). It just return the first line of data
/ has no special meaning so you do not have to escape it (and in not-raw strings you would have to escape every \)
try this
regPattern = r'^\s*(?:https?://)?(?:www\.)?(?:store\.steampowered\.com/app/)?([0-9]+)/?\s*$'
This might be a simple one :)
I am try to turn convert the following:
<gallery>File:ReDescribe.jpg|Photo by:J. K.File:redescribe_still1.pngFile:redescribe_still2.jpegFile:redescribe_still3.jpgFile:redescribe_still4.jpgFile:redescribe_still5.jpg</gallery>
into:
[[File:ReDescribe.jpg|photo by: J K]][[File:redescribe_still1.png]] [[File:redescribe_still2.jpeg]] [[File:redescribe_still3.jpg]] [[File:redescribe_still4.jpg]] [[File:redescribe_still5.jpg]]
And to start with I am looking for a Python regex that can selects only each File:filename.ext
So far I though of 'File:(.*?)File' but this expression excludes the last File: since it is not followed any character.
See it regex_tester https://regex101.com/r/iV1mD9/1
How could the expression also match the last File: which is followed by </gallery>?
File:(.*?)(?=File:|<\/gallery>)
Try this.See demo.Use lookahead to make sure last File: is also captured.
https://regex101.com/r/sJ9gM7/94#python
First remove the gallery tag and then apply the below positive lookahead based regex.
>>> s = '''<gallery>File:ReDescribe.jpg|Photo by:J. K.File:redescribe_still1.pngFile:redescribe_still2.jpegFile:redescribe_still3.jpgFile:redescribe_still4.jpgFile:redescribe_still5.jpg</gallery>'''
>>> re.sub(r'(File:.+?)(?=File:|$)', r'[[\1]]', re.sub(r'</?gallery>', '', s))
'[[File:ReDescribe.jpg|Photo by:J. K.]][[File:redescribe_still1.png]][[File:redescribe_still2.jpeg]][[File:redescribe_still3.jpg]][[File:redescribe_still4.jpg]][[File:redescribe_still5.jpg]]'