RegEx Find all xml attribute value [duplicate] - python

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 7 years ago.
I'm trying to extract all attribute values from a large xml file
s = '<some id="Foo" menu="BAAR"></some>'
output = re.findall( '="(.*)"' ,s)
print output
I'm expecting out put
['Foo','BAAR']
However I'm getting
['Foo" menu="BAAR']
Can anyone please help me pointing out what I'm doing wrong ?

In regular expression * is greedy, that means, it takes as much as it can. Just use the non-greedy version *?:
s = '<some id="Foo" menu="BAAR"></some>'
output = re.findall( '="(.*?)"' ,s)
print output

Related

String pattern in Python [duplicate]

This question already has answers here:
How do I validate a date string format in python?
(5 answers)
Closed 6 months ago.
Im trying to check if a user's input is following the pattern integer/integer/integer(like month/day/year) but i dont know how to use exactly the match function to define that the pattern contains "number",then "/",again "number" and "/"...
Check out https://regex101.com/ for a neat website to check your regex! This is implemented in python using the re library. https://docs.python.org/3/library/re.html
In your case, the pattern would be [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}

URL Regex for Python [duplicate]

This question already has answers here:
Regex match everything after question mark?
(7 answers)
Closed 12 months ago.
I am trying to compare webpage URLs using regex. I am using the below method.
regex_url = r'https://www.website.com/books/\w{8}$'
is_read = re.match(regex_url, request.url) is not None
if not is_read:
add_to_read(token)
Everything works well for the above regex. But there is a new URL pattern now which I cant seem to get the regex right.
The new URL pattern is
https://www.website.com/books/Ab7us83xI?varient=web
9 characters followed by a question mark and then the word 'varient' and then '=web'. Can anyone help me get the correct regex for this?
Only the first 9 characters change every time. Apologies if this is a stupid question.
Many thanks.
Is this what you need?
https://www.website.com/books/\w{9}\?varient=web$
\w{9} - match 9 characters
\? - match question mark
varient=web - match varient=web

Python regex with multiple matches in the same string [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Python non-greedy regexes
(7 answers)
Closed 3 years ago.
test = '<tag>part1</tag><tag can have random stuff here>part2</tag>'
print(re.findall("<tag.*>(.*)</tag>", test))
It outputs:
['part2']
The text can have any amount of "parts". I want to return all of them, not only the last one. What's the best way to do it?
You could change your .* to be .*? so that they are non-greedy. That will make your original example work:
import re
test = '<tag>part1</tag><tag can have random stuff here>part2</tag>'
print(re.findall(r'<tag.*?>(.*?)</tag>', test))
Output:
['part1', 'part2']
Though it would probably be best to not try to parse this with just regex, but instead use a proper HTML parser library.

find all string that starts with "Sanskar:214" and end with "<SP>" using re in python [duplicate]

This question already has answers here:
Regular expression - starting and ending with a character string
(3 answers)
Closed 5 years ago.
I am trying to do something like
term3_pattern = re.compile(r'(Sanskar:214) * <SP>')
and then check like that
term3_pattern.match(i):
Can someone help me with the regex pattern?
hey you can try this
import re
re.findall(r'Sanskar:214.*<SP>', s) // s is your string

Python regex parsing on variable [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 5 years ago.
Not sure why this is not matching and not working? It appears something is wrong with the regex such that it doesn't match even though i tested it out in the online regex tester
current_name = "bob[0]"
regex_match = re.compile('%s'%current_name)
if re.match(regex_match, current_name):
print "matched"
current_name = "bob[0]"
regex_match = re.compile('%s'%current_name.replace('[', r'\['))
if re.match(regex_match, current_name):
print "matched"
That opening square bracket was causing issues. this will print "matched"

Categories

Resources