BeautifulSoup, simple regex issue

BeautifulSoup, simple regex issue - python

I just hit a snag with regex and have no idea why this's not working.
Here is what BeautifulSoup doc says:
soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]
Here is my html:
Aouate</span><span class="pos_text pos3_l_4">
and I'm trying to match the span tag (last position).
>>> if soup.find(class_=re.compile("pos_text pos3_l_\d{1}")):
print "Yes"
# prints nothing - indicating there is no such pattern in the html
So, I'm just repeating the BS4 docs, except my regex is not working. Sure enough if I replace the \d{1} with 4 (as originally in the html) it succeedes.

Try "\\d" in your regex. It's probably interpreting "\d" as trying to escape 'd'.
Alternatively, a raw string ought to work. Just put an 'r' in front of the regex, like this:
re.compile(r"pos_text pos3_l_\d{1}")

I'm not entirely sure, but this worked for me:
soup.find(attrs={'class':re.compile('pos_text pos3_l_\d{1}')})

You are matching not for a class but for an specific combination of classes in an specific order.
From the documentation:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>] But searching for variants of the string value won’t work:
css_soup.find_all("p", class_="strikeout body")
# []
So you should problable fist match for post_text and then in the result try to match with a regexp in the matches for that search

Related

Regex search For HTML Tag with UUID

I'm trying to match a single HTML tag with an id attribute which is a UUID. I tested it with an external resource to make sure the regex is correct with the same input string. The UUID is extracted dynamically so the string replacement is necessary.
The output I would would expect is for the last line to print:
<tr class="ref_row" id="b9060ff1-015d-4089-a193-8fef57e7c2ef">
This is the code I tried:
content = '<tbody><tr class="ref_row" id="b9060ff1-015d-4089-a193-8fef57e7c2ef"><td><b>01/08/2016 14:41:00</b></td>'
ref = 'b9060ff1-015d-4089-a193-8fef57e7c2ef'
regex = '<[^>]+?id=\"%s\"[^<]*?>' % ref
element_to_link = re.search(regex, content)
print element_to_link.string
The output I get when printing is the whole input string, which would suggest the regex is incorrect. What's going on here?
Please don't suggest that I use Beautiful Soup, this should be possible with regular expressions.

Why won't you use group method? This works for me:
element_to_link.group(0)

From the Python re module documentation the MatchObject.string property returns "The string passed to match() or search().". Use one of the methods of MatchObject such as group(), groups() or groupdict().

How to return everything in a string that is not matched by a regex?

I have a string and a regular expression that matches portions of the string. I want to return a string representing what's left of the original string after all matches have been removed.
import re
string="<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"
pattern = r'<[a-zA-Z0-9 ="/\-:;.]*>'
re.findall(pattern, string)
['<font size="2px" face="Tahoma">',
'<br>',
'</font>',
'<div>',
'<br>',
'</div>',
'<div>']
desired_string = "Good Morning, As per last email"

Instead of re.findall, use re.sub to replace each matche with an empty string.
re.sub(pattern, "", string)
While that's the literal answer to your general question about removing patterns from a string, it appears that your specific problem is related to manipulating HTML. It's generally a bad idea to try to manipulate HTML with regular expressions. For more information see this answer to a similar question: https://stackoverflow.com/a/1732454/7432

Instead of a regular expression, use an HTML parser like BeautifulSoup. It looks like you are trying to strip the HTML elements and get the underlying text.
from bs4 import BeautifulSoup
string="""<font size="2px" face="Tahoma"><br>Good Morning, </font><div><br></div><div>As per last email"""
soup = BeautifulSoup(string, 'lxml')
print(soup.get_text())
This outputs:
Good Morning, As per last email
One thing to notice is that the was changed to a regular space using this method.

Alternatives to Python's re.search

I am using re.search to check if a string to text is found in a html page. Sometimes it does not find the string although it is definitely there. For example I would like to find: <div class="dlInfo-Speed"> Does anyone know how to create regex to find that string?
Does anyone know of any good alternatives to re.search?
Thanks

If you just want to determine if a substring is present, you can use in for that.
if some_substring in some_string:
do_something_exciting()

As for a regex, this is the best I got right now:
if re.search(r"<[dD][iI][vV]\s+.*?class="dlInfo-Speed".*?>(.*?)</[dD][iI][vV]>",
html_doc,
re.DOTALL):
print "found"
else:
print "not found"
http://regexr.com?37iqr
I found that regex's are usually not the best solution for %99 of problems like this.
My alternative is BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
Here's how to solve it with bs4:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tag = soup.find("div", class_="dlInfo-Speed")
print tag.string #one way to get the contents

As noted, it is possible that the string is not found because other HTML is mixed in with it. It's also possible that it's formatted in such a way that there are newlines in between the tag attributes, like:
some text goes here <div
class="dlInfo-Speed"> More text
or even
some text goes here <div class="dlInfo-Speed"
> More text
You can write a regex that will account for whitespace (including newlines and tabs) in all the places it may occur:
re.search(text, r'<div \s+ class="dlInfo-Speed" \s* >', re.VERBOSE)
But overall I strongly agree with the comment that for anything more than very simple, well-defined searches, it is usually best to parse the HTML properly and walk the document tree to find what you're looking for.

There is a chance that the string that fails to be found is mixed with some html tags:
<div>string you are <span class="x">looking</span> for</div>
Maybe you should try removing html tags (unless they contain the string you search for) so the text is easier to search through. A simple way to do it using regex:
text = re.sub('<[^<]+?>', '', html_page)
if some_substring in text:
do_something(text)
As for re.search alternatives, you can use string index method.
try:
index = html_data.index(some_substring)
do_something(html_data)
except ValueError:
# string not found
pass
or even find method:
if html_data.find(some_substring) >= 0:
do_something(html_data)

python regex to find any link that contains the text 'abc123'

I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.

If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).

"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"

If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))

Python and "re"

A tutorial I have on Regex in python explains how to use the re module in python, I wanted to grab the URL out of an A tag so knowing Regex I wrote the correct expression and tested it in my regex testing app of choice and ensured it worked. When placed into python it failed:
result = re.match("a_regex_of_pure_awesomeness", "a string containing the awesomeness")
# result is None`
After much head scratching I found out the issue, it automatically expects your pattern to be at the start of the string. I have found a fix but I would like to know how to change:
regex = ".*(a_regex_of_pure_awesomeness)"
into
regex = "a_regex_of_pure_awesomeness"
Okay, it's a standard URL regex but I wanted to avoid any potential confusion about what I wanted to get rid of and possibly pretend to be funny.

In Python, there's a distinction between "match" and "search"; match only looks for the pattern at the start of the string, and search looks for the pattern starting at any location within the string.
Python regex docs
Matching vs searching

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(your_html)
for a in soup.findAll('a', href=True):
# do something with `a` w/ href attribute
print a['href']

>>> import re
>>> pattern = re.compile("url")
>>> string = " url"
>>> pattern.match(string)
>>> pattern.search(string)
<_sre.SRE_Match object at 0xb7f7a6e8>

Are you using the re.match() or re.search() method? My understanding is that re.match() assumes a "^" at the beginning of your expression and will only search at the beginning of the text, while re.search() acts more like the Perl regular expressions and will only match the beginning of the text if you include a "^" at the beginning of your expression. Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup, simple regex issue - python

Try "\\d" in your regex. It's probably interpreting "\d" as trying to escape 'd'. Alternatively, a raw string ought to work. Just put an 'r' in front of the regex, like this: re.compile(r"pos_text pos3_l_\d{1}")

I'm not entirely sure, but this worked for me: soup.find(attrs={'class':re.compile('pos_text pos3_l_\d{1}')})

Related

Regex search For HTML Tag with UUID

How to return everything in a string that is not matched by a regex?

Alternatives to Python's re.search

python regex to find any link that contains the text 'abc123'

Python and "re"

Categories

Resources