Newbie with Beautiful Soup would appreciate any pointers.
I'm working with a page which has a lot of:
<p data-v-04dd08f2> .. </p>
elements. Inside the p is a string value, which I need and an embedded span.
Question might be very simple... I am trying to use find_all to 'get' a list of all those elements which I would subsequently parse out to get the tokens I need from inside.
Can someone put me out my misery and tell me how the find_all should be structured to get these?
I've tried:
find_all('p',{'data':'v-04dd08f2'} } # nope
find_all('p', {"attributes': 'v-04dd08f2'} ) # nope
and lots of other combinations all to no avail.
Thanks!
If you are willing to use CSS selectors instead, which I personally prefer to BeautifulSoup's find_* methods and the paragraph tags are in fact exactly what you indicated, that "data-v-04dd08f2" is an attribute of the tag, then the following should do the trick
from bs4 import BeautifulSoup
soup = BeautifulSoup('<p data-v-04dd08f2> .. </p>')
p_tags = soup.select('p[data-v-04dd08f2]')
print(p_tags)
#[<p data-v-04dd08f2=""> .. </p>]
bs4 uses SoupSieve to implement CSS selectors. The SoupSieve docs for selecting based on attribute are here. Note that based on your attempts I suspect you might actually be looking for p tags who have a data attribute = 'v-04dd08f2'. If that's the case the soup.select string should be soup.select('p[data=v-04dd08f2]')
This will return all elements having attribute name starting with "data-v-"
match_pattern = 'data-v-'
m = soup.findAll(lambda tag: any(attr.startswith(match_pattern) for attr in tag.attrs.keys()))
element.attrs is a key-value structure, {attribute_name: attribute_value}
Related
Suppose I want to parse a html using BeautifulSoup and I wanted to use css selectors to find specific tags. I would "soupify" it by doing
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I wanted to find a tag whose "id" attribute has a value of "abc" I can do
soup.select('#abc')
If I wanted to find all "a" child tags under our current tag, we could do
soup.select('#abc a')
But now, suppose I want to find all "a" tags whose 'href' attributes has values that end in "xyz" I would want to use regex for that, I was hoping something along the lines of
soup.select('#abc a[href] = re.compile(r"xyz$")')
I can not seem to find anything that says BeautifulSoup's .select() method will support regex.
The soup.select() function only supports CSS syntax; regular expressions are not part of that.
You can use such syntax to match attributes ending with text:
soup.select('#abc a[href$="xyz"]')
See the CSS attribute selectors documentation over on MSDN.
You can always use the results of a CSS selector to continue the search:
for element in soup.select('#abc'):
child_elements = element.find_all(href=re.compile('^http://example.com/\d+.html'))
Note that, as the element.select() documentation states:
This is a convenience for users who know the CSS selector syntax. You can do all this stuff with the Beautiful Soup API. And if CSS selectors are all you need, you might as well use lxml directly: it’s a lot faster, and it supports more CSS selectors. But this lets you combine simple CSS selectors with the Beautiful Soup API.
Emphasis mine.
I am also open to other solutions other than using regex.
would checking angle brackets be enough?
any suggestions? Thanks!
Edit: what I need is NOT to parse html tags but just to check it has those tags or not
You can use BeautifulSoup parser and check if there are any tags by iterating the BeautifulSoup object and checking if there is at least one Tag element:
from bs4 import BeautifulSoup, Tag
l = ['test', 'test <br>', '<br>']
for item in l:
soup = BeautifulSoup(item, 'html.parser')
print item, any(isinstance(element, Tag) for element in soup)
prints:
test False
test <br> True
<br> True
Hope that helps.
I highly recommend lxml.html to do anything regarding parsing (xml, html, xhtml ...)
to get the whole idea just have a quick look at these graphs and you will know what i am talking about ;)
for a more detailed comparision have a look here.
I'm doing some HTML cleaning with BeautifulSoup. Noob to both Python & BeautifulSoup. I've got tags being removed correctly as follows, based on an answer I found elsewhere on Stackoverflow:
[s.extract() for s in soup('script')]
But how to remove inline styles? For instance the following:
<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>
<img class="some_image" href="somewhere.com">
Should become:
<p>Text</p>
<img href="somewhere.com">
How to delete the inline class, id, name & style attributes of all elements?
Answers to other similar questions I could find all mentioned using a CSS parser to handle this, rather than BeautifulSoup, but as the task is simply to remove rather than manipulate the attributes, and is a blanket rule for all tags, I was hoping to find a way to do it all within BeautifulSoup.
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
I wouldn't do this in BeautifulSoup - you'll spend a lot of time trying, testing, and working around edge cases.
Bleach does exactly this for you. http://pypi.python.org/pypi/bleach
If you were to do this in BeautifulSoup, I'd suggest you go with the "whitelist" approach, like Bleach does. Decide which tags may have which attributes, and strip every tag/attribute that doesn't match.
Here's my solution for Python3 and BeautifulSoup4:
def remove_attrs(soup, whitelist=tuple()):
for tag in soup.findAll(True):
for attr in [attr for attr in tag.attrs if attr not in whitelist]:
del tag[attr]
return soup
It supports a whitelist of attributes which should be kept. :) If no whitelist is supplied all the attributes get removed.
What about lxml's Cleaner?
from lxml.html.clean import Cleaner
content_without_styles = Cleaner(style=True).clean_html(content)
Based on jmk's function, i use this function to remove attributes base on a white list:
Work in python2, BeautifulSoup3
def clean(tag,whitelist=[]):
tag.attrs = None
for e in tag.findAll(True):
for attribute in e.attrs:
if attribute[0] not in whitelist:
del e[attribute[0]]
#e.attrs = None #delte all attributes
return tag
#example to keep only title and href
clean(soup,["title","href"])
Not perfect but short:
' '.join([el.text for tag in soup for el in tag.findAllNext(whitelist)]);
I achieved this using re and regex.
import re
def removeStyle(html):
style = re.compile(' style\=.*?\".*?\"')
html = re.sub(style, '', html)
return(html)
html = '<p class="author" id="author_id" name="author_name" style="color:red;">Text</p>'
removeStyle(html)
Output: <p class="author" id="author_id" name="author_name">Text</p>
You can use this to strip any inline attribute by replacing "style" in the regex with the attribute's name.
Can someone direct me as how to pull the value of a tag using BeautifulSoup? I read the documentation but had a hard time navigating through it. For example, if I had:
<span title="Funstuff" class="thisClass">Fun Text</span>
How would I just pull "Funstuff" busing BeautifulSoup/Python?
Edit: I am using version 3.2.1
You need to have something to identify the element you're looking for, and it's hard to tell what it is in this question.
For example, both of these will print out 'Funstuff' in BeautifulSoup 3. One looks for a span element and gets the title, another looks for spans with the given class. Many other valid ways to get to this point are possible.
import BeautifulSoup
soup = BeautifulSoup.BeautifulSoup('<html><body><span title="Funstuff" class="thisClass">Fun Text</span></body></html>')
print soup.html.body.span['title']
print soup.find('span', {"class": "thisClass"})['title']
A tags children are available via .contents
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#contents-and-children
In your case you can find the tag be using its CSS class to extract the contents
from bs4 import BeautifulSoup
soup=BeautifulSoup('<span title="Funstuff" class="thisClass">Fun Text</span>')
soup.select('.thisClass')[0].contents[0]
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors has all the details nevessary
Given an HTML link like
texttxt
how can I isolate the url and the text?
Updates
I'm using Beautiful Soup, and am unable to figure out how to do that.
I did
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
i get
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Why am i missing the content?
edit: elaborated on 'stuck' as advised :)
Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
EDIT:
I think you want:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
EDIT 2:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
Here's a code example, showing getting the attributes and contents of the links:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents
Looks like you have two issues there:
link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.
/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/
Here's what it matches:
'text'
// Parts: "url", "text"
'text<span>something</span>'
// Parts: "url", "text<span>something</span>"
If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.