I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.
So far I have this EDITED & UPDATED CURRENT CODE:
soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page
1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.
2) I would like to strip<!-- --> tags and everything in between them. How would I go about that?
QUESTION EDIT #jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example
<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->
Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():
from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
<a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>
I am still trying to figure out why it
doesn't find and strip tags like this:
<!-- //-->. Those backslashes cause
certain tags to be overlooked.
This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:
import re, copy
myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)
BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz
If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment
soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
element.extract()
print soup.prettify()
if mutation isn't your bag, you can
[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]
Related
I am trying to pull static_token, but I am definitely not on the right track. I pulled all the info that is javascript and my next step I figured was i needed to read it with JSON or turn it into JSON format as this is what i was told on my previous question. I tried to use regex to pull this as well, no luck on that end either, as pointed out in the comments I don't know what I'm doing. Can you point me in the right direction to pull the static_token and a brief description of what I need to actually do in these instances?
CODE
import requests
import json, re
from bs4 import BeautifulSoup
url =''' '''
response = requests.get(url)
soup1 = BeautifulSoup(response.text,'lxml')
html1 = soup1.find_all('script')[1] #.text
print(html1)
soup2 = BeautifulSoup(b'html1', 'lxml')
var = soup2.get('var static_token')
print(var)
My attempt with using regex -
static_token = re.findall(('static_token":"([^"]+)'), soup.text)
print(static_token)
Source I'm trying to pull the info
<script type="text/javascript">var CUSTOMIZE_TEXTFIELD = 1;
var FancyboxI18nClose = 'Close';
var FancyboxI18nNext = 'Next';
var FancyboxI18nPrev = 'Previous';
var contentOnly = false;
var roundMode = 2;
var static_token = '850424cb3feab732e072a3d84e378908';
var taxEnabled = 1;
var titleDelivery = '<h3 class="page-subheading">Your delivery address</h3>';
var titleInvoice = '<h3 class="page-subheading">Your billing address</h3>';
var token = '0c7a8347a263972696d6514dcfa24741';
var txtConditionsIsNotNeeded = 'You do not need to accept the Terms of Service for this order.';
var txtDeliveryAddress = 'Delivery address';
var txtErrors = 'Error(s)';
var txtFree = 'Free';
var txtThereis = 'There is';
var txtWithTax = '(tax incl.)';
var txtWithoutTax = '(tax excl.)';
var usingSecureMode = true;
var vat_management = 0;</script>
You are confusing a lot of data types in this question. This is not the only way to approach this (or the most robust) but it is simple and can point you in the right direction.
You seem to be able to read the html and extract the script tags from it using BeautifulSoup into html1.
You need to look at the documentation to understand the data types being used. I would also recommend adding statements like this in your code to help.
html1 = soup.find_all('script')
print('html1', type(html1), html1) # the type is bs4.element.ResultSet
This var contains all of the script tags from your document. You can iterate over the tags, and find the text fields for each tag. But it is NOT a JSON formatted type.
Once you have a string and you want part of that string - regex should be one of your first thoughts. You don't want to use regex on the original html - that's the whole point of the BS4 library and others like it. HTML is often mangled, messy, and not well-suited for simple regular expressions. Use BeautifulSoup to parse the html and find the section you want, THEN use regex to parse the string.
for tag in html1:
print('tag', type(tag), tag) # bs4.element.Tag
print()
print('tag.text', type(tag.text), tag.text) # str
print()
toks = re.findall("var static_token =\s*'(\w+)'", tag.text)
print('toks', type(toks), toks) # list
print()
static_token = toks[0]
print('static_token', type(static_token), static_token) # str
print()
Is there a way to remove all html tags from string, but leave some links and change their representation? Example:
description: <p>Animation params. For other animations, see myA.animation and the animation parameter under the API methods. The following properties are supported:</p>
<dl>
<dt>duration</dt>
<dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See demo.</dd>
</dl>
<p>
and I want to replace
myA.animation
with only 'myA.animation', but
demo
with 'demo: http://example.com'
EDIT:
Now it seems to be working:
def cleanComment(comment):
soup = BeautifulSoup(comment, 'html.parser')
for m in soup.find_all('a'):
if str(m) in comment:
if not m['href'].startswith("#"):
comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
soup = BeautifulSoup(comment, 'html.parser')
comment = soup.get_text()
return comment
This regex should work for you: (?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"
You can try it over here
In Python:
import re
text = ''
with open('textfile', 'r') as file:
text = file.read()
matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)
strings = []
for m in matches:
m = filter(bool, m)
strings.append(': '.join(m))
print(strings)
The result will look like: ['myA.animation', 'demo: http://example.com']
Trying to get my head around html construction with BS.
I'm trying to insert a new tag:
self.new_soup.body.insert(3, """<div id="file_history"></div>""")
when I check the result, I get:
<div id="file_histor"y></div>
So I'm inserting a string that being sanitised for websafe html..
What I expect to see is:
<div id="file_history"></div>
How do I insert a new div tag in position 3 with the id file_history?
See the documentation on how to append a tag:
soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b></b>
new_tag.string = "Link text."
original_tag
# <b>Link text.</b>
Use the factory method to create new elements:
new_tag = self.new_soup.new_tag('div', id='file_history')
and insert it:
self.new_soup.body.insert(3, new_tag)
Other answers are straight off from the documentation. Here is the shortcut:
from bs4 import BeautifulSoup
temp_soup = BeautifulSoup('<div id="file_history"></div>')
# BeautifulSoup automatically add <html> and <body> tags
# There is only one 'div' tag, so it's the only member in the 'contents' list
div_tag = temp_soup.html.body.contents[0]
# Or more simply
div_tag = temp_soup.html.body.div
your_new_soup.body.insert(3, div_tag)
This is the HTML:
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>
I want to extract the text 92 and convert it into integer and print in python2. How can I?
Code:
i = soup.find('div', id='NhsjLK')
print "Followers :", i.find('span', id='list_count').text
I'd not go with getting it by the class directly, since I think "list_count" is too broad of a class value and might be used for other things on the page.
There are definitely several different options judging by this HTML snippet alone, but one of the nicest, from my point of you, is to use that "Followers" text/label and get the next sibling of it:
from bs4 import BeautifulSoup
data = """
<div><div id="NhsjLK">
<li class="EditableListItem NavListItem FollowersNavItem NavItem not_removable">
Followers <span class="list_count">92</span></li></div></div>"""
soup = BeautifulSoup(data, "html.parser")
count = soup.find(text=lambda text: text and text.startswith('Followers')).next_sibling.get_text()
count = int(count)
print(count)
Or, an another, a very concise and reliable approach would be to use the partial match (the *= part below) on the href value of the parent a element:
count = int(soup.select_one("a[href*=followers] .list_count").get_text())
Or, you might check the class value of the parent li element:
count = int(soup.select_one("li.FollowersNavItem .list_count").get_text())
Hi I need to pass a variable to the soup.find() function, but it doesn't work :(
Does anyone know a solution for this?
from bs4 import BeautifulSoup
html = '''<div> blabla
<p class='findme'> p-tag content</p>
</div>'''
sources = {'source1': '\'p\', class_=\'findme\'',
'source2': '\'span\', class_=\'findme2\'',
'source1': '\'div\', class_=\'findme3\'',}
test = BeautifulSoup(html)
# this works
#print(test.find('p', class_='findme'))
# >>> <p class="findme"> p-tag content</p>
# this doesn't work
tag = '\'p\' class_=\'findme\''
# a source gets passed
print(test.find(sources[source]))
# >>> None
I am trying to split it up as suggested like this:
pattern = '"p", {"class": "findme"}'
tag = pattern.split(', ')
tag1 = tag[0]
filter = tag[1]
date = test.find(tag1, filter)
I don't get errors, just None for date. The problem is propably the content of tag1 and filter The debuger of pycharm gives me:
tag1 = '"p"'
filter = '{"class": "findme"}'
Printing them doesn't show these apostrophs. Is it possible to remove these apostrophs?
The first argument is a tag name, and your string doesn't contain that. BeautifulSoup (or Python, generally) won't parse out a string like that, it cannot guess that you put some arbitrary Python syntax in that value.
Separate out the components:
tag = 'p'
filter = {'class_': 'findme'}
test.find(tag, **filter)
Okay I got it, thanks again.
dic_date = {'source1': 'p, class:findme', other sources ...}
pattern = dic_date[source]
tag = pattern.split(', ')
if len(tag) is 2:
att = tag[1].split(':') # getting the attribute
att = {att[0]: att[1]} # building a dictionary for the attributes
date = soup.find(tag[0], att)
else:
date = soup.find(tag[0]) # if there is only a tag without an attribute
Well it doesn't look very nice but it's working :)