Confusion regarding BeautifulSoup's append method - python

I tried to test the below contents. Now I have seen one doubtful things as below:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<a>Foo</a>")
>>> soup.a.append("Bar")
>>> soup
<a>FooBar</a>
>>> soup.a.contents
[u'Foo', u'Bar']
>>>
I am confused why did it came as [u'Foo', u'Bar'] instead of [u'FooBar']?
Can you help me in this concept?

Try this:
>>> from BeautiulSoup import NavigableString
>>> soup = BeautifulSoup("<a>Foo</a>")
>>> soup.a.contents = [NavigableString(str(soup.a.contents[0]) + 'Bar')]
>>> soup
<a>FooBar</a>

Related

re.sub isn't matching when it seems it should

any help as to why this regex isnt' matching<td>\n etc? i tested it successfully on pythex.org. Basically i'm just trying to clean up the output so it just says myfile.doc. I also tried (<td>)?\\n\s+(</td>)?
>>> from bs4 import BeautifulSoup
>>> from pprint import pprint
>>> import re
>>> soup = BeautifulSoup(open("/home/user/message_tracking.html"), "html.parser")
>>>
>>> filename = str(soup.findAll("td", text=re.compile(r"\.[a-z]{3,}")))
>>> print filename
[<td>\n myfile.doc\n </td>]
>>> duh = re.sub("(<td>)?\n\s+(</td>)?", '', filename)
>>> print duh
[<td>\n myfile.doc\n </td>]
It's hard to tell without seeing the repr(filename), but I think your problem is the confusing of real newline characters with escaped newline characters.
Compare and contrast the examples below:
>>> pattern = "(<td>)?\n\s+(</td>)?"
>>> filename1 = '[<td>\n myfile.doc\n </td>]'
>>> filename2 = r'[<td>\n myfile.doc\n </td>]'
>>>
>>> re.sub(pattern, '', filename1)
'[myfile.doc]'
>>> re.sub(pattern, '', filename2)
'[<td>\\n myfile.doc\\n </td>]'
If your goal is to just get the stripped string from within the <td> tag you can just let BeautifulSoup do it for you by getting the stripped_strings attribute of a tag:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/home/user/message_tracking.html"),"html.parser")
filename_tag = soup.find("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the first td string in the html with specified text
filename_string = filename_tag.stripped_strings
print filename_string
If you want to extract further strings from tags of the same type you can then use findNext to extract the next td tag after the current one:
filename_tag = soup.findNext("td", text=re.compile(r"\.[a-z]{3,}"))) #finds the next td string in the html with specified text after current one
filename_string = filename_tag.stripped_strings
print filename_string
And then loop through...

How could I get (print) all inner html from node which I select using python's lxml etree and xpath?

How could I get all inner html from node which I select using etree xpath:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> doc = '<foo><bar><div>привет привет</div></bar></foo>'
>>> hparser = etree.HTMLParser()
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
How could I now print all foo_element's inner HTML as text? I need to get this:
<bar><div>привет привет</div></bar>
BTW when I tried to use lxml.html.tostring I get strange output:
>>> import lxml.etree
>>> lxml.html.tostring(foo_element[0])
'<foo><bar><div>привет првиет</div></bar></foo>'
You can apply the same technique as shown in this other SO post. Example in the context of this question :
>>> from lxml import etree
>>> from lxml import html
>>> from StringIO import StringIO
>>> doc = '<foo><bar><div>TEST NODE</div></bar></foo>'
>>> hparser = etree.HTMLParser()
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
>>> print ''.join(html.tostring(e) for e in foo_element[0])
<bar><div>TEST NODE</div></bar>
Or to handle case when the element may contain text node child :
>>> doc = '<foo>text node child<bar><div>TEST NODE</div></bar></foo>'
>>> htree = etree.parse(StringIO(doc), hparser)
>>> foo_element = htree.xpath("//foo")
>>> print foo_element[0].text + ''.join(html.tostring(e) for e in foo_element[0])
text node child<bar><div>TEST NODE</div></bar>
Refactoring the code into a separate function as shown in the linked post is strongly advised for the real case.

Can't seem to select root tag with BeautifulSoup

I'm trying to use select to select tags with BeautifulSoup, but it seems that BeautifulSoup will select a root tag if it's part of a BeautifulSoup object, but not if it's just in a tag object.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup()
>>> a = soup.new_tag("a")
>>> a = a.wrap(soup.new_tag("b"))
>>> soup.append(a)
>>> soup
<b><a></a></b>
>>> a
<b><a></a></b>
>>> soup.select("b")
[<b><a></a></b>]
>>> a.select("b")
[]
>>> a.select("a")
[<a></a>]
Short of creating a new BeautifulSoup object that consists only of a, is there a way to get this to work?
At first, beautify your string:
>>> from bs4 import BeautifulSoup
>>> a='<b><a></a></b>'
>>> a=BeautifulSoup(a)
>>> a
<html><body><b><a></a></b></body></html>
>>> a.select("b")
[<b><a></a></b>]
>>> a.select("a")
[<a></a>]
>>>
That's the way BeautifulSoup works!

If I have this string in Python, how do I decode it?

s = 'Tara%2520Stiles%2520Living'
How do I turn it into:
Tara Stiles Living
You need to use urllib.unquote, but it appears you need to use it twice:
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> urllib.unquote(urllib.unquote(s))
'Tara Stiles Living'
After unquoting once, your "%2520" turns into "%20", which unquoting again turns into " " (a space).
Use:
urllib.unquote(string)
http://docs.python.org/library/urllib.html
>>> import urllib
>>> s = 'Tara%2520Stiles%2520Living'
>>> t=urllib.unquote_plus(s)
>>> print t
Tara%20Stiles%20Living
>>> urllib.unquote_plus(t)
'Tara Stiles Living'
>>>
import urllib
s = 'Tara%2520Stiles%2520Living'
t=''
while s<>t: s,t=t,urllib.unquote(s)
If you are using Python you should use urllib.parse.unquote(url) like the following code :
import urllib
url = "http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E"
print(url)
print(urllib.parse.unquote(url))
This code will output the following :
>>> print(url)
http://url-with-quoted-char:%3Cspan%3E%20%3C/span%3E
>>> print(urllib.parse.unquote(url))
http://url-with-quoted-char:<span> </span>

Python: Convert those TinyURL (bit.ly, tinyurl, ow.ly) to full URLS

I am just learning python and is interested in how this can be accomplished. During the search for the answer, I came across this service: http://www.longurlplease.com
For example:
http://bit.ly/rgCbf can be converted to:
http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place
I did some inspecting with Firefox and see that the original url is not in the header.
Enter urllib2, which offers the easiest way of doing this:
>>> import urllib2
>>> fp = urllib2.urlopen('http://bit.ly/rgCbf')
>>> fp.geturl()
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
For reference's sake, however, note that this is also possible with httplib:
>>> import httplib
>>> conn = httplib.HTTPConnection('bit.ly')
>>> conn.request('HEAD', '/rgCbf')
>>> response = conn.getresponse()
>>> response.getheader('location')
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'
And with PycURL, although I'm not sure if this is the best way to do it using it:
>>> import pycurl
>>> conn = pycurl.Curl()
>>> conn.setopt(pycurl.URL, "http://bit.ly/rgCbf")
>>> conn.setopt(pycurl.FOLLOWLOCATION, 1)
>>> conn.setopt(pycurl.CUSTOMREQUEST, 'HEAD')
>>> conn.setopt(pycurl.NOBODY, True)
>>> conn.perform()
>>> conn.getinfo(pycurl.EFFECTIVE_URL)
'http://webdesignledger.com/freebies/the-best-social-media-icons-all-in-one-place'

Categories

Resources