I'm getting "AttributeError: 'NoneType' object has no attribute 'string'" when I run the following. however, when the same tasks are performed on a block string variable; it works.
Any Ideas as to what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
url = ("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext")
print ((BeautifulSoup(((urlopen(url)).read()))).find('extract').string).split("\n", 1)[0]
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
url = ("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext")
soup = BeautifulSoup(urlopen(url).read())
print soup.find('extract') # returns None
The find method is not finding anything with the tag 'extract'. If you want to see it work then give it a HTML tag that exists in the document like 'pre' or 'html'
'extract' looks like an xml tag. You might want to try reading the BeautifulSoup documentation on parsing XML - http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML. Also there is a new version of BeautifulSoup out there (bs4). I find the API much nicer.
Related
I am trying to extract the 'meanings' section of a dictionary entry from a html file using beautifulsoup but it is giving me some trouble. Here is a summary of what I have tried so far:
I right click on the dictionary entry page below and save the webpage to my Python directory as 'aufmachen.html'
https://www.duden.de/rechtschreibung/aufmachen
Within the source code of this webpage, the section that I am trying to extract starts from line 1042 with the expression
I wrote the code below but neither tags nor Bedeutungen contains any search results.
import requests
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
with open("aufmachen.html",encoding="utf8") as f:
doc = BeautifulSoup(f,"html.parser")
tags = doc.body.findAll(text = '<div class="division " id="bedeutungen">')
print(tags)
Bedeutungen = doc.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Could you please help me with this problem?
Thanks for your time in advance.
The main bug in your code is that you send BS a file, not a string. Call .read() on your file to get a string.
with open("aufmachen.html", "r",encoding="utf8") as f:
doc = BeautifulSoup(f.read(),"html.parser")
However it seems you want to pull in the HTML file from a URL, not a file on your computer. This can be done like this:
from bs4 import BeautifulSoup
import requests
url = "https://www.duden.de/rechtschreibung/aufmachen"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
Bedeutungen = soup.body.findAll("div", {"id": "bedeutungen"})
print(Bedeutungen)
Your first call to .findAll() didn't work because the text kwarg looks for text inside the tag, not a tag itself. The following works, but there's no particular reason to use this over the other shown above.
tags = soup.body.findAll("div", class_="division", id="bedeutungen")
I am using the lxml and requests modules, and just trying to parse the article from a website. I tried using find_all from BeautifulSoup but still came up empty
from lxml import html
import requests
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
tree = html.fromstring(page.content)
article = tree.xpath('//div[#class="article"]/text()')
Once I print article, I get a list of ['\n','\n','\n','\n','\n'], rather than the body of the article. Where exactly am I going wrong?
I would use bs4 and the class name in css select_one
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://www.thehindu.com/news/national/karnataka/kumaraswamy-congress-leaders-meet-to-discuss-cabinet-reshuffle/article27283040.ece')
soup = bs(page.content, 'lxml')
print(soup.select_one('.article').text)
If you use
article = tree.xpath('//div[#class="article"]//text()')
you get a list and still get all the \n but also the text which I think you can handle with re.sub or conditional logic.
I seem to be doing something wrong. I have an HTML source that I pull using urllib. Based on this HTML file I use beautifulsoup to findAll elements with an ID based on a specified array. This works for me, however the output is messy and includes linebreaks "\n".
Python: 2.7.12
BeautifulSoup: bs4
I have tried to use prettify() to correct the output but always get an error:
AttributeError: 'ResultSet' object has no attribute 'prettify'
import urllib
import re
from bs4 import BeautifulSoup
cfile = open("test.txt")
clist = cfile.read()
clist = clist.split('\n')
i=0
while i<len (clist):
url = "https://example.com/"+clist[i]
htmlfile = urllib.urlopen (url)
htmltext = htmlfile.read()
soup = BeautifulSoup (htmltext, "html.parser")
soup = soup.findAll (id=["id1", "id2", "id3"])
print soup.prettify()
i+=1
I'm sure there is something simple I am overlooking with this line:
soup = soup.findAll (id=["id1", "id2", "id3"])
I'm just not sure what. Sorry if this is a stupid question. I've only been using Python and Beautiful Soup for a few days.
You are reassigning the soup variable to the result of .findAll(), which is a ResultSet object (basically, a list of tags) which does not have the prettify() method.
The solution is to keep the soup variable pointing to the BeautifulSoup instance.
You can call prettify() on the top-level BeautifulSoup object, or on any of its Tag objects:
findAll return a list of match tags, so your code equal to [tag1,tag2..].prettify()
and it will not work.
I am trying to get my feet wet with BS.
I tried to work my way through the documentation butat the very first step I encountered already a problem.
This is my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5....1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description')
print(soup.prettify())
This is the response I get:
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/bs4/__init__.py", line 189
'"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an
HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
UserWarning: "https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5...b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description"
looks like a URL. Beautiful Soup is not an HTTP client. You should
probably use an HTTP client to get the document behind the URL, and feed that document
to Beautiful Soup.
https://api.flickr.com/services/rest/?method=flickr.photos.search&api;_key=5...b&per;_page=250&accuracy;=1&has;_geo=1&extras;=geo,tags,views,description
Is it because I try to call http**s** or is it another problem?
Thanks for your help!
You are passing URL as a string. Instead you need to get the page source via urllib2 or requests:
from urllib2 import urlopen # for Python 3: from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5....1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description'
soup = BeautifulSoup(urlopen(URL))
Note that you don't need to call read() on the result of urlopen(), BeautifulSoup allows the first argument to be a file-like object, urlopen() returns a file-like object.
The error says everything, you are passing a URL to Beautiful Soup. You need to first get the website content, and only then pass the content to BS.
To download content you can use urlib2
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
and later
soup = BeautifulSoup(html)
I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody':
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
req = urllib2.Request(url)
response = urllib2.urlopen(req)
the_page = response.read()
soup = BeautifulSoup(the_page)
soup.findAll('td',attrs={'class':'empformbody'})
Now in the above code we can use findAll to get tags and information related to them, but I want to use XPath. Is it possible to use XPath with BeautifulSoup? If possible, please provide me example code.
Nope, BeautifulSoup, by itself, does not support XPath expressions.
An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.
Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.
try:
# Python 2
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
from lxml import etree
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)
There is also a dedicated lxml.html() module with additional functionality.
Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:
import lxml.html
import requests
url = "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)
Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:
from lxml.cssselect import CSSSelector
td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
# Do something with these table cells.
Coming full circle: BeautifulSoup itself does have very complete CSS selector support:
for cell in soup.select('table#foobar td.empformbody'):
# Do something with these table cells.
I can confirm that there is no XPath support within Beautiful Soup.
As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:
from lxml import html
import requests
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[#title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[#class="item-price"]/text()')
print('Buyers: ', buyers)
print('Prices: ', prices)
I used this as a reference.
BeautifulSoup has a function named findNext from current element directed childern,so:
father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a')
Above code can imitate the following xpath:
div[class=class_value]/div[id=id_value]
from lxml import etree
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('path of your localfile.html'),'html.parser')
dom = etree.HTML(str(soup))
print dom.xpath('//*[#id="BGINP01_S1"]/section/div/font/text()')
Above used the combination of Soup object with lxml and one can extract the value using xpath
when you use lxml all simple:
tree = lxml.html.fromstring(html)
i_need_element = tree.xpath('//a[#class="shared-components"]/#href')
but when use BeautifulSoup BS4 all simple too:
first remove "//" and "#"
second - add star before "="
try this magic:
soup = BeautifulSoup(html, "lxml")
i_need_element = soup.select ('a[class*="shared-components"]')
as you see, this does not support sub-tag, so i remove "/#href" part
I've searched through their docs and it seems there is no XPath option.
Also, as you can see here on a similar question on SO, the OP is asking for a translation from XPath to BeautifulSoup, so my conclusion would be - no, there is no XPath parsing available.
Maybe you can try the following without XPath
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<html>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.</p>
<p>More information...</p>
</div>
</body>
</html>
'''
# What XPath can do, so can it
doc = SimplifiedDoc(html)
# The result is the same as doc.getElementByTag('body').getElementByTag('div').getElementByTag('h1').text
print (doc.body.div.h1.text)
print (doc.div.h1.text)
print (doc.h1.text) # Shorter paths will be faster
print (doc.div.getChildren())
print (doc.div.getChildren('p'))
This is a pretty old thread, but there is a work-around solution now, which may not have been in BeautifulSoup at the time.
Here is an example of what I did. I use the "requests" module to read an RSS feed and get its text content in a variable called "rss_text". With that, I run it thru BeautifulSoup, search for the xpath /rss/channel/title, and retrieve its contents. It's not exactly XPath in all its glory (wildcards, multiple paths, etc.), but if you just have a basic path you want to locate, this works.
from bs4 import BeautifulSoup
rss_obj = BeautifulSoup(rss_text, 'xml')
cls.title = rss_obj.rss.channel.title.get_text()
use soup.find(class_='myclass')