cannot run BeautifulSoup using requests.get(url) - python

start_url=requests.get('http://www.delicious.com/golisoda')
soup=BeautifulSoup(start_url)
this code is displaying the following error:
Traceback (most recent call last):
File "test2_requests.py", line 10, in <module>
soup=BeautifulSoup(start_url)
File "/usr/local/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
self.builder.prepare_markup(markup, from_encoding))
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in prepare_markup
dammit = UnicodeDammit(markup, try_encodings, is_html=True)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 203, in __init__
self._detectEncoding(markup, is_html)
File "/usr/local/lib/python2.7/dist-packages/bs4/dammit.py", line 373, in _detectEncoding
xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer

Use the .content of the response:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.content)
Alternatively, you can use the decoded unicode text:
start_url = requests.get('http://www.delicious.com/golisoda')
soup = BeautifulSoup(start_url.text)
See the Response content section of the documentation.

you probebly need to Use
using
soup=BeautifulSoup(start_url.read())
or
soup=BeautifulSoup(start_url.text)

from BeautifulSoup import BeautifulSoup
import urllib2
data=urllib2.urlopen('http://www.delicious.com/golisoda').read()
soup=BeautifulSoup(data)

Related

I am facing this error "xml.parsers.expat.ExpatError: not well-formed (invalid token):" while prase the url data using minidom

I am facing this error xml.parsers.expat.ExpatError: syntax error: line 1, column 0 while parse data from url using minidom. Anyone can help me for this ?
Here is my code:
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse("about_us.xml")
Error:
Traceback (most recent call last):
File "test3.py", line 11, in <module>
doc=minidom.parse("about_us.xml")
File "C:\Python27\lib\xml\dom\minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "C:\Python27\lib\xml\dom\expatbuilder.py", line 211, in parseFile
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
parser.Parse("", True)
xml.parsers.expat.ExpatError: syntax error: line 1, column 0
The above from your traceback indicates to me that your "about_us.xml" file is empty.
You have openurl but you have not shown that you've ever called openurl.read() to actually get at the data.
Nor have you shown where or how you've written said data to your "about_us.xml" file.
from xml.dom import minidom
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
doc=minidom.parse(openurl)
print doc
gives me
Traceback (most recent call last):
File "main.py", line 5, in <module>
doc=minidom.parse(openurl)
File "/usr/local/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 928, in parse
result = builder.parseFile(file)
File "/usr/local/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 51, column 81
which indicates that the page you are trying to parse as XML is not well-formed. Try using beautiful soup instead which, from memory, is very forgiving.
from BeautifulSoup import BeautifulSoup
import urllib2
url= 'http://www.awgp.org/about_us'
openurl=urllib2.urlopen(url)
soup = BeautifulSoup(openurl.read())
for a in soup.findAll('a'):
print (a.text, a.get('href'))
BTW, you'll need ver 3 of Beautiful Soup since you're still on python 2.7

Python: BeautifulSoup UnboundLocalError

I am trying to remove the HTML tags from some documents in a .txt format. However, there seems to be an error with the bs4 as far as I understand. The error that I am getting is the following:
Traceback (most recent call last):
File "E:/Google Drive1/Thesis stuff/Python/database/get_missing_10ks.py", line 13, in <module>
text = BeautifulSoup(file_read, "html.parser")
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 282, in __init__
self._feed()
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 343, in _feed
self.builder.feed(self.markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
parser.feed(markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
And the code that I am using is the following:
import os
from bs4 import BeautifulSoup
path_to_10k = "D:/10ks/list_missing_10k/"
path_to_saved_10k = "D:/10ks/list_missing_10kp/"
list_txt = os.listdir(path_to_10k)
for name in list_txt:
file = open(path_to_10k + name, "r+", encoding="utf-8")
file_read = file.read()
text = BeautifulSoup(file_read, "html.parser")
text = text.get_text("\n")
file2 = open(path_to_saved_10k + name, "w+", encoding="utf-8")
file2.write(str(text))
file2.close()
file.close()
The thing is that I have used this method on 51320 documents and it worked just fine, however, there are a few documents which it cannot do. When I open those HTML documents they seem the same to me.. If anyone could have any indication of what could be the problem and how to fix it it would be great. Thank you!
EXAMPLE OF FILE: https://files.fm/u/2s45uafp
https://github.com/scrapy/w3lib
https://w3lib.readthedocs.io/en/latest/
pip install w3lib
and
from w3lib.html import remove_tags
And then remove_tags(data) return clear data.
Here is a solution which is using regular expression for removing the HTML tags.
import re
TAG_RE = re.compile(r'<[^>]+>')
f = open("C:\Temp\Data.txt", "r")
strHtml=f.read()
def remove_Htmltags(text):
return TAG_RE.sub('', text)
strClearText=remove_Htmltags(strHtml)
print(strClearText)

py2exe SSLError in making exe file for windows7

I am trying to make an exe file with -py2exe Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
def get_html(url):
r = requests.get(url)
return r.text
url = 'https://loadtv.info/'
html=get_html(url)
soup=BeautifulSoup(html, 'html.parser')
article = soup.findAll('p')
text=[]
article_text = ''
for element in article:
text.append(element.getText().encode('utf8'))
pages=soup.findAll('a',class_='more-link')
links=[]
for page in pages :
links.append(page.get('href'))
zo=[]
for i in links:
z=get_html(i)
soup=BeautifulSoup(z, 'html.parser')
t=[]
r=soup.findAll('p')
w=[]
tex=[]
for i in r :
w.append(i.getText().encode('utf8'))
for i in range(3,6):
tex.append(w[i])
zo.append(tex)
video=[]
for i in links:
z=get_html(i)
soup=BeautifulSoup(z, 'html.parser')
vid = soup.find("iframe").get('src')
video.append(vid)
da={'links': links,'text' :zo,'video':video}
with open('C:\\2\\123.txt','wb') as f:
writer=csv.writer(f)
for i in da:
for rows in da[i]:
writer.writerow([rows])
I get the following error in log file after compilation:
Traceback (most recent call last):
File "films.py", line 12, in <module>
File "films.py", line 9, in resource_path
NameError: global name 'os' is not defined
Traceback (most recent call last):
File "films.py", line 15, in <module>
File "films.py", line 9, in get_html
File "requests\api.pyc", line 70, in get
File "requests\api.pyc", line 56, in request
File "requests\sessions.pyc", line 488, in request
File "requests\sessions.pyc", line 609, in send
File "requests\adapters.pyc", line 497, in send
requests.exceptions.SSLError: [Errno 2] No such file or directory
I made exe by run in cmd python setup.py py2exe
I have no idea how to fix it. The problem is related to SSL certificate, but I have added it to my setup modules. Please help.
setup.py:
from distutils.core import setup
import py2exe
setup(
windows=[{"script":"films.py"}],
options={"py2exe": {"includes":["bs4","requests","csv"]}},
zipfile=None
)

robust DOM parsing with getElementsByTagName

The following (from "Dive into Python")
from xml.dom import minidom
xmldoc = minidom.parse('/path/to/index.html')
reflist = xmldoc.getElementsByTagName('img')
failed with
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/path/to/htmlToNumEmbedded.py", line 2, in <module>
xmldoc = minidom.parse('/path/to/index.html')
File "/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 12, column 4
Using lxml, which is recommended by http://www.ianbicking.org/blog/2008/12/lxml-an-underappreciated-web-scraping-library.html, allows you to parse the document, but it does not seem to have an getElementsByTagName. The following works:
from lxml import html
xmldoc = html.parse('/path/to/index.html')
root = xmldoc.getroot()
for i in root.iter("img"):
print i
but seems kludgey: is there a built-in function that I overlooked?
Or another more elegant way to have robust DOM parsing with getElementsByTagName?
If you want a list of Element, instead of iterating the return value of the Element.iter, call list on it:
from lxml import html
reflist = list(html.parse('/path/to/index.html.html').iter('img'))
You can use BeautifulSoup for this:
from bs4 import BeautifulSoup
with open('/path/to/index.html') as f:
soup = BeautifulSoup(f)
soup.find_all("img")
See Going through HTML DOM in Python

Why do I get a recursion error with BeautifulSoup and IDLE?

I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)

Categories

Resources