I am trying to remove the HTML tags from some documents in a .txt format. However, there seems to be an error with the bs4 as far as I understand. The error that I am getting is the following:
Traceback (most recent call last):
File "E:/Google Drive1/Thesis stuff/Python/database/get_missing_10ks.py", line 13, in <module>
text = BeautifulSoup(file_read, "html.parser")
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 282, in __init__
self._feed()
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\__init__.py", line 343, in _feed
self.builder.feed(self.markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\site-packages\bs4\builder\_htmlparser.py", line 247, in feed
parser.feed(markup)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 179, in goahead
k = self.parse_html_declaration(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\html\parser.py", line 264, in parse_html_declaration
return self.parse_marked_section(i)
File "C:\Users\Adrian PC\AppData\Local\Programs\Python\Python37\lib\_markupbase.py", line 160, in parse_marked_section
if not match:
UnboundLocalError: local variable 'match' referenced before assignment
And the code that I am using is the following:
import os
from bs4 import BeautifulSoup
path_to_10k = "D:/10ks/list_missing_10k/"
path_to_saved_10k = "D:/10ks/list_missing_10kp/"
list_txt = os.listdir(path_to_10k)
for name in list_txt:
file = open(path_to_10k + name, "r+", encoding="utf-8")
file_read = file.read()
text = BeautifulSoup(file_read, "html.parser")
text = text.get_text("\n")
file2 = open(path_to_saved_10k + name, "w+", encoding="utf-8")
file2.write(str(text))
file2.close()
file.close()
The thing is that I have used this method on 51320 documents and it worked just fine, however, there are a few documents which it cannot do. When I open those HTML documents they seem the same to me.. If anyone could have any indication of what could be the problem and how to fix it it would be great. Thank you!
EXAMPLE OF FILE: https://files.fm/u/2s45uafp
https://github.com/scrapy/w3lib
https://w3lib.readthedocs.io/en/latest/
pip install w3lib
and
from w3lib.html import remove_tags
And then remove_tags(data) return clear data.
Here is a solution which is using regular expression for removing the HTML tags.
import re
TAG_RE = re.compile(r'<[^>]+>')
f = open("C:\Temp\Data.txt", "r")
strHtml=f.read()
def remove_Htmltags(text):
return TAG_RE.sub('', text)
strClearText=remove_Htmltags(strHtml)
print(strClearText)
Related
I've installed beautifulsoup (file named bs4) into my pythonproject folder which is the same folder as the python file I am running. The .py file contains the following code, and for input I am using this URL to a simple page with 1 link which the code is supposed to retrieve.
URL used as url input: http://data.pr4e.org/page1.htm
.py code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter - ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Though I could be wrong, it appears to me that bs4 imports correctly because my IDE program suggests BeautifulSoup when I begin typing it. After all, it is installed in the same directory as the .py file. however, It spits out the following lines of error when I run it using the previously provided url:
Traceback (most recent call last):
File "C:\Users\Thomas\PycharmProjects\pythonProject\main.py", line 16, in <module>
soup = BeautifulSoup(html, 'html.parser')
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 215, in __init__
self._feed()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 241, in _feed
self.endData()
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 315, in endData
self.object_was_parsed(o)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\__init__.py", line 320, in
object_was_parsed
previous_element = most_recent_element or self._most_recent_element
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1001, in __getattr__
return self.find(tag)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1238, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1259, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 516, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1560, in __init__
self.text = self._normalize_search_value(text)
File "C:\Users\Thomas\PycharmProjects\pythonProject\bs4\element.py", line 1565, in _
normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value,
'match')
AttributeError: module 'collections' has no attribute 'Callable'
Process finished with exit code 1
The lines being referred to in the error messages are from files inside bs4 that were downloaded as part of it. I haven't edited any of the bs4 contained files or even touched them. Can anyone help me figure out why bs4 isn't working?
Are you using python 3.10? Looks like beautifulsoup library is using removed deprecated aliases to Collections Abstract Base Classes. More info here: https://docs.python.org/3/whatsnew/3.10.html#removed
A quick fix is to paste these 2 lines just below your imports:
import collections
collections.Callable = collections.abc.Callable
Trying to write a really basic scraper for youtube video titles, using a csv of video links and beautiful soup. The script as it currently stands is:
#!/usr/bin/python
from bs4 import BeautifulSoup
import urllib
import csv
with open('url-titles-list.csv', 'wb') as csv_out:
fieldnames = ['url', 'title']
writer = csv.DictWriter(csv_out, fieldnames = fieldnames)
with open('url-nohttps-list.csv', 'rb') as csv_in:
reader = csv.DictReader(csv_in, fieldnames=['linkurls'])
writer.writeheader()
for row in reader:
link = row['linkurls']
with urllib.urlopen(link) as response:
html = response.read()
soup = BeautifulSoup(html, "html.parser")
name = soup.title.string
writer.writerow({'url': row['linkurls'], 'title': name})
This breaks at urllib.urlopen(link), with the following traceback making it look like the url type is not getting recognised correctly, and it's trying to open links as local files?
Traceback (most recent call last):
File "/Users/clarapouletty/Desktop/operation_find_yuzusho/fetcher.py", line 15, in <module>
with urllib.urlopen(link) as response:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory: 'linkurls'
Process finished with exit code 1
Any assistance much appreciated!
How can I set the "Unicode strings encoding declaration" in the lxml.html.clean.Cleaner module? I'm looking to read the plaintext of a website, and have used lxml in the past as a way of doing this, scraping out the html and javascript. For some pages, I'm starting to get some weird errors about encoding, but can't make sense of how to set this param correctly in the documentation.
import requests
from lxml.html.clean import Cleaner
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
cleaner.html= True
>>> url = 'http://www.princeton.edu'
>>> r = requests.get(url)
>>> lx = r.text.replace('\t',' ').replace('\n',' ').replace('\r',' ')
>>> #lx = r.text
... lxclean = cleaner.clean_html(lx)
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/clean.py", line 501, in clean_html
doc = fromstring(html)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 672, in fromstring
doc = document_fromstring(html, parser=parser, base_url=base_url, **kw)
File "/home/username/gh/venv/local/lib/python2.7/site-packages/lxml/html/__init__.py", line 568, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2997, in lxml.etree.fromstring (src/lxml/lxml.etree.c:63276)
File "parser.pxi", line 1607, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:93592)
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
However, it works for other urls, like 'http://www.google.com'
This seems suddenly broken. Use beautifulsoup's html parser instead.
I am following a tutorial to try to learn how to use BeautifulSoup. I am trying to remove names from the urls on a html page I downloaded. I have it working great to this point.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
print link
but when I enter this next part
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("43rd-congress.html"))
final_link = soup.p.a
final_link.decompose()
links = soup.find_all('a')
for link in links:
names = link.contents[0]
fullLink = link.get('href')
print names
print fullLink
I get this error
Traceback (most recent call last):
File "C:/Python27/python tutorials/soupexample.py", line 13, in <module>
print names
File "C:\Python27\lib\idlelib\PyShell.py", line 1325, in write
return self.shell.write(s, self.tags)
File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
seq = self.asynccall(oid, methodname, args, kwargs)
File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
self.putmessage((seq, request))
File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
s = pickle.dumps(message)
File "C:\Python27\lib\copy_reg.py", line 74, in _reduce_ex
getstate = self.__getstate__
RuntimeError: maximum recursion depth exceeded
This is a buggy interaction between IDLE and BeautifulSoup's NavigableString objects (which subclass unicode). See issue 1757057; it's been around for a while.
The work-around is to convert the object to a plain unicode value first:
print unicode(names)
hi im running python 2.7.1 and beautifulsoup 3.2.0
if i try to load some xml feed using
ifile = open(os.path.join(self.path,str(self.FEED_ID)+'.xml'), 'r')
file_data = BeautifulStoneSoup(ifile,
convertEntities=BeautifulStoneSoup.XHTML_ENTITIES)
im getting the following error
File "C:\dev\Python27\lib\site-packages\BeautifulSoup.py", line 1144, in __ini
t__
self._feed(isHTML=isHTML)
File "C:\dev\Python27\lib\site-packages\BeautifulSoup.py", line 1186, in _feed
SGMLParser.feed(self, markup)
File "C:\dev\Python27\lib\sgmllib.py", line 103, in feed
self.rawdata = self.rawdata + data
TypeError: cannot concatenate 'str' and 'NoneType' objects
i try to look everywhere but with no success ... please advise
With the example ...
from BeautifulSoup import BeautifulStoneSoup
xml = "<doc><tag1>Contents 1<tag2>Contents 2<tag1>Contents 3"
soup = BeautifulStoneSoup(xml)
print soup.prettify()
(...)
from here. I infer that you need to pass a string as first parameter instead of the file object ifile, try:
file_data = BeautifulStoneSoup(ifile.read(),
convertEntities=BeautifulStoneSoup.XHTML_ENTITIES)
I had this error too. This worked for me:
from unidecode import unidecode
file_data = BeautifulSoup(unidecode(ifile.read()))