I am scraping a webpage that contains HTML that looks like this in the browser
<td>LGG® MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
Taking just the LGG®, in the first instance it is LGG® In the second instance, ® is written as ® in the source code.
I am using Python 2.7, mechanize and BeautifulSoup.
My difficulty is that the ® is uplifted by mechanize, and carried through and is ultimately printed out or written to file.
There are many other special characters. Some are 'converted' on output and the ® are converted to a muddle.
The webpage is declared as UTF-8 and the only reference I make to encoding is when I open my out file. I've declared UTF-8. If I don't the writing to file bombs on other characters.
I am working on Windows 7. Other details:
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'cp850'
>>> locale.getdefaultlocale()
('en_GB', 'cp1252')
>>>
Can anyone give me any tips on the best way to handle the special characters? Or should they be called HTML entities? This must be a fairly common problem but I haven't been able to find any straightforward explanations on the web.
UPDATE: I've made some progress here.
The basic algorithm is
Read the webpage in mechanize
Use beautiful soup to do what.. as i write it down i have no idea
what this pre-processing stage is for, exactly.
Use beautiful soup to extract information from a table that is
orderly other than for the treatment of special characters.
Write the information to file delimited by | to account for
punctuation in long cell entries and to allow for importing into
Excel etc.
The progress is in stage 3. I've used some regex and htmlentityrefs to change the code cell entry by cell entry. See this blog post.
Remaining difficulty: the code written to file (and printed to screen) is still incorrect but it appears that the problem is now a matter of specifying the coding correctly. The problem seems smaller at least.
To answer the question from the title:
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
html = u"""
<td>LGG® MAX multispecies probiotic consisting of four bacterial trains</td>
<td>LGG® MAX helps to reduce gastro-intestinal discomfort</td>
"""
soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
print(''.join(soup('td', text=True)))
Output
LGG® MAX multispecies probiotic consisting of four bacterial trains
LGG® MAX helps to reduce gastro-intestinal discomfort
Related
I'm trying to read a webpage and output the formatted text to a text file. The code below prints to the shell with formatting but when I write it to the file it puts it on one line (with the linebreaks /n present in the text).
I have tried a variety of things such as not converting it to a string, using prettify from beautiful soup but none seem to produce a text file with formatting. I am presuming I am missing something fairly basic. Any help or guidance would be much appreciated.
# Import
from urllib.request import urlopen
from bs4 import BeautifulSoup
#The actual code
URL = "https://simple.wikipedia.org/wiki/castle" #The target URL
html = urlopen(URL).read() # Reads the url to variable html
soup = BeautifulSoup(html, "lxml") # Uses BS4 to create the soup using the lxml parser
soup = soup.get_text() # Extracts the text
print(soup) # Prints to python 3.5.1 shell, formatted as I would expect
# Now writing what I have extracted to a text file
file = open("TextOutput.txt", 'w') # Creates the file and opens as write (w)
file.writelines(str(soup.encode('UTF-8'))) # Tried file.write/lines(soup), convertion to string and encoding as UTF-8 needed to avoid errors
file.close()
A sample of the file output looks like:
b'\n\n\nCastle - Simple English Wikipedia, the free encyclopedia\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );\n(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Castle","wgTitle":"Castle","wgCurRevisionId":5333370,"wgRevisionId":5333370,"wgArticleId":15933,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":[""],"wgCategories":["Castles"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Castle","wgRelevantArticleId":15933,"wgRequestId":"VxUR5gpAIDAAAEXY6FMAAACC","wgIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgWikiEditorEnabledModules":{"toolbar":true,"dialogs":true,"preview":false,"publish":false},"wgBetaFeaturesFeatures":[],"wgMediaViewerOnClick":true,"wgMediaViewerEnabledByDefault":true,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","usePageImages":true,"usePageDescriptions":true},"wgPreferredVariant":"en","wgRelatedArticles":null,"wgRelatedArticlesUseCirrusSearch":true,"wgRelatedArticlesOnlyUseCirrusSearch":false,"wgULSAcceptLanguageList":[],"wgULSCurrentAutonym":"English","wgCategoryTreePageCategoryOptions":"{\"mode\":0,\"hideprefix\":20,\"showcount\":true,\"namespaces\":false}","wgNoticeProject":"wikipedia","wgCentralNoticeCategoriesUsingLegacy":["Fundraising","fundraising"],"wgCentralAuthMobileDomain":false,"wgWikibaseItemId":"Q23413","wgVisualEditorToolbarScrollOffset":0});mw.loader.implement("user.options",function($,jQuery){mw.user.options.set({"variant":"en"});});mw.loader.implement("user.tokens",function ( $, jQuery ) {\nmw.user.tokens.set({"editToken":"+\\","patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});/#nomin*/;\n\n});mw.loader.load(["mw.MediaWikiPlayer.loader","mw.PopUpMediaTransform","mw.TMHGalleryHook.js","mediawiki.page.startup","mediawiki.legacy.wikibits","ext.centralauth.centralautologin","mmv.head","ext.visualEditor.desktopArticleTarget.init","ext.uls.init","ext.uls.interface","ext.centralNotice.bannerController","skins.vector.js"]);});\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCastle\n\nFrom Wikipedia, the free encyclopedia\n\n\n\t\t\t\t\tJump to:\t\t\t\t\tnavigation, \t\t\t\t\tsearch\n\n\n\n\n\nBodiam Castle in England surrounded by a water-filled moat.\n\n\n\n\n\n\nLichtenstein Castle\n\n\nA castle (from the Latin word castellum) is a fortified structure made in Europe and the Middle East during the Middle Ages. People argue about what the word castle means. However, it usually means a private structure of a lord or noble. This is different from a fortress, which is not a home, and from a fortified town, which was a public defence. For about 900\xc2\xa0years that castles were built they had many different shapes and different details.\nCastles began in Europe in the 9th and 10th centuries. They controlled the places surrounding them, and could both help in attacking and defending. Weapons could be fired from castles, or people could be protected from enemies in castles. However, castles were also a symbol of power. They could be used to control the people and roads around it.\nMany castles were built with earth and wood at first often using manual labour, and then had their defences replaced by stone instead. Early castles often used nature for protection, and did not have towers. By the late 12th and early 13th centuries, though, castles became longer and more complex.\n
file.writelines(str(soup.encode('UTF-8'))) is kind of insane, it's:
Encoding text (str) to binary (bytes)
Getting the text representation of that by wrapping in str (so it's what you'd type to recreate the binary bytes, but it's not the raw binary)
Writing that result one character at a time (writelines iterates what you give it, and strs iterate by character)
Step #3 is silly and inefficient, but mostly harmless. Step #1 would be fine if you then wrote the raw binary to a file opened for binary write and actually wrote the bytes object. But #1 and #2 together mean that stuff like a new line gets converted to a literal \n in the output, rather than actually breaking a line. Non-ASCII stuff like é is output as \xc3\xa9, and the whole thing is wrapped in b'' (or b"").
You want something like:
# open with UTF-8 encoding (in case your system defaults to something else)
with open("TextOutput.txt", 'w', encoding='utf-8') as file:
# Get the text and write it as a single block
file.write(soup.get_text())
I have this short example to demonstrate my problem:
from lxml import html
post = """<p>This a page with URLs.
This goes to
Google<br/>
This
goes to Yahoo!<br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
This outputs:
http://google.com
http://yahoo.com
None
The problem is that my data has
characters embedded in it. For my last link, it is embedded directly between the anchor and the href attribute. The line feeds outside of the elements are important to me.
doc.xpath('//a') correctly saw the <a
href="http://example.com"> as a link, but it can't access the href attribute when I do link.get('href').
How can I clean the data if link.get('href') returns None, so that I can still retrieve the discovered href attribute?
I can't strip all of the
characters from the entire post element as the ones in the text are important.
Module unidecode
Since you need the data outside of the tags, you could try using unidecode. It doesn't tackle Chinese and Korean, but it'll do things like change left and right quotes to ASCII quotes. It should help with these
characters as well, changing them to spaces instead of non-breaking spaces. Hopefully that's all you need in regards to preserving the other data. str.replace(u"\#xa", u" ") is less heavy handed if the ascii space is okay.
import unidecode, urllib2
from lxml import html
html_text = urllib2.urlopen("http://www.yourwebsite.com")
ascii_text = unidecode.unidecode(html_text)
html.fromstring(ascii_text)
Explanation of issue
There seems to be a known issue with this in several versions of Python. And it's C# as well. A related closed issue seems to indicate that the issue was closed because XML attribute tags aren't built to support carriage returns, so escaping it in all xml contexts would be silly. As it turns out, the W3C spec requires that the unicode be put in when parsing (see sec. 1).
All line breaks must have been normalized on input to #xA as described in 2.11 End-of-Line Handling, so the rest of this algorithm operates on text normalized in this way.
You may solve your specific problem with:
post = post.replace('
', '\n')
Resulting test program:
from lxml import html
post = """<p>This a page with URLs.
This goes to
Google<br/>
This
goes to Yahoo!<br/>
<a
href="http://example.com">This is invalid due to that
line feed character</p>
"""
post = post.replace('
', '\n')
doc = html.fromstring(post)
for link in doc.xpath('//a'):
print link.get('href')
Output:
http://google.com
http://yahoo.com
http://example.com
I am using BeautifulSoup to scrape data from a webpage. I want to compare the website data with text that is in a .txt document. However, I seem to be having encoding issues.
The website has the text "heat oven to 400°" The text also appears like this in "view source" (no html entities.)
The website is read using beautifulSoup:
source = "my url".read()
....
soup = BeautifulSoup(source)
The text document was created by making a new text doc encoded as "Encode in UTF-8 without BOM". I then copy-pasted "heat oven to 400°" from the website into the text doc and saved.
The text file is read as
f = codecs.open('myfilename', encoding='utf-8')
When I compare the two strings, they are not equal, but I want them to be.
To see what is going on: In Eclipse, I split the two texts and, looking at the variables in debug mode, I see that the degree sign from BeautifulSoup appears as \xc2 \xb0. The degree sign from the text doc just appears as \xb0.
Why, and how do I fix it? I'm having this issue with many special chars so I need a general solution. Also, I will be copy-pasting data from several sites into the text doc.
Looks like Beautiful Soup doesn't have what it needs in order to detect the encoding correctly. You can give a hint by replacing BeautifulSoup(source) with BeautifulSoup(source, fromEncoding='UTF-8'). More options and information are online at "Beautiful Soup Gives You Unicode, Dammit".
The bytes '\xc2\xb0' are what you get when the UTF-8 encoding of Unicode code point U+00B0 is mistaken for Beautiful Soup's last-resort guess at the encoding, which is Windows 1252.
I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
®
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...
Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()
If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.
Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("' "", {"'": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml
This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 7 years ago.
I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä
Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.
I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:
>>> from BeautifulSoup import BeautifulSoup
>>> html = "<html>ÄÄRITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!
(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)
EDIT:
Another solution:
Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).
Try using BeautifulSoup. It should do the trick and give you a nicely formatted DOM to work with as well.
This blog entry seems to have had some success with it.
I haven't tried it myself, but have you tried
http://zesty.ca/python/scrape.html ?
It seems to have a method htmldecode(text) which would do what you want.