Right now my output to a file is like:
<b>Nov 22–24</b> <b>Nov 29–Dec 1</b> <b>Dec 6–8</b> <b>Dec 13–15</b> <b>Dec 20–22</b> <b>Dec 27–29</b> <b>Jan 3–5</b> <b>Jan 10–12</b> <b>Jan 17–19</b> <b><i>Jan 17–20</i></b> <b>Jan 24–26</b> <b>Jan 31–Feb 2</b> <b>Feb 7–9</b> <b>Feb 14–16</b> <b><i>Feb 14–17</i></b> <b>Feb 21–23</b> <b>Feb 28–Mar 2</b> <b>Mar 7–9</b> <b>Mar 14–16</b> <b>Mar 21–23</b> <b>Mar 28–30</b>
I want to remove all the "Â" and css tags (< b >, < / b >). I tried using the .remove and .replace functions but I get an error:
SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
The output above is in a list, which I get from a webcrawling function:
def getWeekend(item_url):
dates = []
href = item_url[:37]+"page=weekend&" + item_url[37:]
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
return date
I write it to a file like so:
for item in listOfDate:
wr.writerow(item)
How can I remove all the tags so that only the date is left?
I'm not sure, but I think aString.regex_replace('toFind', 'toReplace') should work. Either that or writeb it to a file, and then run sed on it like: sed -i 's/toFind/toReplace/g'
You already got a working solution, but for the future:
Use get_text() to get rid of the tags
date = soup.select('table.chart-wide > tr > td > nobr > font > a > b').get_text()
Use .replace(u'\xc2',u'') to get rid of the Â. the u makes u'\xc2' a unicode string. (This might take some futzing around with encoding, but for me get_Text() is already returning a unicode object.)
(Additionally, possibly consider .replace(u'\u2013',u'-') because right now, you have an en-dash :P.)
date = date.replace(u'\xc2',u'').replace(u'\u2013',u'-')
The problem is that you don't have an ASCII string from the website. You need to convert the non-ASCII text into something Python can understand before manipulating it.
Python will use Unicode when given a chance. There's plenty of information out there if you just have a look. For example, you can find more help from other questions on this website:
Python: Converting from ISO-8859-1/latin1 to UTF-8
python: unicode in Windows terminal, encoding used?
What is the difference between encode/decode?
If your Python 2 source code has literal non-ASCII characters such as  then you should declare the source code encoding as the error message says. Put at the top of your Python file:
# -*- coding: utf-8 -*-
Make sure the file is saved using the utf-8 encoding and use Unicode strings to work with the text.
Related
My scrapy project is giving me a strange encoding for items when using CSS selectors.
Here is the relevent code:
Once the scrapy request is made and the webpage is downloaded, parse_page is called with the response...
def parse_page(self, response):
# Using Selenium WebDriver to select elements
records = self.driver.find_elements_by_css_selector('#searchResultsTable > tbody > tr')
for record in records:
# Convert selenium object into scrapy.Selector object (necessary to use .add_css methods)
sel = Selector(text=record.get_attribute('outerHTML'))
# Instantiate RecordLoader (a custom item loader)
il = RecordLoader(item=Record(), selector=sel)
# Select element and pass to example_field's input processor
il.add_css('example_field', 'td:nth-child(2) > a::text')
il.add_css() passes the result of the CSS selector to example_field's input processor which for demonstration purposes is only print statements and shows the issue...
def example_field_input_processor(text_html):
print(text_html)
print(type(text_html))
print(text_html.encode('utf-8'))
Output:
'\xa0\xa004/29/2020 10:50:24 AM,\xa0\xa0\xa0'
<class 'str'>
b'\xc2\xa0\xc2\xa004/29/2020 10:50:24 AM,\xc2\xa0\xc2\xa0\xc2\xa0'
Here are my questions:
1) Why is it that the CSS selector didn't just give me a normal Python string? Does it have to do with the CSS selector casting to text with ::text. Is it because the webpage is in a different encoding? I checked if there was a <meta> tag that specified the site's encoding but there wasn't one.
2) When I force an encoding of 'utf-8' why don't I get a normal python string instead of a bytes string that shows all the Unicode characters?
3) My goal is to have just a normal python string (No prepended b, no weird characters) that I can parse. How?
While scraping you sometimes have to clean your results from unicode characters
They are usually as a result of spaces tabs and sometimes span
As a common practice clean all texts you scrape:
def string_cleaner(rouge_text):
return ("".join(rouge_text.strip()).encode('ascii', 'ignore').decode("utf-8"))
Explaination:
Use split() and join to translate the characters and clear it of unicodes.
This part of the code "".join(rouge_text.strip())
Then encode it to ascii and decode it to utf-8 to remove special characters
This part of the code .encode('ascii','ignore').decode("utf-8"))
How you would apply it in your code
print(string_cleaner(text_html))
How can I remove unwanted characters from a long text using .replace() or any of that sort. Symbols I wish to kick out from the text are ',',{,},[,] (commas are not included). My existing text is:
{'SearchText':'319 lizzie','ResultList':[{'PropertyQuickRefID':'R016698','PropertyType':'Real'}],'TaxYear':2018}
I tried with the below code:
content='''
{'SearchText':'319 lizzie','ResultList':[{'PropertyQuickRefID':'R016698','PropertyType':'Real'}],'TaxYear':2018}
'''
print(content.replace("'",""))
Output I got: [btw, If i keep going like .replace().replace() with different symbols in it then it works but i wish to do the same in a single instance if its possible]
{SearchText:319 lizzie,ResultList:[{PropertyQuickRefID:R016698,PropertyType:Real}],TaxYear:2018}
I wish i could use replace function like .replace("',{,},[,]",""). However, I'm not after any solution derived from regex. String manipulation is what I expected. Thanks in advance.
content=r"{'SearchText':'319 lizzie','ResultList':[{'PropertyQuickRefID':'R016698','PropertyType':'Real'}],'TaxYear':2018}"
igno = "{}[]''´´``''"
cleaned = ''.join([x for x in content if x not in igno])
print(cleaned)
PyFiddle 3.6:
SearchText:319 lizzie,ResultList:PropertyQuickRefID:R016698,PropertyType:Real,TaxYear:2018
In 2.7 I get an error:
Non-ASCII character '\xc2' in file main.py on line 3, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
wich can be fixed by adding # This Python file uses the following encoding: utf-8 as 1st line in source code - which then gives identical output.
I am trying to analyze xml data, and encountered an issue with regard to HTML entities when I use
import xml.etree.ElementTree as ET
tree = ET.parse(my_xml_file)
root = tree.getroot()
for regex_rule in root.findall('.//regex_rule'):
print(regex_rule.get('input')) #this ".get()" method turns < into <, but I want to get < as written
print(regex_rule.get('input') == "(?<!\S)hello(?!\S)") #prints out false because ElementTree's get method turns < into < , is that right?
And here is the xml file contents:
<rules>
<regex_rule input="(?<!\S)hello(?!\S)" output="world"/>
</rules>
I would appreciate if anybody can direct me to getting the string as is from the xml attribute for the input, without converting
<
into
<
xml.etree.ElementTree is doing exactly the standards-compliant thing, which is to decode XML character entities with the understanding that they do in fact encode the referenced character and should be interpreted as such.
The preferred course of action if you do need to encode the literal < is to change your input file to use < instead (i.e. we XML-encode the &).
If you can't change your input file format then you'll probably need to use a different module, or write your own parser: xml.etree.ElementTree translates entities well before you can do anything meaningful with the output.
Update:
My code works fine on most Hebrew page, but fails on 10% of them. I was unfortunate enough to start with two 'bad' ones.
Here's an example of a 'good' page: http://m.sport5.co.il/Pages/Article.aspx?articleId=154765,
and this is a 'bad' one: http://www.havoda.org.il/Web/Default.aspx.
I still need to deal with the bad ones, and I still don't know how...
Original question:
I'm using lxml.html to parse HTML, and extract only text (to be later used for text classification). I couldn't manage to properly deal with unicode (Hebrew text, in my case).
The tree elements don't seem to be encoded correctly:
When I look at element[i].text , where type(element[i].text) = UnicodeType, I see something like this: "u'\xd7\x9e\xd7\xa9\xd7\x94 \xd7\xa9\xd7\xa8\xd7\xaa (1955-1954)'", and this is not right - this entity cannot be encoded or decoded! (or I haven't found how...) Printing it brings, of course, something like this: "××©× ×©×¨×ª (1955-1954)", and that's not Hebrew...
A workable text string should look like:
1. u'\u05de\u05e9\u05d4 \u05e9\u05e8\u05ea (1955-1954)' - a proper unicode string; or:
2. '\xd7\x9e\xd7\xa9\xd7\x94 \xd7\xa9\xd7\xa8\xd7\xaa (1955-1954)' - unicode encoded into a regular text string; but not:
3. u'\xd7\x9e\xd7\xa9\xd7\x94 \xd7\xa9\xd7\xa8\xd7\xaa (1955-1954)' - a useless hybrid entity ('ascii' codec can't decode byte...)
What do I do to solve it? What am I doing wrong? Here's the code I'm using:
import lxml.html as lh
from types import *
f = urlopen(url)
html = f.read()
root = lh.fromstring(html)
all_elements = root.cssselect('*')
all_text = ''
for i in range(len(all_elements)):
if all_elements[i].tag not in ['script','style']:
if type(all_elements[i].text) in [StringType, UnicodeType]:
all_text = all_text + all_elements[i].text.strip() + ' '
Everything works just fine with pure English (non unicode) html.
Almost all of the answers here refer to lxml.etree, and not lxml.html that I'm using. Do I have to switch? (I don't want to...)
probably (but hard to know for sure without having the data), the page is UTF-8 encoded, but the HTML parser defaults to iso-8859-1 (as opposed to the XML parser which defaults to UTF-8)
I'm trying to build an xml document from scratch using xml.dom.minidom. Everything was going well until I tried to make a text node with a ® (Registered Trademark) symbol in. My objective is for when I finally hit print mydoc.toxml() this particular node will actually contain a ® symbol.
First I tried:
import xml.dom.minidom as mdom
data = '®'
which gives the rather obvious error of:
File "C:\src\python\HTMLGen\test2.py", line 3
SyntaxError: Non-ASCII character '\xae' in file C:\src\python\HTMLGen\test2.py on line 3, but no encoding declared; see http://www.python.or
g/peps/pep-0263.html for details
I have of course also tried changing the encoding of my python script to 'utf-8' using the opening line comment method, but this didn't help.
So I thought
import xml.dom.minidom as mdom
data = '®' #Both accepted xml encodings for registered trademark
data = '®'
text = mdom.Text()
text.data = data
print data
print text.toxml()
But because when I print text.toxml(), the ampersands are being escaped, I get this output:
®
®
My question is, does anybody know of a way that I can force the ampersands not to be escaped in the output, so that I can have my special character reference carry through to the XML document?
Basically, for this node, I want print text.toxml() to produce output of ® or ® in a happy and cooperative way!
EDIT 1:
By the way, if minidom actually doesn't have this capacity, I am perfectly happy using another module that you can recommend which does.
EDIT 2:
As Hugh suggested, I tried using data = u'®' (while also using data # -*- coding: utf-8 -*- Python source tags). This almost helped in the sense that it actually caused the ® symbol itself to be outputted to my xml. This is actually not the result I am looking for. As you may have guessed by now (and perhaps I should have specified earlier) this xml document happens to be an HTML page, which needs to work in a browser. So having ® in the document ends up causing rubbish in the browser (® to be precise!).
I also tried:
data = unichr(174)
text.data = data.encode('ascii','xmlcharrefreplace')
print text.toxml()
But of course this lead to the same origional problem where all that happens is the ampersand gets escaped by .toxml().
My ideal scenario would be some way of escaping the ampersand so that the XML printing function won't "escape" it on my behalf for the document (in other words, achieving my original goal of having ® or ® appear in the document).
Seems like soon I'm going to have to resort to regular expressions!
EDIT 2a:
Or perhaps not. Seems like getting my html meta information correct <META http-equiv="Content-Type" Content="text/html; charset=UTF-8"> could help, but I'm not sure yet how this fits in with the xml structure...
Two options that work, one with the escaping ® and the other without. It's not really obvious why you want escaping ... it's 6 bytes instead of the 2 or 3 bytes for non-CJK characters.
import xml.dom.minidom as mdom
text = mdom.Text()
# Start with unicode
text.data = u'\xae'
f = open('reg1.html', 'w')
f.write("header saying the file is ascii")
uxml = text.toxml()
bxml = uxml.encode('ascii', 'xmlcharrefreplace')
f.write(bxml)
f.close()
f = open('reg2.html', 'w')
f.write("header saying the file is UTF-8")
xml = text.toxml(encoding='UTF-8')
f.write(xml)
f.close()
If I understand correctly, what you really want is to be able to create a text node from a unicode object (e.g. u'®' or u'\u00ae') and then have toxml() output unicode characters encoded as entities (e.g. ®). Looking at the source of minidom.py, however, it seems that minidom doesn't support entity encoding on output except the special cases of &, ", < and >.
You also ask about alternative modules that could help, however. There are several possible candidates, but ElementTree (xml.etree) seems to do the appropriate encoding. For example, if you take the first example from this blog post by Doug Hellmann but replace:
child_with_tail.text = 'This child has regular text.'
... with:
child_with_tail.text = u'This child has regular text \u00ae.'
... and run the script, you should see the output contains:
This child has regular text®.
You could also use the lxml implementation of ElementTree in that example just by replacing the import statement with:
from lxml.etree import Element, SubElement, Comment, tostring
Update: the alternative answer from John Machin takes the nice approach of running .encode('ascii', 'xmlcharrefreplace') on the output from minidom's toxml(), which converts any non-ASCII characters to their equivalent XML numeric character references.
Default unescape:
from xml.sax.saxutils import unescape
unescape("< & >")
The result is,
'< & >'
And, unescape more:
unescape("' "", {"'": "'", """: '"'})
Check details here, https://wiki.python.org/moin/EscapingXml