terminaltables asciitables and encoding - python

So I am trying to figure out how to properly encode the .table function in terminaltables but so far no luck.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
from terminaltables import AsciiTable
html = """<a href="#" class="link" title="Some title B">
Some text with åäö</a>"""
soup = BeautifulSoup(html, "html.parser")
test = soup.find('a').get_text(strip=True).encode('utf-8')
test_list = [test]
test_table = AsciiTable(test_list)
print test_list
print test
print test_table.table
This will print
['Some text with \xc3\xa5\xc3\xa4\xc3\xb6']
Some text with åäö
And the traceback for the test_table.table
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data

The documentation for asciitable states that the argument to AsciiData should be a list of list of strings.
Changing your definition of test_list to test_list = [[test]] prints out a table for me.
Dropping the encode('utf-8') from the definition of test makes it show up nicely on my terminal.

Related

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 6233: ordinal not in range(128)

I'm working on a new project but I can't fix the error in the title.
Here's the code:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code)
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
The error occurred because of .encode which works on a unicode object. So we need to convert the byte string to unicode string using
.decode('unicode_escape')
So the code will be:
#!/usr/bin/env python3.5.2
import urllib.request , urllib.parse
def start(url):
source_code = urllib.request.urlopen(url).read()
info = urllib.parse.parse_qs(source_code.decode('unicode_escape'))
print(info)
start('https://www.youtube.com/watch?v=YfRLJQlpMNw')
Try this
source_code = urllib.request.urlopen(url).read().decode('utf-8')
The error message is self explainatory: there is a byte 0xf0 in an input string that is expected to be an ascii string.
You should have given the exact error message and on what line it happened, but I can guess that is happened on info = urllib.parse.parse_qs(source_code), because parse_qs expects either a unicode string or an ascii byte string.
The first question is why you call parse_qs on data coming from youtube, because the doc for the Python Standart Library says:
Parse a query string given as a string argument (data of type application/x-www-form-urlencoded). Data are returned as a dictionary. The dictionary keys are the unique query variable names and the values are lists of values for each name.
So you are going to parse this on = and & character to interpret it as a query string in the form key1=value11&key2=value2&key1=value12 to give { 'key1': [ 'value11', 'value12'], 'key2': ['value2']}.
If you know why you want that, you should first decode the byte string into a unicode string, using the proper encoding, or if unsure Latin1 which is able to accept any byte:
def start(url):
source_code = urllib.request.urlopen(url).read().decode('latin1')
info = urllib.parse.parse_qs(source_code)
print(info)
This code is rather weird indeed. You are using query parser to parse contents of a web page.
So instead of using parse_qs you should be using something like this.

hi § symbol unrecognized

good morning.
I'm trying to do this and not leave me .
Can you help me?
thank you very much
soup = BeautifulSoup(html_page)
titulo=soup.find('h3').get_text()
titulo=titulo.replace('§','')
titulo=titulo.replace('§','')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Define the coding and operate with unicode strings:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_page = u"<h3>§ title here</h3>"
soup = BeautifulSoup(html_page, "html.parser")
titulo = soup.find('h3').get_text()
titulo = titulo.replace(u'§', '')
print(titulo)
Prints title here.
I'll explain you clearly what's the problem:
By default Python does not recognize particular characters like "à" or "ò". To make Python recognize those characters you have to put at the top of your script:
# -*- coding: utf-8 -*-
This codes makes Python recognize particular characters that by default are not recognized.
Another method to use the coding is using "sys" library:
# sys.setdefaultencoding() does not exist, here!
import sys
reload(sys) #This reloads the sys module
sys.setdefaultencoding('UTF8') #Here you choose the encoding

Regex on unicode string

I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?
According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address
From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)

Handling Indian Languages in BeautifulSoup

I'm trying to scrape the NDTV website for news titles. This is the page I'm using as a HTML source. I'm using BeautifulSoup (bs4) to handle the HTML code, and I've got everything working, except my code breaks when I encounter the hindi titles in the page I linked to.
My code so far is :
import urllib2
from bs4 import BeautifulSoup
htmlUrl = "http://archives.ndtv.com/articles/2012-01.html"
FileName = "NDTV_2012_01.txt"
fptr = open(FileName, "w")
fptr.seek(0)
page = urllib2.urlopen(htmlUrl)
soup = BeautifulSoup(page, from_encoding="UTF-8")
li = soup.findAll( 'li')
for link_tag in li:
hypref = link_tag.find('a').contents[0]
strhyp = str(hypref)
fptr.write(strhyp)
fptr.write("\n")
The error I get is :
Traceback (most recent call last):
File "./ScrapeTemplate.py", line 30, in <module>
strhyp = str(hypref)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
I got the same error even when I didn't include the from_encoding parameter. I initially used it as fromEncoding, but python warned me that it was deprecated usage.
How do I fix this? From what I've read I need to either avoid the hindi titles or explicitly encode it into non-ascii text, but I don't know how to do that. Any help would be greatly appreciated!
What you see is a NavigableString instance (which is derived from the Python unicode type):
(Pdb) hypref.encode('utf-8')
'NDTV'
(Pdb) hypref.__class__
<class 'bs4.element.NavigableString'>
(Pdb) hypref.__class__.__bases__
(<type 'unicode'>, <class 'bs4.element.PageElement'>)
You need to convert to utf-8 using
hypref.encode('utf-8')
strhyp = hypref.encode('utf-8')
http://joelonsoftware.com/articles/Unicode.html

Python encoding error when turning MySQL query to XML

I generate custom XML files that have to be in a certain format with this scripts. It queries a database and turns the results into one big xml file. I do this to multiple databases that range from inventory parts list to employee records.
import csv
import StringIO
import time
import MySQLdb
import lxml.etree
import lxml.builder
from datetime import datetime
import string
from lxml import etree
from lxml.builder import E as buildE
from datetime import datetime
from time import sleep
import shutil
import glob
import os
import logging
def logWrite(message):
logging.basicConfig(
filename="C:\\logs\\XMLSyncOut.log",
level=logging.DEBUG,
format='%(asctime)s %(message)s',
datefmt='%m/%d/%Y %I:%M:%S: %p'
)
logging.debug(message)
def buildTag(tag,parent=None,content=None):
element = buildE(tag)
if content is not None:
element.text = unicode(content)
if parent is not None:
parent.append(element)
return element
def fetchXML(cursor):
logWrite("constructing XML from cursor")
fields = [x[0] for x in cursor.description]
doc = buildTag('DATA')
for record in cursor.fetchall():
r = buildTag('ROW',parent=doc)
for (k,v) in zip(fields,record):
buildTag(k,content=v,parent=r)
return doc
def updateDatabase 1():
try:
conn = MySQLdb.connect(host = 'host',user = 'user',passwd = 'passwd',db = 'database')
cursor = conn.cursor()
except:
sys.exit(1)
logWrite("Cannot connect to database - quitting!")
cursor.execute("SELECT * FROM database.table")
logWrite("Dumping fields from database.table into cursor")
xmlFile = open("results.xml","w")
doc = fetchXML(cursor)
xmlFile.write(etree.tostring(doc,pretty_print=True))
logWrite("Writing XML results.xml")
xmlFile.close()
For some reason one of the new databases I imported from an excel spreadsheet is having some type of encoding error that that the others aren't having. This is the error
element.text = unicode(content)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 21: ordinal not in range(128)
I tried explicitly encoding to ascii by changing the buildTag function to look like this:
def buildTag(tag,parent=None,content=None):
element = buildE(tag)
if content is not None:
content = str(content).encode('ascii','ignore')
element.text = content
if parent is not None:
parent.append(element)
return element
This still didn't work.
Any ideas on what I can do to stop this? I can't escape them because I can't have "\x92" showing up in records as output.
I think you problem in windows encoding you can try in shell:
In: print '\x92'.decode('cp1251')
Out: '
I'm focusing on
element.text = unicode(content)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 21: ordinal not in range(128)
I am assuming that content is of type str, i.e. it contains byte code (only true for Python 2). You have to know which encoding has been used to generate this byte code. Then, in order to create a unicode object from this byte code, you have to explicitly tell Python how to decode it, via e.g.:
element.text = content.decode("utf-8")

Categories

Resources