Getting text with accented characters using Python and Selenium - python

I made a scraping script with python and selenium. It scrapes data from a Spanish language website:
for i, line in enumerate(browser.find_elements_by_xpath(xpath)):
tds = line.find_elements_by_tag_name('td') # takes <td> tags from line
print tds[0].text # FIRST PRINT
if len(tds)%2 == 0: # takes data from lines with even quantity of cells only
data.append([u"".join(tds[0].text), u"".join(tds[1].text), ])
print data # SECOND PRINT
The first print statement gives me a normal Spanish string. But the second print gives me a string like this: "Data de Distribui\u00e7\u00e3o".
What's the reason for this?

You are mixing encodings:
u'' # unicode string
b'' # bytearray string
The text property of tds[0] is a bytearray string which is encoding agnostic, and you are operating in the second print with unicode string, thus mixing the encodings

for using any type of accented character we have to first encode or decode it before using them
accent_char = "ôâ"
name = accent_char.decode('utf-8')
print(name)
The above code will work for decoding the characters

Related

How to change the coding for python array?

I use the following code to scrape a table from a Chinese website. It works fine. But it seems that the contents I stored in the list are not shown properly.
import requests
from bs4 import BeautifulSoup
import pandas as pd
x = requests.get('http://www.sohu.com/a/79780904_126549')
bs = BeautifulSoup(x.text,'lxml')
clg_list = []
for tr in bs.find_all('tr'):
tds = tr.find_all('td')
for i in range(len(tds)):
clg_list.append(tds[i].text)
print(tds[i].text)
When I print the text, it shows Chinese characters. But when I print out the list, it's showing \u4e00\u671f\uff0834\u6240\uff09'. I am not sure if I should change the encoding or something else is wrong.
There is nothing wrong in this case.
When you print a python list, python calls repr on each of the list's elements. In python2, the repr of a unicode string shows the unicode code points for the characters that make up the string.
>>> c = clg_list[0]
>>> c # Ask the interpreter to display the repr of c
u'\u201c985\u201d\u5de5\u7a0b\u5927\u5b66\u540d\u5355\uff08\u622a\u6b62\u52302011\u5e743\u670831\u65e5\uff09'
However, if you print the string, python encodes the unicode string with a text encoding (for example, utf-8) and your computer displays the characters that match the encoding.
>>> print c
“985”工程大学名单(截止到2011年3月31日)
Note that in python3 printing the list will show the chinese characters as you expect, because of python3's better unicode handling.

Python XML Compatible String

I am writing an XML file using lxml and am having issues with control characters. I am reading text from a file to assign to an element that contains control characters. When I run the script I receive this error:
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters
So I wrote a small function to replace the control characters with a '?', when I look at the generated XML it appears that the control characters are new lines 0x0A. With this knowledge I wrote a function to encode there control characters :
def encodeXMLText(text):
text = text.replace("&", "&")
text = text.replace("\"", """)
text = text.replace("'", "&apos;")
text = text.replace("<", "<")
text = text.replace(">", ">")
text = text.replace("\n", "
")
text = text.replace("\r", "
")
return text
This still returns the same error as before. I want to preserve the new lines so simply stripping them isn't a valid option for me. No idea what I am doing wrong at this point. I am looking for a way to do this with lxml, similar to this:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = encodeXMLText(titleText)
The other questions I have read either don't use lxml or don't address new line (/n) and line feed (/r) characters as control characters
I printed out the string to see what specific characters were causing the issue and noticed these characters : \xe2\x80\x99 in the text. So the issue was the encoding, changing the code to look like this fixed my issue:
ruleTitle = ET.SubElement(rule,'title')
ruleTitle.text = titleText.decode('UTF-8')

Unable to decode utf8 in python to display hindi language in the shell [duplicate]

I'm having a hard time trying to generate a list from a string, with a proper UTF-8 encoding, I'm using Python (I'm just learning to program, so bare with my silly question/terrible coding).
The source file is a tweet feed (JSON format), after parsing it successfully and extracting the tweet message from all the rest I manage to get the text with the right encoding only after a print (as a string). If I try to put it pack into list forms, it goes back to unencoded u\000000 form.
My code is:
import json
with open("file_name.txt") as tweets_file:
tweets_list = []
for a in tweets_file:
b = json.loads(a)
tweets_list.append(b)
tweet = []
for i in tweets_list:
key = "text"
if key in i:
t = i["text"]
tweet.append(t)
for k in tweet:
print k.encode("utf-8")
As an alternative, I tried to have the encoding at the beginning (when fetching the file):
import json
import codecs
tweets_file = codecs.open("file_name.txt", "r", "utf-8")
tweets_list = []
for a in tweets_file:
b = json.loads(a)
tweets_list.append(b)
tweets_file.close()
tweet = []
for i in tweets_list:
key = "text"
if key in i:
t = i["text"]
tweet.append(t)
for k in tweet:
print k
My question is: how can I put the resulting k strings, into a list? With each k string as an item?
You are getting confused by the Python string representation.
When you print a python list (or any other standard Python container), the contents are shown in special representation to make debugging easier; each value is shown is the result of calling the repr() function on that value. For string values, that means the result is a unicode string representation, and that is not the same thing as what you see when the string is printed directly.
Unicode and byte strings, when shown like that, are presented as string literals; quoted values that you can copy and paste straight back into Python code without having to worry about encoding; anything that is not a printable ASCII character is shown in quoted form. Unicode code points beyond the latin-1 plane are shown as '\u....' escape sequences. Characters in the latin-1 range use the '\x.. escape sequence. Many control characters are shown in their 1-letter escape form, such as \n and \t.
The python interactive prompt does the same thing; when you echo a value on the prompt without using print, the value in 'represented', shown in the repr() form:
>>> print u'\u2036Hello World!\u2033'
‶Hello World!″
>>> u'\u2036Hello World!\u2033'
u'\u2036Hello World!\u2033'
>>> [u'\u2036Hello World!\u2033', u'Another\nstring']
[u'\u2036Hello World!\u2033', u'Another\nstring']
>>> print _[1]
Another
string
This entirly normal behaviour. In other words, your code works, nothing is broken.
To come back to your code, if you want to extract just the 'text' key from the tweet JSON structures, filter while reading the file, don't bother with looping twice:
import json
with open("file_name.txt") as tweets_file:
tweets = []
for line in tweets_file:
data = json.loads(a)
if 'text' in data:
tweets.append(data['text'])

issue with converting html entities and encoding

I'm using this function to escape the HTML enities
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# #param text The HTML (or XML) source text.
# #return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
but when i try to process some text i get this error, (most of the text works) but python throws me this error
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 3
48: character maps to <undefined>
i have tried encoding the text string a million different ways, nothing is working so far ascii, utf, unicode... all that stuff which i really don't understand
Based on the error message, it looks like you may be attempting to convert a unicode string into CP 437 (an IBM PC character set). This doesn't appear to be occurring in your function, but could happen when attempting to print the resulting string to your console. I ran a quick test with the input string "® some text" and was able to reproduce the failure when printing the resulting string:
print unescape("® some text")
You can avoid this by specifying the encoding you want to convert the unicode string to:
print unescape("® some text").encode('utf-8')
You'll see non-ascii characters if you attempt to print this string to the console, however if you write it to a file and read it in a viewer that supports utf-8 encoded documents, you should see the characters you expect.
You need to post the FULL traceback so that we can see where in YOUR code the error happens. You also need to show us repr(a SMALL piece of data that has this problem) -- your data is at least 348 bytes long.
Based on the initially-supplied information:
You are crashing trying to encode a unicode character using cp437 ...
Either (1) the error is happening somewhere in your displayed code and somebody has kludged your default encoding to be cp437 (don't do that)
or (2) the error is not happening anywhere in the code that you have shown us, it is happening when you try to print some of the results of your function, you are running in a Windows "Command Prompt" window, and so your sys.stdout.encoding is set to some legacy MS-DOS encoding which doesn't support the U+00AE character.
you need to convert result using encode method ,apply encoding like 'utf-8' ,
for eg.
strdata = (result).encode('utf-8')
print strdata

Python encoding problems

So, I've read a lot about Python encoding and stuff - maybe not enough but I've been working on this for 2 days and still nothing - but I'm still getting troubles. I'll try to be as clear as I can. The main thing is that I'm trying to remove all accents and characters such as #, !, %, &...
The thing is, I do a query search on Twitter Search API with this call:
query = urllib2.urlopen(settings.SEARCH_URL + '?%s' % params)
Then, I call a method (avaliar_pesquisa()) to evaluate the results I've got, based on the tags (or terms) of the input:
dados = avaliar_pesquisa(simplejson.loads(query.read()), str(tags))
On avaliar_pesquisa(), the following happens:
def avaliar_pesquisa(dados, tags):
resultados = []
# Percorre os resultados
for i in dados['results']
resultados.append({'texto' : i['text'],
'imagem' : i['profile_image_url'],
'classificacao' : avaliar_texto(i['text'], tags),
'timestamp' : i['created_at'],
})
Note the avaliar_texto() which evaluates the Tweet text. And there's exactly the problem on the following lines:
def avaliar_texto(texto, tags):
# Remove accents
from unicodedata import normalize
def strip_accents(txt):
return normalize('NFKD', txt.decode('utf-8'))
# Split
texto_split = strip_accents(texto)
texto_split = texto.lower().split()
# Remove non-alpha characters
import re
pattern = re.compile('[\W_]+')
texto_aux = []
for i in texto_split:
texto_aux.append(pattern.sub('', i))
texto_split = texto_aux
The split doesn't really matter here.
The thing is, if I print the type of the var texto on this last method, I may get str or unicode as answer. If there is any kind of accent on the text, it comes like unicode.
So, I get this error running the application that receives 100 tweets max as answer:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 17: ordinal not in range(128)
For the following text:
Text: Agora o problema é com o speedy.
type 'unicode'
Any ideas?
See this page.
The decode() method is to be applied to a str object, not a unicode object. Given a unicode string as input, it first tries to encode it to a str using the ascii codec, then decode as utf-8, which fails.
Try return normalize('NFKD', unicode(txt) ).
This is what I used in my code to discard accents, etc.
text = unicodedata.normalize('NFD', text).encode('ascii','ignore')
Ty placing:
# -*- coding: utf-8 -*-
at the beginning of your python script containing the code.

Categories

Resources