good morning.
I'm trying to do this and not leave me .
Can you help me?
thank you very much
soup = BeautifulSoup(html_page)
titulo=soup.find('h3').get_text()
titulo=titulo.replace('§','')
titulo=titulo.replace('§','')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Define the coding and operate with unicode strings:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
html_page = u"<h3>§ title here</h3>"
soup = BeautifulSoup(html_page, "html.parser")
titulo = soup.find('h3').get_text()
titulo = titulo.replace(u'§', '')
print(titulo)
Prints title here.
I'll explain you clearly what's the problem:
By default Python does not recognize particular characters like "à" or "ò". To make Python recognize those characters you have to put at the top of your script:
# -*- coding: utf-8 -*-
This codes makes Python recognize particular characters that by default are not recognized.
Another method to use the coding is using "sys" library:
# sys.setdefaultencoding() does not exist, here!
import sys
reload(sys) #This reloads the sys module
sys.setdefaultencoding('UTF8') #Here you choose the encoding
Related
So I am trying to figure out how to properly encode the .table function in terminaltables but so far no luck.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
from terminaltables import AsciiTable
html = """<a href="#" class="link" title="Some title B">
Some text with åäö</a>"""
soup = BeautifulSoup(html, "html.parser")
test = soup.find('a').get_text(strip=True).encode('utf-8')
test_list = [test]
test_table = AsciiTable(test_list)
print test_list
print test
print test_table.table
This will print
['Some text with \xc3\xa5\xc3\xa4\xc3\xb6']
Some text with åäö
And the traceback for the test_table.table
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
The documentation for asciitable states that the argument to AsciiData should be a list of list of strings.
Changing your definition of test_list to test_list = [[test]] prints out a table for me.
Dropping the encode('utf-8') from the definition of test makes it show up nicely on my terminal.
I'm new to Python and have to build an application to get historical data from Twitter. I can see the tweets in my console and all the information I need!
But the problem I have now is that I need to write this information to a .csv file but I encounter the following error while running the code:
"UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)". I know that the problem is that the Tweets I'm collecting are written in Swedish so its a frequent use of the letter "ÅÄÖ".
Can anyone please help me with this or have any pointers as to where i should start looking for a solution?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from TwitterSearch import *
import csv
import codecs
try:
tuo = TwitterUserOrder('Vismaspcs') # create a TwitterUserOrder
ts = TwitterSearch(
consumer_key = '',
consumer_secret = '',
access_token = '',
access_token_secret = ''
)
# start asking Twitter about the timeline
for tweet in ts.search_tweets_iterable(tuo):
print( '#%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text']) )
print (tweet['created_at'],tweet['favorite_count'],tweet ['retweet_count'])
with open('visma.csv','w') as fout:
writer=csv.writer(fout)
writer.writerows([tweet['user']['screen_name'],tweet['text'],tweet['created_at'],tweet['favorite_count'],tweet['retweet_count']])
except TwitterSearchException as e: # catch all those ugly errors
print(e)
The csv module cannot handle unicode in python2:
Note This version of the csv module doesn't support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.
You can use tweet['user']['screen_name'].encode("utf-8")...
Thank's,
I use it with file.write(word['value'].encode("utf-8")) and it work too :)
But you can try with file.encode('utf8') if it's not for write something
I am trying to download a few hundred Korean pages like this one:
http://homeplusexpress.com/store/store_view.asp?cd_express=3
For each page, I want to use a regex to extract the "address" field, which in the above page looks like:
*주소 : 서울시 광진구 구의1동 236-53
So I do this:
>>> import requests
>>> resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
>>> resp.encoding
'ISO-8859-1'
>>> # I wonder why it's ISO-8859-1, since I thought that is for Latin text (Latin-1).
>>> html = resp.text
>>> type(html)
<type 'unicode'>
>>> html
(outputs a long string that contains a lot of characters like \xc3\xb7\xaf\xbd\xba \xc0\xcd\xbd\xba\xc7\xc1\xb7\xb9\)
I then wrote a script. I set # -*- coding: utf-8 -*- on the .py file and put this:
address = re.search('주소', html)
However, re.search is returning None. I tried with and without the u prefix on the regex string.
Usually I can solve issues like this with a call to .encode or .decode but I tried a few things and am stuck. Any pointers on what I'm missing?
According to the tag in the html document header:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
the web page uses the euc-kr encoding.
I wrote this code:
# -*- coding: euc-kr -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
html = resp.text
address = re.search('주소', html)
print address
Then I saved it in gedit using the euc-kr encoding.
I got a match.
But actually there is an even better solution! You can keep the utf-8 encoding for your files.
# -*- coding: utf-8 -*-
import re
import requests
resp=requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
resp.encoding = 'euc-kr'
# we need to specify what the encoding is because the
# requests library couldn't detect it correctly
html = resp.text
# now the html variable contains a utf-8 encoded unicode instance
print type(html)
# we use the re.search functions with unicode strings
address = re.search(u'주소', html)
print address
From requests documetation: When you make a request, Requests makes educated guesses about the encoding of the response based on the HTTP headers
If you check your website, we can see there is no encoding in server response:
I think the only option in this case is directly specify what encoding to use:
# -*- coding: utf-8 -*-
import requests
import re
r = requests.get('http://homeplusexpress.com/store/store_view.asp?cd_express=3')
r.encoding = 'euc-kr'
print re.search(ur'주소', r.text, re.UNICODE)
I have:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import urlopen
page2 = urlopen('http://pogoda.yandex.ru/moscow/').read().decode('utf-8')
page = urlopen('http://yasko.by/').read().decode('utf-8')
And in line "page ..." I have error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 32: invalid continuation byte", but in line "page2 ..." th error is not, why?
From a position of 32 in yasko.by starts Cyrillic symbols, how I get it correctly?
Thanks!
The content of http://yasko.by/ is encoded with windows-1251, while the content of http://pogoda.yandex.ru/moscow/ is encoded with utf-8.
page = .. line should become:
page = urlopen('http://yasko.by/').read().decode('windows-1251')
This is the example on the pycurl's sourceforge page. And if the url contain like Chinese. What process should we do? Since pycurl does not support unicode?
import pycurl
c = pycurl.Curl()
c.setopt(pycurl.URL, "http://www.python.org/")
c.setopt(pycurl.HTTPHEADER, ["Accept:"])
import StringIO
b = StringIO.StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
print b.getvalue()
Here's a script that demonstrates three separate issues:
non-ascii characters in Python source code
non-ascii characters in the url
non-ascii characters in the html content
# -*- coding: utf-8 -*-
import urllib
from StringIO import StringIO
import pycurl
title = u"UNIX时间" # 1
url = "https://zh.wikipedia.org/wiki/" + urllib.quote(title.encode('utf-8')) # 2
c = pycurl.Curl()
c.setopt(pycurl.URL, url)
c.setopt(pycurl.HTTPHEADER, ["Accept:"])
b = StringIO()
c.setopt(pycurl.WRITEFUNCTION, b.write)
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.perform()
data = b.getvalue() # bytes
print len(data), repr(data[:200])
html_page_charset = "utf-8" # 3
html_text = data.decode(html_page_charset)
print html_text[:200] # 4
Note: all utf-8 in the code are compeletely independent from each other.
Unicode literals use whatever character encoding you defined at the
top of the file. Make sure your text editor respects that setting
Path in the url should be encoded using utf-8 before it is
percent-encoded (urlencoded)
There are several ways to find out a html page charset. See
Character encodings in HTML. Some libraries such as requests mentioned by #Oz123 do it automatically:
# -*- coding: utf-8 -*-
import requests
r = requests.get(u"https://zh.wikipedia.org/wiki/UNIX时间")
print len(r.content), repr(r.content[:200]) # bytes
print r.encoding
print r.text[:200] # Unicode
To print Unicode to console you could use PYTHONIOENCODING environment variable to set character encoding that your terminal understands
See also The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) and Python-specific Pragmatic Unicode.
Try urllib.quote, which will replace non-ASCII characters by an escape sequence:
import urllib
url_to_fetch = urllib.quote(unicode_url)
edit: only the path should be quoted, you will have to split the complete URL with urlparse, quote the path, and then use urlunparse to obtain the final URL to fetch.
Just encode your url in "utf-8", and everything would be fine. from the docs:
Under Python 3, the bytes type holds arbitrary encoded byte strings. PycURL will accept bytes values for all options where libcurl specifies a “string” argument:
>>> import pycurl
>>> c = pycurl.Curl()
>>> c.setopt(c.USERAGENT, b'Foo\xa9')
# ok
The str type holds Unicode data. PycURL will accept str values containing ASCII code points only:
>>> c.setopt(c.USERAGENT, 'Foo')
# ok
>>> c.setopt(c.USERAGENT, 'Foo\xa9')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xa9' in position 3:
ordinal not in range(128)
>>> c.setopt(c.USERAGENT, 'Foo\xa9'.encode('iso-8859-1'))
# ok
[1] http://pycurl.io/docs/latest/unicode.html