I use the following code to scrape a table from a Chinese website. It works fine. But it seems that the contents I stored in the list are not shown properly.
import requests
from bs4 import BeautifulSoup
import pandas as pd
x = requests.get('http://www.sohu.com/a/79780904_126549')
bs = BeautifulSoup(x.text,'lxml')
clg_list = []
for tr in bs.find_all('tr'):
tds = tr.find_all('td')
for i in range(len(tds)):
clg_list.append(tds[i].text)
print(tds[i].text)
When I print the text, it shows Chinese characters. But when I print out the list, it's showing \u4e00\u671f\uff0834\u6240\uff09'. I am not sure if I should change the encoding or something else is wrong.
There is nothing wrong in this case.
When you print a python list, python calls repr on each of the list's elements. In python2, the repr of a unicode string shows the unicode code points for the characters that make up the string.
>>> c = clg_list[0]
>>> c # Ask the interpreter to display the repr of c
u'\u201c985\u201d\u5de5\u7a0b\u5927\u5b66\u540d\u5355\uff08\u622a\u6b62\u52302011\u5e743\u670831\u65e5\uff09'
However, if you print the string, python encodes the unicode string with a text encoding (for example, utf-8) and your computer displays the characters that match the encoding.
>>> print c
“985”工程大学名单(截止到2011年3月31日)
Note that in python3 printing the list will show the chinese characters as you expect, because of python3's better unicode handling.
Related
Decoding normal URL escaped characters is a fairly easy task with python.
If you want to decode something like: Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3
All you need to use is:
import urllib
urllib.parse.unquote('Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3')
And you get: 'Wikivoyage:删除表决'
However, I have identified some characters which this does not work with, namely 4-digit % encoded strings:
For example: %25D8
This apparently decodes to ◘
But if you use the urllib function I demonstrated previously, you get: %D8
I understand why this happens, the unquote command reads the %25 as a '%', which is what it normally translates to. Is there any way to get Python to read this properly? Especially in a string of similar characters?
The actual problem
In a comment you posted the real examples:
The data I am pulling from is just a list of url-encoded strings. One of the example strings I am trying to decode is represented as: %25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1 This is the raw form of it. Other strings are normal url escapes such as: %D8%A5%D9%88%D8%B2
The first one is double-quoted, as wim pointed out. So they unquote as: إزالة_الشعر_بالليزر and إوز (which are Arabic for "laser hair removal" and "geese").
So you were mistaken about the unquoting and ◘ is a red herring.
Solution
Ideally you would fix whatever gave you this inconsistent data, but if nothing else, you could try detecting double-quoted strings, for example, by checking if the number of % equals the number of %25.
def unquote_possibly_double_quoted(s: str) -> str:
if s.count('%') == s.count('%25'):
# Double
s = urllib.parse.unquote(s)
return urllib.parse.unquote(s)
>>> s = '%25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1'
>>> unquote_possibly_double_quoted(s)
'إزالة_الشعر_بالليزر'
>>> unquote_possibly_double_quoted('%D8%A5%D9%88%D8%B2')
'إوز'
You might want to add some checks to this, like for example, s.count('%') > 0 (or '%' in s).
Please help because this flipping program is my ongoing nightmare!
I have several files that include some base64 encoded strings.
Part of one file for examples reads as follows:
charset=utf-8;base64,I2JhY2tydW5uZXJfUV81c3R7aGVpZ2h0OjkzcHg7fWJhY2tydW5uZXJfUV81c3R7ZGlzcGxheTpibG9jayFpbXBvcnRhbnQ7fQ=="
They are always in the format "ANYTHINGbase64,STRING"
It is html but I am treating it as one large string and using BeautifulSoup elsewhere. I am using a regex expression 'base' to extract the base64 string, then using base64 module to decode this as per my defined function "debase".
This seems to work ok up to a point: the output of b64encode for some reason adds unnecessary stuff:
b'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}' with the string the stuff in the middle.
I'm guessing this means in bytes; so I have tried getting my function to encode this as utf8 but basically I am out of my depth.
The end result that I want is for all "base64,STRING" in my html to be decoded and replaced with DECODEDSTRING.
Please help!
import os, sys, bs4, re, base64, codecs
from bs4 import BeautifulSoup
def debase(instr):
outstring = base64.b64decode(instr)
outstring = codecs.utf_8_encode(str(outstring))
outstring.split("'")[1]
return outstring
base = re.compile('base64,(.*?)"')
for eachArg in sys.argv[1:]:
a=open(eachArg,'r',encoding='utf8')
presoup = a.read()
b = re.findall(base, presoup)
for value in b:
re.sub('base64,.*?"', debase(value))
print(debase(value))
soup=BeautifulSoup(presoup, 'lxml')
bname= str(eachArg).split('.')[0]
a.close()
[s.extract() for s in soup('script')]
os.remove(eachArg)
b=open(bname +'.html','w',encoding='utf8')
b.write(soup.prettify())
b.close()
Your input is a bit oddly formatted (with a trailing unmatched single quote, for instance), so make sure you're not doing unnecessary work or parsing content in a weird way.
Anyway, assuming you have your input in the form it's given, you have to decode it using base64 in the way you just did, then decode using the given encoding to get a string rather than a bytestring:
import base64
inp = 'charset=utf-8;base64,I2JhY2tydW5uZXJfUV81c3R7aGVpZ2h0OjkzcHg7fWJhY2tydW5uZXJfUV81c3R7ZGlzcGxheTpibG9jayFpbXBvcnRhbnQ7fQ=="'
head,tail = inp.split(';')
_,enc = head.split('=') # TODO: check if the beginning is "charset"
_,msg = tail.split(',') # TODO: check that the beginning is "base64"
plaintext_bytes = base64.b64decode(msg)
plaintext_str = plaintext_bytes.decode(enc)
Now the two results are
>>> plaintext_bytes
b'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}'
>>> plaintext_str
'#backrunner_Q_5st{height:93px;}backrunner_Q_5st{display:block!important;}'
As you can see, the content of the bytes was already readable, this is because the contents were ASCII. Also note that I didn't remove the trailing quote from your string: base64 is smart enough to ignore what comes after the two equation signs in the content.
In a nutshell, strings are a somewhat abstract representation of text in python 3, and you need a specific encoding if you want to represent the text with a stream of ones and zeros (which you need when you transfer data from one place to another). When you get a string in bytes, you have to know how it was encoded in order to decode it and obtain a proper string. If the string is ASCII-compatible then the encoding is fairly trivial, but once more general characters appear your code will break if you use the wrong encoding.
I made a scraping script with python and selenium. It scrapes data from a Spanish language website:
for i, line in enumerate(browser.find_elements_by_xpath(xpath)):
tds = line.find_elements_by_tag_name('td') # takes <td> tags from line
print tds[0].text # FIRST PRINT
if len(tds)%2 == 0: # takes data from lines with even quantity of cells only
data.append([u"".join(tds[0].text), u"".join(tds[1].text), ])
print data # SECOND PRINT
The first print statement gives me a normal Spanish string. But the second print gives me a string like this: "Data de Distribui\u00e7\u00e3o".
What's the reason for this?
You are mixing encodings:
u'' # unicode string
b'' # bytearray string
The text property of tds[0] is a bytearray string which is encoding agnostic, and you are operating in the second print with unicode string, thus mixing the encodings
for using any type of accented character we have to first encode or decode it before using them
accent_char = "ôâ"
name = accent_char.decode('utf-8')
print(name)
The above code will work for decoding the characters
This question already has answers here:
Process escape sequences in a string in Python
(8 answers)
Closed 7 months ago.
In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:
\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
How would I output the actual text itself in my application?
This is the code generating the string:
response = requests.get(url)
messages = json.loads( extract_json(response.text) )
for k,v in messages.items():
for message in v['foo']['bar']:
print("\nFoobar: %s" % (message['body'],))
Here is the function which returns the JSON from the HTML page:
def extract_json(input_):
"""
Get the JSON out of a webpage.
The line of interest looks like this:
foobar = ["{\"name\":\"dotan\",\"age\":38}"]
"""
for line in input_.split('\n'):
if 'foobar' in line:
return line[line.find('"')+1:-2].replace(r'\"',r'"')
return None
In googling the issue, I've found quite a bit of information relating to Python 2, however Python 3 has completely changed how strings and especially Unicode are handled in Python.
How can I convert the example string (\u05ea) to characters (ת) in Python 3?
Addendum:
Here is some information regarding message['body']:
print(type(message['body']))
# Prints: <class 'str'>
print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'
print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין
Note that the last line does work as expected, but it has a few issues:
Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters. (Thank you bobince)
encode() relies on the default encoding, which is a bad thing.(Thank you bobince)
The encode() fails on some newer Unicode characters, such as \ud83d\ude03, with UnicodeEncodeError "surrogates not allowed".
It appears your input uses backslash as an escape character, you should unescape the text before passing it to json:
>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש
Don't use 'unicode-escape' encoding on JSON text; it may produce different results:
>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['😂']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'
'😂' == '\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY).
I'm having a hard time trying to generate a list from a string, with a proper UTF-8 encoding, I'm using Python (I'm just learning to program, so bare with my silly question/terrible coding).
The source file is a tweet feed (JSON format), after parsing it successfully and extracting the tweet message from all the rest I manage to get the text with the right encoding only after a print (as a string). If I try to put it pack into list forms, it goes back to unencoded u\000000 form.
My code is:
import json
with open("file_name.txt") as tweets_file:
tweets_list = []
for a in tweets_file:
b = json.loads(a)
tweets_list.append(b)
tweet = []
for i in tweets_list:
key = "text"
if key in i:
t = i["text"]
tweet.append(t)
for k in tweet:
print k.encode("utf-8")
As an alternative, I tried to have the encoding at the beginning (when fetching the file):
import json
import codecs
tweets_file = codecs.open("file_name.txt", "r", "utf-8")
tweets_list = []
for a in tweets_file:
b = json.loads(a)
tweets_list.append(b)
tweets_file.close()
tweet = []
for i in tweets_list:
key = "text"
if key in i:
t = i["text"]
tweet.append(t)
for k in tweet:
print k
My question is: how can I put the resulting k strings, into a list? With each k string as an item?
You are getting confused by the Python string representation.
When you print a python list (or any other standard Python container), the contents are shown in special representation to make debugging easier; each value is shown is the result of calling the repr() function on that value. For string values, that means the result is a unicode string representation, and that is not the same thing as what you see when the string is printed directly.
Unicode and byte strings, when shown like that, are presented as string literals; quoted values that you can copy and paste straight back into Python code without having to worry about encoding; anything that is not a printable ASCII character is shown in quoted form. Unicode code points beyond the latin-1 plane are shown as '\u....' escape sequences. Characters in the latin-1 range use the '\x.. escape sequence. Many control characters are shown in their 1-letter escape form, such as \n and \t.
The python interactive prompt does the same thing; when you echo a value on the prompt without using print, the value in 'represented', shown in the repr() form:
>>> print u'\u2036Hello World!\u2033'
‶Hello World!″
>>> u'\u2036Hello World!\u2033'
u'\u2036Hello World!\u2033'
>>> [u'\u2036Hello World!\u2033', u'Another\nstring']
[u'\u2036Hello World!\u2033', u'Another\nstring']
>>> print _[1]
Another
string
This entirly normal behaviour. In other words, your code works, nothing is broken.
To come back to your code, if you want to extract just the 'text' key from the tweet JSON structures, filter while reading the file, don't bother with looping twice:
import json
with open("file_name.txt") as tweets_file:
tweets = []
for line in tweets_file:
data = json.loads(a)
if 'text' in data:
tweets.append(data['text'])