Decoding text in python

Decoding text in python - python

I want to know how to decode certain text, and have found some text like this which I want to decode:
\xe2\x80\x93
I know that printing it will solve it, but I am building a web crawler hence I need to build an index (dictionary) containing words with a list of URLs where the word appears.
Hence I want to do something like this:
dic = {}
dic['\xe2\x80\x93'] = 'http://example.com' #this is the url where the word appears
... but when I do:
print dic
I get:
'\xe2\x80\x93'
... instead of â€“.
But when I do print dic['\xe2\x80\x93'] I successfully get â€“.
Howe can I get â€“ by print dic also?

When you see \xhh, that is a a character escape sequence. In this case, it is showing you the hex value of the character (see: lexical analysis: string-literals).
The reason you see \xhh sometimes, and you see the actual characters when you use print is related to the difference between __str__ and __repr__ in Python.

Related

How to detect undecoded characters in python?

Im getting data from a csv file, doing something with it and then writing it to a text template.
The problem occurs when I come across characters that I cannot encode.
For example, when I come accross a value written in chinese, the selected field is blank when I open it with some kind of a csv editor (e.g. LibreOffice Calc for Linux).
But when I get the data via csv.reader in my script, I can see that it is actually a string that hasn't been decoded properly.
And when I try to write it to a template, I get this weird SUB string.
Here is the breakdown of the problem:
for row in csv.DictReader(csvfile):
# take value from the row and store it in a dictionary
....
# take the values from the dictionary and write them to a template
with open('template.txt', 'r+') as template:
src = Template(template.read())
content = src.substitute(rec)
with open('myoutput.txt', 'w') as bill:
bill.write(content)
And the template.txt looks like this:
$name
$address
$city
...
All of this generates txt files like this:
Bill
North Grove 14
Scottsdale
...
If any of the dictionary values are empty, e.g. an empty string '', my template rendering function ignores the tag, so for example if the address attribute was missing from a particular row, the output would be
Bill
Scottsdale
...
When I try to do that with my chinese data, my function does write the data because the strings in question are not empty. And when I write them to a template, the end result looks like this:
SUB
SUB
Hong Kong
...
How can I display my data properly? Also is there a way to skip that data, for example something that can try to decode the data, and if it's not successful, convert it to an empty string.
P.S. try except won't work here, because mystring.encode('utf-8') or mystring.encode('latin-1') will encode the string, but it will still be outputted as garbage.
EDIT
After printing out the problem row, the output of the problematic values is the following:
{'Name': '\x1a \x1a\x1a', 'State': '\x1a\x1a\x1a'}

\x1a is the ASCII substitute character. This is the reason why you see "SUB" in your output. This character is generally used as a replacement by programs that try to decode bytes but fail.
Your CSV file does not contain valid data. Probably it was generated starting from a source containing valid data, but the file itself does not contain valid data anymore.
Just guessing: perhaps, did you open the file with LibreOffice and then saved it?
If you want to check whether your string contains ASCII unprintable characters, use this:
def is_printable(data):
return all(c in string.printable for c in data)
If you want to remove ASCII unprintable characters:
def strip_unprintable(data):
return ''.join(c for c in data if c in string.printable)
If you want to deal with Unicode strings, then replace c in string.printable with:
ord(c) > 0x1f and ord(c) != 0x7f and not (0x80 <= ord(c) <= 0x9f)
(Credit goes to What is the range of Unicode Printable Characters?)

Thanks to #Andrea Corbellini, your answer helped me find a solution.
def stringcheck(line):
for letter in line:
if letter not in string.printable:
return 0
return 1
However I don't think this is the most pythonic way of doing this, so any suggestions on how to make this better would be much appreciated.

Replacing from dictionary - Python

I'm building a program that is able to replace characters in a message with characters the user has entered into a dictionary. Some of the characters are given in a text file. So, to import them, I used this code:
d = {}
with open("dictionary.txt") as d:
for line in d:
(key, val) = line.split()
d[str(key)] = val
It works well, except it adds "ï»¿" to the start of the dictionary. The array of to-be-replaced text is called 'words'. This is the code I have for that:
for each in d:
words = ";".join(words)
words = words.replace(d[each],each)
words = words.split(";")
print words
When I hit F5, however, I get a load of gobbledook. Here's an example:
\xef\xbb\xbf\xef\xbb\xbfA+/084&
I'm just a newbie at Python, so any help would be appreciated.

Ensure to save your file in dictionnary file in UTF-8.
With notepad++ (Windows) there are conversion functions if your former file is not utf-8.
The "ï»¿" pattern is related to latin-1 encoding (you won't have it if you use utf-8 encoding)
Then, instead of str(key), use key.encode("utf-8") to avoid possible other errors in the future.
If you want to know more, you can take a look to the good Python documentation about this : http://docs.python.org/2/howto/unicode.html

What is the best way to iterate over a python list, excluding certain values and printing out the result

I am new to python and have a question:
I have checked similar questions, checked the tutorial dive into python, checked the python documentation, googlebinging, similar Stack Overflow questions and a dozen other tutorials.
I have a section of python code that reads a text file containing 20 tweets. I am able to extract these 20 tweets using the following code:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
data.append(Tweets.get('text'))
i=0
while i < len(data):
print data[i]
i=i+1
The above while loop iterates perfectly and prints out the 20 tweets (lines) from output.txt.
However, these 20 lines contain Non-English Character data like "Los ladillo a los dos, soy maaaala o maloooooooooooo", URLs like "http://t.co/57LdpK", the string "None" and Photos with a URL like so "Photo: http://t.co/kxpaaaaa(I have edited this for privacy)
I would like to purge the output of this (which is a list), and exclude the following:
The None entries
Anything beginning with the string "Photo:"
It would be a bonus also if I can exclude non-unicode data
I have tried the following bits of code
Using data.remove("None:") but I get the error list.remove(x): x not in list.
Reading the items I do not want into a set and then doing a comparison on the output but no luck.
Researching into list comprehensions, but wonder if I am looking at the right solution here.
I am from an Oracle background where there are functions to chop out any wanted/unwanted section of output, so really gone round in circles in the last 2 hours on this. Any help greatly appreciated!

Try something like this:
def legit(string):
if (string.startswith("Photo:") or "None" in string):
return False
else:
return True
whatyouwant = [x for x in data if legit(x)]
I'm not sure if this will work out of the box for your data, but you get the idea. If you're not familiar, [x for x in data if legit(x)] is called a list comprehension

First of all, only add Tweet.get('text') if there is a text entry:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
if 'text' in Tweets:
data.append(Tweets['text'])
That'll not add None entries (.get() returns None if the 'text' key is not present in the dictionary).
I'm assuming here that you want to further process the data list you are building here. If not, you can dispense with the for entry in data: loops below and stick to one loop with if statements. Tweets['text'] is the same value as entry in the for entry in data loops.
Next, you are looping over python unicode values, so use the methods provided on those objects to filter out what you don't want:
for entry in data:
if not entry.startswith("Photo:"):
print entry
You can use a list comprehension here; the following would print all entries too, in one go:
print '\n'.join([entry for entry in data if not entry.startswith("Photo:")])
In this case that doesn't really buy you much, as you are building one big string just to print it; you may as well just print the individual strings and avoid the string building cost.
Note that all your data is Unicode data. What you perhaps wanted is to filter out text that uses codepoints beyond ASCII points perhaps. You could use regular expressions to detect that there are codepoints beyond ASCII in your text
import re
nonascii = re.compile(ur'[^\x00-0x7f]', re.UNICODE) # all codepoints beyond 0x7F are non-ascii
for entry in data:
if entry.startswith("Photo:") or nonascii.search(entry):
continue # skip the rest of this iteration, continue to the next
print entry
Short demo of the non-ASCII expression:
>>> import re
>>> nonascii = re.compile(ur'[^\x00-\x7f]', re.UNICODE)
>>> nonascii.search(u'All you see is ASCII')
>>> nonascii.search(u'All you see is ASCII plus a little more unicode, like the EM DASH codepoint: \u2014')
<_sre.SRE_Match object at 0x1086275e0>

with open ('output.txt') as fp:
for line in fp.readlines():
Tweets=json.loads(line)
if not 'text' in Tweets: continue
txt = Tweets.get('text')
if txt.replace('.', '').replace('?','').replace(' ','').isalnum():
data.append(txt)
print txt
Small and simple.
Basic principle, one loop, if data matches your "OK" criteria add it and print it.
As Martijn pointed out, 'text' might not be in all the Tweets data.
Regexp replacement for .replace() would go something along the lines of: if re.match('^[\w-\ ]+$', txt) is not None: (it will not work for blankspace etc so yea as mentioned below..)

I'd suggest something like the following:
# use itertools.ifilter to remove items from a list according to a function
from itertools import ifilter
import re
# write a function to filter out entries you don't want
def my_filter(value):
if not value or value.startswith('Photo:'):
return False
# exclude unwanted chars
if re.match('[^\x00-\x7F]', value):
return False
return True
# Reading the data can be simplified with a list comprehension
with open('output.txt') as fp:
data = [json.loads(line).get('text') for line in fp]
# do the filtering
data = list(ifilter(my_filter, data))
# print the output
for line in data:
print line
Regarding unicode, assuming you're using python 2.x, the open function won't read data as unicode, it'll be read as the str type. You might want to convert it if you know the encoding, or read the file with a given encoding using codecs.open.

Try this:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
data.append(Tweets.get('text'))
i=0
while i < len(data):
# these conditions will skip (continue) over the iterations
# matching your first two conditions.
if data[i] == None or data[i].startswith("Photo"):
continue
print data[i]
i=i+1

splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?

You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.

Working with unicode encoded Strings from Active Directory via python-ldap

I already came up with this problem, but after some testing I decided to create a new question with some more specific Infos:
I am reading user accounts with python-ldap (and Python 2.7) from our Active Directory. This does work well, but I have problems with special chars. They do look like UTF-8 encoded strings when printed on the console. The goal is to write them into a MySQL DB, but I don't get those strings into proper UTF-8 from the beginning.
Example (fullentries is my array with all the AD entries):
fullentries[23][1].decode('utf-8', 'ignore')
print fullentries[23][1].encode('utf-8', 'ignore')
print fullentries[23][1].encode('latin1', 'ignore')
print repr(fullentries[23][1])
A second test with a string inserted by hand as follows:
testentry = "M\xc3\xbcller"
testentry.decode('utf-8', 'ignore')
print testentry.encode('utf-8', 'ignore')
print testentry.encode('latin1', 'ignore')
print repr(testentry)
The output of the first example ist:
M\xc3\xbcller
M\xc3\xbcller
u'M\\xc3\\xbcller'
Edit: If I try to replace the double backslashes with .replace('\\\\','\\) the output remains the same.
The output of the second example:
Müller
M�ller
'M\xc3\xbcller'
Is there any way to get the AD output properly encoded? I already read a lot of documentation, but it all states that LDAPv3 gives you strictly UTF-8 encoded strings. Active Directory uses LDAPv3.
My older question this topic is here: Writing UTF-8 String to MySQL with Python
Edit: Added repr(s) infos

First, know that printing to a Windows console is often the step that garbles data, so for your tests, you should print repr(s) to see the precise bytes you have in your string.
You need to find out how the data from AD is encoded. Again, print repr(s) will let you see the content of the data.
UPDATED:
OK, it looks like you're getting strange strings somehow. There might be a way to get them better, but you can adapt in any case, though it isn't pretty:
u.decode('unicode_escape').encode('iso8859-1').decode('utf8')
You might want to look into whether you can get the data in a more natural format.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decoding text in python - python

Related

How to detect undecoded characters in python?

Replacing from dictionary - Python

What is the best way to iterate over a python list, excluding certain values and printing out the result

splitting unicode string into words

Working with unicode encoded Strings from Active Directory via python-ldap

Categories

Resources