How to detect undecoded characters in python?

How to detect undecoded characters in python? - python

Im getting data from a csv file, doing something with it and then writing it to a text template.
The problem occurs when I come across characters that I cannot encode.
For example, when I come accross a value written in chinese, the selected field is blank when I open it with some kind of a csv editor (e.g. LibreOffice Calc for Linux).
But when I get the data via csv.reader in my script, I can see that it is actually a string that hasn't been decoded properly.
And when I try to write it to a template, I get this weird SUB string.
Here is the breakdown of the problem:
for row in csv.DictReader(csvfile):
# take value from the row and store it in a dictionary
....
# take the values from the dictionary and write them to a template
with open('template.txt', 'r+') as template:
src = Template(template.read())
content = src.substitute(rec)
with open('myoutput.txt', 'w') as bill:
bill.write(content)
And the template.txt looks like this:
$name
$address
$city
...
All of this generates txt files like this:
Bill
North Grove 14
Scottsdale
...
If any of the dictionary values are empty, e.g. an empty string '', my template rendering function ignores the tag, so for example if the address attribute was missing from a particular row, the output would be
Bill
Scottsdale
...
When I try to do that with my chinese data, my function does write the data because the strings in question are not empty. And when I write them to a template, the end result looks like this:
SUB
SUB
Hong Kong
...
How can I display my data properly? Also is there a way to skip that data, for example something that can try to decode the data, and if it's not successful, convert it to an empty string.
P.S. try except won't work here, because mystring.encode('utf-8') or mystring.encode('latin-1') will encode the string, but it will still be outputted as garbage.
EDIT
After printing out the problem row, the output of the problematic values is the following:
{'Name': '\x1a \x1a\x1a', 'State': '\x1a\x1a\x1a'}

\x1a is the ASCII substitute character. This is the reason why you see "SUB" in your output. This character is generally used as a replacement by programs that try to decode bytes but fail.
Your CSV file does not contain valid data. Probably it was generated starting from a source containing valid data, but the file itself does not contain valid data anymore.
Just guessing: perhaps, did you open the file with LibreOffice and then saved it?
If you want to check whether your string contains ASCII unprintable characters, use this:
def is_printable(data):
return all(c in string.printable for c in data)
If you want to remove ASCII unprintable characters:
def strip_unprintable(data):
return ''.join(c for c in data if c in string.printable)
If you want to deal with Unicode strings, then replace c in string.printable with:
ord(c) > 0x1f and ord(c) != 0x7f and not (0x80 <= ord(c) <= 0x9f)
(Credit goes to What is the range of Unicode Printable Characters?)

Thanks to #Andrea Corbellini, your answer helped me find a solution.
def stringcheck(line):
for letter in line:
if letter not in string.printable:
return 0
return 1
However I don't think this is the most pythonic way of doing this, so any suggestions on how to make this better would be much appreciated.

Related

hex header of file, magic numbers, python

I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?

with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.

In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.

python 3 regex not finding confirmed matches

So I'm trying to parse a bunch of citations from a text file using the re module in python 3.4 (on, if it matters, a mac running mavericks). Here's some minimal code. Note that there are two commented lines: they represent two alternative searches. (Obviously, the little one, r'Rawls', is the one that works)
def makeRefList(reffile):
print(reffile)
# namepattern = r'(^[A-Z1][A-Za-z1]*-?[A-Za-z1]*),.*( \(?\d\d\d\d[a-z]?[.)])'
# namepattern = r'Rawls'
refsTuplesList = re.findall(namepattern, reffile, re.MULTILINE)
print(refsTuplesList)
The string in question is ugly, and so I stuck it in a gist: https://gist.github.com/paultopia/6c48c398a42d4834f2ae
As noted, the search string r'Rawls' produces expected output ['Rawls', 'Rawls']. However, the other search string just produces an empty list.
I've confirmed this regex (partially) works using the regex101 tester. Confirmation here: https://regex101.com/r/kP4nO0/1 -- this match what I expect it to match. Since it works in the tester, it should work in the code, right?
(n.b. I copied the text from terminal output from the first print command, then manually replaced \n characters in the string with carriage returns for regex101.)
One possible issue is that python has appended the bytecode flag (is the little b called a "flag?") to the string. This is an artifact of my attempt to convert the text from utf-8 to ascii, and I haven't figured out how to make it go away.
Yet re clearly is able to parse strings in that form. I know this because I'm converting two text files from utf-8 to ascii, and the following code works perfectly fine on the other string, converted from the other text file, which also has a little b in front of it:
def makeCiteList(citefile):
print(citefile)
citepattern = r'[\s(][A-Z1][A-Za-z1]*-?[A-Za-z1]*[ ,]? \(?\d\d\d\d[a-z]?[\s.,)]'
rawCitelist = re.findall(citepattern, citefile)
cleanCitelist = cleanup(rawCitelist)
finalCiteList = list(set(cleanCitelist))
print(finalCiteList)
return(finalCiteList)
The other chunk of text, which the code immediately above matches correctly: https://gist.github.com/paultopia/a12eba2752638389b2ee
The only hypothesis I can come up with is that the first, broken, regex expression is puking on the combination of newline characters and the string being treated as a byte object, even though a) I know the regex is correct for newlines (because, confirmation from the linked regex101), and b) I know it's matching the strings (because, confirmation from the successful match on the other string).
If that's true, though, I don't know what to do about it.
Thus, questions:
1) Is my hypothesis right that it's the combination of newlines and b that blows up my regex? If not, what is?
2) How do I fix that?
a) replace the newlines with something in the string?
b) rewrite the regex somehow?
c) somehow get rid of that b and make it into a normal string again? (how?)
thanks!
Addition
In case this is a problem I need to fix upstream, here's the code I'm using to get the text files and convert to ascii, replacing non-ascii characters:
this function gets called on utf-8 .txt files saved by textwrangler in mavericks
def makeCorpoi(citefile, reffile):
citebox = open(citefile, 'r')
refbox = open(reffile, 'r')
citecorpus = citebox.read()
refcorpus = refbox.read()
citebox.close()
refbox.close()
corpoi = [str(citecorpus), str(refcorpus)]
return corpoi
and then this function gets called on each element of the list the above function returns.
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
stringstring = str(bigstring)
return stringstring

Aah. I've tracked it down and answered my own question. Apparently one needs to call some kind of encode method on the decoded thing. The following code produces an actual string, with newlines and everything, out the other end (though now I have to fix a bunch of other bugs before I can figure out if the final output is as expected):
def conv2ASCII(bigstring):
def convHandler(error):
return ('1FOREIGN', error.start + 1)
codecs.register_error('foreign', convHandler)
bigstring = bigstring.encode('ascii', 'foreign')
newstring = bigstring.decode('ascii', 'foreign')
return newstring
apparently the str() function doesn't do the same job, for reasons that are mysterious to me. This is despite an answer here How to make new line commands work in a .txt file opened from the internet? which suggests that it does.

Choose a safe separator to save a list data to db

I need to convert a file list into string and save it to db.
a file list like:
[
# [name, length]
# name is in bytes
['111.txt', '1024'],
['english.txt', '2048'],
['some CJK words.log', '2048'],
....
]
note:
all file name is legal.
file name is not user input
now, I use:
if fs:
files = []
file_names = fs[0]
file_lengths = fs[1]
for i in xrange(len(file_names)):
files.append(file_names[i] + '\#' + file_lengths[i])
files = '\n'.join(files)
save_to_mysql(files)
Because I think a file name which be present in bytes would not have \n and \#,but I am not quite sure.Is it safe to use \# and \n in my situation?

The proper solution for this is to use a character that cannot appear in the texts.
But if this is not possible and it this character does appear in the text, then it have to be marked somehow, that is, escaped.
There are already solutions that to exactly that: you can use the C string syntax or XML or JSON or YAML...
But if you feel particular lazy, I've sometimes used the character U+0080, because it not used anywhere. But note taht if in the future you want to encode a list of strings as an element of your list... you'll have a problem! Also, you'll have to check the input strings, in case some malicious user injects this character U+0080 into your strings and start breaking things.

What is the best way to iterate over a python list, excluding certain values and printing out the result

I am new to python and have a question:
I have checked similar questions, checked the tutorial dive into python, checked the python documentation, googlebinging, similar Stack Overflow questions and a dozen other tutorials.
I have a section of python code that reads a text file containing 20 tweets. I am able to extract these 20 tweets using the following code:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
data.append(Tweets.get('text'))
i=0
while i < len(data):
print data[i]
i=i+1
The above while loop iterates perfectly and prints out the 20 tweets (lines) from output.txt.
However, these 20 lines contain Non-English Character data like "Los ladillo a los dos, soy maaaala o maloooooooooooo", URLs like "http://t.co/57LdpK", the string "None" and Photos with a URL like so "Photo: http://t.co/kxpaaaaa(I have edited this for privacy)
I would like to purge the output of this (which is a list), and exclude the following:
The None entries
Anything beginning with the string "Photo:"
It would be a bonus also if I can exclude non-unicode data
I have tried the following bits of code
Using data.remove("None:") but I get the error list.remove(x): x not in list.
Reading the items I do not want into a set and then doing a comparison on the output but no luck.
Researching into list comprehensions, but wonder if I am looking at the right solution here.
I am from an Oracle background where there are functions to chop out any wanted/unwanted section of output, so really gone round in circles in the last 2 hours on this. Any help greatly appreciated!

Try something like this:
def legit(string):
if (string.startswith("Photo:") or "None" in string):
return False
else:
return True
whatyouwant = [x for x in data if legit(x)]
I'm not sure if this will work out of the box for your data, but you get the idea. If you're not familiar, [x for x in data if legit(x)] is called a list comprehension

First of all, only add Tweet.get('text') if there is a text entry:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
if 'text' in Tweets:
data.append(Tweets['text'])
That'll not add None entries (.get() returns None if the 'text' key is not present in the dictionary).
I'm assuming here that you want to further process the data list you are building here. If not, you can dispense with the for entry in data: loops below and stick to one loop with if statements. Tweets['text'] is the same value as entry in the for entry in data loops.
Next, you are looping over python unicode values, so use the methods provided on those objects to filter out what you don't want:
for entry in data:
if not entry.startswith("Photo:"):
print entry
You can use a list comprehension here; the following would print all entries too, in one go:
print '\n'.join([entry for entry in data if not entry.startswith("Photo:")])
In this case that doesn't really buy you much, as you are building one big string just to print it; you may as well just print the individual strings and avoid the string building cost.
Note that all your data is Unicode data. What you perhaps wanted is to filter out text that uses codepoints beyond ASCII points perhaps. You could use regular expressions to detect that there are codepoints beyond ASCII in your text
import re
nonascii = re.compile(ur'[^\x00-0x7f]', re.UNICODE) # all codepoints beyond 0x7F are non-ascii
for entry in data:
if entry.startswith("Photo:") or nonascii.search(entry):
continue # skip the rest of this iteration, continue to the next
print entry
Short demo of the non-ASCII expression:
>>> import re
>>> nonascii = re.compile(ur'[^\x00-\x7f]', re.UNICODE)
>>> nonascii.search(u'All you see is ASCII')
>>> nonascii.search(u'All you see is ASCII plus a little more unicode, like the EM DASH codepoint: \u2014')
<_sre.SRE_Match object at 0x1086275e0>

with open ('output.txt') as fp:
for line in fp.readlines():
Tweets=json.loads(line)
if not 'text' in Tweets: continue
txt = Tweets.get('text')
if txt.replace('.', '').replace('?','').replace(' ','').isalnum():
data.append(txt)
print txt
Small and simple.
Basic principle, one loop, if data matches your "OK" criteria add it and print it.
As Martijn pointed out, 'text' might not be in all the Tweets data.
Regexp replacement for .replace() would go something along the lines of: if re.match('^[\w-\ ]+$', txt) is not None: (it will not work for blankspace etc so yea as mentioned below..)

I'd suggest something like the following:
# use itertools.ifilter to remove items from a list according to a function
from itertools import ifilter
import re
# write a function to filter out entries you don't want
def my_filter(value):
if not value or value.startswith('Photo:'):
return False
# exclude unwanted chars
if re.match('[^\x00-\x7F]', value):
return False
return True
# Reading the data can be simplified with a list comprehension
with open('output.txt') as fp:
data = [json.loads(line).get('text') for line in fp]
# do the filtering
data = list(ifilter(my_filter, data))
# print the output
for line in data:
print line
Regarding unicode, assuming you're using python 2.x, the open function won't read data as unicode, it'll be read as the str type. You might want to convert it if you know the encoding, or read the file with a given encoding using codecs.open.

Try this:
with open ('output.txt') as fp:
for line in iter(fp.readline,''):
Tweets=json.loads(line)
data.append(Tweets.get('text'))
i=0
while i < len(data):
# these conditions will skip (continue) over the iterations
# matching your first two conditions.
if data[i] == None or data[i].startswith("Photo"):
continue
print data[i]
i=i+1

Decoding text in python

I want to know how to decode certain text, and have found some text like this which I want to decode:
\xe2\x80\x93
I know that printing it will solve it, but I am building a web crawler hence I need to build an index (dictionary) containing words with a list of URLs where the word appears.
Hence I want to do something like this:
dic = {}
dic['\xe2\x80\x93'] = 'http://example.com' #this is the url where the word appears
... but when I do:
print dic
I get:
'\xe2\x80\x93'
... instead of â€“.
But when I do print dic['\xe2\x80\x93'] I successfully get â€“.
Howe can I get â€“ by print dic also?

When you see \xhh, that is a a character escape sequence. In this case, it is showing you the hex value of the character (see: lexical analysis: string-literals).
The reason you see \xhh sometimes, and you see the actual characters when you use print is related to the difference between __str__ and __repr__ in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to detect undecoded characters in python? - python

Related

hex header of file, magic numbers, python

python 3 regex not finding confirmed matches

Choose a safe separator to save a list data to db

What is the best way to iterate over a python list, excluding certain values and printing out the result

Decoding text in python

Categories

Resources