Munging non-printable characters to dots using string.translate() - python

So I've done this before and it's a surprising ugly bit of code for such a seemingly simple task.
The goal is to translate any non-printable character into a . (dot). For my purposes "printable" does exclude the last few characters from string.printable (new-lines, tabs, and so on). This is for printing things like the old MS-DOS debug "hex dump" format ... or anything similar to that (where additional whitespace will mangle the intended dump layout).
I know I can use string.translate() and, to use that, I need a translation table. So I use string.maketrans() for that. Here's the best I could come up with:
filter = string.maketrans(
string.translate(string.maketrans('',''),
string.maketrans('',''),string.printable[:-5]),
'.'*len(string.translate(string.maketrans('',''),
string.maketrans('',''),string.printable[:-5])))
... which is an unreadable mess (though it does work).
From there you can call use something like:
for each_line in sometext:
print string.translate(each_line, filter)
... and be happy. (So long as you don't look under the hood).
Now it is more readable if I break that horrid expression into separate statements:
ascii = string.maketrans('','') # The whole ASCII character set
nonprintable = string.translate(ascii, ascii, string.printable[:-5]) # Optional delchars argument
filter = string.maketrans(nonprintable, '.' * len(nonprintable))
And it's tempting to do that just for legibility.
However, I keep thinking there has to be a more elegant way to express this!

Here's another approach using a list comprehension:
filter = ''.join([['.', chr(x)][chr(x) in string.printable[:-5]] for x in xrange(256)])

Broadest use of "ascii" here, but you get the idea
>>> import string
>>> ascii="".join(map(chr,range(256)))
>>> filter="".join(('.',x)[x in string.printable[:-5]] for x in ascii)
>>> ascii.translate(filter)
'................................ !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~.................................................................................................................................'
If I were golfing, probably use something like this:
filter='.'*32+"".join(map(chr,range(32,127)))+'.'*129

for actual code-golf, I imagine you'd avoid string.maketrans entirely
s=set(string.printable[:-5])
newstring = ''.join(x for x in oldstring if x in s else '.')
or
newstring=re.sub('[^'+string.printable[:-5]+']','',oldstring)

I don't find this solution ugly. It is certainly more efficient than any regex based solution. Here is a tiny bit shorter solution. But only works in python2.6:
nonprintable = string.maketrans('','').translate(None, string.printable[:-5])
filter = string.maketrans(nonprintable, '.' * len(nonprintable))

Related

Constructing a long Python string using variables

I have a python string which is basically a concatenation of 3 variables.I am using f-strings to make it a string. It looks like this now:
my_string = f'{getattr(RequestMethodVerbMapping, self.request_method).value} {self.serializer.Meta.model.__name__} {self.request_data['name']}'
which gives me the output:
Create IPGroup test-create-demo-098
Exactly the output that I want. However, as is obvious the line is too long and now Pylint starts complaining so I try to break up the line using multiline f-strings as follows:
my_string = f'''{getattr(RequestMethodVerbMapping, self.request_method).value}
{self.serializer.Meta.model.__name__} {self.request_data['name']}'''
Pylint is now happy but my string looks like this:
Create
IPGroup test-create-demo-098
What is the best way of doing this so that I get my string in one line and also silence Pylint for the line being longer than 120 chars?
This is a long line by itself and, instead of trying to fit a lot into a single line, I would break it down and apply the "Extract Variable" refactoring method for readability:
request_method_name = getattr(RequestMethodVerbMapping, self.request_method).value
model_name = self.serializer.Meta.model.__name__
name = self.request_data['name']
my_string = f'{request_method_name} {model_name} {name}'
I think the following pieces of wisdom from the Zen of Python fit our problem here well:
Sparse is better than dense.
Readability counts.
It is possible to concat f-strings with just whitespace in between, just like ordinary strings. Just parenthesize the whole thing.
my_string = (f'{getattr(RequestMethodVerbMapping, self.request_method).value}'
f' {self.serializer.Meta.model.__name__}
f' {self.request_data["name"]}')
It will compile to exactly the same code as your one-line string.
It is even possible to interleave f-strings and non-f-strings:
>>> print(f"'foo'" '"{bar}"' f'{1 + 42}')
'foo'"{bar}"43

How to search for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python

import re
string="b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '"
print(re.findall(r"\x[0-9a-z]{2}",string))
The the list returned by the findall() function is empty :(
The problem here is that your string is the Python representation of a Python bytes object, which is pretty much useless.
Most likely, you had a bytes object, like this:
b = b'#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 '
… and you converted it to a string, like this:
s = str(b)
Don't do that. Instead, decode it:
s = b.decode('utf-8')
That will get you the actual characters, which you can then match easily, instead of trying to match the characters in the string representation of the bytes representation and then reconstructing the actual characters laboriously from the results.
However, it's worth noting that \xe2\x80\xa6 is not an emoji, it's a horizontal ellipsis character, …. If that isn't what you wanted, you already corrupted the data before this point.
Not a regexp per se, but might help you out any way.
def emojis(s):
return [c for c in s if ord(c) in range(0x1F600, 0x1F64F)]
print(emojis("hello world 😊")) # sample usage
You need to re.compile(ur'A\xe2\x80\xa6',re.UNICODE)
Compile a Unicode regex and use that pattern matching for your find,find all’s,subs,etc.
Try this. I joined the string in your question with that in your title to make the final search string
import re
k = r"#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python"
print(k)
print()
p = re.findall(r"((\\x[a-z0-9]{1,}){1,})", k)
for each in p:
print(each[0])
Output
#DerkGently #seanferg85 #Umbertobaggio #EL4JC and he already had Popular support.. most people know this already. A\xe2\x80\xa6 for a string like \x60\xe2\x4b(indicating a emoticon) using regular expression in python
\xe2\x80\xa6
\x60\xe2\x4b

Appending '0x' before the hex numbers in a string

I'm parsing a xml file in which I get basic expressions (like id*10+2). What I am trying to do is to evaluate the expression to actually get the value. To do so, I use the eval() method which works very well.
The only thing is the numbers are in fact hexadecimal numbers. The eval() method could work well if every hex number was prefixed with '0x', but I could not find a way to do it, neither could I find a similar question here. How would it be done in a clean way ?
Use the re module.
>>> import re
>>> re.sub(r'([\dA-F]+)', r'0x\1', 'id*A+2')
'id*0xA+0x2'
>>> eval(re.sub(r'([\dA-F]+)', r'0x\1', 'CAFE+BABE'))
99772
Be warned though, with an invalid input to eval, it won't work. There are also many risks of using eval.
If your hex numbers have lowercase letters, then you could use this:
>>> re.sub(r'(?<!i)([\da-fA-F]+)', r'0x\1', 'id*a+b')
'id*0xa+0xb'
This uses a negative lookbehind assertion to assure that the letter i is not before the section it is trying to convert (preventing 'id' from turning into 'i0xd'. Replace i with I if the variable is Id.
If you can parse expresion into individual numbers then I would suggest to use int function:
>>> int("CAFE", 16)
51966
Be careful with eval! Do not ever use it in untrusted inputs.
If it's just simple arithmetic, I'd use a custom parser (there are tons of examples out in the wild)... And using parser generators (flex/bison, antlr, etc.) is a skill that is useful and easily forgotten, so it could be a good chance to refresh or learn it.
One option is to use the parser module:
import parser, token, re
def hexify(ast):
if not isinstance(ast, list):
return ast
if ast[0] in (token.NAME, token.NUMBER) and re.match('[0-9a-fA-F]+$', ast[1]):
return [token.NUMBER, '0x' + ast[1]]
return map(hexify, ast)
def hexified_eval(expr, *args):
ast = parser.sequence2st(hexify(parser.expr(expr).tolist()))
return eval(ast.compile(), *args)
>>> hexified_eval('id*10 + BABE', {'id':0xcafe})
567466
This is somewhat cleaner than a regex solution in that it only attempts to replace tokens that have been positively identified as either names or numbers (and look like hex numbers). It also correctly handles more general python expressions such as id*10 + len('BABE') (it won't replace 'BABE' with '0xBABE').
OTOH, the regex solution is simpler and might cover all the cases you need to deal with anyway.

adding regexp to yaml python

Is there any way to store and read this regexp in YAML by using python:
regular: /<title [^>]*lang=("|')wo("|')>/
Anyone have any idea or some solution for this ?
I have the following error:
% ch.encode('utf-8'), self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '|' that cannot start any token
in "test.yaml", line 10, column 49
My code:
def test2():
clueAppconf = open('test.yaml')
clueContext = yaml.load(clueAppconf)
print clueContext['webApp']
Ok, it looks like the problem is the type of scalar you have chosen to represent this regex. If you're married to scalars (yaml strings), you'll need to use double quoted scalars and escape codes for your special characters that it chokes on. So, your yaml should look something like this:
regular: "/<title [^>]*lang=("\x7C')wo("\x7C')>/"
I've only escaped the character that it was choking on to maintain some semblance of readability, however you may need to escape additional ones depending on whether it throws more errors. Additionally, you could use unicode escape codes. That would look like this:
regular: "/<title [^>]*lang=("\u007C')wo("\u007C')>/"
I'm a little out on my yaml knowledge, so I don't know a way to maintain the special characters and their readability in the yaml. Based on my cursory scan of the yaml documentation, this was the best I could find.

splitting unicode string into words

I am trying to split a Unicode string into words (simplistic), like this:
print re.findall(r'(?u)\w+', "раз два три")
What I expect to see is:
['раз','два','три']
But what I really get is:
['\xd1', '\xd0', '\xd0', '\xd0', '\xd0\xb2\xd0', '\xd1', '\xd1', '\xd0']
What am I doing wrong?
Edit:
If I use u in front of the string:
print re.findall(r'(?u)\w+', u"раз два три")
I get:
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Edit 2:
Aaaaand it seems like I should have read docs first:
print re.findall(r'(?u)\w+', u"раз два три")[0].encode('utf-8')
Will give me:
раз
Just to make sure though, does that sound like a proper way of approaching it?
You're actually getting the stuff you expect in the unicode case. You only think you are not because of the weird escaping due to the fact that you're looking at the reprs of the strings, not not printing their unescaped values. (This is just how lists are displayed.)
>>> words = [u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
>>> for w in words:
... print w # This uses the terminal encoding -- _only_ utilize interactively
...
раз
два
три
>>> u'раз' == u'\u0440\u0430\u0437'
True
Don't miss my remark about printing these unicode strings. Normally if you were going to send them to screen, a file, over the wire, etc. you need to manually encode them into the correct encoding. When you use print, Python tries to leverage your terminal's encoding, but it can only do that if there is a terminal. Because you don't generally know if there is one, you should only rely on this in the interactive interpreter, and always encode to the right encoding explicitly otherwise.
In this simple splitting-on-whitespace approach, you might not want to use regex at all but simply to use the unicode.split method.
>>> u"раз два три".split()
[u'\u0440\u0430\u0437', u'\u0434\u0432\u0430', u'\u0442\u0440\u0438']
Your top (bytestring) example does not work because re basically assumes all bytestrings are ASCII for its semantics, but yours was not. Using unicode strings allows you to get the right semantics for your alphabet and locale. As much as possible, textual data should always be represented using unicode rather than str.

Categories

Resources