Python unicode escape for RethinkDB match (regex) query

Python unicode escape for RethinkDB match (regex) query - python

I am trying to perform a rethinkdb match query with an escaped unicode user provided search param:
import re
from rethinkdb import RethinkDB
r = RethinkDB()
search_value = u"\u05e5" # provided by user via flask
search_value_escaped = re.escape(search_value) # results in u'\\\u05e5' ->
# when encoded with "utf-8" gives "\ץ" as expected.
conn = rethinkdb.connect(...)
results_cursor_a = r.db(...).table(...).order_by(index="id").filter(
lambda doc: doc.coerce_to("string").match(search_value)
).run(conn) # search_value works fine
results_cursor_b = r.db(...).table(...).order_by(index="id").filter(
lambda doc: doc.coerce_to("string").match(search_value_escaped)
).run(conn) # search_value_escaped spits an error
The error for search_value_escaped is the following:
ReqlQueryLogicError: Error in regexp `\ץ` (portion `\ץ`): invalid escape sequence: \ץ in:
r.db(...).table(...).order_by(index="id").filter(lambda var_1: var_1.coerce_to('string').match(u'\\\u05e5m'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I tried encoding with "utf-8" before/after re.escape() but same results with different errors. What am I messing? Is it something in my code or some kind of a bug?
EDIT: .coerce_to('string') converts the document to "utf-8" encoded string. RethinkDB also converts the query to "utf-8" and then it matches them hence the first query works even though it looks like a unicde match inside a string.

From what it looks like RethinkDB rejects escaped unicode characters so I wrote a simple workaround with a custom escape without implementing my own logic of replacing characters (in fear that I must miss one and create a security issue).
import re
def no_unicode_escape(u):
escaped_list = []
for i in u:
if ord(i) < 128:
escaped_list.append(re.escape(i))
else:
escaped_list.append(i)
rv = "".join(escaped_list)
return rv
or a one-liner:
import re
def no_unicode_escape(u):
return "".join(re.escape(i) if ord(i) < 128 else i for i in u)
Which yields the required result of escaping "dangerous" characters and works with RethinkDB as I wanted.

Related

Format String of Dictionary

I've a string of dictionary as following:
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"cisco123\", \"name\": \"admin\"}}}"
Now I want to format this string to replace the pwd and name dynamically. What I've tried is:
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}".format('password', 'username')
But this gives following error:
traceback (most recent call last):
File ".\ll.py", line 4, in <module>
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}".format('password', 'username')
KeyError: '"aaaUser"
It is possible by just loading the string as dict using json.loads()and then setting the attributes as required, but this is not what I want. I want to format the string, so that I can use this string in other files/modules.
'
What I'm missing here? Any help would be appreciated.

Don't try to work with the JSON string directly; decode it, update the data structure, and re-encode it:
# Use single quotes instead of escaping all the double quotes
CREDENTIALS = '{"aaaUser": {"attributes": {"pwd": "cisco123", "name": "admin"}}}'
d = json.loads(CREDENTIALS)
attributes = d["aaaUser"]["attributes"]
attributes["name"] = username
attributes["pwd"] = password
CREDENTIALS = json.dumps(d)
With string formatting, you would need to change your string to look like
CREDENTIALS = '{{"aaaUser": {{"attributes": {{"pwd": "{0}", "name": "{1}"}}}}}}'
doubling all the literal braces so that the format method doesn't mistake them for placeholders.
However, formatting also means that the password needs to be pre-escaped if it contains anything that could be mistaken for JSON syntax, such as a double quote.
# This produces invalid JSON
NEW_CREDENTIALS = CREDENTIALS.format('new"password', 'bob')
# This produces valid JSON
NEW_CREDENTIALS = CREDENTIALS.format('new\\"password', 'bob')
It's far easier and safer to just decode and re-encode.

str.format deals with the text enclosed with braces {}. Here variable CREDENTIALS has the starting letter as braces { which follows the str.format rule to replace it's text and find the immediately closing braces since it don't find it and instead gets another opening braces '{' that's why it throws the error.
The string on which this method is called can contain literal text or replacement fields delimited by braces {}
Now to escape braces and replace only which indented can be done if enclosed twice like
'{{ Hey Escape }} {0}'.format(12) # O/P '{ Hey Escape } 12'
If you escape the parent and grandparent {} then it will work.
Example:
'{{Escape Me {n} }}'.format(n='Yes') # {Escape Me Yes}
So following the rule of the str.format, I'm escaping the parents text enclosed with braces by adding one extra brace to escape it.
"{{\"aaaUser\": {{\"attributes\": {{\"pwd\": \"{0}\", \"name\": \"{1}\"}}}}}}".format('password', 'username')
#O/P '{"aaaUser": {"attributes": {"pwd": "password", "name": "username"}}}'
Now Coming to the string formatting to make it work. There is other way of doing it. However this is not recommended in your case as you need to make sure the problem always has the format as you mentioned and never mess with other otherwise the result could change drastically.
So here the solution that I follow is using string replace to convert the format from {0} to %(0)s so that string formatting works without any issue and never cares about braces .
'Hello %(0)s' % {'0': 'World'} # Hello World
SO here I'm using re.sub to replace all occurrence
def myReplace(obj):
found = obj.group(0)
if found:
found = found.replace('{', '%(')
found = found.replace('}', ')s')
return found
CREDENTIALS = re.sub('\{\d{1}\}', myReplace, "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}"% {'0': 'password', '1': 'username'}
print CREDENTIALS # It should print desirable result

How to check the Emoji property of a character in Python?

In unicode a character can have an Emoji property.
Is there a standard way in Python to determine if a character is an Emoji?
I know of unicodedata, but it doesn't appear to expose all these extra character details.
Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.

This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.
#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json
'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).
#param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
#param attribute The attribute that is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
content = f.read().decode(f.headers.get_content_charset())
cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
for line in content.splitlines():
m = cldr.match(line)
if m == None:
continue
line_attribute = m.group(5).strip()
if line_attribute != attribute:
continue
code_point = int(m.group(1),16)
if m.group(3) == None:
on_emoji(code_point)
else:
to_code_point = int(m.group(3),16)
for i in range(code_point,to_code_point+1):
on_emoji(i)
# Dumps the values into a JSON format
def print_emoji(value):
c = chr(value)
try:
obj = {
'code': value,
'name': unicodedata.name(c).lower(),
}
print(json.dumps(obj),',')
except:
# Unicode DB is likely outdated in installed Python
pass
print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )
That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.

I have used the following regex pattern successfully before
import re
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
Also check out this question: removing emojis from a string in Python

Bottle wildcard filter for hex color code

I am trying to add a filter for hex color Codes (should take formats like: 0xFF0000 or FF0000) to my bottle application.
I followed this bottle tutorial https://bottlepy.org/docs/dev/routing.html:
You can add your own filters to the router. All you need is a function that returns three elements: A regular expression string, a callable to convert the URL fragment to a python value, and a callable that does the opposite. The filter function is called with the configuration string as the only parameter and may parse it as needed:
But everytime I call my function:
#app.route('/<color:hexa>')
def call(color):
....
I receive a 404:
Not found: '/0x0000FF'
Maybe I am blind but I just don't know what I am missing. Here is my filter:
def hexa_filter(config):
regexp = r'^(0[xX])?[a-fA-F0-9]+$'
def to_python(match):
return int(match, 0)
def to_url(hexNum):
return str(hexNum)
return regexp, to_python, to_url
app.router.add_filter('hexa', hexa_filter)

Problem makes ^ (and eventually $).
Your regex can be used as part of bigger regex which checks full url - so ^ (and sometimes $) inside bigger regex makes no sense.
from bottle import *
app = Bottle()
def hexa_filter(config):
regexp = r'(0[xX])?[a-fA-F0-9]+'
def to_python(match):
return int(match, 16)
def to_url(hexNum):
return str(hexNum)
return regexp, to_python, to_url
app.router.add_filter('hexa', hexa_filter)
#app.route('/<color:hexa>')
def call(color):
return 'color: ' + str(color)
app.run(host='localhost', port=8000)

urllib.quote() throws KeyError

To encode the URI, I used urllib.quote("schönefeld") but when some non-ascii characters exists in string, it thorws
KeyError: u'\xe9'
Code: return ''.join(map(quoter, s))
My input strings are köln, brønshøj, schönefeld etc.
When I tried just printing statements in windows(Using python2.7, pyscripter IDE). But in linux it raises exception (I guess platform doesn't matter).
This is what I am trying:
from commands import getstatusoutput
queryParams = "schönefeld";
cmdString = "http://baseurl" + quote(queryParams)
print getstatusoutput(cmdString)
Exploring the issue reason:
in urllib.quote(), actually exception being throwin at return ''.join(map(quoter, s)).
The code in urllib is:
def quote(s, safe='/'):
if not s:
if s is None:
raise TypeError('None object cannot be quoted')
return s
cachekey = (safe, always_safe)
try:
(quoter, safe) = _safe_quoters[cachekey]
except KeyError:
safe_map = _safe_map.copy()
safe_map.update([(c, c) for c in safe])
quoter = safe_map.__getitem__
safe = always_safe + safe
_safe_quoters[cachekey] = (quoter, safe)
if not s.rstrip(safe):
return s
return ''.join(map(quoter, s))
The reason for exception is in ''.join(map(quoter, s)), for every element in s, quoter function will be called and finally the list will be joined by '' and returned.
For non-ascii char è, the equivalent key will be %E8 which presents in _safe_map variable. But when I am calling quote('è'), it searches for the key \xe8. So that the key does not exist and exception thrown.
So, I just modifed s = [el.upper().replace("\\X","%") for el in s] before calling ''.join(map(quoter, s)) within try-except block. Now it works fine.
But I am annoying what I have done is correct approach or it will create any other issue?
And also I do have 200+ instances of linux which is very tough to deploy this fix in all instances.

You are trying to quote Unicode data, so you need to decide how to turn that into URL-safe bytes.
Encode the string to bytes first. UTF-8 is often used:
>>> import urllib
>>> urllib.quote(u'sch\xe9nefeld')
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py:1268: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
return ''.join(map(quoter, s))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1268, in quote
return ''.join(map(quoter, s))
KeyError: u'\xe9'
>>> urllib.quote(u'sch\xe9nefeld'.encode('utf8'))
'sch%C3%A9nefeld'
However, the encoding depends on what the server will accept. It's best to stick to the encoding the original form was sent with.

By just converting the string to unicode I resolved the issue.
here is the snippet:
try:
unicode(mystring, "ascii")
except UnicodeError:
mystring = unicode(mystring, "utf-8")
else:
pass
Detailed description of solution can be found at http://effbot.org/pyfaq/what-does-unicodeerror-ascii-decoding-encoding-error-ordinal-not-in-range-128-mean.htm

I had the exact same error as #underscore but in my case the problem was that map(quoter,s) tried to look for the key u'\xe9' which was not in the _safe_map. However \xe9 was, so I solved the issue by replacing u'\xe9' by \xe9 in s.
Moreover, shouldn't the return statement be within the try/except ? I also had to change this to completely solve the problem.

Unable to encode/decode pprint output

This question is based on a side-effect of that one.
My .py files are all have # -*- coding: utf-8 -*- encoding definer on the first line, like my api.py
As I mention on the related question, I use HttpResponse to return the api documentation. Since I defined encoding by:
HttpResponse(cy_content, content_type='text/plain; charset=utf-8')
Everything is ok, and when I call my API service, there are no encoding problems except the string formed from a dictionary by pprint
Since I am using Turkish characters in some values in my dict, pprint converts them to unichr equivalents, like:
API_STATUS = {
1: 'müşteri',
2: 'some other status message'
}
my_str = 'Here is the documentation part that contains Turkish chars like işüğçö'
my_str += pprint.pformat(API_STATUS, indent=4, width=1)
return HttpRespopnse(my_str, content_type='text/plain; charset=utf-8')
And my plain text output is like:
Here is the documentation part that contains Turkish chars like işüğçö
{
1: 'm\xc3\xbc\xc5\x9fteri',
2: 'some other status message'
}
I try to decode or encode pprint output to different encodings, with no success... What is the best practice to overcome this problem

pprint appears to use repr by default, you can work around this by overriding PrettyPrinter.format:
# coding=utf8
import pprint
class MyPrettyPrinter(pprint.PrettyPrinter):
def format(self, object, context, maxlevels, level):
if isinstance(object, unicode):
return (object.encode('utf8'), True, False)
return pprint.PrettyPrinter.format(self, object, context, maxlevels, level)
d = {'foo': u'işüğçö'}
pprint.pprint(d) # {'foo': u'i\u015f\xfc\u011f\xe7\xf6'}
MyPrettyPrinter().pprint(d) # {'foo': işüğçö}

You should use unicode strings instead of 8-bit ones:
API_STATUS = {
1: u'müşteri',
2: u'some other status message'
}
my_str = u'Here is the documentation part that contains Turkish chars like işüğçö'
my_str += pprint.pformat(API_STATUS, indent=4, width=1)
The pprint module is designed to print out all possible kind of nested structure in a readable way. To do that it will print the objects representation rather then convert it to a string, so you'll end up with the escape syntax wheather you use unicode strings or not. But if you're using unicode in your document, then you really should be using unicode literals!
Anyway, thg435 has given you a solution how to change this behaviour of pformat.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python unicode escape for RethinkDB match (regex) query - python

Related

Format String of Dictionary

How to check the Emoji property of a character in Python?

Bottle wildcard filter for hex color code

urllib.quote() throws KeyError

Unable to encode/decode pprint output

Categories

Resources