"Unescaping" backslashes in a string [duplicate] - python

This question already has an answer here:
Reversing Python's re.escape
(1 answer)
Closed 7 months ago.
TL;DR;
I want to transform a string (representing a regex) like "\\." into "\." in a clean and resilient way (something akin to sed 's/\\\\/\\/g', I don't know if this could break on edge cases though)
val.decode('string-escape') is not an option since I'm using python3.
What I tried so far:
variations of val.replace('\\\\', '\\')
looked at the answers to these two
questions but couldn't get them to work in my case
variations of val.encode().decode('unicode-escape')
had a look at the docs for strings but
couldn't find a solution
I am sure that I missed a relevant part, because string escaping (and unescaping) seems like a fairly common and basic problem, but I haven't found a solution yet =/
Full Story:
I have a YAML-File like so
- !Scheme
barcode: _([ACGTacgt]+)[_.]
lane: _L(\d\d\d)[_.]
name: RKI
read: _R(\d)+[_.]
sample_name: ^(.+)(?:_.+){5}
set: _S(\d+)[_.]
user: _U([a-zA-Z0-9\-]+)[_.]
validation: .*/(?:[a-zA-Z0-9\-]+_)+(?:[a-zA-Z0-9])+\.fastq.*
...
that describes a "Scheme" Object.
The 'name' key is an identifier and the rest describe regexes.
I want to be able to parse an object from that YAML so I wrote a from_yaml class method:
scheme = Scheme()
loaded_mapping = loader.construct_mapping(node) # load yaml-node as dictionary WARNING! loads str escaped
# re.compile all keys except name, adding name as regular string and
# unescaping escaped sequences (like '\') in the process
for key, val in loaded_mapping.items():
if key == 'name':
processed_val = val
else:
processed_val = re.compile(val) # backslashes in val are escaped
scheme.__dict__[key] = processed_val
the problem is that loader.construct_mapping(node) loads the strings with backslashes escaped, so the regex is not correct anymore.
I tried several variations of val.encode().decode('unicode-escape') and val.replace('\\\\', '\\'),
but had no luck with it
If anyone has an idea how to handle this I'd appreciate it very much! I am not married to this specific way of doing things and open to alternative approaches.
Kind Regards!

Assuming I have this super simple YAML file
lane: _L(\d\d\d)[_.]
and load it with PyYAML like this:
import yaml
import re
with open('test.yaml', 'rb') as stream:
data = yaml.safe_load(stream)
lane_pattern = data['lane']
print(lane_pattern)
lane_expr = re.compile(data['lane'])
print(lane_expr)
Then the result is exactly as one would expect:
_L(\d\d\d)[_.]
re.compile('_L(\\d\\d\\d)[_.]')
There is no double escaping of strings going on when YAML is parsed, so there is nothing for you to unescape.

Related

How to convert a regular string to a raw string? [duplicate]

I have a string s, its contents are variable. How can I make it a raw string? I'm looking for something similar to the r'' method.
i believe what you're looking for is the str.encode("string-escape") function. For example, if you have a variable that you want to 'raw string':
a = '\x89'
a.encode('unicode_escape')
'\\x89'
Note: Use string-escape for python 2.x and older versions
I was searching for a similar solution and found the solution via:
casting raw strings python
Raw strings are not a different kind of string. They are a different way of describing a string in your source code. Once the string is created, it is what it is.
Since strings in Python are immutable, you cannot "make it" anything different. You can however, create a new raw string from s, like this:
raw_s = r'{}'.format(s)
As of Python 3.6, you can use the following (similar to #slashCoder):
def to_raw(string):
return fr"{string}"
my_dir ="C:\data\projects"
to_raw(my_dir)
yields 'C:\\data\\projects'. I'm using it on a Windows 10 machine to pass directories to functions.
raw strings apply only to string literals. they exist so that you can more conveniently express strings that would be modified by escape sequence processing. This is most especially useful when writing out regular expressions, or other forms of code in string literals. if you want a unicode string without escape processing, just prefix it with ur, like ur'somestring'.
For Python 3, the way to do this that doesn't add double backslashes and simply preserves \n, \t, etc. is:
a = 'hello\nbobby\nsally\n'
a.encode('unicode-escape').decode().replace('\\\\', '\\')
print(a)
Which gives a value that can be written as CSV:
hello\nbobby\nsally\n
There doesn't seem to be a solution for other special characters, however, that may get a single \ before them. It's a bummer. Solving that would be complex.
For example, to serialize a pandas.Series containing a list of strings with special characters in to a textfile in the format BERT expects with a CR between each sentence and a blank line between each document:
with open('sentences.csv', 'w') as f:
current_idx = 0
for idx, doc in sentences.items():
# Insert a newline to separate documents
if idx != current_idx:
f.write('\n')
# Write each sentence exactly as it appared to one line each
for sentence in doc:
f.write(sentence.encode('unicode-escape').decode().replace('\\\\', '\\') + '\n')
This outputs (for the Github CodeSearchNet docstrings for all languages tokenized into sentences):
Makes sure the fast-path emits in order.
#param value the value to emit or queue up\n#param delayError if true, errors are delayed until the source has terminated\n#param disposable the resource to dispose if the drain terminates
Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.
Scheduler:\n{#code amb} does not operate by default on a particular {#link Scheduler}.
#param the common element type\n#param sources\nan Iterable of ObservableSource sources competing to react first.
A subscription to each source will\noccur in the same order as in the Iterable.
#return an Observable that emits the same sequence as whichever of the source ObservableSources first\nemitted an item or sent a termination notification\n#see ReactiveX operators documentation: Amb
...
Just format like that:
s = "your string"; raw_s = r'{0}'.format(s)
With a little bit correcting #Jolly1234's Answer:
here is the code:
raw_string=path.encode('unicode_escape').decode()
s = "hel\nlo"
raws = '%r'%s #coversion to raw string
#print(raws) will print 'hel\nlo' with single quotes.
print(raws[1:-1]) # will print hel\nlo without single quotes.
#raws[1:-1] string slicing is performed
The solution, which worked for me was:
fr"{orignal_string}"
Suggested in comments by #ChemEnger
I suppose repr function can help you:
s = 't\n'
repr(s)
"'t\\n'"
repr(s)[1:-1]
't\\n'
Just simply use the encode function.
my_var = 'hello'
my_var_bytes = my_var.encode()
print(my_var_bytes)
And then to convert it back to a regular string do this
my_var_bytes = 'hello'
my_var = my_var_bytes.decode()
print(my_var)
--EDIT--
The following does not make the string raw but instead encodes it to bytes and decodes it.

How can I filter Emoji characters from my input so I can save in MySQL <5.5?

I have a Django app that takes tweet data from Twitter's API and saves it in a MySQL database. As far as I know (I'm still getting my head around the finer points of character encoding) I'm using UTF-8 everywhere, including MySQL encoding and collation, which works fine except when a tweet contains Emoji characters, which I understand use a four-byte encoding. Trying to save them produces the following warnings from Django:
/home/biggleszx/.virtualenvs/myvirtualenv/lib/python2.6/site-packages/django/db/backends/mysql/base.py:86: Warning: Incorrect string value: '\xF0\x9F\x98\xAD I...' for column 'text' at row 1
return self.cursor.execute(query, args)
I'm using MySQL 5.1, so using utf8mb4 isn't an option unless I upgrade to 5.5, which I'd rather not just yet (also from what I've read, Django's support for this isn't quite production-ready, though this might no longer be accurate). I've also seen folks advising the use of BLOB instead of TEXT on affected columns, which I'd also rather not do as I figure it would harm performance.
My question is, then, assuming I'm not too bothered about 100% preservation of the tweet contents, is there a way I can filter out all Emoji characters and replace them with a non-multibyte character, such as the venerable WHITE MEDIUM SMALL SQUARE (U+25FD)? I figure this is the easiest way to save that data given my current setup, though if I'm missing another obvious solution, I'd love to hear it!
FYI, I'm using the stock Python 2.6.5 on Ubuntu 10.04.4 LTS. sys.maxunicode is 1114111, so it's a UCS-4 build.
Thanks for reading.
So it turns out this has been answered a few times, I just hadn't quite got the right Google-fu to find the existing questions.
Python, convert 4-byte char to avoid MySQL error "Incorrect string value:"
Warning raised by inserting 4-byte unicode to mysql
Thanks to Martijn Pieters, the solution came from the world of regular expressions, specifically this code (based on his answer to the first link above):
import re
try:
# UCS-4
highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
# UCS-2
highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
The character I'm replacing with is the WHITE MEDIUM SMALL SQUARE (U+25FD), FYI, but could be anything.
For those unfamiliar with UCS, like me, this is a system for Unicode conversion and a given build of Python will include support for either the UCS-2 or UCS-4 variant, each of which has a different upper bound on character support.
With the addition of this code, the strings seem to persist in MySQL 5.1 just fine.
Hope this helps anyone else in the same situation!
I tryied the solution by BigglesZX and its wasn't woring for the emoji of the heart (❤) after reading the [emoji's wikipedia article][1] I've seen that the regular expression is not covering all the emojis while also covering other range of unicode that are not emojis.
The following code create the 5 regular expressions that cover the 5 emoji blocks in the standard:
emoji_symbols_pictograms = re.compile(u'[\U0001f300-\U0001f5fF]')
emoji_emoticons = re.compile(u'[\U0001f600-\U0001f64F]')
emoji_transport_maps = re.compile(u'[\U0001f680-\U0001f6FF]')
emoji_symbols = re.compile(u'[\U00002600-\U000026FF]')
emoji_dingbats = re.compile(u'[\U00002700-\U000027BF]')
Those blocks could be merged in three blocks (UCS-4):
emoji_block0 = re.compile(u'[\U00002600-\U000027BF]')
emoji_block1 = re.compile(u'[\U0001f300-\U0001f64F]')
emoji_block2 = re.compile(u'[\U0001f680-\U0001f6FF]')
Their equivalents in UCS-2 are:
emoji_block0 = re.compile(u'[\u2600-\u27BF]')
emoji_block1 = compile(u'[\uD83C][\uDF00-\uDFFF]')
emoji_block1b = compile(u'[\uD83D][\uDC00-\uDE4F]')
emoji_block2 = re.compile(u'[\uD83D][\uDE80-\uDEFF]')
So finally we can define a single regular expression with all the cases together:
import re
try:
# UCS-4
highpoints = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
except re.error:
# UCS-2
highpoints = re.compile(u'([\u2600-\u27BF])|([\uD83C][\uDF00-\uDFFF])|([\uD83D][\uDC00-\uDE4F])|([\uD83D][\uDE80-\uDEFF])')
# mytext = u'<some string containing 4-byte chars>'
mytext = highpoints.sub(u'\u25FD', mytext)
I found out there another regular expresion that is able to identify the emojis.
This the regex is provided by the team at instagram-enginnering blog
u"(?<!&)#(\w|(?:[\xA9\xAE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u2388\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665\u2666\u2668\u267B\u267F\u2692-\u2694\u2696\u2697\u2699\u269B\u269C\u26A0\u26A1\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|\uD83C[\uDC04\uDCCF\uDD70\uDD71\uDD7E\uDD7F\uDD8E\uDD91-\uDD9A\uDE01\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F\uDD70\uDD73-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDD95\uDD96\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED0\uDEE0-\uDEE5\uDEE9\uDEEB\uDEEC\uDEF0\uDEF3]|\uD83E[\uDD10-\uDD18\uDD80-\uDD84\uDDC0]|(?:0\u20E3|1\u20E3|2\u20E3|3\u20E3|4\u20E3|5\u20E3|6\u20E3|7\u20E3|8\u20E3|9\u20E3|#\u20E3|\\*\u20E3|\uD83C(?:\uDDE6\uD83C(?:\uDDEB|\uDDFD|\uDDF1|\uDDF8|\uDDE9|\uDDF4|\uDDEE|\uDDF6|\uDDEC|\uDDF7|\uDDF2|\uDDFC|\uDDE8|\uDDFA|\uDDF9|\uDDFF|\uDDEA)|\uDDE7\uD83C(?:\uDDF8|\uDDED|\uDDE9|\uDDE7|\uDDFE|\uDDEA|\uDDFF|\uDDEF|\uDDF2|\uDDF9|\uDDF4|\uDDE6|\uDDFC|\uDDFB|\uDDF7|\uDDF3|\uDDEC|\uDDEB|\uDDEE|\uDDF6|\uDDF1)|\uDDE8\uD83C(?:\uDDF2|\uDDE6|\uDDFB|\uDDEB|\uDDF1|\uDDF3|\uDDFD|\uDDF5|\uDDE8|\uDDF4|\uDDEC|\uDDE9|\uDDF0|\uDDF7|\uDDEE|\uDDFA|\uDDFC|\uDDFE|\uDDFF|\uDDED)|\uDDE9\uD83C(?:\uDDFF|\uDDF0|\uDDEC|\uDDEF|\uDDF2|\uDDF4|\uDDEA)|\uDDEA\uD83C(?:\uDDE6|\uDDE8|\uDDEC|\uDDF7|\uDDEA|\uDDF9|\uDDFA|\uDDF8|\uDDED)|\uDDEB\uD83C(?:\uDDF0|\uDDF4|\uDDEF|\uDDEE|\uDDF7|\uDDF2)|\uDDEC\uD83C(?:\uDDF6|\uDDEB|\uDDE6|\uDDF2|\uDDEA|\uDDED|\uDDEE|\uDDF7|\uDDF1|\uDDE9|\uDDF5|\uDDFA|\uDDF9|\uDDEC|\uDDF3|\uDDFC|\uDDFE|\uDDF8|\uDDE7)|\uDDED\uD83C(?:\uDDF7|\uDDF9|\uDDF2|\uDDF3|\uDDF0|\uDDFA)|\uDDEE\uD83C(?:\uDDF4|\uDDE8|\uDDF8|\uDDF3|\uDDE9|\uDDF7|\uDDF6|\uDDEA|\uDDF2|\uDDF1|\uDDF9)|\uDDEF\uD83C(?:\uDDF2|\uDDF5|\uDDEA|\uDDF4)|\uDDF0\uD83C(?:\uDDED|\uDDFE|\uDDF2|\uDDFF|\uDDEA|\uDDEE|\uDDFC|\uDDEC|\uDDF5|\uDDF7|\uDDF3)|\uDDF1\uD83C(?:\uDDE6|\uDDFB|\uDDE7|\uDDF8|\uDDF7|\uDDFE|\uDDEE|\uDDF9|\uDDFA|\uDDF0|\uDDE8)|\uDDF2\uD83C(?:\uDDF4|\uDDF0|\uDDEC|\uDDFC|\uDDFE|\uDDFB|\uDDF1|\uDDF9|\uDDED|\uDDF6|\uDDF7|\uDDFA|\uDDFD|\uDDE9|\uDDE8|\uDDF3|\uDDEA|\uDDF8|\uDDE6|\uDDFF|\uDDF2|\uDDF5|\uDDEB)|\uDDF3\uD83C(?:\uDDE6|\uDDF7|\uDDF5|\uDDF1|\uDDE8|\uDDFF|\uDDEE|\uDDEA|\uDDEC|\uDDFA|\uDDEB|\uDDF4)|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C(?:\uDDEB|\uDDF0|\uDDFC|\uDDF8|\uDDE6|\uDDEC|\uDDFE|\uDDEA|\uDDED|\uDDF3|\uDDF1|\uDDF9|\uDDF7|\uDDF2)|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C(?:\uDDEA|\uDDF4|\uDDFA|\uDDFC|\uDDF8)|\uDDF8\uD83C(?:\uDDFB|\uDDF2|\uDDF9|\uDDE6|\uDDF3|\uDDE8|\uDDF1|\uDDEC|\uDDFD|\uDDF0|\uDDEE|\uDDE7|\uDDF4|\uDDF8|\uDDED|\uDDE9|\uDDF7|\uDDEF|\uDDFF|\uDDEA|\uDDFE)|\uDDF9\uD83C(?:\uDDE9|\uDDEB|\uDDFC|\uDDEF|\uDDFF|\uDDED|\uDDF1|\uDDEC|\uDDF0|\uDDF4|\uDDF9|\uDDE6|\uDDF3|\uDDF7|\uDDF2|\uDDE8|\uDDFB)|\uDDFA\uD83C(?:\uDDEC|\uDDE6|\uDDF8|\uDDFE|\uDDF2|\uDDFF)|\uDDFB\uD83C(?:\uDDEC|\uDDE8|\uDDEE|\uDDFA|\uDDE6|\uDDEA|\uDDF3)|\uDDFC\uD83C(?:\uDDF8|\uDDEB)|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C(?:\uDDF9|\uDDEA)|\uDDFF\uD83C(?:\uDDE6|\uDDF2|\uDDFC))))[\ufe00-\ufe0f\u200d]?)+
Source:
http://instagram-engineering.tumblr.com/post/118304328152/emojineering-part-2-implementing-hashtag-emoji
note: I add another answer as this one is not complemetary to my previous answer here.
i am using json encoder function that encode the input.
this function is used for dict encoding (to convert it to string) on json.dumps. (so we need to do some edit to the response - remove the ' " ')
this enabled me to save emoji to mysql, and present it (at web):
# encode input
from json.encoder import py_encode_basestring_ascii
name = py_encode_basestring_ascii(name)[1:-1]
# save
YourModel.name = name
name.save()

adding regexp to yaml python

Is there any way to store and read this regexp in YAML by using python:
regular: /<title [^>]*lang=("|')wo("|')>/
Anyone have any idea or some solution for this ?
I have the following error:
% ch.encode('utf-8'), self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '|' that cannot start any token
in "test.yaml", line 10, column 49
My code:
def test2():
clueAppconf = open('test.yaml')
clueContext = yaml.load(clueAppconf)
print clueContext['webApp']
Ok, it looks like the problem is the type of scalar you have chosen to represent this regex. If you're married to scalars (yaml strings), you'll need to use double quoted scalars and escape codes for your special characters that it chokes on. So, your yaml should look something like this:
regular: "/<title [^>]*lang=("\x7C')wo("\x7C')>/"
I've only escaped the character that it was choking on to maintain some semblance of readability, however you may need to escape additional ones depending on whether it throws more errors. Additionally, you could use unicode escape codes. That would look like this:
regular: "/<title [^>]*lang=("\u007C')wo("\u007C')>/"
I'm a little out on my yaml knowledge, so I don't know a way to maintain the special characters and their readability in the yaml. Based on my cursory scan of the yaml documentation, this was the best I could find.

python string good practise: ' vs " [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Single quotes vs. double quotes in Python
I have seen that when i have to work with string in Python both of the following sintax are accepted:
mystring1 = "here is my string 1"
mystring2 = 'here is my string 2'
Is anyway there any difference?
Is it by any reason better use one solution rather than the other?
Cheers,
No, there isn't. When the string contains a single quote, it's easier to enclose it in double quotes, and vice versa. Other than this, my advice would be to pick a style and stick to it.
Another useful type of string literals are triple-quoted strings that can span multiple lines:
s = """string literal...
...continues on second line...
...and ends here"""
Again, it's up to you whether to use single or double quotes for this.
Lastly, I'd like to mention "raw string literals". These are enclosed in r"..." or r'...' and prevent escape sequences (such as \n) from being parsed as such. Among other things, raw string literals are very handy for specifying regular expressions.
Read more about Python string literals here.
While it's true that there is no difference between one and the other, I encountered a lot of the following behavior in the opensource community:
" for text that is supposed to be read (email, feeback, execption, etc)
' for data text (key dict, function arguments, etc)
triple " for any docstring or text that includes " and '
No. A matter of style only. Just be consistent.
I tend to using " simply because that's what most other programming languages use.
So, habit, really.
There's no difference.
What's better is arguable. I use "..." for text strings and '...' for characters, because that's consistent with other languages and may save you some keypresses when porting to/from different language. For regexps and SQL queries, I always use r'''...''', because they frequently end up containing backslashes and both types of quotes.
Python is all about the least amount of code to get the most effect. The shorter the better. And ' is, in a way, one dot shorter than " which is why I prefer it. :)
As everyone's pointed out, they're functionally identical. However, PEP 257 (Docstring Conventions) suggests always using """ around docstrings just for the purposes of consistency. No one's likely to yell at you or think poorly of you if you don't, but there it is.

[Python]How to deal with a string ending with one backslash?

I'm getting some content from Twitter API, and I have a little problem, indeed I sometimes get a tweet ending with only one backslash.
More precisely, I'm using simplejson to parse Twitter stream.
How can I escape this backslash ?
From what I have read, such raw string shouldn't exist ...
Even if I add one backslash (with two in fact) I still get an error as I suspected (since I have a odd number of backslashes)
Any idea ?
I can just forget about these tweets too, but I'm still curious about that.
Thanks : )
Prepending the string with r (stands for "raw") will escape all characters inside the string. For example:
print r'\b\n\\'
will output
\b\n\\
Have I understood the question correctly?
I guess you are looking a method similar to stripslashes in PHP. So, here you go:
Python version of PHP's stripslashes
You can try using raw strings by prepending an r (so nothing has to be escaped) to the string or re.escape().
I'm not really sure what you need considering I haven't seen the text of the response. If none of the methods you come up with on your own or get from here work, you may have to forget about those tweets.
Unless you update your question and come back with a real problem, I'm asserting that you don't have an issue except confusion.
You get the string from the Tweeter API, ergo the string does not show up in your code. “Raw strings” exist only in your code, and it is “raw strings” in code that can't end in a backslash.
Consider this:
def some_obscure_api():
"This exists in a library, so you don't know what it does"
return r"hello" + "\\" # addition just for fun
my_string = some_obscure_api()
print(my_string)
See? my_string happily ends in a backslash and your code couldn't care less.

Categories

Resources