Is there any way to store and read this regexp in YAML by using python:
regular: /<title [^>]*lang=("|')wo("|')>/
Anyone have any idea or some solution for this ?
I have the following error:
% ch.encode('utf-8'), self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '|' that cannot start any token
in "test.yaml", line 10, column 49
My code:
def test2():
clueAppconf = open('test.yaml')
clueContext = yaml.load(clueAppconf)
print clueContext['webApp']
Ok, it looks like the problem is the type of scalar you have chosen to represent this regex. If you're married to scalars (yaml strings), you'll need to use double quoted scalars and escape codes for your special characters that it chokes on. So, your yaml should look something like this:
regular: "/<title [^>]*lang=("\x7C')wo("\x7C')>/"
I've only escaped the character that it was choking on to maintain some semblance of readability, however you may need to escape additional ones depending on whether it throws more errors. Additionally, you could use unicode escape codes. That would look like this:
regular: "/<title [^>]*lang=("\u007C')wo("\u007C')>/"
I'm a little out on my yaml knowledge, so I don't know a way to maintain the special characters and their readability in the yaml. Based on my cursory scan of the yaml documentation, this was the best I could find.
Related
This question already has an answer here:
Reversing Python's re.escape
(1 answer)
Closed 7 months ago.
TL;DR;
I want to transform a string (representing a regex) like "\\." into "\." in a clean and resilient way (something akin to sed 's/\\\\/\\/g', I don't know if this could break on edge cases though)
val.decode('string-escape') is not an option since I'm using python3.
What I tried so far:
variations of val.replace('\\\\', '\\')
looked at the answers to these two
questions but couldn't get them to work in my case
variations of val.encode().decode('unicode-escape')
had a look at the docs for strings but
couldn't find a solution
I am sure that I missed a relevant part, because string escaping (and unescaping) seems like a fairly common and basic problem, but I haven't found a solution yet =/
Full Story:
I have a YAML-File like so
- !Scheme
barcode: _([ACGTacgt]+)[_.]
lane: _L(\d\d\d)[_.]
name: RKI
read: _R(\d)+[_.]
sample_name: ^(.+)(?:_.+){5}
set: _S(\d+)[_.]
user: _U([a-zA-Z0-9\-]+)[_.]
validation: .*/(?:[a-zA-Z0-9\-]+_)+(?:[a-zA-Z0-9])+\.fastq.*
...
that describes a "Scheme" Object.
The 'name' key is an identifier and the rest describe regexes.
I want to be able to parse an object from that YAML so I wrote a from_yaml class method:
scheme = Scheme()
loaded_mapping = loader.construct_mapping(node) # load yaml-node as dictionary WARNING! loads str escaped
# re.compile all keys except name, adding name as regular string and
# unescaping escaped sequences (like '\') in the process
for key, val in loaded_mapping.items():
if key == 'name':
processed_val = val
else:
processed_val = re.compile(val) # backslashes in val are escaped
scheme.__dict__[key] = processed_val
the problem is that loader.construct_mapping(node) loads the strings with backslashes escaped, so the regex is not correct anymore.
I tried several variations of val.encode().decode('unicode-escape') and val.replace('\\\\', '\\'),
but had no luck with it
If anyone has an idea how to handle this I'd appreciate it very much! I am not married to this specific way of doing things and open to alternative approaches.
Kind Regards!
Assuming I have this super simple YAML file
lane: _L(\d\d\d)[_.]
and load it with PyYAML like this:
import yaml
import re
with open('test.yaml', 'rb') as stream:
data = yaml.safe_load(stream)
lane_pattern = data['lane']
print(lane_pattern)
lane_expr = re.compile(data['lane'])
print(lane_expr)
Then the result is exactly as one would expect:
_L(\d\d\d)[_.]
re.compile('_L(\\d\\d\\d)[_.]')
There is no double escaping of strings going on when YAML is parsed, so there is nothing for you to unescape.
Given two nearly identical text files (plain text, created in MacVim), I get different results when reading them into a variable in Python. I want to know why this is and how I can produce consistent behavior.
For example, f1.txt looks like this:
This isn't a great example, but it works.
And f2.txt looks like this:
This isn't a great example, but it wasn't meant to be.
"But doesn't it demonstrate the problem?," she said.
When I read these files in, using something like the following:
f = open("f1.txt","r")
x = f.read()
I get the following when I look at the variables in the console. f1.txt:
>>> x
"This isn't a great example, but it works.\n\n"
And f2.txt:
>>> y
'This isn\'t a great example, but it wasn\'t meant to be. \n"But doesn\'t it demonstrate the problem?," she said.\n\n'
In other words, f1 comes in with only escaped newlines, while f2 also has its single quotes escaped.
repr() shows what's going on. first for f1:
>>> repr(x)
'"This isn\'t a great example, but it works.\\n\\n"'
And f2:
>>> repr(y)
'\'This isn\\\'t a great example, but it wasn\\\'t meant to be. \\n"But doesn\\\'t it demonstrate the problem?," she said.\\n\\n\''
This kind of behavior is driving me crazy. What's going on and how do I make it consistent? If it matters, I'm trying to read in plain text, manipulate it, and eventually write it out so that it shows the properly escaped characters (for pasting into Javascript code).
Python is giving you a string literal which, if you gave it back to Python, would result in the same string. This is known as the repr() (short for "representation") of the string. This may not (probably won't, in fact) match the string as it was originally specified, since there are so many ways to do that, and Python does not record anything about how it was originally specified.
It uses double quotes around your first example, which works fine because it doesn't contain any double quotes. The second string contains double quotes, so it can't use double quotes as a delimiter. Instead it uses single quotes and uses backslashes to escape the single quotes in the string (it doesn't have to escape the double quotes this way, and there are more of them than there are single quotes). This keeps the representation as short as possible.
There is no reason for this behavior to drive you crazy and no need to try to make it consistent. You only get the repr() of a string when you are peeking at values in Python's interactive mode. When you actually print or otherwise use the string, you get the string itself, not a reconstituted string literal.
If you want to get a JavaScript string literal, the easiest way is to use the json module:
import json
print json.dumps('I said, "Hello, world!"')
Both f1 and f2 contain perfectly normal, unescaped single quotes.
The fact that their repr looks different is meaningless.
There are a variety of different ways to represent the same string. For example, these are all equivalent literals:
"abc'def'ghi"
'abc\'def\'ghi'
'''abc'def'ghi'''
r"abc'def'ghi"
The repr function on a string always just generates some literal that is a valid representation of that string, but you shouldn't depend on exactly which one it generate. (In fact, you should rarely use it for anything but debugging purposes in the first place.)
Since the language doesn't define anywhere what algorithm it uses to generate a repr, it could be different for each version of each implementation.
Most of them will try to be clever, using single or double quotes to avoid as many escaped internal quotes as possible, but even that isn't guaranteed. If you really want to know the algorithm for a particular implementation and version, you pretty much have to look at the source. For example, in CPython 3.3, inside unicode_repr, it counts the number of quotes of each type; then if there are single quotes but no double quotes, it uses " instead of '.
If you want "the" representation of a string, you're out of luck, because there is no such thing. But if you want some particular representation of a string, that's no problem. You just have to know what format you want; most formats, someone's already written the code, and often it's in the standard library. You can make C literal strings, JSON-encoded strings, strings that can fit into ASCII RFC822 headers… But all of those formats have different rules from each other (and from Python literals), so you have to use the right function for the job.
Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.
I'm getting some content from Twitter API, and I have a little problem, indeed I sometimes get a tweet ending with only one backslash.
More precisely, I'm using simplejson to parse Twitter stream.
How can I escape this backslash ?
From what I have read, such raw string shouldn't exist ...
Even if I add one backslash (with two in fact) I still get an error as I suspected (since I have a odd number of backslashes)
Any idea ?
I can just forget about these tweets too, but I'm still curious about that.
Thanks : )
Prepending the string with r (stands for "raw") will escape all characters inside the string. For example:
print r'\b\n\\'
will output
\b\n\\
Have I understood the question correctly?
I guess you are looking a method similar to stripslashes in PHP. So, here you go:
Python version of PHP's stripslashes
You can try using raw strings by prepending an r (so nothing has to be escaped) to the string or re.escape().
I'm not really sure what you need considering I haven't seen the text of the response. If none of the methods you come up with on your own or get from here work, you may have to forget about those tweets.
Unless you update your question and come back with a real problem, I'm asserting that you don't have an issue except confusion.
You get the string from the Tweeter API, ergo the string does not show up in your code. “Raw strings” exist only in your code, and it is “raw strings” in code that can't end in a backslash.
Consider this:
def some_obscure_api():
"This exists in a library, so you don't know what it does"
return r"hello" + "\\" # addition just for fun
my_string = some_obscure_api()
print(my_string)
See? my_string happily ends in a backslash and your code couldn't care less.
I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไà¸à¹€à¸Ÿà¸¥ &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.
Try
xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.
xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)
Take a look at µTidyLib, a Python wrapper to TidyLib.
If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.
You could have a look at the unicodedata package, especially the normalize method.
I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.
>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"ไภเฟล &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'
It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?
I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).
You could read in the file into a Python string named old_str
Then perform a filter call in conjunction with a lambda statement:
new_str = filter(lambda x: x in string.ascii_letters, old_str)
Parse new_str
Many ways exist to accomplish stripping non-ASCII characters from a string.
This question might be related: How to check if a string in Python is in ASCII?