Why does this regular expression not work?

Why does this regular expression not work? - python

I have a function that parses HTML code so it is easy to read and write with. In order to do this I must split the string with multiple delimiters and as you can see I have used re.split() and I cannot find a better solution. However, when I submit some HTML such as this, it has absolutely no effect. This has lead me to believe that my regular expression is incorrectly written. What should be there instead?
def parsed(data):
"""Removes junk from the data so it can be easily processed."""
data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
data = data[2:-1]
lines = re.split(r'\r|\n', data) # This clarifies the lines for writing.
return lines
This isn't a duplicate if you find a similar question, I've been crawling around for ages and it still doesn't work.

You are converting a bytes value to string:
data = str(data)
# This checks for a cruft and removes it if it exists.
if re.search("b'", data):
data = data[2:-1]
which means that all line delimiters have been converted to their Python escape codes:
>>> str(b'\n')
"b'\n'"
That is a literal b, literal quote, literal \ backslash, literal n, literal quote. You would have to split on r'(\\n|\\r)' instead, but most of all, you shouldn't turn bytes values to string representations here. Python produced the representation of the bytes value as a literal string you can paste back into your Python interpreter, which is not the same thing as the value contained in the object.
You want to decode to string instead:
if isinstance(data, bytes):
data = data.decode('utf8')
where I am assuming that the data is encoded with UTF8. If this is data from a web request, the response headers quite often include the character set used to encode the data in the Content-Type header, look for the charset= parameter.
A response produced by the urllib.request module has an .info() method, and the character set can be extracted (if provided) with:
charset = response.info().get_param('charset')
where the return value is None if no character set was provided.
You don't need to use a regular expression to split lines, the str type has a dedicated method, str.splitlines():
Return a list of the lines in the string, breaking at line boundaries. This method uses the universal newlines approach to splitting lines. Line breaks are not included in the resulting list unless keepends is given and true.
For example, 'ab c\n\nde fg\rkl\r\n'.splitlines() returns ['ab c', '', 'de fg', 'kl'], while the same call with splitlines(True) returns ['ab c\n', '\n', 'de fg\r', 'kl\r\n'].

Related

Strings are too long [duplicate]

How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.

>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'

I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.

Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.

Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved

I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))

I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.

How to decode URL with 4-digit escaped characters with Python?

Decoding normal URL escaped characters is a fairly easy task with python.
If you want to decode something like: Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3
All you need to use is:
import urllib
urllib.parse.unquote('Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3')
And you get: 'Wikivoyage:删除表决'
However, I have identified some characters which this does not work with, namely 4-digit % encoded strings:
For example: %25D8
This apparently decodes to ◘
But if you use the urllib function I demonstrated previously, you get: %D8
I understand why this happens, the unquote command reads the %25 as a '%', which is what it normally translates to. Is there any way to get Python to read this properly? Especially in a string of similar characters?

The actual problem
In a comment you posted the real examples:
The data I am pulling from is just a list of url-encoded strings. One of the example strings I am trying to decode is represented as: %25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1 This is the raw form of it. Other strings are normal url escapes such as: %D8%A5%D9%88%D8%B2
The first one is double-quoted, as wim pointed out. So they unquote as: إزالة_الشعر_بالليزر and إوز (which are Arabic for "laser hair removal" and "geese").
So you were mistaken about the unquoting and ◘ is a red herring.
Solution
Ideally you would fix whatever gave you this inconsistent data, but if nothing else, you could try detecting double-quoted strings, for example, by checking if the number of % equals the number of %25.
def unquote_possibly_double_quoted(s: str) -> str:
if s.count('%') == s.count('%25'):
# Double
s = urllib.parse.unquote(s)
return urllib.parse.unquote(s)
>>> s = '%25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1'
>>> unquote_possibly_double_quoted(s)
'إزالة_الشعر_بالليزر'
>>> unquote_possibly_double_quoted('%D8%A5%D9%88%D8%B2')
'إوز'
You might want to add some checks to this, like for example, s.count('%') > 0 (or '%' in s).

How can I stop Python from recognizing string literals such as "\n" or "\b"?

I am using an API to convert LaTeX to PNG format. The latex strings that I am converting are written in latex, .tex, and so they use phrases such as '\n'.
On example of a string that I have is
query = "$\displaystyle \binom n r = \dfrac{n!}{r!(n-r)!}$"
However, Python recognizes the \b in \binom and thus the string is recognized as having a line break, even though all I want it to do is just recognize the individual characters.
If at all possible, I would like to not have to modify the string itself, as the string too is taken from an API. So is there any way to ignore these string literals such as '\b' or '\n'?

Use r"$\displaystyle \binom n r = \dfrac{n!}{r!(n-r)!}$". This is called a raw string. You can read more about it here
Generally, you can use raw strings in the following format:
Normal string:
'Hi\nHow are you?'
output:
Hi
How are you?
Raw string:
r'Hi\nHow are you?'
output:
Hi\nHow are you?

I've updated my answer for clarity.
If the string comes directly from an API then it should already be in a raw format (or the rawest that will be accessible to you), such as r"$\displaystyle \binom n r = \dfrac{n!}{r!(n-r)!}$". Therefore, Python won't be assuming escaped characters and there shouldn't be an issue.
To answer your other question about raw strings - to print a string as a raw string in Python try the repr function, which returns a printable representational string of the given object.
query = "$\displaystyle \binom n r = \dfrac{n!}{r!(n-r)!}$"
print(repr(query))
Here is the output:
'$\\displaystyle \x08inom n r = \\dfrac{n!}{r!(n-r)!}$'
Note that in the true raw data of query above, the \b character is still technically stored as the \b encoding (or \x08), not as two separate characters. Why isn't \d stored as an encoding, you may ask? Because \d is not a valid encoded escape sequence, so it is overlooked and Python treats the \ as a character. (Ahh... silently ignoring parsing errors, isn't this why we love Python?)
Then what about this example?
query = r"$\displaystyle \binom n r = \dfrac{n!}{r!(n-r)!}$"
print(repr(query))
Looks good, but wait, Python prints '$\\displaystyle \\binom n r = \\dfrac{n!}{r!(n-r)!}$'.
Why the \\? Well, the repr function returns a printable representational string of the given object, so to avoid any confusion the \ character is properly escaped with \, creating \\.
All of that coming full circle and back to your question - if the value of a string comes directly from an API call, then the string data should already be translated from a binary encoding and things like escape sequences shouldn't be an issue (because the aren't in the raw data). But in the example you provided you declared a string in a query = "st\ring" format, which unfortunately isn't equivalent to retrieving a string from an API, and the obvious solution would be to use the query = r"st\ring" format.

How to remove those "\x00\x00"

How to remove those "\x00\x00" in a string ?
I have many of those strings (example shown below). I can use re.sub to replace those "\x00". But I am wondering whether there is a better way to do that? Converting between unicode, bytes and string is always confusing.
'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'.

Use rstrip
>>> text = 'Hello\x00\x00\x00\x00'
>>> text.rstrip('\x00')
'Hello'
It removes all \x00 characters at the end of the string.

>>> a = 'Hello\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> a.replace('\x00','')
'Hello'

I think the more general solution is to use:
cleanstring = nullterminatedstring.split('\x00',1)[0]
Which will split the string using \x00 as the delimeter 1 time. The split(...) returns a 2 element list: everything before the null in addition to everything after the null (it removes the delimeter). Appending [0] only returns the portion of the string before the first null (\x00) character, which I believe is what you're looking for.
The convention in some languages, specifically C-like, is that a single null character marks the end of the string. For example, you should also expect to see strings that look like:
'Hello\x00dpiecesofsomeoldstring\x00\x00\x00'
The answer supplied here will handle that situation as well as the other examples.

Building on the answers supplied, I suggest that strip() is more generic than rstrip() for cleaning up a data packet, as strip() removes chars from the beginning and the end of the supplied string, whereas rstrip() simply removes chars from the end of the string.
However, NUL chars are not treated as whitespace by default by strip(), and as such you need to specify explicitly. This can catch you out, as print() will of course not show the NUL chars. My solution that I used was to clean the string using ".strip().strip('\x00')":
>>> arbBytesFromSocket = b'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii')
>>> print(arbBytesAsString)
hello
>>> str(arbBytesAsString)
'\x00\x00\x00\x00hello\x00\x00\x00\x00'
>>> arbBytesAsString = arbBytesFromSocket.decode('ascii').strip().strip('\x00')
>>> str(arbBytesAsString)
'hello'
>>>
This gives you the string/byte array required, without the NUL chars on each end, and also preserves any NUL chars inside the "data packet", which is useful for received byte data that may contain valid NUL chars (eg. a C-type structure). NB. In this case the packet must be "wrapped", i.e. surrounded by non-NUL chars (prefix and suffix), to allow correct detection, and thus only strip unwanted NUL chars.

Neil wrote, '...you might want to put some thought into why you have them in the first place.'
For my own issue with this error code, this led me to the problem. My saved file that I was reading from was in unicode. Once I re-saved the file as a plain ASCII text, the problem was solved

I tried strip and rstrip and they didn't work, but this one did;
Use split and then join the result list:
if '\x00' in name:
name=' '.join(name.split('\x00'))

I ran into this problem copy lists out of Excel. Process was:
Copy a list of ID numbers sent to me in Excel
Run set of pyton code that:
Read the clipboard as text
txt.Split('\n') to give a list
Processed each element in the list
(updating the production system as requird)
Problem was intermitently was returning multiple '\x00' at the end of the text when reading the clipboard.
Have changed from using win32clipboard to using pyperclip to read the clipboard, and it seems to have resolved the problem.

latexcodec stripping slashes but not translating characters (Python)

I'm trying to process some Bibtex entries converted to an XML tree via Pybtex. I'd like to go ahead and process all the special characters from their LaTeX specials to unicode characters, via latexcodec. Via question Does pybtex support accent/special characters in .bib file? and the documentation I have checked the syntax, however, I am not getting the correct output.
>>> import latexcodec
>>> name = 'Br\"{u}derle'
>>> name.decode('latex')
u'Br"{u}derle'
I have tested this across different strings and special characters and always it just strips off the first slash without translating the character. Should I be using latexencoder differently to get the correct output?

Your backslash is not included in the string at all because it is treated as a string escape, so the codec never sees it:
>>> print 'Br\"{u}derle'
Br"{u}derle
Use a raw string:
name = r'Br\"{u}derle'
Alternatively, try reading actual data from a file, in which case the raw/non-raw distinction will not matter. (The distinction only applies to literal strings in Python source code.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does this regular expression not work? - python

Related

Strings are too long [duplicate]

How to decode URL with 4-digit escaped characters with Python?

How can I stop Python from recognizing string literals such as "\n" or "\b"?

How to remove those "\x00\x00"

latexcodec stripping slashes but not translating characters (Python)

Categories

Resources