Write metacharacters to a list - python

I try to encapsulate regex metacharaters to a list
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\']
Enter and get errors
SyntaxError: EOL while scanning string literal
How to resolve the problem?

The problem is the backslash, which is an escape character. The correct representation of a single backslash would be '\\' or "\\".
While all the answers above seem to work, for readability it might be better to write
mc = list("^$[]{}-?*+()|\\")
This makes it much easier to see which characters are being used, reducing visual clutter at very little cost.

It should be:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
You need to escape the final backslash \ with another one, as in the list above \\.

You need to escape the final backslash:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
In your example, the backslash is escaping the last quote, so it's not valid python.

The backslash next to a " ' " is an escape sequence
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']

Related

How to split duplicated separator in Python

I have a string with the format
exp = '(( 200 + (4 * 3.14)) / ( 2 ** 3 ))'
I would like to separate the string into tokens by using re.split() and include the separators as well. However, I am not able to split ** together and eventually being split by * instead.
This is my code: tokens = re.split(r'([+|-|**?|/|(|)])',exp)
My Output (wrong):
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '*', '*', '3', ')', ')']
I would like to ask is there a way for me to split the separators between * and **? Thank you so much!
Desired Output:
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '**', '3', ')', ')']
Using the [...] notation only allows you to specify individual characters. To get variable sized alternate patterns you need to use the | operator outside of these brackets. This also means that you need to escape the regular expression operators and that you need to place the longer patterns before the shorter ones (i.e. ** before *)
tokens = re.split(r'(\*\*|\*|\+|\-|/|\(|\))',exp)
or even shorter:
tokens = re.split(r'(\*\*|[*+-/()])',exp)

Python Telegram Bot Markdown symbol '[' or ']'

how to get in send_message symbol '[' and ']' if I use parse_mode = 'Markdown'? Now it replaces the characters with a space
In markdown, special characters can be escaped with a backslash
\[\]
The list of characters you must escape is ('_', '*', '[', ']', '(', ')', '~', '`', '>', '#', '+', '-', '=', '|', '{', '}', '.', '!')
To add a specific for Telegram escaping you can just use PlainText from telegram-text:
from telegram_text import PlainText
element = PlainText("[text to escape]")
escaped_text = element.to_markdown()
escaped_text
'\\[text to escape\\]'

Multiple symbols replace not working

I need to check a string for some symbols and replace them with a whitespace. My code:
string = 'so\bad'
symbols = ['•', '!', '"', '#', '$', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '>', '=', '?', '#', '[', ']', '\\', '^', '_', '`', '{', '}', '~', '|', '"', '⌐', '¬', '«', '»', '£', '$', '°', '§', '–', '—']
for symbol in symbols:
string = string.replace(symbol, ' ')
print string
>> sad
Why does it replace a\b with nothing?
This is because \b is ASCII backspace character:
>>> string = 'so\bad'
>>> print string
sad
You can find it and all the other escape characters from Python Reference Manual.
In order to get the behavior you expect escape the backslash character or use raw strings:
# Both result to 'so bad'
string = 'so\\bad'
string = r'so\bad'
The issue you are facing is the use of \ as a escape character.
\b is a special character (backspace)
Use a String literal with prefix r.
With the r, backslashes \ are treated as literal
string = r'so\bad'
You are not replacing anything "\b" is backspace, moving your cursor to the left one step.
Note that even if you omit the symbols list and your for symbol in symbols: code, you will always get the result "sad" when you print string. This is because \b means something as an ascii character, and is being interpreted together.
Check out this stackoverflow answer for a solution on how to work around this issue: How can I print out the string "\b" in Python

re.split on multiple characters (and maintaining the characters) produces a list containing also empty strings

I need to split a mathematical expression based on the delimiters. The delimiters are (, ), +, -, *, /, ^ and space. I came up with the following regular expression
"([\\s\\(\\)\\-\\+\\*/\\^])"
which also keeps the delimiters in the resulting list (which is what I want), but it also produces empty strings "" elements, which I don't want. I hardly ever use regular expression (unfortunately), so I am not sure if it is possible to avoid this.
Here's an example of the problem:
>>> import re
>>> e = "((12*x^3+4 * 3)*3)"
>>> re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e)
['', '(', '', '(', '12', '*', 'x', '^', '3', '+', '4',
' ', '', ' ', '', ' ', '', '*', '', ' ', '3', ')', '', '*', '3', ')', '']
Is there a way to not produce those empty strings, maybe by modifying my regular expression? Of course I can remove them using for example filter, but the idea would be not to produce them at all.
Edit
I would also need to not include spaces. If you can help also in that matter, it would be great.
You could add \w+, remove the \s and do a findall:
import re
e = "((12*x^3+44 * 3)*3)"
print re.findall("(\w+|[()\-+*/^])", e)
Output:
['(', '(', '12', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Depending on what you want you can change the regex:
e = "((12a*x^3+44 * 3)*3)"
print re.findall("(\d+|[a-z()\-+*/^])", e)
print re.findall("(\w+|[()\-+*/^])", e)
The first considers 12a to be two strings the latter one:
['(', '(', '12', 'a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
['(', '(', '12a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Just strip/filter them out in a comprehension.
result = [item for item in re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e) if item.strip()]

Repeatedly remove characters from string

>>> split=['((((a','b','+b']
>>> [ (w[1:] if w.startswith((' ','!', '#', '#', '$', '%', '^', '&', '*', "(", ")", '-', '_', '+', '=', '~', ':', "'", ';', ',', '.', '?', '|', '\\', '/', '<', '>', '{', '}', '[', ']', '"')) else w) for w in split]
['(((a','b','b']
I wanted ['a', 'b', 'b'] instead.
I want to create a repeat function to repeat the command. I make my split clear all the '(' from the start. Suppose my split is longer, I want to clear all ((( in front of the words. I don't use replace because it will change the '(' in between of words.
E.g. if the '(' is in the middle of a word like 'aa(aa', I don't want to change this.
There is no need to repeat your expression, you are not using the right tools, is all. You are looking for the str.lstrip() method:
[w.lstrip(' !##$%^&*()-_+=~:\';,.?|\\/<>{}[]"') for w in split]
The method treats the string argument as a set of characters and does exactly what you tried to do in your code; repeatedly remove the left-most character if it is part of that set.
There is a corresponding str.rstrip() for removing characters from the end, and str.strip() to remove them from both ends.
Demo:
>>> split=['((((a', 'b', '+b']
>>> [w.lstrip(' !##$%^&*()-_+=~:\';,.?|\\/<>{}[]"') for w in split]
['a', 'b', 'b']
If you really needed to repeat an expression, you could just create a new function for that task:
def strip_left(w):
while w.startswith((' ','!', '#', '#', '$', '%', '^', '&', '*', "(", ")", '-', '_', '+', '=', '~', ':', "'", ';', ',', '.', '?', '|', '\\', '/', '<', '>', '{', '}', '[', ']', '"')):
w = w[1:]
return w
[strip_left(w) for w in split]

Categories

Resources