How to split strings with multiple delimiters while keep the delimiters | python - python

For example, I have a string section 213(d)-456(c)
How can I split it to get a list of strings:
['section', '213', '(', 'd', ')', '-', '456', '(', 'c', ')'].
Thank you!

You can do so using Regex.
import re
text = "section 213(d)-456(c)"
output = re.split("(\W)", text)
Output: ['section', ' ', '213', '(', 'd', ')', '', '-', '456', '(', 'c', ')', '']
Here \W is for non-word character!

You can come close with
re.split(r'([-\s()])', 'section 213(d)-456(c)')
When the delimiter contains a capture group, the result includes the captured text.
However, this will also include the space delimiters in the result:
['section', ' ', '213', '(', 'd', ')', '', '-', '456', '(', 'c', ')', '']
You can easily remove these afterward.

Related

How to split duplicated separator in Python

I have a string with the format
exp = '(( 200 + (4 * 3.14)) / ( 2 ** 3 ))'
I would like to separate the string into tokens by using re.split() and include the separators as well. However, I am not able to split ** together and eventually being split by * instead.
This is my code: tokens = re.split(r'([+|-|**?|/|(|)])',exp)
My Output (wrong):
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '*', '*', '3', ')', ')']
I would like to ask is there a way for me to split the separators between * and **? Thank you so much!
Desired Output:
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '**', '3', ')', ')']
Using the [...] notation only allows you to specify individual characters. To get variable sized alternate patterns you need to use the | operator outside of these brackets. This also means that you need to escape the regular expression operators and that you need to place the longer patterns before the shorter ones (i.e. ** before *)
tokens = re.split(r'(\*\*|\*|\+|\-|/|\(|\))',exp)
or even shorter:
tokens = re.split(r'(\*\*|[*+-/()])',exp)

Python, Split the input string on elements of other list and remove digits from it

I have had some trouble with this problem, and I need your help.
I have to make a Python method (mySplit(x)) which takes an input list (which only has one string as element), split that element on the elements of other list and digits.
I use Python 3.6
So here is an example:
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
banned=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ' ', ';']
The returned lists should be like this:
mySplit(l)=['I', 'am', 'learning']
mySplit(l1)=['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I have tried the following, but I always get stuck:
def mySplit(x):
l=['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] #Banned chars
l2=[i for i in x if i not in l] #Removing chars from input list
l2=",".join(l2)
l3=[i for i in l2 if not i.isdigit()] #Removes all the digits
l4=[i for i in l3 if i is not ',']
l5=[",".join(l4)]
l6=l5[0].split(' ')
return l6
and
mySplit(l1)
mySplit(l)
returns:
['T,h,i,s,e,x,a,m,p,l,e,a,i,n,t,e,a,s,y']
['I,', ',a,m,', ',l,e,a,r,n,i,n,g']
Use re.split() for this task:
import re
w_list = [i for i in re.split(r'[^a-zA-Z]',
'____-----This4ex5ample---aint___ea5sy;782') if i ]
Out[12]: ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
I would import the punctuation marks from string and proceed with regular expressions as follows.
l=['I am learning']
l1=['____-----This4ex5ample---aint___ea5sy;782']
import re
from string import punctuation
punctuation # to see the punctuation marks.
>>> '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l]).split()
Here is the output:
>>> ['I', 'am', 'learning']
Notice the \d attached at the end of the punctuation marks to remove any digits.
Similarly,
' '.join([re.sub('[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~\d]',' ', w) for w in l1]).split()
Yields
>>> ['This', 'ex', 'ample', 'aint', 'ea', 'sy']
You can also modify your function as follows:
def mySplit(x):
banned = ['-', '+' , ',', '#', '.', '!', '?', ':', '_', ';'] + list('0123456789')#Banned chars
return ''.join([word if not word in banned else ' ' for word in list(x[0]) ]).split()

Write metacharacters to a list

I try to encapsulate regex metacharaters to a list
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\']
Enter and get errors
SyntaxError: EOL while scanning string literal
How to resolve the problem?
The problem is the backslash, which is an escape character. The correct representation of a single backslash would be '\\' or "\\".
While all the answers above seem to work, for readability it might be better to write
mc = list("^$[]{}-?*+()|\\")
This makes it much easier to see which characters are being used, reducing visual clutter at very little cost.
It should be:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
You need to escape the final backslash \ with another one, as in the list above \\.
You need to escape the final backslash:
mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']
In your example, the backslash is escaping the last quote, so it's not valid python.
The backslash next to a " ' " is an escape sequence
In [1]: mc = ['^', '$', '[', ']', '{', '}', '-', '?', '*', '+', '(', ')', '|', '\\']

re.split on multiple characters (and maintaining the characters) produces a list containing also empty strings

I need to split a mathematical expression based on the delimiters. The delimiters are (, ), +, -, *, /, ^ and space. I came up with the following regular expression
"([\\s\\(\\)\\-\\+\\*/\\^])"
which also keeps the delimiters in the resulting list (which is what I want), but it also produces empty strings "" elements, which I don't want. I hardly ever use regular expression (unfortunately), so I am not sure if it is possible to avoid this.
Here's an example of the problem:
>>> import re
>>> e = "((12*x^3+4 * 3)*3)"
>>> re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e)
['', '(', '', '(', '12', '*', 'x', '^', '3', '+', '4',
' ', '', ' ', '', ' ', '', '*', '', ' ', '3', ')', '', '*', '3', ')', '']
Is there a way to not produce those empty strings, maybe by modifying my regular expression? Of course I can remove them using for example filter, but the idea would be not to produce them at all.
Edit
I would also need to not include spaces. If you can help also in that matter, it would be great.
You could add \w+, remove the \s and do a findall:
import re
e = "((12*x^3+44 * 3)*3)"
print re.findall("(\w+|[()\-+*/^])", e)
Output:
['(', '(', '12', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Depending on what you want you can change the regex:
e = "((12a*x^3+44 * 3)*3)"
print re.findall("(\d+|[a-z()\-+*/^])", e)
print re.findall("(\w+|[()\-+*/^])", e)
The first considers 12a to be two strings the latter one:
['(', '(', '12', 'a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
['(', '(', '12a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Just strip/filter them out in a comprehension.
result = [item for item in re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e) if item.strip()]

Repeatedly remove characters from string

>>> split=['((((a','b','+b']
>>> [ (w[1:] if w.startswith((' ','!', '#', '#', '$', '%', '^', '&', '*', "(", ")", '-', '_', '+', '=', '~', ':', "'", ';', ',', '.', '?', '|', '\\', '/', '<', '>', '{', '}', '[', ']', '"')) else w) for w in split]
['(((a','b','b']
I wanted ['a', 'b', 'b'] instead.
I want to create a repeat function to repeat the command. I make my split clear all the '(' from the start. Suppose my split is longer, I want to clear all ((( in front of the words. I don't use replace because it will change the '(' in between of words.
E.g. if the '(' is in the middle of a word like 'aa(aa', I don't want to change this.
There is no need to repeat your expression, you are not using the right tools, is all. You are looking for the str.lstrip() method:
[w.lstrip(' !##$%^&*()-_+=~:\';,.?|\\/<>{}[]"') for w in split]
The method treats the string argument as a set of characters and does exactly what you tried to do in your code; repeatedly remove the left-most character if it is part of that set.
There is a corresponding str.rstrip() for removing characters from the end, and str.strip() to remove them from both ends.
Demo:
>>> split=['((((a', 'b', '+b']
>>> [w.lstrip(' !##$%^&*()-_+=~:\';,.?|\\/<>{}[]"') for w in split]
['a', 'b', 'b']
If you really needed to repeat an expression, you could just create a new function for that task:
def strip_left(w):
while w.startswith((' ','!', '#', '#', '$', '%', '^', '&', '*', "(", ")", '-', '_', '+', '=', '~', ':', "'", ';', ',', '.', '?', '|', '\\', '/', '<', '>', '{', '}', '[', ']', '"')):
w = w[1:]
return w
[strip_left(w) for w in split]

Categories

Resources