How to properly split this list of strings? - python

I have a list of strings such as this :
['z+2-44', '4+55+z+88']
How can I split this strings in the list such that it would be something like
[['z','+','2','-','44'],['4','+','55','+','z','+','88']]
I have tried using the split method already however that splits the 44 into 4 and 4, and am not sure what else to try.

You can use regex:
import re
lst = ['z+2-44', '4+55+z+88']
[re.findall('\w+|\W+', s) for s in lst]
# [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
\w+|\W+ matches a pattern that consists either of word characters (alphanumeric values in your case) or non word characters (+- signs in your case).

That will work, using itertools.groupby
z = ['z+2-44', '4+55+z+88']
print([["".join(x) for k,x in itertools.groupby(i,str.isalnum)] for i in z])
output:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
It just groups the chars if they're alphanumerical (or not), just join them back in a list comprehension.
EDIT: the general case of a calculator with parenthesis has been asked as a follow-up question here. If z is as follows:
z = ['z+2-44', '4+55+((z+88))']
then with the previous grouping we get:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+((', 'z', '+', '88', '))']]
Which is not easy to parse in terms of tokens. So a change would be to join only if alphanum, and let as list if not, flattening in the end using chain.from_iterable:
print([list(itertools.chain.from_iterable(["".join(x)] if k else x for k,x in itertools.groupby(i,str.isalnum))) for i in z])
which yields:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', '(', '(', 'z', '+', '88', ')', ')']]
(note that the alternate regex answer can also be adapted like this: [re.findall('\w+|\W', s) for s in lst] (note the lack of + after W)
also "".join(list(x)) is slightly faster than "".join(x), but I'll let you add it up to avoid altering visibility of that already complex expression.

Alternative solution using re.split function:
l = ['z+2-44', '4+55+z+88']
print([list(filter(None, re.split(r'(\w+)', i))) for i in l])
The output:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]

You could only use str.replace() and str.split() built-in functions within a list comprehension:
In [34]: lst = ['z+2-44', '4+55+z+88']
In [35]: [s.replace('+', ' + ').replace('-', ' - ').split() for s in lst]
Out[35]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
But note that this is not an efficient approach for longer strings. In that case the best way to go is using regex.
As another pythonic way you can also use tokenize module:
In [56]: from io import StringIO
In [57]: import tokenize
In [59]: [[t.string for t in tokenize.generate_tokens(StringIO(i).readline)][:-1] for i in lst]
Out[59]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays.

If you want to stick with split (hence avoiding regex), you can provide it with an optional character to split on:
>>> testing = 'z+2-44'
>>> testing.split('+')
['z', '2-44']
>>> testing.split('-')
['z+2', '44']
So, you could whip something up by chaining the split commands.
However, using regular expressions would probably be more readable:
import re
>>> re.split('\+|\-', testing)
['z', '2', '44']
This is just saying to "split the string at any + or - character" (the backslashes are escape characters because both of those have special meaning in a regex.
Lastly, in this particular case, I imagine the goal is something along the lines of "split at every non-alpha numeric character", in which case regex can still save the day:
>>> re.split('[^a-zA-Z0-9]', testing)
['z', '2', '44']
It is of course worth noting that there are a million other solutions, as discussed in some other SO discussions.
Python: Split string with multiple delimiters
Split Strings with Multiple Delimiters?
My answers here are targeted towards simple, readable code and not performance, in honor of Donald Knuth

Related

How to split duplicated separator in Python

I have a string with the format
exp = '(( 200 + (4 * 3.14)) / ( 2 ** 3 ))'
I would like to separate the string into tokens by using re.split() and include the separators as well. However, I am not able to split ** together and eventually being split by * instead.
This is my code: tokens = re.split(r'([+|-|**?|/|(|)])',exp)
My Output (wrong):
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '*', '*', '3', ')', ')']
I would like to ask is there a way for me to split the separators between * and **? Thank you so much!
Desired Output:
['(', '(', '200', '+', '(', '4', '*', '3.14', ')', ')', '/', '(', '2', '**', '3', ')', ')']
Using the [...] notation only allows you to specify individual characters. To get variable sized alternate patterns you need to use the | operator outside of these brackets. This also means that you need to escape the regular expression operators and that you need to place the longer patterns before the shorter ones (i.e. ** before *)
tokens = re.split(r'(\*\*|\*|\+|\-|/|\(|\))',exp)
or even shorter:
tokens = re.split(r'(\*\*|[*+-/()])',exp)

Regex [] vs () in Python with respect to re.split() [duplicate]

This question already has answers here:
Using alternation or character class for single character matching?
(3 answers)
Closed 2 years ago.
What is the difference between [,.] and (,|.) when used as a pattern in re.split(pattern,string)? Can some please explain with respect to this example in Python:
import re
regex_pattern1 = r"[,\.]"
regex_pattern2 = r"(,|\.)"
print(re.split(regex_pattern1, '100,000.00')) #['100', '000', '00']
print(re.split(regex_pattern2, '100,000.00'))) #['100', ',', '000', '.', '00']
[,\.] is equivalent to ,|\..[1]
(,|\.) is equivalent to ([,\.]).
() creates a capture, and re.split returns captured text as well as the text separated by the pattern.
>>> import re
>>> re.split(r'([,\.])', '100,000.00')
['100', ',', '000', '.', '00']
>>> re.split(r'(,|\.)', '100,000.00')
['100', ',', '000', '.', '00']
>>> re.split(r',|\.', '100,000.00')
['100', '000', '00']
>>> re.split(r'(?:,|\.)', '100,000.00')
['100', '000', '00']
>>> re.split(r'[,\.]', '100,000.00')
['100', '000', '00']
You might sometime need (?:,|\.) to limit what is considered the operands of | when you embed it in a larger pattern, though.

Python regular expression split string into numbers and text/symbols

I would like to split a string into sections of numbers and sections of text/symbols
my current code doesn't include negative numbers or decimals, and behaves weirdly, adding an empty list element on the end of the output
import re
mystring = 'AD%5(6ag 0.33--9.5'
newlist = re.split('([0-9]+)', mystring)
print (newlist)
current output:
['AD%', '5', '(', '6', 'ag ', '0', '.', '33', '--', '9', '.', '5', '']
desired output:
['AD%', '5', '(', '6', 'ag ', '0.33', '-', '-9.5']
Your issue is related to the fact that your regex captures one or more digits and adds them to the resulting list and digits are used as a delimiter, the parts before and after are considered. So if there are digits at the end, the split results in the empty string at the end to be added to the resulting list.
You may split with a regex that matches float or integer numbers with an optional minus sign and then remove empty values:
result = re.split(r'(-?\d*\.?\d+)', s)
result = filter(None, result)
To match negative/positive numbers with exponents, use
r'([+-]?\d*\.?\d+(?:[eE][-+]?\d+)?)'
The -?\d*\.?\d+ regex matches:
-? - an optional minus
\d* - 0+ digits
\.? - an optional literal dot
\d+ - one or more digits.
Unfortunately, re.split() does not offer an "ignore empty strings" option. However, to retrieve your numbers, you could easily use re.findall() with a different pattern:
import re
string = "AD%5(6ag0.33-9.5"
rx = re.compile(r'-?\d+(?:\.\d+)?')
numbers = rx.findall(string)
print(numbers)
# ['5', '6', '0.33', '-9.5']
As mentioned here before, there is no option to ignore the empty strings in re.split() but you can easily construct a new list the following way:
import re
mystring = "AD%5(6ag0.33--9.5"
newlist = [x for x in re.split('(-?\d+\.?\d*)', mystring) if x != '']
print newlist
output:
['AD%', '5', '(', '6', 'ag', '0.33', '-', '-9.5']

pyparsing: setResultsName for multiple elements get combined

Here is the text I'm parsing:
x ~ normal(mu, 1)
y ~ normal(mu2, 1)
The parser matches those lines using:
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression).setResultsName('model_definition')
// end of line: .setResultsName('model_definition')
The problem is that when there are two model definitions, they aren't named separately in the ParseResults object:
It looks like the first one gets overridden by the second. The reason I'm naming them is to make executing the lines easier - this way I (hopefully) don't have to figure out what is going on at evaluation time - the parser has already labelled everything. How can I get both model_definitions labelled? It would be nice if model_definition held a list of every model definition found.
Just in case, here is some more of my code:
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression).setResultsName('model_definition')
expression << Or([function_application, number, identifier, list_literal, probability_expression])
statement = Optional(newline) + Or([model_definition, assignment, function_application]) + Optional(newline)
line = OneOrMore('\n').suppress()
comment = Group('#' + SkipTo(newline)).suppress()
program = OneOrMore(Or([line, statement, comment]))
ast = program.parseString(input_string)
return ast
Not documented that I know of, but I found something in pyparsing.py:
I changed .setResultsName('model_definition') to .setResultsName('model_definition*') and they listed correctly!
Edit: it is documented, but it is a flag you pass to setResultsName:
setResultsName( string, listAllMatches=False ) - name to be given to tokens matching the element; if multiple tokens within a repetition group (such as ZeroOrMore or delimitedList) the default is to return only the last matching token - if listAllMatches is set to True, then a list of matching tokens is returned.
Here is enough of your code to get things to work:
from pyparsing import *
# fake in the bare minimum to parse the given test strings
identifier = Word(alphas, alphanums)
integer = Word(nums)
function_call = identifier + '(' + Optional(delimitedList(identifier | integer)) + ')'
expression = function_call
model_definition = Group(identifier.setResultsName('random_variable_name') + '~' + expression)
sample = """
x ~ normal(mu, 1)
y ~ normal(mu2, 1)
"""
The trailing '*' is there in setResultsName for those cases where you use the short form of setResultsName: expr("name*") vs expr.setResultsName("name", listAllMatches=True). If you prefer calling setResultsName, then I would not use the '*' notation, but would pass the listAllMatches argument.
If you are getting names that step on each other, you may need to add a level of Grouping. Here is your solution using listAllMatches=True, by virtue of the trailing '*' notation:
model_definition1 = model_definition('model_definition*')
print OneOrMore(model_definition1).parseString(sample).dump()
It returns this parse result:
[['x', '~', 'normal', '(', 'mu', '1', ')'], ['y', '~', 'normal', '(', 'mu2', '1', ')']]
- model_definition: [['x', '~', 'normal', '(', 'mu', '1', ')'], ['y', '~', 'normal', '(', 'mu2', '1', ')']]
[0]:
['x', '~', 'normal', '(', 'mu', '1', ')']
- random_variable_name: x
[1]:
['y', '~', 'normal', '(', 'mu2', '1', ')']
Here is a variation that does not use listAllMatches, but adds another level of Group:
model_definition2 = model_definition('model_definition')
print OneOrMore(Group(model_definition2)).parseString(sample).dump()
gives:
[[['x', '~', 'normal', '(', 'mu', '1', ')']], [['y', '~', 'normal', '(', 'mu2', '1', ')']]]
[0]:
[['x', '~', 'normal', '(', 'mu', '1', ')']]
- model_definition: ['x', '~', 'normal', '(', 'mu', '1', ')']
- random_variable_name: x
[1]:
[['y', '~', 'normal', '(', 'mu2', '1', ')']]
- model_definition: ['y', '~', 'normal', '(', 'mu2', '1', ')']
- random_variable_name: y
In both cases, I see the full content being returned, so I don't quit understand what you mean by "if you return multiple, it fails to split out each child."

re.split on multiple characters (and maintaining the characters) produces a list containing also empty strings

I need to split a mathematical expression based on the delimiters. The delimiters are (, ), +, -, *, /, ^ and space. I came up with the following regular expression
"([\\s\\(\\)\\-\\+\\*/\\^])"
which also keeps the delimiters in the resulting list (which is what I want), but it also produces empty strings "" elements, which I don't want. I hardly ever use regular expression (unfortunately), so I am not sure if it is possible to avoid this.
Here's an example of the problem:
>>> import re
>>> e = "((12*x^3+4 * 3)*3)"
>>> re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e)
['', '(', '', '(', '12', '*', 'x', '^', '3', '+', '4',
' ', '', ' ', '', ' ', '', '*', '', ' ', '3', ')', '', '*', '3', ')', '']
Is there a way to not produce those empty strings, maybe by modifying my regular expression? Of course I can remove them using for example filter, but the idea would be not to produce them at all.
Edit
I would also need to not include spaces. If you can help also in that matter, it would be great.
You could add \w+, remove the \s and do a findall:
import re
e = "((12*x^3+44 * 3)*3)"
print re.findall("(\w+|[()\-+*/^])", e)
Output:
['(', '(', '12', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Depending on what you want you can change the regex:
e = "((12a*x^3+44 * 3)*3)"
print re.findall("(\d+|[a-z()\-+*/^])", e)
print re.findall("(\w+|[()\-+*/^])", e)
The first considers 12a to be two strings the latter one:
['(', '(', '12', 'a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
['(', '(', '12a', '*', 'x', '^', '3', '+', '44', '*', '3', ')', '*', '3', ')']
Just strip/filter them out in a comprehension.
result = [item for item in re.split("([\\s\\(\\)\\-\\+\\*/\\^])", e) if item.strip()]

Categories

Resources