Regex [] vs () in Python with respect to re.split() [duplicate] - python

This question already has answers here:
Using alternation or character class for single character matching?
(3 answers)
Closed 2 years ago.
What is the difference between [,.] and (,|.) when used as a pattern in re.split(pattern,string)? Can some please explain with respect to this example in Python:
import re
regex_pattern1 = r"[,\.]"
regex_pattern2 = r"(,|\.)"
print(re.split(regex_pattern1, '100,000.00')) #['100', '000', '00']
print(re.split(regex_pattern2, '100,000.00'))) #['100', ',', '000', '.', '00']

[,\.] is equivalent to ,|\..[1]
(,|\.) is equivalent to ([,\.]).
() creates a capture, and re.split returns captured text as well as the text separated by the pattern.
>>> import re
>>> re.split(r'([,\.])', '100,000.00')
['100', ',', '000', '.', '00']
>>> re.split(r'(,|\.)', '100,000.00')
['100', ',', '000', '.', '00']
>>> re.split(r',|\.', '100,000.00')
['100', '000', '00']
>>> re.split(r'(?:,|\.)', '100,000.00')
['100', '000', '00']
>>> re.split(r'[,\.]', '100,000.00')
['100', '000', '00']
You might sometime need (?:,|\.) to limit what is considered the operands of | when you embed it in a larger pattern, though.

Related

Python: Sort list of similar strings based on another string

I have a string
deplete mineral resources , from 123 in x 123 in x 19 ft , on 24 ft t shaped hole
and a list of strings
['123', '123', '19', '24', 'in', 'in', 'ft', 'ft', 'deplete mineral', 't', 'resources', 'shaped hole']
I want to sort this list based on the given string. When I did sorted(l, key=s.index), I am getting the output as:
['deplete mineral', 't', 'in', 'in', 'resources', '123', '123', '19', 'ft', 'ft', '24', 'shaped hole']
But my desired output is:
['deplete mineral', 'resources', '123', 'in' , '123', 'in' , '19', 'ft', '24', 'ft', 't' , 'shaped hole']
The list should be sorted exactly as the string given. Is there an efficient way to achieve this?
This produces the desired pattern. It's not technically a sort though - just a regular expression search of the sort string.
>>> import re
>>>
>>> sort_str = "deplete mineral resources , from 123 in x 123 in x " \
... "19 ft , on 24 ft t shaped hole"
>>>
>>> str_list = ['123', '123', '19', '24', 'in', 'in', 'ft', 'ft',
... 'deplete mineral', 't', 'resources', 'shaped hole']
>>>
>>> re.findall('|'.join(str_list), sort_str)
['deplete mineral', 'resources', '123', 'in', '123', 'in', '19',
'ft', '24', 'ft', 't', 'shaped hole']
>>>
>>>
>>> desired = ['deplete mineral', 'resources', '123', 'in' , '123',
... 'in' , '19', 'ft', '24', 'ft', 't' , 'shaped hole']
>>> desired == re.findall('|'.join(str_list), sort_str)
True
The regular expression is simple. It's of the form "alt_1|alt_2|alt_3". What that OR-like expression produces is a pattern matcher that scans a string looking for the substrings "alt_1", "alt_2", or "alt_3".
str_list is joined together to form this OR-like expression in this simple fashion:
>>> '|'.join(str_list)
'123|123|19|24|in|in|ft|ft|deplete mineral|t|resources|shaped hole'
The ordering of the above expression isn't important - they could be in any order.
This string expression is turned into a regular expression internally when passed in as the first parameter to re.findall() and used to find all matching substrings in sort_str with the following line:
>>> re.findall('|'.join(str_list), sort_str)
re.findall() scans sort_str from beginning to end looking for substrings that are part of str_list. Each occurrence is added to the list it returns.
So the substrings matched will be in the same order as the words in sort_str.

Split string on punctuation or number in Python

I'm trying to split strings every time I'm encountering a punctuation mark or numbers, such as:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.sub('[0123456789,.?:;~!##$%^&*()]', ' \1',toSplit).split()
The desired output would be:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
However, the code above (although it properly splits where it's supposed to) removes all the numbers and punctuation marks.
Any clarification would be greatly appreciated.
Use re.split with capture group:
toSplit = 'I2eat!Apples22becauseilike?Them'
result = re.split('([0-9,.?:;~!##$%^&*()])', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '2', '', '2', 'becauseilike', '?', 'Them']
If you want to split repeated numbers or punctuation, add +:
result = re.split('([0-9,.?:;~!##$%^&*()]+)', toSplit)
result
Output:
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']
You may tokenize strings like you have into digits, letters, and other chars that are not whitespace, letters and digits using
re.findall(r'\d+|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
Here,
\d+ - 1+ digits
(?:[^\w\s]|_)+ - 1+ chars other than word and whitespace chars or _
[^\W\d_]+ - any 1+ Unicode letters.
See the regex demo.
Matching approach is more flexible than splitting as it also allows tokenizing complex structure. Say, you also want to tokenize decimal (float, double...) numbers. You will just need to use \d+(?:\.\d+)? instead of \d+:
re.findall(r'\d+(?:\.\d+)?|(?:[^\w\s]|_)+|[^\W\d_]+', toSplit)
^^^^^^^^^^^^^
See this regex demo.
Use re.split to split at whenever a alphabet range is found
>>> import re
>>> re.split(r'([A-Za-z]+)', toSplit)
['', 'I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them', '']
>>>
>>> ' '.join(re.split(r'([A-Za-z]+)', toSplit)).split()
['I', '2', 'eat', '!', 'Apples', '22', 'becauseilike', '?', 'Them']

How to properly split this list of strings?

I have a list of strings such as this :
['z+2-44', '4+55+z+88']
How can I split this strings in the list such that it would be something like
[['z','+','2','-','44'],['4','+','55','+','z','+','88']]
I have tried using the split method already however that splits the 44 into 4 and 4, and am not sure what else to try.
You can use regex:
import re
lst = ['z+2-44', '4+55+z+88']
[re.findall('\w+|\W+', s) for s in lst]
# [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
\w+|\W+ matches a pattern that consists either of word characters (alphanumeric values in your case) or non word characters (+- signs in your case).
That will work, using itertools.groupby
z = ['z+2-44', '4+55+z+88']
print([["".join(x) for k,x in itertools.groupby(i,str.isalnum)] for i in z])
output:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
It just groups the chars if they're alphanumerical (or not), just join them back in a list comprehension.
EDIT: the general case of a calculator with parenthesis has been asked as a follow-up question here. If z is as follows:
z = ['z+2-44', '4+55+((z+88))']
then with the previous grouping we get:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+((', 'z', '+', '88', '))']]
Which is not easy to parse in terms of tokens. So a change would be to join only if alphanum, and let as list if not, flattening in the end using chain.from_iterable:
print([list(itertools.chain.from_iterable(["".join(x)] if k else x for k,x in itertools.groupby(i,str.isalnum))) for i in z])
which yields:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', '(', '(', 'z', '+', '88', ')', ')']]
(note that the alternate regex answer can also be adapted like this: [re.findall('\w+|\W', s) for s in lst] (note the lack of + after W)
also "".join(list(x)) is slightly faster than "".join(x), but I'll let you add it up to avoid altering visibility of that already complex expression.
Alternative solution using re.split function:
l = ['z+2-44', '4+55+z+88']
print([list(filter(None, re.split(r'(\w+)', i))) for i in l])
The output:
[['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
You could only use str.replace() and str.split() built-in functions within a list comprehension:
In [34]: lst = ['z+2-44', '4+55+z+88']
In [35]: [s.replace('+', ' + ').replace('-', ' - ').split() for s in lst]
Out[35]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
But note that this is not an efficient approach for longer strings. In that case the best way to go is using regex.
As another pythonic way you can also use tokenize module:
In [56]: from io import StringIO
In [57]: import tokenize
In [59]: [[t.string for t in tokenize.generate_tokens(StringIO(i).readline)][:-1] for i in lst]
Out[59]: [['z', '+', '2', '-', '44'], ['4', '+', '55', '+', 'z', '+', '88']]
The tokenize module provides a lexical scanner for Python source code, implemented in Python. The scanner in this module returns comments as tokens as well, making it useful for implementing “pretty-printers,” including colorizers for on-screen displays.
If you want to stick with split (hence avoiding regex), you can provide it with an optional character to split on:
>>> testing = 'z+2-44'
>>> testing.split('+')
['z', '2-44']
>>> testing.split('-')
['z+2', '44']
So, you could whip something up by chaining the split commands.
However, using regular expressions would probably be more readable:
import re
>>> re.split('\+|\-', testing)
['z', '2', '44']
This is just saying to "split the string at any + or - character" (the backslashes are escape characters because both of those have special meaning in a regex.
Lastly, in this particular case, I imagine the goal is something along the lines of "split at every non-alpha numeric character", in which case regex can still save the day:
>>> re.split('[^a-zA-Z0-9]', testing)
['z', '2', '44']
It is of course worth noting that there are a million other solutions, as discussed in some other SO discussions.
Python: Split string with multiple delimiters
Split Strings with Multiple Delimiters?
My answers here are targeted towards simple, readable code and not performance, in honor of Donald Knuth

python split string number in digits

Hi I would like to split the following string "1234" in ['1', '2', '3', '4'] in python.
My current approach is using re module
import re
re.compile('(\d)').split("1234")
['', '1', '', '2', '', '3', '', '4', '']
But i get some extra empty strings. I am not an expert in regular expressions, what could be a proper regular expression in python to accomplish my task?
Please give me some advices.
Simply use list function, like this
>>> list("1234")
['1', '2', '3', '4']
The list function iterates the string, and creates a new list with all the characters in it.
Strings are by default character lists:
>>> nums = "1234"
>>> for i in nums:
... print i
...
1
2
3
4
>>> nums[:-1]
'123'

Slicing a string into a list [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you split a list into evenly sized chunks in Python?
What is the most “pythonic” way to iterate over a list in chunks?
Say I have a string
s = '1234567890ABCDEF'
How can I slice (or maybe split is the correct term?) this string into a list consisting of strings containing 2 characters each?
desired_result = ['12', '34', '56', '78', '90', 'AB', 'CD', 'EF']
Not sure if this is relevant, but I'm parsing a string of hex characters and the final result I need is a list of bytes, created from the list above (for instance, by using int(desired_result[i], 16))
3>> bytes.fromhex('1234567890ABCDEF')
b'\x124Vx\x90\xab\xcd\xef'
You could use binascii:
>>> from binascii import unhexlify
>>> unhexlify(s)
'\x124Vx\x90\xab\xcd\xef'
Then:
>>> list(_)
['\x12', '4', 'V', 'x', '\x90', '\xab', '\xcd', '\xef']
>>> s = '1234567890ABCDEF'
>>> iter_s = iter(s)
>>> [a + next(iter_s) for a in iter_s]
['12', '34', '56', '78', '90', 'AB', 'CD', 'EF']
>>>
>>> s = '1234567890ABCDEF'
>>> [char0+char1 for (char0, char1) in zip(s[::2], s[1::2])]
['12', '34', '56', '78', '90', 'AB', 'CD', 'EF']
But, as others have noted, there are more direct solutions to the more general problem of converting hexadecimal numbers to bytes.
Also note that Robert Kings's solution is more efficient, in general, as it essentially has a zero memory footprint (at the cost of a less legible code).

Categories

Resources