Separate number/letter tokens in Python - python

I'm using re.split() to separate a string into tokens. Currently the pattern I'm using as the argument is [^\dA-Za-z], which retrieves alphanumeric tokens from the string.
However, what I need is to also split tokens that have both numbers and letters into tokens with only one or the other, eg.
re.split(pattern, "my t0kens")
would return ["my", "t", "0", "kens"].
I'm guessing I might need to use lookahead/lookbehind, but I'm not sure if that's actually necessary or if there's a better way to do it.

Try the findall method instead.
>>> print re.findall ('[^\d ]+', "my t0kens");
['my', 't', 'kens']
>>> print re.findall ('[\d]+', "my t0kens");
['0']
>>>
Edit: Better way from Bart's comment below.
>>> print re.findall('[a-zA-Z]+|\\d+', "my t0kens")
['my', 't', '0', 'kens']
>>>

>>> [x for x in re.split(r'\s+|(\d+)',"my t0kens") if x]
['my', 't', '0', 'kens']
By using capturing parenthesis within the pattern, the tokens will also be return. Since you only want to maintain digits and not the spaces, I've left the \s outside the parenthesis so None is returned which can then be filtered out using a simple loop.

Should be one line of code
re.findall('[a-z]+|[\d]+', 'my t0kens')

Not perfect, but removing space from the list below is easy :-)
re.split('([\d ])', 'my t0kens')
['my', ' ', 't', '0', 'kens']
docs: "Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."

Related

Regex expression for a given string

I have a small issue i am running into. I need a regular expression that would split a passed string with numbers separately and anything chunk of characters within square brackets separately and regular set of string separately.
for example if I have a strings that resembles
s = 2[abc]3[cd]ef
i need a list with lst = ['2','abc','3','cd','ef']
I have a code so far that has this..
import re
s = "2[abc]3[cd]ef"
s_final = ""
res = re.findall("(\d+)\[([^[\]]*)\]", s)
print(res)
This is outputting a list of tuples that looks like this.
[('2', 'abc'), ('3', 'cd')]
I am very new to regular expression and learning.. Sorry if this is an easy one.
Thanks!
The immediate fix is getting rid of the capturing groups and using alternation to match either digits or chars other than square bracket chars:
import re
s = "2[abc]3[cd]ef"
res = re.findall(r"\d+|[^][]+", s)
print(res)
# => ['2', 'abc', '3', 'cd', 'ef']
See the regex demo and the Python demo. Details:
\d+ - one or more digits
| - or
[^][]+ - one or more chars other than [ and ]
Other solutions that might help are:
re.findall(r'\w+', s)
re.findall(r'\d+|[^\W\d_]+', s)
where \w+ matches one or more letters, digits, underscores and some more connector punctuation with diacritics and [^\W\d_]+ matches any one or more Unicode letters.
See this Python demo.
Don't try a regex that will find all part in the string, but rather a regex that is able to match each block, and \w (meaning [a-zA-Z0-9_]) feats well
s = "2[abc]3[cd]ef"
print(re.findall(r"\w+", s)) # ['2', 'abc', '3', 'cd', 'ef']
Or split on brackets
print(re.split(r"[\[\]]", s)) # ['2', 'abc', '3', 'cd', 'ef ']
Regex is intended to be used as a Regular Expression, your string is Irregular.
regex is being mostly used to find a specific pattern in a long text, text validation, extract things from text.
for example, in order to find a phone number in a string, I would use RegEx, but when I want to build a calculator and I need to extract operators/digits I would not, but I would rather want to write a python code to do that.

python re.split with maxsplit argument

when using re.split I'd expect the maxsplit to be the length of the returned list (-1).
The examples in the docs suggest so.
But when there is a capture group (and maybe some other cases) then I don't understand how the maxsplit argument works.
>>> re.split("(\W+)", "Words, words, words.", maxsplit=1)
['Words', ', ', 'words, words.']
>>> re.split("(:)", ":a:b::c", maxsplit=2)
['', ':', 'a', ':', 'b::c']
>>> re.split("((:))", ":a:b::c", maxsplit=2)
['', ':', ':', 'a', ':', ':', 'b::c']
What am I missing?
It's not about maxsplit, it's about you using parentheses in the regular expression:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
DOCS: https://docs.python.org/3/library/re.html#re.split
So what I'm guessing is that maxsplit determines the number of splits, and the parentheses return additional groups.
Example
":a:b::c" with maxsplit=2 splits your string in three parts:
"", "a", "b::c"
But because the pattern "(:)" also contains a captured group, it's returned in between the parts:
"", ":", "a", ":", "b::c"
If the pattern is "((:))", then each colon is returned twice in between the parts

Python regex that separates to words unless it is inside quotes

My goal is to achieve this:
Input:
Hi, Are you happy? I am "extremely happy" today
Output:
['Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Is there a straight-forward approach to achieve this? I tried using another pattern I found:
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
I assume this should find the text inside the quote, but did not manage to find a way to nail it.
EDIT
I also tried splitting using the next regex, but this obviously only gives me spaces separation which cuts my text inside quotes to segments:
tokens = [token for token in re.split(r"(\W)", text) if token.strip()]
Is there a way to combine the pattern I supplied with this for loop such that it return an array that each word in a different cell unless it is quoted and then whats inside the quotes gets its own cell?
You could use shlex.split instead of regex
import shlex
print(shlex.split('input: Hi, Are you happy? I am "extremely happy" today'))
result:
['input:', 'Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Another fun way to do it: First split on quotes, then split every non-quoted part (every other):
str = 'I am "super happy" today'
ss = str.split('"')
res = sum(([w] if i%2 else w.split() for i,w in enumerate(ss)), [])
To remove punctuation, you need to replace split() on last line with a proper regexp, but I think you had that covered already.
This will not remove punctuation inside quotes of course, and you cannot nest quotes. So you cannot be "super "super" happy" :)

Split string by position not character

We know that anchors, word boundaries, and lookaround match at a position, rather than matching a character.
Is it possible to split a string by one of the preceding ways with regex (specifically in python)?
For example consider the following string:
"ThisisAtestForchEck,Match IngwithPosition."
So i want the following result (the sub-strings that start with uppercase letter but not precede by space ):
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match Ingwith' ,'Position.']
If i split with grouping i get:
>>> re.split(r'([A-Z])',s)
['', 'T', 'hisis', 'A', 'test', 'F', 'orch', 'E', 'ck,', 'M', 'atchingwith', 'P', 'osition.']
And this is the result with look-around :
>>> re.split(r'(?<=[A-Z])',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z]))',s)
['ThisisAtestForchEck,MatchingwithPosition.']
>>> re.split(r'((?<=[A-Z])?)',s)
['ThisisAtestForchEck,MatchingwithPosition.']
Note that if i want to split by sub-strings that start with uppercase and are preceded by a space, e.g.:
['Thisis', 'Atest', 'Forch' ,'Eck,' ,'Match ', Ingwith' ,'Position.']
I can use re.findall, viz.:
>>> re.findall(r'([A-Z][^A-Z]*)',s)
['Thisis', 'Atest', 'Forch', 'Eck,', 'Match ', 'Ingwith', 'Position.']
But what about the first example: is it possible to solve it with re.findall?
A way with re.findall:
re.findall(r'(?:[A-Z]|^[^A-Z\s])[^A-Z\s]*(?:\s+[A-Z][^A-Z]*)*',s)
When you decide to change your approach from split to findall, the first job consists to reformulate your requirements: "I want to split the string on each uppercase letter non preceded by a space" => "I want to find one or more substrings separed by space that begins with an uppercase letter except from the start of the string (if the string doesn't start with an uppercase letter)"
(?<!\s)(?=[A-Z])
You can use this to split with regex module as re does not support split at 0 width assertions.
import regex
x="ThisisAtestForchEck,Match IngwithPosition."
print regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1)
or
print [i for i in regex.split(r"(?<![\s])(?=[A-Z])",x,flags=regex.VERSION1) if i]
See demo.
https://regex101.com/r/sJ9gM7/65
I know this might be less convenient because of the tuple nature of the result. But I think that this findall finds what you need:
re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)
## returns [('Thisis', 's'), ('Atest', 't'), ('Forch', 'h'), ('Eck,', ','), ('Match Ingwith', 'h'), ('Position.', '.')]
This can be used in the following list comprehension to give the desired output:
[val[0] for val in re.findall(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
And here is a hack that uses split:
re.split(r'((?<!\s)[A-Z]([^A-Z]|(?<=\s)[A-Z])*)', s)[1::3]
## returns ['Thisis', 'Atest', 'Forch', 'Eck,', 'Match Ingwith', 'Position.']
try capture using this pattern
([A-Z][a-z]*(?: [A-Z][a-z]*)*)
Demo

Decompose a Python string into its characters

I want to break a Python string into its characters.
sequenceOfAlphabets = list( string.uppercase )
works.
However, why does not
sequenceOfAlphabets = re.split( '.', string.uppercase )
work?
All I get are empty, albeit expected count of elements
The '.' matches every character and re.split returns everything that wasn't matched, that's why you're getting the empty list.
Using list is usually the way to handle something like this but if you want to use regular expressions just use re.findall
sequenceOfAlphabets = re.findall( '.', string.uppercase )
That should give you ['A', 'B', 'C', .... ,'Z']
Because the delimiter character used by split does not appear in the resulting list. This allows it be used like:
re.split(',', "foo,bar,baz")
['foo', 'bar', 'baz']
Also, you will find the resulting list from your split code actually contains one extra element, since split returns one more than the number of delimiters found. The above has two commas, so it returns a three-element list.
If you can do something with both a built-in function and with regexes, then usually the built-in approach will be faster and more legible.
The regex world is a maze of twisty little passages, populated by purveyors of almost-truths like """The '.' matches every character""" ... which it does, but only when you use the re.DOTALL flag. This information is not cunningly concealed in the fine print of the documentation; it's right there as the FIRST entry of "special characters":
'.'
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
>>> import re
>>> re.findall(".", "fu\nbar")
['f', 'u', 'b', 'a', 'r']
>>>
Just an FYI, this also works:
sequenceOfAlphabets = [a for a in string.uppercase]
...but that does exactly what list() would do so I don't think it would be any faster (I could be wrong).
You can also create an empty set and use the update method, like so:
destroy_string = set()
destroy_string.update('Stack Overflow')
destroy_string
{'k', ' ', 'S', 'c', 'v', 'o', 'r', 't', 'w', 'e', 'f', 'O', 'l', 'a'}
Albeit, it will become unordered and the duplicates will be lost in the set, however, this is still a valid way to decompose a string into a set of its individual members.
From the documentation:
If capturing parentheses are used in
pattern, then the text of all groups
in the pattern are also returned as
part of the resulting list.
Also note:
If there are capturing groups in the
separator and it matches at the start
of the string, the result will start
with an empty string. The same holds
for the end of the string.
So, use re.split( '(.)', string.uppercase)[1:-1] instead.

Categories

Resources