Why are there space outcome in my re.split() result - python

I want to extract the strings in the brackets and single quote in the given a string, e.g. Given ['this'], extract this
, yet it keeps haunting me that the following example and result:
import re
target_string = "['this']['current']"
result = re.split(r'[\[|\]|\']+', target_string)
print(result)
I got
['', 'this', 'current', '']
# I expect ['this', 'current']
Now I really don't understand where are the first and last ' ' in the result coming from, I guarantee that the input target_string has no such leading and trailing space, I don't expect that they occurred in the result
Can anybody help me fix this, please?

Using re.split match every time the pattern is found and since your string starts and ends with the pattern is output a '' at he beguining and end to be able to use join on the output and form the original string
If you want to capture why don't you use re.findall instead of re.split? you have very simple use if you only have one word per bracket.
target_string = "['this']['current']"
re.findall("\w", target_string)
output
['this', 'current']
Note the above will not work for:
['this is the', 'current']
For such a case you can use lookahead (?=...) and lookbehind (?<=...) and capture everything in a nongreedy way .+?
target_string = "['this is the', 'current']"
re.findall("(?<=\[\').+?(?=\'\])", target_string) # this patter is equivalent "\[\'(.+)\'\]"
output:
['this is the', 'current']

Related

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Python regex that separates to words unless it is inside quotes

My goal is to achieve this:
Input:
Hi, Are you happy? I am "extremely happy" today
Output:
['Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Is there a straight-forward approach to achieve this? I tried using another pattern I found:
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
I assume this should find the text inside the quote, but did not manage to find a way to nail it.
EDIT
I also tried splitting using the next regex, but this obviously only gives me spaces separation which cuts my text inside quotes to segments:
tokens = [token for token in re.split(r"(\W)", text) if token.strip()]
Is there a way to combine the pattern I supplied with this for loop such that it return an array that each word in a different cell unless it is quoted and then whats inside the quotes gets its own cell?
You could use shlex.split instead of regex
import shlex
print(shlex.split('input: Hi, Are you happy? I am "extremely happy" today'))
result:
['input:', 'Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Another fun way to do it: First split on quotes, then split every non-quoted part (every other):
str = 'I am "super happy" today'
ss = str.split('"')
res = sum(([w] if i%2 else w.split() for i,w in enumerate(ss)), [])
To remove punctuation, you need to replace split() on last line with a proper regexp, but I think you had that covered already.
This will not remove punctuation inside quotes of course, and you cannot nest quotes. So you cannot be "super "super" happy" :)

How can I split at word boundaries with regexes?

I'm trying to do this:
import re
sentence = "How are you?"
print(re.split(r'\b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall instead of split.
Actually \b just means word boundary.
It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Ideone Demo
Regex101 Demo
Regex Explanation:
"[\w']+|[.,!?;]"
1st Alternative: [\w']+
[\w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split on word boundaries:
re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall on word boundaries
re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

Regex one-liner for matching only what comes after a certain word?

I want to extract song names from a list like this: 'some text here, songs: song1, song2, song3, fro: othenkl' and get ['song1', 'song2', 'song3']. So I try to do it in one regex:
result = re.findall('[Ss]ongs?:?.*', 'songs: songname1, songname2,')
print re.findall('(?:(\w+),)*', result[0])
This matches perfectly: ['', '', '', '', '', '', '', 'songname1', '', 'songname2', ''] (except for the empty strings, but nbd.
But I want to do it in one line, so I do the following:
print re.findall('[Ss]ongs?:?(?:(\w+),)*','songs: songname1, songname2,')
But I do not understand why this is unable to capture the same as the two regexes above:
['', 'name1', 'name2']
Is there a way to accomplish this in one line? It would be useful to be concise here. thanks.
You don't need to use re.findall in this case, you better to use re.search to find the sequence of songs then split the result with comma ,. Also you don't need to use character class [Ss] to match the Capitals you can use Ignore case flag (re.I) :
>>> s ='some text here, songs: song1, song2, song3, fro: othenkl'
>>> re.search(r'(?<=songs:)(.+),', s,flags=re.I).group(1).split(',')
[' song1', ' song2', ' song3']
(?<=songs:) is a positive look behind which will makes your regex engine match the strings precede by songs: and (.+), will match the largest string after songs: which follows by comma that is the sequence of your songs.
Also as a more general way instead of specifying comma at the end of your regex you can capture the song names based on this fact that they are followed by this patter \s\w+:.
>>> re.search(r'(?<=songs:)(.+)(?=\s\w+:)', s).group(1).split(',')
[' song1', ' song2', ' song3', '']
No, you can't do it in one pattern with the re module.
What you can do is to use the regex module instead with this pattern:
regex.findall(r'(?:\G(?!\A), |\msongs: )(\w++)(?!:)', s)
Where \G is the position after the previous match, \A the start of the string, \m a word boundary followed by word characters, and ++ a possessive quantifier.

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources