Python regex that separates to words unless it is inside quotes - python

My goal is to achieve this:
Input:
Hi, Are you happy? I am "extremely happy" today
Output:
['Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']
Is there a straight-forward approach to achieve this? I tried using another pattern I found:
pattern = r'"([A-Za-z0-9_\./\\-]*)"'
I assume this should find the text inside the quote, but did not manage to find a way to nail it.
EDIT
I also tried splitting using the next regex, but this obviously only gives me spaces separation which cuts my text inside quotes to segments:
tokens = [token for token in re.split(r"(\W)", text) if token.strip()]
Is there a way to combine the pattern I supplied with this for loop such that it return an array that each word in a different cell unless it is quoted and then whats inside the quotes gets its own cell?

You could use shlex.split instead of regex
import shlex
print(shlex.split('input: Hi, Are you happy? I am "extremely happy" today'))
result:
['input:', 'Hi,', 'Are', 'you', 'happy?', 'I', 'am', 'extremely happy', 'today']

Another fun way to do it: First split on quotes, then split every non-quoted part (every other):
str = 'I am "super happy" today'
ss = str.split('"')
res = sum(([w] if i%2 else w.split() for i,w in enumerate(ss)), [])
To remove punctuation, you need to replace split() on last line with a proper regexp, but I think you had that covered already.
This will not remove punctuation inside quotes of course, and you cannot nest quotes. So you cannot be "super "super" happy" :)

Related

Why are there space outcome in my re.split() result

I want to extract the strings in the brackets and single quote in the given a string, e.g. Given ['this'], extract this
, yet it keeps haunting me that the following example and result:
import re
target_string = "['this']['current']"
result = re.split(r'[\[|\]|\']+', target_string)
print(result)
I got
['', 'this', 'current', '']
# I expect ['this', 'current']
Now I really don't understand where are the first and last ' ' in the result coming from, I guarantee that the input target_string has no such leading and trailing space, I don't expect that they occurred in the result
Can anybody help me fix this, please?
Using re.split match every time the pattern is found and since your string starts and ends with the pattern is output a '' at he beguining and end to be able to use join on the output and form the original string
If you want to capture why don't you use re.findall instead of re.split? you have very simple use if you only have one word per bracket.
target_string = "['this']['current']"
re.findall("\w", target_string)
output
['this', 'current']
Note the above will not work for:
['this is the', 'current']
For such a case you can use lookahead (?=...) and lookbehind (?<=...) and capture everything in a nongreedy way .+?
target_string = "['this is the', 'current']"
re.findall("(?<=\[\').+?(?=\'\])", target_string) # this patter is equivalent "\[\'(.+)\'\]"
output:
['this is the', 'current']

How to split a sentence string into words, but also make punctuation a separate element

I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:
'Hello, my name is John. What's your name?'
If I used split() on this sentence then I would get
['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']
What I want to get is:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.
Does anybody know if there's a more efficient way to do this?
Thank you.
You can do a trick:
text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace(" ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list
Or just this with input:
mList = input().replace(",", " , ").replace(".", " . ")replace(" ", " ").split(" ")
Here is an approach using re.finditer which at least seems to work with the sample data you provided:
inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
parts.append(match.group())
print(parts)
Output:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
The idea here is to match one of the following two patterns:
[^.,?!\s]+ which matches any non punctuation, non whitespace character
[.,?!] which matches a single punctuation character
Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.
Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.
Word tokenisation is not as trivial as it sounds. The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (e.g. a.m., p.m., N.Y., D.I.Y, A.D., B.C., e.g., etc., i.e., Mr., Ms., Dr.). These will be split into separate tokens (e.g. B, ., C, .) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). You will also have to decide what to do with other punctuation like " and ', $, %, such things as email addresses and URLs, sequences of digits (e.g 5,000.99, 33.3%), hyphenated words (e.g. pre-processing, avant-garde), names that include punctuation (e.g. O'Neill), contractions (e.g. aren't, can't, let's), the English possessive marker ('s) etc. etc., etc.
I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). See:
spaCy (especially geared towards efficiency on large corpora)
NLTK
Stanford CoreNLP
TreeTagger
The first three are full toolkits with many functionalities besides tokenisation. The last is a part-of-speech tagger that tokenises the text. These are just a few and there are other options out there, so try some out and see which works best for you. They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes.
You can use re.sub to replace all chars defined in string.punctuation followed by a space after them, with a space before them, and finally can use str.split to split the words
>>> s = "Hello, my name is John. What's your name?"
>>>
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
In python2
>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
TweetTokenizer from nltk can also be used for this..
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')
#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

How can I split at word boundaries with regexes?

I'm trying to do this:
import re
sentence = "How are you?"
print(re.split(r'\b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall instead of split.
Actually \b just means word boundary.
It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Ideone Demo
Regex101 Demo
Regex Explanation:
"[\w']+|[.,!?;]"
1st Alternative: [\w']+
[\w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split on word boundaries:
re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall on word boundaries
re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

Python Regular Expression help. Combinations

I'm trying to read from a text file and create a list of seed words that begin a sentence and a second list containing all adjacent words excluding the seed words.
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted. How would you keep them as they appear in the file?
Text contained in file:
This doesn't seem to work. Is findall or sub the correct approach? Or neither?
CODE:
my_string = open('sample.txt', 'r').read()
starter = list(set(re.findall(r"(?<![a-z]\s)[A-Z]\w+", my_string)))
adjacent = re.findall(r"(?<!(?<![a-z]\s))\w+", my_string)
print(my_string)
RESULT:
['doesn', 'seem', 'to', 'work', 'sub', 'or', 'findall', 'the', 'correct', 'approach', 'neither']
It is easier with two regex's:
import re
txt="""\
This doesn't seem to work. Is findall or sub the correct approach? Or neither? Isn't it grand?
"""
first_words=set(re.findall(r'(?:^|(?:[.!?]\s))(\b[a-zA-Z\']+)', txt))
rest={word for word in re.findall(r'(\b[a-zA-Z\']+)', txt) if word not in first_words}
print first_words
# set(['This', 'Is', 'Or', "Isn't"])
print rest
# set(["doesn't", 'sub', 'grand', 'the', 'work', 'it', 'findall', 'to', 'neither', 'correct', 'seem', 'approach', 'or'])
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted.
The slash-w-plus isn't your friend. It is a short-cut for alphabetic characters, numbers, and underscores. It does not include hyphens or apostrophes.
Use a character range instead. That way you can include apostrophes and exclude numbers and underscores:
r"[A-Za-z\']+" # works better than \w+

Separate number/letter tokens in Python

I'm using re.split() to separate a string into tokens. Currently the pattern I'm using as the argument is [^\dA-Za-z], which retrieves alphanumeric tokens from the string.
However, what I need is to also split tokens that have both numbers and letters into tokens with only one or the other, eg.
re.split(pattern, "my t0kens")
would return ["my", "t", "0", "kens"].
I'm guessing I might need to use lookahead/lookbehind, but I'm not sure if that's actually necessary or if there's a better way to do it.
Try the findall method instead.
>>> print re.findall ('[^\d ]+', "my t0kens");
['my', 't', 'kens']
>>> print re.findall ('[\d]+', "my t0kens");
['0']
>>>
Edit: Better way from Bart's comment below.
>>> print re.findall('[a-zA-Z]+|\\d+', "my t0kens")
['my', 't', '0', 'kens']
>>>
>>> [x for x in re.split(r'\s+|(\d+)',"my t0kens") if x]
['my', 't', '0', 'kens']
By using capturing parenthesis within the pattern, the tokens will also be return. Since you only want to maintain digits and not the spaces, I've left the \s outside the parenthesis so None is returned which can then be filtered out using a simple loop.
Should be one line of code
re.findall('[a-z]+|[\d]+', 'my t0kens')
Not perfect, but removing space from the list below is easy :-)
re.split('([\d ])', 'my t0kens')
['my', ' ', 't', '0', 'kens']
docs: "Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list."

Categories

Resources