Python Regular Expression help. Combinations - python

I'm trying to read from a text file and create a list of seed words that begin a sentence and a second list containing all adjacent words excluding the seed words.
The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted. How would you keep them as they appear in the file?
Text contained in file:
This doesn't seem to work. Is findall or sub the correct approach? Or neither?
CODE:
my_string = open('sample.txt', 'r').read()
starter = list(set(re.findall(r"(?<![a-z]\s)[A-Z]\w+", my_string)))
adjacent = re.findall(r"(?<!(?<![a-z]\s))\w+", my_string)
print(my_string)
RESULT:
['doesn', 'seem', 'to', 'work', 'sub', 'or', 'findall', 'the', 'correct', 'approach', 'neither']

It is easier with two regex's:
import re
txt="""\
This doesn't seem to work. Is findall or sub the correct approach? Or neither? Isn't it grand?
"""
first_words=set(re.findall(r'(?:^|(?:[.!?]\s))(\b[a-zA-Z\']+)', txt))
rest={word for word in re.findall(r'(\b[a-zA-Z\']+)', txt) if word not in first_words}
print first_words
# set(['This', 'Is', 'Or', "Isn't"])
print rest
# set(["doesn't", 'sub', 'grand', 'the', 'work', 'it', 'findall', 'to', 'neither', 'correct', 'seem', 'approach', 'or'])

The problem I'm encountering is that words containing an apostrophe get split after the apostrophe and the rest of the word omitted.
The slash-w-plus isn't your friend. It is a short-cut for alphabetic characters, numbers, and underscores. It does not include hyphens or apostrophes.
Use a character range instead. That way you can include apostrophes and exclude numbers and underscores:
r"[A-Za-z\']+" # works better than \w+

Related

How to split a sentence string into words, but also make punctuation a separate element

I'm currently trying to tokenize some language data using Python and was curious if there was an efficient or built-in method for splitting strings of sentences into separate words and also separate punctuation characters. For example:
'Hello, my name is John. What's your name?'
If I used split() on this sentence then I would get
['Hello,', 'my', 'name', 'is', 'John.', "What's", 'your', 'name?']
What I want to get is:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
I've tried to use methods such as searching the string, finding punctuation, storing their indices, removing them from the string and then splitting the string, and inserting the punctuation accordingly but this method seems too inefficient especially when dealing with large corpora.
Does anybody know if there's a more efficient way to do this?
Thank you.
You can do a trick:
text = "Hello, my name is John. What's your name?"
text = text.replace(",", " , ") # Add an space before and after the comma
text = text.replace(".", " . ") # Add an space before and after the point
text = text.replace(" ", " ") # Remove possible double spaces
mListtext.split(" ") # Generates your list
Or just this with input:
mList = input().replace(",", " , ").replace(".", " . ")replace(" ", " ").split(" ")
Here is an approach using re.finditer which at least seems to work with the sample data you provided:
inp = "Hello, my name is John. What's your name?"
parts = []
for match in re.finditer(r'[^.,?!\s]+|[.,?!]', inp):
parts.append(match.group())
print(parts)
Output:
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
The idea here is to match one of the following two patterns:
[^.,?!\s]+ which matches any non punctuation, non whitespace character
[.,?!] which matches a single punctuation character
Presumably anything which is not whitespace or punctuation should be a matching word/term in the sentence.
Note that the really nice way to solve this problem would be try doing a regex split on punctuation or whitespace. But, re.split does not support splitting on zero width lookarounds, so we forced to try re.finditer instead.
Word tokenisation is not as trivial as it sounds. The previous answers using regular expressions or string replacement won't always deal with such things as acronyms or abbreviations (e.g. a.m., p.m., N.Y., D.I.Y, A.D., B.C., e.g., etc., i.e., Mr., Ms., Dr.). These will be split into separate tokens (e.g. B, ., C, .) by such approaches unless you write more complex patterns to deal with such cases (but there will always be annoying exceptions). You will also have to decide what to do with other punctuation like " and ', $, %, such things as email addresses and URLs, sequences of digits (e.g 5,000.99, 33.3%), hyphenated words (e.g. pre-processing, avant-garde), names that include punctuation (e.g. O'Neill), contractions (e.g. aren't, can't, let's), the English possessive marker ('s) etc. etc., etc.
I recommend using an NLP library to do this as they should be set up to deal with most of these issue for you (although they do still make "mistakes" that you can try to fix). See:
spaCy (especially geared towards efficiency on large corpora)
NLTK
Stanford CoreNLP
TreeTagger
The first three are full toolkits with many functionalities besides tokenisation. The last is a part-of-speech tagger that tokenises the text. These are just a few and there are other options out there, so try some out and see which works best for you. They will all tokenise your text differently, but in most cases (not sure about TreeTagger) you can modify their tokenisation decisions to correct mistakes.
You can use re.sub to replace all chars defined in string.punctuation followed by a space after them, with a space before them, and finally can use str.split to split the words
>>> s = "Hello, my name is John. What's your name?"
>>>
>>> import string, re
>>> re.sub(fr'([{string.punctuation}])\B', r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
In python2
>>> re.sub(r'([%s])\B' % string.punctuation, r' \1', s).split()
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']
TweetTokenizer from nltk can also be used for this..
from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()
tokenizer.tokenize('''Hello, my name is John. What's your name?''')
#op
['Hello', ',', 'my', 'name', 'is', 'John', '.', "What's", 'your', 'name', '?']

Regex to find words starting with capital letters not at beginning of sentence

I've managed to find the words beginning with capital Letters but can't figure out a regex to filter out the ones starting at the beginning of the sentence.
Each sentence ends with a full stop and a space.
Test_string = This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence.
Desired output = ['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']
I'm coding in Python.
Will be glad if someone can help me out with the regex :)
You may use the following expression:
(?<!^)(?<!\. )[A-Z][a-z]+
Regex demo here.
import re
mystr="This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence."
print(re.findall(r'(?<!^)(?<!\. )[A-Z][a-z]+',mystr))
Prints:
['Test', 'Supposed', 'Ignore', 'Words', 'Sentence']
A very basic option. See here for an explanation.
[^.]\s([A-Z]\w+)
import re
s = 'This is a Test sentence. The sentence is Supposed to Ignore the Words at the beginning of the Sentence, And others.'
re.findall(r'[^.]\s([A-Z]\w+)', s)
output
['Test', 'Supposed', 'Ignore', 'Words', 'Sentence', 'And']

How to use .split() while setting an argument to skip a specific tag

I am trying to split all text group from a readme file, this is to get all the individual words, however words written inside the markdown syntax for embeding a URL.. []() are showing unwanted results.
So, if I use .split() on this sentence
The site uses the [stackoverlow api](https://api.stackexchange.com/docs) to fetch all existing tags and create a
.split() will treat [stackoverflow api](.. as two words and yield this result
>>> r = "The site uses the [stackoverlow api](https://api.stackexchange.com/docs) to fetch"
>>> print(r.split())
['The', 'site', 'uses', 'the', '[stackoverlow', 'api](https://api.stackexchange.com/docs)', 'to', 'fetch']
>>>
Since this is unintended, is there way to ignore anything inside []() or treat it as a single word?
The solution using re.findall() function:
import re
s = "The site uses the [stackoverlow api](https://api.stackexchange.com/docs) to fetch"
result = re.findall(r'\[[^]]+\]\([^)]+\)|\S+', s)
print(result)
The output:
['The', 'site', 'uses', 'the', '[stackoverlow api](https://api.stackexchange.com/docs)', 'to', 'fetch']
\[[^]]+\]\([^)]+\) - matches sequence [...](...) as a single item
\S+ - matches non-whitespace character sequence (word)
if i understand correctly ,a simple solution would be to replace all instances of "[", "]" and then split:
st = "The site uses the [stackoverlow api](https://api.stackexchange.com/docs) to fetch all existing tags and create a"
st.replace("["," ").replace("]", " ").split()
this will give you :
['The', 'site', 'uses', 'the', 'stackoverlow', 'api', '(https://api.stackexchange.com/docs)', 'to', 'fetch', 'all', 'existing', 'tags', 'and', 'create', 'a']
of course you can also replace "(", ")" or any other manipulation to split the url as well.

How can I split at word boundaries with regexes?

I'm trying to do this:
import re
sentence = "How are you?"
print(re.split(r'\b', sentence))
The result being
[u'How are you?']
I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?
Unfortunately, Python cannot split by empty strings.
To get around this, you would need to use findall instead of split.
Actually \b just means word boundary.
It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).
That means, the following code would work:
import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))
import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)
Output:
['How', 'are', 'you', '?']
Ideone Demo
Regex101 Demo
Regex Explanation:
"[\w']+|[.,!?;]"
1st Alternative: [\w']+
[\w']+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
' the literal character '
2nd Alternative: [.,!?;]
[.,!?;] match a single character present in the list below
.,!?; a single character in the list .,!?; literally
Here is my approach to split on word boundaries:
re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']
and using findall on word boundaries
re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

How to tokenize contractions in Python?

I have sentences that I want to tokenize, including the punctuations. But I need to handle contractions so that words which are something+not like "can't" is tokenized into "ca" and "n't" where the split is one character before the apostrophe, and the rest of the contraction words split at the apostrophe like "you've" and "It's" turn into "you" "'ve" and "It" and "'s". This is where I'm stuck. Basically roughly equivalent to how to NKTL's TreebankWord Tokenizer behaves:
NLTK Word Tokenization Demo
I've been using one of the solution proposed on here which doesn't handle contractions the way I want it to:
re.findall("'\w+|[\w]+|[^\s\w]", "Hello, I'm a string! Please don't kill me? It's his car.")
and I get this result:
['Hello', ',', 'I', "'m", 'a', 'string', '!', 'Please', 'don', "'t", 'kill', 'me', '?', 'It', "'s", 'his', 'car', '.']
Which handles the apostrophes correctly except in the don't case where it should be "do" and "n't". Anyone know how to fix that?
I can only use the standard library, so the NLTK is not an option in this case.
Regex:
\w+(?=n't)|n't|\w+(?=')|'\w+|\w+
Usage
match_list = re.findall(r"\w+(?=n't)|n't|\w+(?=')|'\w+|\w+","you've it's couldn't don't", re.IGNORECASE | re.DOTALL)
Matches:
['you', "'ve", "it", "'s", 'could', "n't", "do", "n't"]
Try:
r"[n]'[\w]+|[\w]+(?!')(?:[A-Za-mo-z](?='))?|(?<=\s)[\w](?=')|[^\s\w']"
This matches character followed by ' followed by more characters and will match first even if it is able to match the other patterns.
Catch n't and \w+(?=n't) before \w+
r"'\w+|n't|\w+(?=n't)|\w+|[^\s\w]"

Categories

Resources