I am using Regex tokenizer for a text passage, and I would like to extract all words that only have white space before and after that. Here is my code:
tokenizer = RegexpTokenizer('[0-9a-z][^\s\']*[a-z]')
For instance, the sentence "we don't have 500 dollars" will end up becoming "we don have dollars". I would like to have "don" eliminated since it does not end with a whitespace. How do I do so?
You can use positive lookahead and lookbehind to achieve this
Code:
import re
pattern = r"(?:(?<=^)|(?<=\s))([a-zA-Z0-9]+)(?:(?=\s)|(?=$))"
print(re.findall(pattern, "we don't have 500 dollars"))
print(re.findall(pattern, "Your money's no good here, Mr. Torrance"))
Output:
['we', 'have', '500', 'dollars']
['Your', 'no', 'good', 'Torrance']
You can play around with this here
https://regex101.com/r/IeLC88/3
Related
import re
random_regex = re.compile(r'^\w')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
This is the code I have, following along on Automate the Boring Stuff with Python. However I kind of side-tracked a bit and wanted to see if I could get a list of all the words in the string passed in random_regex.findall() that begin with a word, so I wrote \w for the regex pattern. However for some reason my output only prints "R" and not the rest of the letters in the string, Would anyone be able to explain why/tell me how to fix this problem?
import re
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
print(x)
A regex find all should work here:
inp = "RoboCop eats baby food. BABY FOOD."
words = re.findall(r'\w+', inp)
print(words) # ['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']
^ Requires the start of a string, so it only finds RoboCop. Use \w+ to get all of the letters. You can test your regex at regex101.com.
random_regex = re.compile(r'\w+')
x = random_regex.findall('RoboCop eats baby food. BABY FOOD.')
to get x:
['RoboCop', 'eats', 'baby', 'food', 'BABY', 'FOOD']
I'd like to know how to create a regular expression to delete whitespaces after a newline, for example, if my text is like this:
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
how I can create something to get:
["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afraid" ]
i've tried to use "replace("-\n","")" to try to get them together but i only get something like:
["be","cause"] and ["ex","change"]
Any suggestion? Thanks!!
import re
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''.lower()
s = re.sub(r'-\n\s*', '', s) # join hyphens
s = re.sub(r'[^\w\s]', '', s) # remove punctuation
print(s.split())
\s* means 0 or more spaces.
From what I can tell, Alex Hall's answer more adequately answers your question (both explicitly in that it's regex and implicitly in that it's adjusts capitalization and removes punctuation), but it jumped out as a good candidate for a generator.
Here, using a generator to join tokens popped from a stack-like list:
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''
def condense(lst):
while lst:
tok = lst.pop(0)
if tok.endswith('-'):
yield tok[:-1] + lst.pop(0)
else:
yield tok
print(list(condense(s.split())))
# Result:
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather',
# 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid.']
import re
s.replace('-\n', '') #Replace the newline and - with a space
#Your s would now look like 'So she refused to ex change the feather and the rock be cause she was afraid.'
s = re.sub('\s\s+', '', s) #Replace 2 or more whitespaces with a ''
#Now your s would look like 'So she refused to exchange the feather and the rock because she was afraid.'
You could use an optional greedy expression:
-?\n\s+
This needs to be replaced by nothing, see a demo on regex101.com.
For the second part, I'd suggest nltk so that you end up having:
import re
from nltk import word_tokenize
string = """
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
"""
rx = re.compile(r'-?\n\s+')
words = word_tokenize(rx.sub('', string))
print(words)
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid', '.']
There's a ton available about removing punctuation, but I can't seem to find anything keeping it.
If I do:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']
the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']
I'd like this to always perform as the second case. For now, I'm hackishly doing:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")
since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?
It is a quirk of spelling that if a sentence ends with an abbreviated word, we only write one period, not two. The nltk's tokenizer doesn't "remove" it, it splits it off because sentence structure ("a sentence must end with a period or other suitable punctuation") is more important to NLP tools than consistent representation of abbreviations. The tokenizer is smart enough to recognize most abbreviations, so it doesn't separate the period in L.P. mid-sentence.
Your solution with ||| results in inconsistent sentence structure, since you now have no sentence-final punctuation. A better solution would be to add the missing period only after abbreviations. Here's one way to do this, ugly but as reliable as the tokenizer's own abbreviation recognizer:
toks = nltk.word_tokenize(test_str + " .")
if len(toks) > 1 and len(toks[-2]) > 1 and toks[-2].endswith("."):
pass # Keep the added period
else:
toks = toks[:-1]
PS. The solution you have accepted will completely change the tokenization, leaving all punctuation attached to the adjacent word (along with other undesirable effects like introducing empty tokens). This is most likely not what you want.
Could you use re?
import re
test_str = "Some Co Inc. Other Co L.P."
print re.split('\s', test_str)
This will split the input string based on spacing, retaining your punctuation.
I am trying to match items (single words) from one list with items (full sentences) from a second list. This is my code:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
if word in line:
print(word,line)
The problem now is that my code outputs substrings, so when looking for a sentence in which 'Python' occurs, I am also getting 'Pythons'; similarly, I am getting 'Funny' when I only want the sentence containing the word 'Fun'.
I have tried adding spaces surrounding the words in the list, but this is not an ideal solution, because the sentences may contain punctuation, and the code does not return a match.
Desired output:
- Time, Time is High
- Fun, That's Fun!
- Python, Python is nice
Since you want exact matches, it'd be better to use == instead of in.
import string
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
for word in tokens:
for line in sentences:
for wrd in line.split():
if wrd.strip(string.punctuation) == word: #strip method removes any punctuation from both sides of the wrd
print(word,line)
It is not as easy (requires more lines of code) to achieve retrieving "Fun!" for Fun and at the same time not "Pythons" for Python.. It can be done of course but your rules are not very clear to me at this point. Have a look at this though:
tokens = ['Time', 'Fun', 'Python']
sentences = ['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
print([(word, phrase) for phrase in sentences for word in tokens if word in phrase.split()])
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
Below you get exactly the same thing only this time instead of a list comprehension you use good old for loops. I though it might help you understand the code above easier.
a = []
for phrase in sentences:
words_in_phrase = phrase.split()
for words in tokens:
if words in words_in_phrase:
a.append((words, phrase))
print(a)
# prints: [('Time', 'Time is High'), ('Python', 'Python is Nice')]
What is happening here is that the code returns the string it found and the phrase in which it found it. The way this is done that it takes the phrases in the sentence list and split them on whitespace. So "Pythons" and "Python" are not the same as you wanted but so is "Fun!" and "Fun". This is also case sensitive.
You might want to use dynamically generated regular expressions, ie for "Python" the regexp will look like '\bPython\b'. '\b' is a word boundary.
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import re
for word in tokens:
regexp = re.compile('\b' + word + '\b')
for line in sentences:
if regexp.match(line):
print(line)
print(word,line)
tokenized sentence is better then split it by space, since tokenize will separate punctuation.
for example:
sentence = 'this is a test.'
>>> 'test' in 'this is a test.'.split(' ')
False
>>> nltk.word_tokenize('this is a test.')
['this', 'is', 'a', 'test','.']
Code:
tokens=['Time','Fun','Python']
sentences=['Time is High', "Who's Funny", 'Pythons', 'Python is Nice', "That's Fun!"]
import nltk
for sentence in sentences:
for token in tokens:
if token in nltk.word_tokenize(sentence):
print token,sentence
I want to find so called Acronyms in text is this the correct way of defining the regex for it?
My idea is that if something starts with capital and ends with capital letter it is acronym. Is this correct?
import re
test_string = "Department of Something is called DOS,
or DoS, or (DiS) or D.O.S. in United State of America, U.S.A./ USA"
pattern3=r'([A-Z][a-zA-Z]*[A-Z]|(?:[A-Z]\.)+)'
print re.findall(pattern3, test_string)
and the out put is:
['DOS', 'DoS', 'DiS', 'D.O.S.', 'U.S.A.', 'USA']
Think that you can use the word boundary \b anchor for what you want to do
>>> regex = r"\b[A-Z][a-zA-Z\.]*[A-Z]\b\.?"
>>> re.findall(regex, "AbIA AoP U.S.A.")
['AbIA', 'AoP', 'U.S.A.']