Python re.findall() Capitalized Words including Apostrophes - python

I'm having trouble completing the a regex tutorial that went \w+ references words to this problem with "Find all capitalized words in my_string and print the result" where some of the words have apostrophes.
Original String:
In [1]: my_string
Out[1]: "Let's write RegEx! Won't that be fun? I sure think so. Can you
find 4 sentences? Or perhaps, all 19 words?"
Current Attempt:
# Import the regex module
import re
# Find all capitalized words in my_string and print the result
capitalized_words = r"((?:[A-Z][a-z]+ ?)+)"
print(re.findall(capitalized_words, my_string))
Current Result:
['Let', 'RegEx', 'Won', 'Can ', 'Or ']
What I think the desired outcome is:
['Let's', 'RegEx', 'Won't', 'Can't', 'Or']
How do you go from r"((?:[A-Z][a-z]+ ?)+)" to also selecting the 's and 't at the end of Let's, Won't and Can't when not everything were trying to catch is expected to have an apostrophe?

Just add an apostrophe to the second bracket group:
capitalized_words = r"((?:[A-Z][a-z']+)+)"

I suppose you can add a little apostrophe in the group [a-z'].
So it will be like ((?:[A-Z][a-z']+ ?)+)
Hope that works

While you do have your answer, I'd like to provide a more "real-world" solution using nltk:
from nltk import sent_tokenize, regexp_tokenize
my_string = """Let's write RegEx! Won't that be fun? I sure think so. Can you
find 4 sentences? Or perhaps, all 19 words?"""
sent = sent_tokenize(my_string)
print(len(sent))
# 5
pattern = r"\b(?i)[a-z][\w']*"
print(len(regexp_tokenize(my_string, pattern)))
# 19
And imo, these are 5 sentences, not 4 unless there's some special requirement for a sentence in place.

Related

Split sentences based on different patterns in Python 3

I need to split strings based on a sequence of Regex patterns. I am able to apply individually the split, but the issue is recursively split the different sentences.
For example I have this sentence:
"I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
I would need to split the sentence based on ",", ";" and ".".
The resulst should be 5 sentences like:
"I want to be splitted using different patterns."
"It is a complex task,"
"and not easy to solve;"
"so,"
"I would need help."
My code so far:
import re
sample_sentence = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
patterns = [re.compile('(?<=\.) '),
re.compile('(?<=,) '),
re.compile('(?<=;) ')]
for pattern in patterns:
splitted_sentences = pattern.split(sample_sentence)
print(f'Pattern used: {pattern}')
How can I apply the different patterns without losing the results and get the expected result?
Edit: I need to run each pattern one by one, as I need to do some checks in the result of every pattern, so running it in some sort of tree algorithm. Sorry for not explaining entirely, in my head it was clear, but I did not think it would have side effects.
You can join each pattern with |:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=\.)\s|,\s*|;\s*', s)
Output:
['I want to be splitted using different patterns.', 'It is a complex task', 'and not easy to solve', 'so', 'I would need help.']
Python has this in re
Try
re.split('; | , | . ',ourString)
I can't think of a single regex to do this. So, what you can do it replace all the different type of delimiters with a custom-defined delimiter, say $DELIMITER$ and then split your sentence based on this delimiter.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent.split('$DELIMITER$')
This will result in the following:
['I want to be splitted using different patterns',
' It is a complex task',
' and not easy to solve',
' so',
' I would need help',
'']
NOTE: The above output has an additional empty string. This is because there is a period at the end of the sentence. To avoid this, you can either remove that empty element from the list or you can substitute the custom defined delimiter if it occurs at the end of the sentence.
new_sent = re.sub('[.,;]', '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
In case you have a list of delimiters, you can make you regex pattern using the following code:
delimiter_list = [',', '.', ':', ';']
pattern = '[' + ''.join(delimiter_list) + ']' #will result in [,.:;]
new_sent = re.sub(pattern, '$DELIMITER$', sent)
new_sent = re.sub('\$DELIMITER\$$', '', new_sent)
new_sent.split('$DELIMITER$')
I hope this helps!!!
Use a lookbehind with a character class:
import re
s = "I want to be splitted using different patterns. It is a complex task, and not easy to solve; so, I would need help."
result = re.split('(?<=[.,;])\s', s)
print(result)
Output:
['I want to be splitted using different patterns.',
'It is a complex task,',
'and not easy to solve;',
'so,',
'I would need help.']

Why is regex cutting my word?

My previous effort was something like this with Python NLTK
from nltk.tokenize import RegexpTokenizer
a = "miris ribe na balkanu"
capt1 = RegexpTokenizer('[a-b-c]\w+')
capt1.tokenize(a)
['be', 'balkanu']
This is was not what I wanted,ribe was cut to be from b.This was suggested by Tanzeel but doesn't help
>>> capt1
RegexpTokenizer(pattern='\x08[abc]\\w+', gaps=False, discard_empty=True, flags=56)
>>> a
'miris ribe na balkanu'
>>> capt1.tokenize(a)
[]
>>> capt1 = RegexpTokenizer('\b[a-b-c]\w+')
>>> capt1.tokenize(a)
[]
How to change this,to stay just with last word?
What you probably need is a word-boundary \b in your regex to match the start of a word.
Updating your regex to \b[abc]\w+ should work.
Update:
Since the OP could not get the regex with the word-boundary to work with NLTK (the word-boundary \b is a valid regex meta-character) I downloaded and tested the regex with NLTK myself.
This updated regex works now (?<=\s)[abc]\w+ and it returns the result ['balkanu'] as you'd expect.
Have not worked with NLTK before so I can't explain why the word-boundary didn't work.
The purpose of the RegexTokenizer is not to pull out selected words from your input, but to break it up into tokens according to your rules. To find all words that begin with a, b or c, use this:
import re
bwords = re.findall(r"\b[abc]\w*", 'miris ribe na balkanu')
I'm not too sure what you are after, so if your goal was actually to extract the last word in a string, use this:
word = re.findall(r"\b\w+$", 'miris ribe na balkanu')[0]
This matches the string of letters between a word boundary and the end of the string.
I think you are mixing up the notions of matching and tokenizing.
This line
capt1 = RegexpTokenizer('[abc]\w+')
(don't do [a-b-c]) says that the tokenizer should look for a, b or c and count everything following, up to the end of the word, as a token.
I think what you want to do is tokenize your data, then discard any tokens that don't begin with a or b or c.
That is a separate step.
>>> capt1 = RegexpTokenizer('\w+')
>>> tokens = capt1.tokenize(a)
# --> ['miris', 'ribe', 'na', 'balkanu']
>>> selection = [t for t in tokens if t.startswith(('a','b','c'))]
# --> ['balkanu']
I've used str.startswith() here because it is simple, but you could always use a regular expression for that too. But not in the same one as the tokenizer is using.

Is there a library for splitting sentence into a list of words in it?

I'm looking at nltk for python, but it splits(tokenize) won't as ['wo',"n't"]. Are there libraries that do this more robustly?
I know i can build a regex of some sort to solve this problem, but I'm looking for a library/tool because it would be a more directed approach. For example, after a basic regex with periods and commas, I realized words like 'Mr. ' will break the system.
(#artsiom)
If the sentence was "you won't?", split() will give me ["you", "won't?"]. So there's an extra '?' that I have to deal with.
I'm looking for a tried and tested method which do away with the kinks like the above mentioned and also the lot many exceptions that I'm sure exist. Of course, I'll resort to a split(regex) if I don't find any.
The Natural Language Toolkit (NLTK) is probably what you need.
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("'Hello. This is a test. It works!")
["'Hello", '.', 'This', 'is', 'a', 'test', '.', 'It', 'works', '!']
>>> word_tokenize("I won't fix your computer")
['I', 'wo', "n't", 'fix', 'your', 'computer']
nltk.tokenize.word_tokenize by default use the TreebankWordTokenizer, a word tokenizer that tokenizes sentences with the Penn Treebank conventions.
Note that this tokenizer assumes that the text has already been segmented into
sentences.
You can test some of the various tokenizers provided by NLTK (i.e. WordPunctTokenizer, WhitespaceTokenizer...) on this page.
Despite what you say, NLTK is by far your best bet. You will not find a more 'tried and tested' method than the tokenizers in there (since some are based on calssifiers trained especially for this). You just need to pick the right tokenizer for you needs. Let's take the following sentence:
I am a happy teapot that won't do stuff?
Here is how the various tokenizers in NLTK will split it up.
TreebankWordTokenizer
I am a happy teapot that wo n't do stuff ?
WordPunctTokenizer
I am a happy teapot that won ' t do stuff ?
PunktWordTokenizer
I am a happy teapot that won 't do stuff ?
WhitespaceTokenizer
I am a happy teapot that won't do stuff?
Your best bet might be a combination of approaches. For example you might use the PunktSentenceTokenizer to tokenize your sentences first, this tends to be extremle accurate. Then for each sentence remove the punctuation chars at the end if any. Then use the WhitespaceTokenizer, that way you'll avoid the final punctuation/word combining e.g. stuff?, since you will have removed the final punctuation chars from each sentence, but you still know where the sentences are delimited (e.g. store them in an array) and you won't have words such as won't broken up in unexpected ways.
#Karthick, here is a simple algorithm I used long ago to split a text into a wordlist:
Input text
Iterate through the text character by character.
If the current character is in "alphabet", then append it to a word. Else - add the previously created word to a list and start a new word.
alphabet = set('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
text = "I won't answer this question!"
word = ''
wordlist = []
for c in text:
if c in alphabet:
word += c
else:
if len(word) > 0:
wordlist.append(word)
word = ''
print wordlist
['I', "won't", 'answer', 'this', 'question']
It's just a launchpad and you can definitely modify this algorithm to make it smarter :)
NLTK comes with a number of different tokenizers, and you can see demos for each online at text-processing.com word tokenization demo. For your case, it looks like the WhitespaceTokenizer is best, which is essentially the same as doing string.split().
You can try this:
op = []
string_big = "One of Python's coolest features is the string format operator This operator is unique to strings"
Flag = None
postion_start = 0
while postion_start < len(string_big):
Flag = (' ' in string_big)
if Flag == True:
space_found = string_big.index(' ')
print(string_big[postion_start:space_found])
#print(space_found)
op.append(string_big[postion_start:space_found])
#postion_start = space_found
string_big = string_big[space_found+1:len(string_big)]
#print string_big
else:
op.append(string_big[postion_start:])
break
print op

splitting merged words in python

I am working with a text where all "\n"s have been deleted (which merges two words into one, like "I like bananasAnd this is a new line.And another one.") What I would like to do now is tell Python to look for combinations of a small letter followed by capital letter/punctuation followed by capital letter and insert a whitespace.
I thought this would be easy with reg. expressions, but it is not - I couldnt find an "insert" function or anything, and the string commands seem not to be helpful either. How do I do this?
Any help would be greatly appreciated, I am despairing over here...
Thanks, patrick
Try the following:
re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", your_string)
For example:
import re
lines = "I like bananasAnd this is a new line.And another one."
print re.sub(r"([a-z\.!?])([A-Z])", r"\1 \2", lines)
# I like bananas And this is a new line. And another one.
If you want to insert a newline instead of a space, change the replacement to r"\1\n\2".
Using re.sub you should be able to make a pattern that grabs a lowercase and uppercase letter and substitutes them for the same two letters, but with a space in between:
import re
re.sub(r'([a-z][.?]?)([A-Z])', '\\1\n\\2', mystring)
You're looking for the sub function. See http://docs.python.org/library/re.html for documentation.
Hmm, interesting. You can use regular expressions to replace text with the sub() function:
>>> import re
>>> string = 'fooBar'
>>> re.sub(r'([a-z][.!?]*)([A-Z])', r'\1 \2', string)
'foo Bar'
If you really don't have any caps except at the beginning of a sentence, it will probably be easiest to just loop through the string.
>>> import string
>>> s = "a word endsA new sentence"
>>> lastend = 0
>>> sentences = list()
>>> for i in range(0, len(s)):
... if s[i] in string.uppercase:
... sentences.append(s[lastend:i])
... lastend = i
>>> sentences.append(s[lastend:])
>>> print sentences
['a word ends', 'A new sentence']
Here's another approach, which avoids regular expressions and does not use any imported libraries, just built-ins...
s = "I like bananasAnd this is a new line.And another one."
with_whitespace = ''
last_was_upper = True
for c in s:
if c.isupper():
if not last_was_upper:
with_whitespace += ' '
last_was_upper = True
else:
last_was_upper = False
with_whitespace += c
print with_whitespace
Yields:
I like bananas And this is a new line. And another one.

Create a Reg Exp to search for __word__?

In a program I'm making in python and I want all words formatted like __word__ to stand out. How could I search for words like these using a regex?
Perhaps something like
\b__(\S+)__\b
>>> import re
>>> re.findall(r"\b__(\S+)__\b","Here __is__ a __test__ sentence")
['is', 'test']
>>> re.findall(r"\b__(\S+)__\b","__Here__ is a test __sentence__")
['Here', 'sentence']
>>> re.findall(r"\b__(\S+)__\b","__Here's__ a test __sentence__")
["Here's", 'sentence']
or you can put tags around the word like this
>>> print re.sub(r"\b(__)(\S+)(__)\b",r"<b>\2<\\b>","__Here__ is a test __sentence__")
<b>Here<\b> is a test <b>sentence<\b>
If you need more fine grained control over the legal word characters it's best to be explicit
\b__([a-zA-Z0-9_':])__\b ### count "'" and ":" as part of words
>>> re.findall(r"\b__([a-zA-Z0-9_']+)__\b","__Here's__ a test __sentence:__")
["Here's"]
>>> re.findall(r"\b__([a-zA-Z0-9_':]+)__\b","__Here's__ a test __sentence:__")
["Here's", 'sentence:']
Take a squizz here: http://docs.python.org/library/re.html
That should show you syntax and examples from which you can build a check for word(s) pre- and post-pended with 2 underscores.
The simplest regex for this would be
__.+__
If you want access to the word itself from your code, you should use
__(.+)__
This will give you a list with all such words
>>> import re
>>> m = re.findall("(__\w+__)", "What __word__ you search __for__")
>>> print m
['__word__', '__for__']
\b(__\w+__)\b
\b word boundary
\w+ one or more word characters - [a-zA-Z0-9_]
simple string functions. no regex
>>> mystring="blah __word__ blah __word2__"
>>> for item in mystring.split():
... if item.startswith("__") and item.endswith("__"):
... print item
...
__word__
__word2__

Categories

Resources