I've started learning python few days ago and I'm training myself on codewars. One of the exercises was to calculate how many times a given word appears in the sentences. I made it my way but in the correction, some people are doing it this way:
import re
def sum_of_a_beach(beach):
return len(re.findall('Sand|Water|Fish|Sun', beach, re.IGNORECASE))
I understand most of it but I don't understand why is len() used.
re.findall('Sand|Water|Fish|Sun', beach, re.IGNORECASE) finds all the occurrences of the words (no word boundary, that is...).
len just counts those occurrences.
Using count on beach would work too, but you'd have to lowercase and perform a loop. regex avoids to convert to lowercase, and the loop is done with |
If you test it with:
s = "The sand is touching the water, it's fishy"
You'll get 3 occurrences. Maybe it's not what you want. So while you're using regular expressions, maybe you want to add the "word only" feature:
def sum_of_a_beach(beach):
return len(re.findall(r'\b(Sand|Water|Fish|Sun)\b', beach, re.IGNORECASE))
This will only match whole words thanks to \b word boundary delimiter
Related
I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
You can use 2 capture groups instead, and match a single word starting with a capital A-Z on the left or on the right.
Using [^\S\r\n] will match a whitespace char without a newline, as \s can match a newline
\b[A-Z]\w*[^\S\r\n]+(Test|Study)\b|\b(Test|Study)[^\S\r\n]+[A-Z]\w*
Regex demo
Ok, this is possibly way out of the actual scope but you could use the newer regex module with subroutines:
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)
See a demo on regex101.com (and mind the modifiers!).
In actual code, this could be:
import regex as re
junk = """
I have been analyzing large amounts of text data. This is what I got so far:
(([A-Z][\w-]*)+\s+(\b(Study|Test)\b)(\s[A-Z][\w-]*)*)|(\b(Study|Test)\b)(\s[A-Z][\w-]*)+
Types of phrases I would like to capture:
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
I want to capture the word 'Study' or 'Test' ONLY if it is surrounded by the words starting with a capital letter. The ideal regex would achieve all of this + it would ignore\escape certain words like 'of' or 'the'.
*the above regex is super slow with the str.findall function, I guess there must be a better solution
** I used https://regex101.com for testing and then run it in Jupyter, Python 3
"""
pattern = re.compile(r'''
(?(DEFINE)
(?<marker>\b[A-Z][-\w]*\b)
(?<ws>[\ \t]+)
(?<needle>\b(?:Study|Test))
(?<pre>(?:(?&marker)(?&ws))+)
(?<post>(?:(?&ws)(?&marker))+)
(?<before>(?&pre)(?&needle))
(?<after>(?&needle)(?&post))
(?<both>(?&pre)(?&needle)(?&post))
)
(?&both)|(?&before)|(?&after)''', re.VERBOSE)
for match in pattern.finditer(junk):
print(match.group(0))
And would yield
Europe National Longitudinal Study
Longitudinal Study
Study Initiative
Longitudinal Study Initiative
((?:[A-Z]\w+\s+){0,5}\bStudy\b\s*(?:[A-Z]\w+\b\s*){0,5})
Test
I have to further test it to check whether it works for the all of the possible scenarios in a real world. I might need to adjust '5' in the expression to a lower or higher number(s) to optimize my algorithm's performance, though. I tested it on some real datasets already and the results have been promising so far. It is fast.
In my program I'm using count=strr2.lower().count("ieee") to calculate the number of occurrences in the following string,
"i love ieee and ieeextream is the best coding competition ever"
In here it counts "ieeextream" is also as one occurrence which is not my expected result. The expected output is count=1
So are there any method to check only for "ieee" word or can we change the same code with different implementation? Thanks for your time
If you are trying to find the sub-string as a whole word present in the original string, then I guess, this is what you need :
count=strr2.lower().split().count("ieee")
If you want to count only whole words, you can use a regular expression, wrapping the word to be found in word-boundary characters \b. This will also work if the word is surrounded by punctuation.
>>> import re
>>> s = "i love IEEE, and ieeextream is the best coding competition ever"
>>> len(re.findall(r"\bieee\b", s.lower()))
1
I have a very long string in python and i'm trying to break it up into a list of sentences. Only some of these sentences are missing puntuation and spaces between them.
Example
I have 9 sheep in my garageVideo games are super cool.
I can't figure out the regex to separate the two! It's drive me nuts.
There are properly punctuated sentences as well, so I thought i'd make several different regex patterns, each splitting off different styles of combination.
Input
I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!
Output
['I have 9 sheep in my garage',
'Video games are super cool.'
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Thanks!
Position Split: Use the regex module
I will give you both a "Split" and a "Match All" option. Let's start with "Split".
In many engines, but not Python's re module, you can split at a position defined by a zero-width match.
In Python, to split on a position, I would use Matthew Barnett's outstanding regex module, whose features far outstrip those of Python's default re engine. That is my default regex engine in Python.
With your input, you can use this regex:
(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])
Note that if you had strangely-formatted acronyms such as B. B. C., we would need to tweak this.
Sample Python Code:
string = "I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!"
result = regex.split("(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])", string)
print(result)
Output:
['I have 9 sheep in my garage',
'Video games are super cool.',
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Explanation
(?V1) instructs the engine to use the new behavior, where we can split on zero-width matches.
(?<=[a-z])(?=[A-Z]) matches a position where the lookbehind (?<=[a-z]) can assert that what precedes is a lower-case letter and the lookahead (?=[A-Z]) can assert that what follows is an uppercase letter.
| OR...
(?<=[.!?]) +(?=[A-Z]) matches one or more spaces + where the lookbehind (?<=[.!?]) can assert that what precedes is a dot, bang, question mark and a space, and where the lookahead (?=[A-Z]) can assert that what follows is a capital letter.
Option 2: Use findall (again with the regex module)
Since the "Split" and "Match All" operations are two sides of the same coin, you can do this:
print(regex.findall(r".+?(?:(?<=[.!?])|(?<=[a-z])(?=[A-Z]))",string))
Again, this would not work with re (which would skip the V that starts the second sentence Video).
I want to look for a phrase, match up to a few words following it, but stop early if I find another specific phrase.
For example, I want to match up to three words following "going to the", but stop the matching process if I encounter "to try". So for example "going to the luna park" will result with "luna park"; "going to the capital city of Peru" will result with "capital city of" and "going to the moon to try some cheesecake" will result with "moon".
Can it be done with a single, simple regular expression (preferably in Python)? I've tried all the combinations I could think of, but failed miserably :).
This one matches up to 3 ({1,3}) words following going to the as long as they are not followed by to try ((?!to try)):
import re
infile = open("input", "r")
for line in infile:
m = re.match("going to the ((?:\w+\s*(?!to try)){1,3})", line)
if m:
print m.group(1).rstrip()
Output
luna park
capital city of
moon
I think you are looking for a way to extract Proper Nouns out of sentences. You should look at NLTK for proper approach. Regex can be only helpful of a limited context free grammer. On the other hand you seem to asking for ability to parse human language which is non-trivial (for computers).
I have a sentence. I want to find all occurrences of a word that start with a specific character in that sentence. I am very new to programming and Python, but from the little I know, this sounds like a Regex question.
What is the pattern match code that will let me find all words that match my pattern?
Many thanks in advance,
Brock
import re
print re.findall(r'\bv\w+', thesentence)
will print every word in the sentence that starts with 'v', for example.
Using the split method of strings, as another answer suggests, would not identify words, but space-separated chunks that may include punctuation. This re-based solution does identify words (letters and digits, net of punctuation).
I second the Dive Into Python recommendation. But it's basically:
m = re.findall(r'\bf.*?\b', 'a fast and friendly dog')
print(m)
\b means word boundary, and .*? ensures we store the whole word, but back off to avoid going too far (technically, ? is called a lazy operator).
You could do (doesn't use re though):
matching_words = [x for x in sentence.split() if x.startswith(CHAR_TO_FIND)]
Regular expressions work too (see the other answers) but I think this solution will be a little more readable, and as a beginner learning Python, you'll find list comprehensions (like the solution above) important to gain a comfort level with.
>>> sentence="a quick brown fox for you"
>>> pattern="fo"
>>> for word in sentence.split():
... if word.startswith(pattern):
... print word
...
fox
for
Split the sentence on spaces, use a loop to search for the pattern and print them out.
import re
s = "Your sentence that contains the word ROAD"
s = re.sub(r'\bROAD', 'RD.', s)
print s
Read: http://diveintopython3.org/regular-expressions.html