I am using Python to write a program that counts how many time a word appears. But, in order to count, the program needs to look at the beginning of a sentence and only count words in a sentence that starts with %. For example,
%act: <dur> pours peanut on plate
and I want to count the word peanut. The program should return 1. While,
*CHI: peanut.
would return 0 because it starts with *
So I used findall()
findall('\%.*?' + "peanut", website_html)
But, if a sentence has two "peanut"'s, the pattern matching would only return 1. For example
%act: <bef> gives peanut . eats . <dur> gives peanut . <aft> gives raisin
would only return 1.
How can I make it return 2?
Thanks
I'd recommend breaking it down into two parts. I.e., something like:
num_peanuts = 0
for sentence in re.findall(r'(?m)^%.*', website_html):
num_peanuts = len(re.findall(r'\bpeanut\b', sentence))
I'm not sure what the right regexp would be for selecting "a sentence that begins with "%" -- here I assume that it's a line whose first character is % (note that by default . does not match newlines; also, the (?m) puts the regexp in multiline mode; and the ^ is a zero-width assertion that matches the beginning of a line.).
I'll also note that the \b's in my peanut-related regexp are to make sure that the word peanut is not a substring of some larger word (eg peanuts). You may or may not want them, depending on the details of your task.
Related
I have a regex job to search for a pattern
(This) <some words> (is a/was a) <some words> (0-4 digit number) <some words> (word)
where <some words> can be any number of words/charecters including a space.
I used the following to get achieve this.
(^|\W)This(?=\W).*?(?<=\W)(is a|was a)(?=\W).*?(?<=\W)(\d{1,4})((?=\W).*?(?<=\W))*(word)(?=\W)
I also have another constrain: the total length of the match should be less than 30 char.
Currently, my search works for all lengths and searches for all sets of words.
Is there an option in regex which I can use to achieve this constrain using the regex string itself?
I am currently getting this done by looking at the length of the matched regex objects. I have to deal with strings that are more than the required length and this is causing issues which misses some detections which are under the length constrain.
for eg:
string:
"hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish."
has 2 matches:
"This is a alpha, bravo Charley, delta, echo, fox, golf, this is 12
word"
"this is a 12 word"
My search captures the first one and misses the second. But the second one matches my length criteria.
If the first match is less than the length constrain then I can ignore the second match.
I am using re.sub() to replace those strings and use a repl function inside sub() to check the length. My dataset is large, so the search takes a lot of time. The most important thing to me is to do the search efficiently including the length constraints so as to avoid these incorrect matches.
I am using python 3
Thanks in advance
The regex engine doesn't provide a method to do exactly what you're asking for; you'd need to use regex in conjunction with another tool to get the result you want.
Building on some of the comments on your question, the following regex will return the entire match (everything from 'This' through 'word'):
\b(?=([Tt]his\b.+?\b(?:i|wa)s a\b.+?\b\d{1,4}\b.+?\bword))\b
You can then filter the results to only produce the output you're looking for.
import re
string = 'hi This is a alpha, bravo Charley, delta, echo, fox, golf, this is a 12 word finish.'
pat = re.compile(r'\b(?=([Tt]his\b.*?\b(?:i|wa)s a\b.*?\b\d{1,4}\b.*?\bword))\b')
# returns ['this is a 12 word']
[x[1] for x in pat.finditer(string) if len(x[1]) < 30]
I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)
My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.
A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:
found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
found += "."
As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.
If you want one regex, you can use something like this:
((?:[A-Za-z](\.))*[A-Za-z]+)\.?
And substitute with:
\1\2
Regex demo.
Python 3 example:
import re
regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
word
word
w.s.f.w.
w.s.f.w.
m.in.
Python demo.
I have a script that gives me sentences that contain one of a specified list of key words. A sentence is defined as anything between 2 periods.
Now I want to use it to select all of a sentence like 'Put 1.5 grams of powder in' where if powder was a key word it would get the whole sentence and not '5 grams of powder'
I am trying to figure out how to express that a sentence is between to sequences of period then space. My new filter is:
def iterphrases(text):
return ifilter(None, imap(lambda m: m.group(1), finditer(r'([^\.\s]+)', text)))
However now I no longer print any sentences just pieces/phrases of words (including my key word). I am very confused as to what I am doing wrong.
if you don't HAVE to use an iterator, re.split would be a bit simpler for your use case (custom definition of a sentence):
re.split(r'\.\s', text)
Note the last sentence will include . or will be empty (if text ends with whitespace after last period), to fix that:
re.split(r'\.\s', re.sub(r'\.\s*$', '', text))
also have a look at a bit more general case in the answer for Python - RegEx for splitting text into sentences (sentence-tokenizing)
and for a completely general solution you would need a proper sentence tokenizer, such as nltk.tokenize
nltk.tokenize.sent_tokenize(text)
Here you get it as an iterator. Works with my testcases. It considers a sentence to be anything (non-greedy) until a period, which is followed by either a space or the end of the line.
import re
sentence = re.compile("\w.*?\.(?= |$)", re.MULTILINE)
def iterphrases(text):
return (match.group(0) for match in sentence.finditer(text))
If you are sure that . is used for nothing besides sentences delimiters and that every relevant sentence ends with a period, then the following may be useful:
matches = re.finditer('([^.]*?(powder|keyword2|keyword3).*?)\.', text)
result = [m.group() for m in matches]
I am creating a regex to match a sentence if it has atleast 5 capital letters (preceded by a space as well) in the first 10 words . My regex is as follows:
(^(?:\w+\s(?= [A-Z]{5})){10}.*(?:\n|$))
My idea is :
^ Match start of string
?: look for word followed by a boundary i.e a space
?= Match if Capital letters preceded by a space
.* - match everything till line end / end string.
I guess i need to restructure this one but I don't know how to. {10} was for the first 10 words but it looks wrongly placed.
Example strings:
Match -- Lets Search For Water somewhere Because I am thirsty and i really am , wishing for a desert rain
Don't match -- fully lowercase or maybe One UPPERCASE but there are actually two uppercase letters that are preceded by a space.
Are you locked into using regex? If not:
# Python 2.7
def checkCaps(text):
words = text.split()
caps = 0
for word in words[:10]:
if word[0].isupper(): caps += 1
return caps >= 5
Edited to reflect the good feedback from #Kevin and #KarlKnechtel (and remove cruft)
Tried it out in the interpreter:
>>> checkCaps('Lets Search For Water somewhere Because I am thirsty and i really am , wishing for a desert rain')
True
>>> checkCaps('fully lowercase or maybe One UPPERCASE but there are actually two uppercase letters that are preceded by a space.')
False
Regular expressions are really not built for this task, I agree. You can look for a certain number of consecutive matches, but getting several matches interspersed with other stuff is hard, especially if you need to keep count of the "other stuff".
Your task is conceptually oriented around words, so an approach that treats the string as words (by first cutting it up into words) makes much more sense, as #rchang showed. Making it a little more powerful, adding documentation and doing the counting a little more elegantly (simple approaches are fine, too, but I really dislike explicit for loops for "counting", building lists etc. these days):
def enough_capitalized_words(text, required, limit):
"""Determine if the first `limit` words of the `text`
contain the `required` number of capitalized words."""
return sum(
word[0].isupper()
for word in text.split()[:limit]
) >= required
reduce(lambda count, word: count + word[0].isupper(), text.split()[:10], 0) >= 5
I am trying to identify a particular word and then count it. I need to save the count for each identifier.
For example, a document may contain as below:
risk risk risk free interest rate
asterisk risk risk
market risk risk [risk
*I need to count 'risk' not asterisk. There could be other risk related words, so don't stick to the above example. What I need to find is 'risk'. If risk ends with or starts with anything like < [ ( or . ! * > ] ), etc.. I need to count it as well. But if risk word is a component of a word like asterisk, then I should not count it.
Here is what I have so far. However, it returns a count for asterisk and [risk as well as risk. I tried to use regular expression but keep getting errors. Plus, I am a beginner of Python. If anyone has any idea, please help me!!^^ Thanks.
from collections import defaultdict
word_dict=defaultdict(int)
for line in mylist:
words=line.lower().split() # converted all words to lower case
for word in words:
word_dict[word]+=1
for word in word_dict:
if 'risk' in word:
word, word_dict[word]
It's actually quite easy to do this with regular expressions:
import re
haystack = "risk asterisk risk brisk risk"
prog = re.compile(r'\brisk\b')
result = re.findall(prog, haystack)
print len(result)
This outputs "3".
The \b regexp means any word delimiter including end/beginning of line.
The regular expression (?<![a-zA-Z])risk(?![a-zA-Z]) should match "risk" if it's not preceded or followed by another letter. For example:
>>> len(re.findall('(?<![a-zA-Z])risk(?![a-zA-Z])','risk? 1risk asterisk risky'))
2
Here's the breakdown of this re:
(?<![a-zA-Z]) This negative lookbehind assertion says that the match will only happen if it is not preceded by a match for [a-zA-Z], which in turn just matches a letter.
risk This is the central re that matches "risk"; nothing fancy here...
(?![a-zA-Z]) This is similar to the first part. It is a negative lookahead assertion that makes the match happen only if it is not followed by a letter.
So, say you also don't want to match things like "1risk" that have numbers before them. You can just change the [a-zA-Z] portion of the re to [a-zA-Z0-9]. Eg.:
>>> len(re.findall('(?<![a-zA-Z0-9])risk(?![a-zA-Z0-9])','risk? 1risk asterisk risky'))
1
Update:
In response to your question How to replace words, count a word, and save the count, I now get what you are asking for. You can use the same type of structure I have shown you, but modified to include all of these words:
risk
risked
riskier
riskiest
riskily
riskiness
risking
risks
risky
There are a couple ways to modify the original re; the most intuitive is probably to just use the re OR | and add in \- to the negative lookahead to prevent matching on "risk-free" and such. For example:
>>> words = '|'.join(["risk","risked","riskier","riskiest","riskily","riskiness","risking","risks","risky"])
>>> len(re.findall('(?<![a-zA-Z])(%s)(?![a-zA-Z\-])' % words, 'risk? 1risk risky risk-free'))
3
if 'risk' == word:
print word, word_dict[word]