Regex pattern counting with repetitive words [duplicate] - python

This question already has answers here:
Find substring in string but only if whole words?
(8 answers)
Closed 2 years ago.
I try to write a python function that counts a specific word in a string.
My regex pattern doesn't work when the word I want to count is repeated multiple times in a row. The pattern seems to work well otherwise.
Here is my function
import re
def word_count(word, text):
return len(re.findall('(^|\s|\b)'+re.escape(word)+'(\,|\s|\b|\.|$)', text, re.IGNORECASE))
When I test it with a random string
>>> word_count('Linux', "Linux, Word, Linux")
2
When the word I want to count is adjacent to itself
>>> word_count('Linux', "Linux Linux")
1

Problem is in your regex. Your regex is using 2 capture groups and re.findall will return any capture groups if available. That needs to change to non-capture groups using (?:...)
Besides there is reason to use (^|\s|\b) as \b or word boundary is suffice which covers all the cases besides \b is zero width.
Same way (\,|\s|\b|\.|$) can be changed to \b.
So you can just use:
def word_count(word, text):
return len(re.findall(r'\b' + re.escape(word) + r'\b', text, re.I))
This will give:
>>> word_count('Linux', "Linux, Word, Linux")
2
>>> word_count('Linux', "Linux Linux")
2

I am not sure this is 100% because I don't understand the part about passing the function the word to search for when you are just looking for words that repeat in a string. So maybe consider...
import re
pattern = r'\b(\w+)( \1\b)+'
def word_count(text):
split_words = text.split(' ')
count = 0
for split_word in split_words:
count = count + len(re.findall(pattern, text, re.IGNORECASE))
return count
word_count('Linux Linux Linux Linux')
Output:
4
Maybe it helps.
UPDATE: Based on comment below...
def word_count(word, text):
count = text.count(word)
return count
word_count('Linux', "Linux, Word, Linux")
Output:
2

Related

Remove tuple based on character count

I have a dataset consisting of tuple of words. I want to remove words that contain less than 4 characters, but I could not figure out a way to iterate my codes.
Here is a sample of my data:
content clean4Char
0 [yes, no, never] [never]
1 [to, every, contacts] [every, contacts]
2 [words, tried, describe] [words, tried, describe]
3 [word, you, go] [word]
Here is the code that I'm working with (it keeps showing me error warning).
def remove_single_char(text):
text = [word for word in text]
return re.sub(r"\b\w{1,3}\b"," ", word)
df['clean4Char'] = df['content'].apply(lambda x: remove_single_char(x))
df.head(3)
the problem is with your remove_single_char function. This will do the job:
Also there is no need to use lambda since you already are passing a function to applay
def remove(input):
return list(filter(lambda x: len(x) > 4, input))
df['clean4Char'] = df['content'].apply(remove)
df.head(3)
We can use str.replace here for a Pandas option:
df["clean4Char"] = df["content"].str.replace(r'\b\w{1,3}\b,?\s*', '', regex=True)
The regex used here says to match:
\b a word boundary (only match entire words)
\w{1,3} a word with no more than 3 characters
\b closing word boundary
,? optional comma
\s* optional whitespace
We then replace with empty string to effectively remove the 3 letter or less matching words along with optional trailing whitespace and comma.
Here is a regex demo showing that the replacement logic is working.

Split a string with RegEx

Good time of the day,
Currently I am little bit stuck on a challenge.
I have to make a word count within a phrase, I have to split it by empty spaces or any special cases present.
import re
def word_count(string):
counts = dict()
regex = re.split(r" +|[\s+,._:+!&#$%^🖖]",string)
for word in regex:
word = str(word) if word.isdigit() else word
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
However I am stuck at Regex part.
While splitting, empty space are taken also in account
I started with using
for word in string.split():
But it does not pass the test wiht phrases such as:
"car : carpet as java : javascript!!&#$%^&"
"hey,my_spacebar_is_broken."
'до🖖свидания!'
Hence, if I understand, RegEx is needed.
Thank you very much in advance!
Thanks to Olvin Roght for his suggestions. Your function can be elegantly reduced to this.
import re
from collections import Counter
def word_count(text):
count=Counter(re.split(r"[\W_]+",text))
del count['']
return count
See Ryszard Czech's answer for an equivalent one liner.
Use
import re
from collections import Counter
def word_count(text):
return Counter(re.findall(r"[^\W_]+",text))
[^\W_]+ matches one or more characters different from non-word and underscore chars. This matches one or more letters or digits in effect.
See regex proof.
Change the regex pattern as below. No need to use ' +| in the pattern as you are already using '\s'. Also, note the '+'.
regex = re.split(r"[\s+,._:+!&#$%^🖖]+", string)

How to extract substring between two keywords with exceptional cases? [duplicate]

This question already has answers here:
RegExp exclusion, looking for a word not followed by another
(3 answers)
Closed 3 years ago.
I want to extract substring between apple and each in a string. However, if each is followed by box, I want the result be an empty string.
In details, it means:
1)apple costs 5 dollars each -> costs 5 dollars
2)apple costs 5 dollars each box -> ``
I tried re.findall('(?<=apple)(.*?)(?=each)')).
It can tackle 1) but not 2).
How to solve the problem?
Thanks.
You could add a negative lookahead, asserting what is on the right is not box. For a match only you can omit the capturing group.
(?<=apple).*?(?=each(?! box))
Regex demo
If you don't want to match the leading space, you could add that to the lookarounds
import re
s = "apple costs 5 dollars each"
print(re.findall(r'(?<=apple ).*?(?= each(?! box))', s))
Output
['costs 5 dollars']
You can also use a capturing group without the positive lookaheads and use the negative lookahead only. The value is in the first capturing group.
You could make use of word boundaries \b to prevent the word being part of a larger word.
\bapple\b(.*?)\beach\b(?! box)
Regex demo
try this without using regex:
myString = "apple costs 5 dollars each box"
myList = myString.split(" ")
storeString = []
for x in myList:
if x == "apple":
continue
elif x == "each":
break
else:
storeString.append(x)
# using list comprehension
listToStr = ' '.join(map(str, storeString))
print(listToStr)
Output:

How to use Regular Expressions to find duplicate non-consecutive words?

I need to reprint lines of a poem which coincide with specific rules. The rule I have been having trouble with is reprinting a line if the line has a word which appears more then once.
For example, I have to go out with Jane would not print. Whereas, I have to go out to the movies with Jane would print as the word to is repeated in the line.
Rules = ['']
Yip = open('poem.txt', 'r')
Lines = Yip.read().split('\n')
n = 1
for r in Rules:
i = 1
print("\nMatching rule", n)
for ln in Lines:
if re.search(r, ln):
print(i, end = ", ")
i = i + 1
n = n + 1
I've gotten the code '(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+', this finds duplicate words but only consecutively.
Likewise I've gotten to '^(?=(.*?to){2}).*$', this is I believe my closest code. It will print the line above as it finds both instances of 'to' but the problem is it only hits the 'to'.
I'm trying to figure out if theres a way to write the code which will print the line if it finds a non-consecutive duplicate of any word in the line so it will work on any line given.
The general regex that matches consecutive and non-consecutive duplicate words is
\b(\w+)\b(?=.*?\b\1\b)
See the regex demo
To make the pattern search for duplicate words across lines, make sure . matches line break chars, for example:
(?s)\b(\w+)\b(?=.*?\b\1\b)
^^^^
Or, use re.S or re.DOTALL in Python re.
To make it case insensitive, add i modifier, or use re.I / re.IGNORECASE:
(?si)\b(\w+)\b(?=.*?\b\1\b)
^^^^^
Pattern details
\b - word boundary
(\w+) - Group 1: one or more word chars (letters, digits, _)
\b - word boundary
(?=.*?\b\1\b) - a positive lookahead that matches a location immediately followed with
.*? - any 0+ chars, as few as possible
\b\1\b - Group 1 value as whole word (we need to use \b word boundaries again here since \1 does not "remember" the context where (\w+) matched).
Python demo:
import re
strs = ['I have to go out with Jane','I have to go out to the movies with Jane']
rx = re.compile(r'(?si)\b(\w+)\b(?=.*?\b\1\b)')
for s in strs:
print(s, "=>", rx.findall(s))
Output:
I have to go out with Jane => []
I have to go out to the movies with Jane => ['to']

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Categories

Resources