Split a string with RegEx

Split a string with RegEx - python

Good time of the day,
Currently I am little bit stuck on a challenge.
I have to make a word count within a phrase, I have to split it by empty spaces or any special cases present.
import re
def word_count(string):
counts = dict()
regex = re.split(r" +|[\s+,._:+!&#$%^🖖]",string)
for word in regex:
word = str(word) if word.isdigit() else word
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
However I am stuck at Regex part.
While splitting, empty space are taken also in account
I started with using
for word in string.split():
But it does not pass the test wiht phrases such as:
"car : carpet as java : javascript!!&#$%^&"
"hey,my_spacebar_is_broken."
'до🖖свидания!'
Hence, if I understand, RegEx is needed.
Thank you very much in advance!

Thanks to Olvin Roght for his suggestions. Your function can be elegantly reduced to this.
import re
from collections import Counter
def word_count(text):
count=Counter(re.split(r"[\W_]+",text))
del count['']
return count
See Ryszard Czech's answer for an equivalent one liner.

Use
import re
from collections import Counter
def word_count(text):
return Counter(re.findall(r"[^\W_]+",text))
[^\W_]+ matches one or more characters different from non-word and underscore chars. This matches one or more letters or digits in effect.
See regex proof.

Change the regex pattern as below. No need to use ' +| in the pattern as you are already using '\s'. Also, note the '+'.
regex = re.split(r"[\s+,._:+!&#$%^🖖]+", string)

Related

Regex pattern counting with repetitive words [duplicate]

This question already has answers here:
Find substring in string but only if whole words?
(8 answers)
Closed 2 years ago.
I try to write a python function that counts a specific word in a string.
My regex pattern doesn't work when the word I want to count is repeated multiple times in a row. The pattern seems to work well otherwise.
Here is my function
import re
def word_count(word, text):
return len(re.findall('(^|\s|\b)'+re.escape(word)+'(\,|\s|\b|\.|$)', text, re.IGNORECASE))
When I test it with a random string
>>> word_count('Linux', "Linux, Word, Linux")
2
When the word I want to count is adjacent to itself
>>> word_count('Linux', "Linux Linux")
1

Problem is in your regex. Your regex is using 2 capture groups and re.findall will return any capture groups if available. That needs to change to non-capture groups using (?:...)
Besides there is reason to use (^|\s|\b) as \b or word boundary is suffice which covers all the cases besides \b is zero width.
Same way (\,|\s|\b|\.|$) can be changed to \b.
So you can just use:
def word_count(word, text):
return len(re.findall(r'\b' + re.escape(word) + r'\b', text, re.I))
This will give:
>>> word_count('Linux', "Linux, Word, Linux")
2
>>> word_count('Linux', "Linux Linux")
2

I am not sure this is 100% because I don't understand the part about passing the function the word to search for when you are just looking for words that repeat in a string. So maybe consider...
import re
pattern = r'\b(\w+)( \1\b)+'
def word_count(text):
split_words = text.split(' ')
count = 0
for split_word in split_words:
count = count + len(re.findall(pattern, text, re.IGNORECASE))
return count
word_count('Linux Linux Linux Linux')
Output:
4
Maybe it helps.
UPDATE: Based on comment below...
def word_count(word, text):
count = text.count(word)
return count
word_count('Linux', "Linux, Word, Linux")
Output:
2

Spacing words in a text file with Regex

I'm currently having a hard time separating words on a txt document
with regex into a list, I have tried ".split" and ".readlines" my document
consists of words like "HelloPleaseHelpMeUnderstand" the words are
capitalized but not spaced so I'm at a loss on how to get them into a list.
this is what I have currently but it only returns a single word.
import re
file1 = open("file.txt","r")
strData = file1.readline()
listWords = re.findall(r"[A-Za-z]+", strData)
print(listWords)
one of my goals for doing this is to search for another word within the elements of the list, but i just wish to know how to list them so i may continue my work.
if anyone can guide me to a solution I would be grateful.

A regular regex based on lookarounds to insert spaces between glued letter words is
import re
text = "HelloPleaseHelpMeUnderstand"
print( re.sub(r"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])", " ", text) )
# => Hello Please Help Me Understand
See the regex demo. Note adjustments will be necessary to account for numbers, or single letter uppercase words like I, A, etc.
Regarding your current code, you need to make sure you read the whole file into a variable (using file1.read(), you are reading just the first line with readline()) and use a [A-Z]+[a-z]* regex to match all the words glued the way you show:
import re
with open("file.txt","r") as file1:
strData = file1.read()
listWords = re.findall(r"[A-Z]+[a-z]*", strData)
print(listWords)
See the Python demo
Pattern details
[A-Z]+ - one or more uppercase letters
[a-z]* - zero or more lowercase letters.

How about this:
import re
strData = """HelloPleaseHelpMeUnderstand
And here not in
HereIn"""
listWords = re.findall(r"(([A-Z][a-z]+){2,})", strData)
result = [i[0] for i in listWords]
print(result)
# ['HelloPleaseHelpMeUnderstand', 'HereIn']

print(re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?"))
Do i Think This Is A Better Answer?

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)

Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)

You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO

I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)

Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO

def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Would regex be the better way to write code involving words and sentences?

I want to define a function that takes a sentence and returns the words that are at least a length of 4 and in lowercase. The problem is, I pretty new to Python and I'm not quite certain on how to make code dealing with words instead of integers. My current code is as follows:
def my_function(s):
sentence = []
for word in s.split():
if len(word) >=4:
return (word.lower())
If I my_function("Bill's dog was born in 2010") I expect ["bill","born"] where as my code outputs "bill's"
From what I've seen on StackOverflow and in the Python tutorial, regular expression would help me but I do not fully understand what is going on in the module. Can you guys explain how regex could help, if it can at all?

Your requirements are slightly inconsistent, so I'll go with your example as the reference.
In [27]: import re
In [28]: s = "Bill's dog was born in 2010"
In [29]: [w.lower() for w in re.findall(r'\b[A-Za-z]{4,}\b', s)]
Out[29]: ['bill', 'born']
Let's take a look at the regular expression, r'\b[A-Za-z]{4,}\b'.
The r'...' is not part of the regular expression. It's a Python construct called a raw string. It's like a normal string literal except backslash sequences like \b don't have their usual meaning.
The two \b look for a word boundary (that is, the start or the end of a word).
The [A-Za-z]{4,} looks for a sequence of four or more letters. The [A-Za-z] is called a character class and consists of letters A through Z and a through z. The {4,} is a repetition operator that requires that the character class is matched at least four times.
Finally, the list comprehension, [w.lower() for w in ...], converts the words to lowercase.

Yes, Regex would be the simplest and easiest approach to achieve what you want.
Try this regex:
matches = re.findall(ur"\b[a-zA-Z]{4,}\b", "Put Your String Here") #matches [Your,String,Here]

You return the first word that is 4 chars or longer, instead of all such words. Append to sentence and return that instead:
def my_function(s):
sentence = []
for word in s.split():
if len(word) >=4:
sentence.append(word.lower())
return sentence
You can simplify that with a list comprehension:
def my_function(s):
return [word.lower() for word in s.split() if len(word) >= 4]
Yes, a regular expression could do this too, but for your case that may be overkill.

You forgot to accumulate the long words in 'sentence';) You're instead returning the first one

Using re.split
>>> import re
>>> a='Hi, how are you today?'
>>> [x for x in re.split('[^a-z]', a.lower()) if len(x)>=4]
['today']
>>>

Identifying lines with consecutive upper case letters

I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).

Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.

Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)

print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!

Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a string with RegEx - python

Thanks to Olvin Roght for his suggestions. Your function can be elegantly reduced to this. import re from collections import Counter def word_count(text): count=Counter(re.split(r"[\W_]+",text)) del count[''] return count See Ryszard Czech's answer for an equivalent one liner.

Use import re from collections import Counter def word_count(text): return Counter(re.findall(r"[^\W_]+",text)) [^\W_]+ matches one or more characters different from non-word and underscore chars. This matches one or more letters or digits in effect. See regex proof.

Change the regex pattern as below. No need to use ' +| in the pattern as you are already using '\s'. Also, note the '+'. regex = re.split(r"[\s+,._:+!&#$%^🖖]+", string)

Related

Regex pattern counting with repetitive words [duplicate]

Spacing words in a text file with Regex

regex for repeating words in a string in Python

Would regex be the better way to write code involving words and sentences?

Identifying lines with consecutive upper case letters

Categories

Resources