Spacing words in a text file with Regex

Spacing words in a text file with Regex - python

I'm currently having a hard time separating words on a txt document
with regex into a list, I have tried ".split" and ".readlines" my document
consists of words like "HelloPleaseHelpMeUnderstand" the words are
capitalized but not spaced so I'm at a loss on how to get them into a list.
this is what I have currently but it only returns a single word.
import re
file1 = open("file.txt","r")
strData = file1.readline()
listWords = re.findall(r"[A-Za-z]+", strData)
print(listWords)
one of my goals for doing this is to search for another word within the elements of the list, but i just wish to know how to list them so i may continue my work.
if anyone can guide me to a solution I would be grateful.

A regular regex based on lookarounds to insert spaces between glued letter words is
import re
text = "HelloPleaseHelpMeUnderstand"
print( re.sub(r"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])", " ", text) )
# => Hello Please Help Me Understand
See the regex demo. Note adjustments will be necessary to account for numbers, or single letter uppercase words like I, A, etc.
Regarding your current code, you need to make sure you read the whole file into a variable (using file1.read(), you are reading just the first line with readline()) and use a [A-Z]+[a-z]* regex to match all the words glued the way you show:
import re
with open("file.txt","r") as file1:
strData = file1.read()
listWords = re.findall(r"[A-Z]+[a-z]*", strData)
print(listWords)
See the Python demo
Pattern details
[A-Z]+ - one or more uppercase letters
[a-z]* - zero or more lowercase letters.

How about this:
import re
strData = """HelloPleaseHelpMeUnderstand
And here not in
HereIn"""
listWords = re.findall(r"(([A-Z][a-z]+){2,})", strData)
result = [i[0] for i in listWords]
print(result)
# ['HelloPleaseHelpMeUnderstand', 'HereIn']

print(re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?"))
Do i Think This Is A Better Answer?

Related

Split a string with RegEx

Good time of the day,
Currently I am little bit stuck on a challenge.
I have to make a word count within a phrase, I have to split it by empty spaces or any special cases present.
import re
def word_count(string):
counts = dict()
regex = re.split(r" +|[\s+,._:+!&#$%^🖖]",string)
for word in regex:
word = str(word) if word.isdigit() else word
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
return counts
However I am stuck at Regex part.
While splitting, empty space are taken also in account
I started with using
for word in string.split():
But it does not pass the test wiht phrases such as:
"car : carpet as java : javascript!!&#$%^&"
"hey,my_spacebar_is_broken."
'до🖖свидания!'
Hence, if I understand, RegEx is needed.
Thank you very much in advance!

Thanks to Olvin Roght for his suggestions. Your function can be elegantly reduced to this.
import re
from collections import Counter
def word_count(text):
count=Counter(re.split(r"[\W_]+",text))
del count['']
return count
See Ryszard Czech's answer for an equivalent one liner.

Use
import re
from collections import Counter
def word_count(text):
return Counter(re.findall(r"[^\W_]+",text))
[^\W_]+ matches one or more characters different from non-word and underscore chars. This matches one or more letters or digits in effect.
See regex proof.

Change the regex pattern as below. No need to use ' +| in the pattern as you are already using '\s'. Also, note the '+'.
regex = re.split(r"[\s+,._:+!&#$%^🖖]+", string)

Match the words using re compile in Python

I'm new to Python, i have text file which consists of punctuation and other words how to recompile using specific text match.
text file looks like below actual with more that 100 sentences like below
file.txt
copy() {
foundation.d.k("cloud control")
this.is.a(context),reality, new point {"copy.control.ZOOM_CONTROL", "copy.control.ACTIVITY_CONTROL"},
context control
I just want the output something like this
copy.control.ZOOM_CONTROL
copy.control.ACTIVITY_CONTROL
i coded something like this
file=(./data/.txt)
data=re.compile('copy.control. (.*?)', re.DOTALL | re.IGNORECASE).findall(file)
res= str("|".join(data))
The above regex doesn't match for my required output. please help me on this issue. Thanks in Advance

You need to open and read the file first, then apply the re.findall method:
data = []
with open('./data/.txt', 'r') as file:
data = re.findall(r'\bcopy\.control\.(\w+)', file.read())
The \bcopy\.control\.(\w+) regex matches
\bcopy\.control\. - a copy.control. string as a whole word (\b is a word boundary)
(\w+) - Capturing group 1 (the output of re.findall): 1 or more letters, digits or _
See the regex demo.
Then, you may print the matches:
for m in data:
print(m)

Regular Expression for strings with underscore

I want to catch the following line in a parsed file using regex,
types == "EQUAL_NUM_SEQUENTIAL_LBAS":
for this, I am using the following code
variable = 'types'
for i in data:
if re.search(re.escape(variable) + r"\s\=\=\s^[A-Z_]+$", i):
print "yyy"
break
where data here is the list of lines in the parsed file. What is wrong in the expression I have written?

If you want to match a string consisting only of uppercase letters possibly separated by underscores, then use:
^[A-Z]+(?:_[A-Z]+)*$
Sample script:
inp = "EQUAL_NUM_SEQUENTIAL_LBAS"
if re.search(r'^[A-Z]+(?:_[A-Z]+)*$', inp):
print "MATCH"
The regex pattern, read out loud and in order, says to match some capital letter only word, followed optionally by an underscore and another word, zero or more times.
To capture such words appearing anywhere in a larger text/document, use:
inp = "Here is one ABC_DEF word and another EQUAL_NUM_SEQUENTIAL_LBAS here"
words = re.findall(r'\b[A-Z]+(?:_[A-Z]+)*\b', inp)
print(words)
This prints:
['EQUAL_NUM_SEQUENTIAL', 'LBAS']

Remove ^ charector in the pattern
r"\s\s\=\=\s[A-Z_]+$"

Nested squared brackets

I have a pattern like this:
word word one/two/three word
I want to match the groups which is seperated by the /. My thought is as follows:
[\w]+ # match any words
[\w]+/[\w]+ # followed by / and another word
[\w]+[/[\w]+]+ # repeat the latter
But this does not work since it seems as soon as I add ], it does not close the mose inner [ but the most outer [.
How can I work with nested squared brackets?

Here is one option using re.findall:
import re
input = "word word one/two/three's word apple/banana"
r1 = re.findall(r"[A-Za-z0-9'.,:;]+(?:/[A-Za-z0-9'.,:;]+)+", input)
print(r1)
["one/two/three's", 'apple/banana']
Demo

I suggest you the following answer using a simple regex very close from yours and from #Tim Biegeleisen's answer, but not exactly the same:
import re
words = "word word one/two/three's word other/test"
result = re.findall('[\w\']+/[\w\'/]+', words)
print(result) # ["one/two/three's", 'other/test']

Identifying lines with consecutive upper case letters

I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).

Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.

Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)

print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!

Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spacing words in a text file with Regex - python

How about this: import re strData = """HelloPleaseHelpMeUnderstand And here not in HereIn""" listWords = re.findall(r"(([A-Z][a-z]+){2,})", strData) result = [i[0] for i in listWords] print(result) # ['HelloPleaseHelpMeUnderstand', 'HereIn']

print(re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?")) Do i Think This Is A Better Answer?

Related

Split a string with RegEx

Match the words using re compile in Python

Regular Expression for strings with underscore

Nested squared brackets

Identifying lines with consecutive upper case letters

Categories

Resources