I have a pattern like this:
word word one/two/three word
I want to match the groups which is seperated by the /. My thought is as follows:
[\w]+ # match any words
[\w]+/[\w]+ # followed by / and another word
[\w]+[/[\w]+]+ # repeat the latter
But this does not work since it seems as soon as I add ], it does not close the mose inner [ but the most outer [.
How can I work with nested squared brackets?
Here is one option using re.findall:
import re
input = "word word one/two/three's word apple/banana"
r1 = re.findall(r"[A-Za-z0-9'.,:;]+(?:/[A-Za-z0-9'.,:;]+)+", input)
print(r1)
["one/two/three's", 'apple/banana']
Demo
I suggest you the following answer using a simple regex very close from yours and from #Tim Biegeleisen's answer, but not exactly the same:
import re
words = "word word one/two/three's word other/test"
result = re.findall('[\w\']+/[\w\'/]+', words)
print(result) # ["one/two/three's", 'other/test']
Related
I am trying to write a regex that excludes square brackets and the text inside them.
My sample text looks like this: 'WordA, WordB, WordC, [WordD]'
I want to match each text item in the string except '[WordD]'. I've tried using a negative lookahead, something like... [A-Z][A-Za-z]+(?!\[[A-Z]+\]) but doing so is still matching the text inside the brackets.
Is negative lookahead the best approach? If so, where am I going wrong?
Rather than a regex, you might consider splitting by commas and then filtering by whether the word starts with [:
output = [word for word in str.split(', ') if word[0] != '[']
If you use a regex, you can match either the beginning of the string, or lookbehind for a space:
re.findall(r'(?:^|(?<= ))[A-Z][A-Za-z]+', str)
Or you could negative lookahead for ] at the end, after a word boundary:
output = re.findall(r'[A-Z][A-Za-z]+\b(?!\])', str)
This can be as simple as
(\w+),
Regex Demo
Retrieve value of Group 1 for desired result.
I'm guessing that maybe you were trying to write some expression similar to:
[A-Z][a-z]*[A-Z](?=,|$)
or,
[A-Z][a-z]+[A-Z](?=,|$)
Test
import re
regex = r"[A-Z][a-z]*[A-Z](?=,|$)"
string = """
WordA, WordB, WordC, [WordD]
WordA, WordB, WordC, [WordD], WordE
"""
print(re.findall(regex, string))
Output
['WordA', 'WordB', 'WordC', 'WordA', 'WordB', 'WordC', 'WordE']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
I'm currently having a hard time separating words on a txt document
with regex into a list, I have tried ".split" and ".readlines" my document
consists of words like "HelloPleaseHelpMeUnderstand" the words are
capitalized but not spaced so I'm at a loss on how to get them into a list.
this is what I have currently but it only returns a single word.
import re
file1 = open("file.txt","r")
strData = file1.readline()
listWords = re.findall(r"[A-Za-z]+", strData)
print(listWords)
one of my goals for doing this is to search for another word within the elements of the list, but i just wish to know how to list them so i may continue my work.
if anyone can guide me to a solution I would be grateful.
A regular regex based on lookarounds to insert spaces between glued letter words is
import re
text = "HelloPleaseHelpMeUnderstand"
print( re.sub(r"(?<=[A-Z])(?=[A-Z][a-z])|(?<=[a-z])(?=[A-Z])", " ", text) )
# => Hello Please Help Me Understand
See the regex demo. Note adjustments will be necessary to account for numbers, or single letter uppercase words like I, A, etc.
Regarding your current code, you need to make sure you read the whole file into a variable (using file1.read(), you are reading just the first line with readline()) and use a [A-Z]+[a-z]* regex to match all the words glued the way you show:
import re
with open("file.txt","r") as file1:
strData = file1.read()
listWords = re.findall(r"[A-Z]+[a-z]*", strData)
print(listWords)
See the Python demo
Pattern details
[A-Z]+ - one or more uppercase letters
[a-z]* - zero or more lowercase letters.
How about this:
import re
strData = """HelloPleaseHelpMeUnderstand
And here not in
HereIn"""
listWords = re.findall(r"(([A-Z][a-z]+){2,})", strData)
result = [i[0] for i in listWords]
print(result)
# ['HelloPleaseHelpMeUnderstand', 'HereIn']
print(re.sub(r"\B([A-Z])", r" \1", "DoIThinkThisIsABetterAnswer?"))
Do i Think This Is A Better Answer?
I want to split strings only by suffixes. For example, I would like to be able to split dord word to [dor,wor].
I though that \wd would search for words that end with d. However this does not produce the expected results
import re
re.split(r'\wd',"dord word")
['do', ' wo', '']
How can I split by suffixes?
x='dord word'
import re
print re.split(r"d\b",x)
or
print [i for i in re.split(r"d\b",x) if i] #if you dont want null strings.
Try this.
As a better way you can use re.findall and use r'\b(\w+)d\b' as your regex to find the rest of word before d:
>>> re.findall(r'\b(\w+)d\b',s)
['dor', 'wor']
Since \w also captures digits and underscore, I would define a word consisting of just letters with a [a-zA-Z] character class:
print [x.group(1) for x in re.finditer(r"\b([a-zA-Z]+)d\b","dord word")]
See demo
If you're wondering why your original approach didn't work,
re.split(r'\wd',"dord word")
It finds all instances of a letter/number/underscore before a "d" and splits on what it finds. So it did this:
do[rd] wo[rd]
and split on the strings in brackets, removing them.
Also note that this could split in the middle of words, so:
re.split(r'\wd', "said tendentious")
would split the second word in two.
I want to define a function that takes a sentence and returns the words that are at least a length of 4 and in lowercase. The problem is, I pretty new to Python and I'm not quite certain on how to make code dealing with words instead of integers. My current code is as follows:
def my_function(s):
sentence = []
for word in s.split():
if len(word) >=4:
return (word.lower())
If I my_function("Bill's dog was born in 2010") I expect ["bill","born"] where as my code outputs "bill's"
From what I've seen on StackOverflow and in the Python tutorial, regular expression would help me but I do not fully understand what is going on in the module. Can you guys explain how regex could help, if it can at all?
Your requirements are slightly inconsistent, so I'll go with your example as the reference.
In [27]: import re
In [28]: s = "Bill's dog was born in 2010"
In [29]: [w.lower() for w in re.findall(r'\b[A-Za-z]{4,}\b', s)]
Out[29]: ['bill', 'born']
Let's take a look at the regular expression, r'\b[A-Za-z]{4,}\b'.
The r'...' is not part of the regular expression. It's a Python construct called a raw string. It's like a normal string literal except backslash sequences like \b don't have their usual meaning.
The two \b look for a word boundary (that is, the start or the end of a word).
The [A-Za-z]{4,} looks for a sequence of four or more letters. The [A-Za-z] is called a character class and consists of letters A through Z and a through z. The {4,} is a repetition operator that requires that the character class is matched at least four times.
Finally, the list comprehension, [w.lower() for w in ...], converts the words to lowercase.
Yes, Regex would be the simplest and easiest approach to achieve what you want.
Try this regex:
matches = re.findall(ur"\b[a-zA-Z]{4,}\b", "Put Your String Here") #matches [Your,String,Here]
You return the first word that is 4 chars or longer, instead of all such words. Append to sentence and return that instead:
def my_function(s):
sentence = []
for word in s.split():
if len(word) >=4:
sentence.append(word.lower())
return sentence
You can simplify that with a list comprehension:
def my_function(s):
return [word.lower() for word in s.split() if len(word) >= 4]
Yes, a regular expression could do this too, but for your case that may be overkill.
You forgot to accumulate the long words in 'sentence';) You're instead returning the first one
Using re.split
>>> import re
>>> a='Hi, how are you today?'
>>> [x for x in re.split('[^a-z]', a.lower()) if len(x)>=4]
['today']
>>>
I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).
Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.
Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)
print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!
Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA