Identifying lines with consecutive upper case letters - python

I'm looking for logic that searches a capital word in a line in python, like I have a *.txt:
aaa
adadad
DDD_AAA
Dasdf Daa
I would like to search only for the lines which have 2 or more capital words after each other (in the above case DDD_AAA).

Regex are the way to go:
import re
pattern = "([A-Z]+_[A-Z]+)" # matches CAPITALS_CAPITALS only
match = re.search(pattern, text)
if match: print match.group(0)
You have to figure out what exactly you are looking for though.

Presuming your definition of a "capital word" is a string of two or more uppercase alphabet (non-numeric) characters, i.e. [A-Z], and assuming that what separates one "capital word" from another is not quite the complementary set ([^A-Z]) but rather the complementary set to the alphanumeric characters, i.e. [^a-zA-Z0-9], you're looking for a regex like
\b[A-Z]{2,}\b.*\b[A-Z]{2,}\b
I say like because the above is not exactly correct: \b counts the underscore _ as a word character. Replace the \bs with [^a-zA-Z0-9]s wrapped in lookaround assertions (to make them zero-width, like \b), and you have the correct regex:
(?<=[^a-zA-Z0-9]|^)[A-Z]{2,}(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]{2,}(?=[^a-zA-Z0-9]|$)
Here's a Rubular demo.
Finally, if you consider a one-character word, a "word", then simply do away with the {2,} quantifiers:
(?<=[^a-zA-Z0-9]|^)[A-Z]+(?=[^a-zA-Z0-9]).*(?<=[^a-zA-Z0-9])[A-Z]+(?=[^a-zA-Z0-9]|$)

print re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",search_text)
should work to match 2 words that both start with a capital letter
for your specific example
lines = []
for line in file:
if re.findall("[A-Z][a-zA-Z]*\s[A-Z][a-zA-Z]",line): lines.append(line)
print lines
basically look into regexes!

Here you go:
import re
lines = open("r1.txt").readlines()
for line in lines:
if re.match(r'[^\w]*[A-Z]+[ _][A-Z]+[^\w]*', line) is not None:
print line.strip("\n")
Output:
DDD_AAA

Related

Writing a regex expression that finds 'zz' in a word but not at the start and the end

I am having some difficulty writing a regex expression that finds words in a text that contain 'zz', but not at the start and the end of the text. These are two of my many attempts:
pattern = re.compile(r'(?!(?:z){2})[a-z]*zz[a-z]*(?!(?:z){2})')
pattern = re.compile(r'\b[^z\s\d_]{2}[a-z]*zz[a-y][a-z]*(?!(?:zz))\b')
Thanks
Well, the direct translation would be
\b(?!zz)(?:(?!zz\b)\w)+zz(?:(?!zz\b)\w)+\b
See a demo on regex101.com.
Programmatically, you could use
text = "lorem ipsum buzz mezzo mix zztop but this is all"
words = [word
for word in text.split()
if not (word.startswith("zz") or word.endswith("zz")) and "zz" in word]
print(words)
Which yields
['mezzo']
See a demo on ideone.com.
Another idea to use non word boundaries.
\B matches at any position between two word characters as well as at any position between two non-word characters ...
\w*\Bzz\B\w*
See this demo at regex101
Be aware that above matches words with two or more z. For exactly two:
\w*(?<=[^\Wz])zz(?=[^\Wz])\w*
Another demo at regex101
Use any of those patterns with (?i) flag for caseless matching if needed.
You can use lookarounds:
\b(?!zz)\w+?zz\w+\b(?<!zz)
demo
or not:
\bz?[^\Wz]\w*?zz\w*[^\Wz]z?\b
demo
Limited to ASCII letters this last pattern can also be written:
\bz?[a-y][a-z]*?zz[a-z]*[a-y]z?\b
Your criteria just means that the first and last letter cannot be z. So we simply have to make sure the first and last letter is not z, and then we have a zz somewhere in the text.
Something like
^[^z].*zz.*[^z]$
should work

Regular Expression for strings with underscore

I want to catch the following line in a parsed file using regex,
types == "EQUAL_NUM_SEQUENTIAL_LBAS":
for this, I am using the following code
variable = 'types'
for i in data:
if re.search(re.escape(variable) + r"\s\=\=\s^[A-Z_]+$", i):
print "yyy"
break
where data here is the list of lines in the parsed file. What is wrong in the expression I have written?
If you want to match a string consisting only of uppercase letters possibly separated by underscores, then use:
^[A-Z]+(?:_[A-Z]+)*$
Sample script:
inp = "EQUAL_NUM_SEQUENTIAL_LBAS"
if re.search(r'^[A-Z]+(?:_[A-Z]+)*$', inp):
print "MATCH"
The regex pattern, read out loud and in order, says to match some capital letter only word, followed optionally by an underscore and another word, zero or more times.
To capture such words appearing anywhere in a larger text/document, use:
inp = "Here is one ABC_DEF word and another EQUAL_NUM_SEQUENTIAL_LBAS here"
words = re.findall(r'\b[A-Z]+(?:_[A-Z]+)*\b', inp)
print(words)
This prints:
['EQUAL_NUM_SEQUENTIAL', 'LBAS']
Remove ^ charector in the pattern
r"\s\s\=\=\s[A-Z_]+$"

Split by suffix with Python regular expression

I want to split strings only by suffixes. For example, I would like to be able to split dord word to [dor,wor].
I though that \wd would search for words that end with d. However this does not produce the expected results
import re
re.split(r'\wd',"dord word")
['do', ' wo', '']
How can I split by suffixes?
x='dord word'
import re
print re.split(r"d\b",x)
or
print [i for i in re.split(r"d\b",x) if i] #if you dont want null strings.
Try this.
As a better way you can use re.findall and use r'\b(\w+)d\b' as your regex to find the rest of word before d:
>>> re.findall(r'\b(\w+)d\b',s)
['dor', 'wor']
Since \w also captures digits and underscore, I would define a word consisting of just letters with a [a-zA-Z] character class:
print [x.group(1) for x in re.finditer(r"\b([a-zA-Z]+)d\b","dord word")]
See demo
If you're wondering why your original approach didn't work,
re.split(r'\wd',"dord word")
It finds all instances of a letter/number/underscore before a "d" and splits on what it finds. So it did this:
do[rd] wo[rd]
and split on the strings in brackets, removing them.
Also note that this could split in the middle of words, so:
re.split(r'\wd', "said tendentious")
would split the second word in two.

How to extract numbers from string using regular expression in Python except when within brackets?

Here are my test strings:
Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word
What regular expression should I use to extract only those numbers (format: \d+-\d+) that are not within brackets (the ones in bold above)?
I've tried this:
(\d+-\d+)(?!\))
But it's matching:
Word word word; 123-125
Word word (1000-1000)
Word word word (1000-1000); 99-999
Word word word word
Note the last digit before the second bracket.
I was trying to drop any match that is followed by a bracket, but it's only dropping one digit rather than the whole match! What am I missing here?
Any help will be greatly appreciated.
You can use a negative look-ahead to get only those values you need like this:
(?![^()]*\))(\d+-\d+)
The (?![^()]*\)) look-ahead actually checks that there are no closing round brackets after the hyphenated numbers.
See demo
Sample code:
import re
p = re.compile(ur'(?![^()]*\))(\d+-\d+)')
test_str = u"Word word word; 123-125\nWord word (1000-1000)\nWord word word (1000-1000); 99-999\nWord word word word"
re.findall(p, test_str)
Output of the sample program:
[u'123-125', u'99-999']
A way consists to describe all you don't want:
[^(\d]*(?:\([^)]*\)[^(\d]*)*
Then you can use an always true assertion: a digits are always preceded by zero or more characters that are not digits and characters between quotes.
You only need to capture the digits in a group:
p = re.compile(r'[^(\d]*(?:\([^)]*\)[^(\d]*)*(\d+-\d+)')
The advantage of this way is that you don't need to test a lookahead at each position in the string, so it is a fast pattern. The inconvenient is that it consumes a little more memory, because the whole match produces more long strings.

Regex? Match part of or whole word

I was wondering if it's possible to use regex with python to capture a word, or a part of the word (if it's at the end of the string).
Eg:
target word - potato
string - "this is a sentence about a potato"
string - "this is a sentence about a potat"
string - "this is another sentence about a pota"
Thanks!
import re
def get_matcher(word, minchars):
reg = '|'.join([word[0:i] for i in range(len(word), minchars - 1, -1)])
return re.compile('(%s)$' % (reg))
matcher = get_matcher('potato', 4)
for s in ["this is a sentence about a potato", "this is a sentence about a potat", "this is another sentence about a pota"]:
print matcher.search(s).groups()
OUTPUT
('potato',)
('potat',)
('pota',)
Dont know how to match a regex in python, but the regex would be:
"\bp$|\bpo$|\bpot$|\bpota$|\bpotat$|\bpotato$"
This would match anything from p to potato if its the last word in the string, and also for example not something like "foopotato", if this is what you want.
The | denotes an alternative, the \b is a "word boundary", so it matches a position (not a character) between a word- and a non-word character. And the $ matches the end of the string (also a position).
Use the $ to match at the end of a string. For example, the following would match 'potato' only at the end of a string (first example):
"potato$"
This would match all of your examples:
"pota[to]{1,2}$"
However, some risk of also matching "potao" or "potaot".
import re
patt = re.compile(r'(p|po|pot|pota|potat|potato)$')
patt.search(string)
I was tempted to use r'po?t?a?t?o?$', but that would also match poto or pott.
No, you can't do that with a regex as far as I know, without pointless (p|po|pot ...) matches which are excessive. Instead, just pick off the last word, and match that using a substring:
match = re.search('\S+$', haystack)
if match.group(0) == needle[:len(match.group(0))]:
# matches.

Categories

Resources