How to match one character word? - python

How do I match only words of character length one? Or do I have to check the length of the match after I performed the match operation? My filter looks like this:
sw = r'\w+,\s+([A-Za-z]){1}
So it should match
rs =re.match(sw,'Herb, A')
But shouldn't match
rs =re.match(sw,'Herb, Abc')

If you use \b\w\b you will only match one character of type word. So your expression would be
sw = r'\w+,\s+\w\b'
(since \w is preceded by at least one \s you don't need the first \b)
Verification:
>>> sw = r'\w+,\s+\w\b'
>>> print re.match(sw,'Herb, A')
<_sre.SRE_Match object at 0xb7242058>
>>> print re.match(sw,'Herb, Abc')
None

You can use
(?<=\s|^)\p{L}(?=[\s,.!?]|$)
which will match a single letter that is preceded and followed either by a whitespace character or the end of the string. The lookahead is a little augmented by punctuation marks as well ... this all depends a bit on your input data. You could also do a lookahead on a non-letter, but that begs the question whether “a123” is really a one-letter word. Or “I'm”.

Related

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

Strange regex behavior when matching with strings that include dots

I have a following case where the matching seems to not work properly:
import re
test_case1 = u"I will meet you at 2 pm"
test_case2 = u"I will meet you at 2 p.m."
test_case3 = u"I will meet you at 2 p.m. "
test_case4 = u"I will meet you at 2 p.m. pm "
list_of_words = ['p.m.', 'pm'] # list of words that can be enlarged
# join all words into an or expression and escape all punctuation
joined_words = '|'.join([re.escape(x) for x in list_of_words])
# create a regex that will match a word from the list of words only if it is
# at the start/end of the sentence or it is between two word boundaries
match_regex = r'(^|\b)('+joined_words+r')(\b|$)'
comp_regex = re.compile(match_regex, re.IGNORECASE) # compile the final regex
print comp_regex.findall(test_case1), len(comp_regex.findall(test_case1))
print comp_regex.findall(test_case2), len(comp_regex.findall(test_case2))
print comp_regex.findall(test_case3), len(comp_regex.findall(test_case3))
print comp_regex.findall(test_case4), len(comp_regex.findall(test_case4))
I get the following results for the 4 test cases:
[(u'', u'pm', u'')] 1
[(u'', u'p.m.', u'')] 1
[] 0
[(u'', u'pm', u'')] 1
The 1st and 2nd cases seem to work fine, 3rd doesn't match "p.m." if there is space after it, even though I have used "\b" word boundary in the regex.
The 4th case doesn't seem to match the "p.m." at all and only matches the "pm".
I can't seem to understand where the problem lies, any help is appreciated.
Python docs state following about \b:
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string, so the precise set of characters deemed to be alphanumeric depends on the values of the UNICODE and LOCALE flags. For example, r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.
According to that definition . can't mark the end of the word so \b doesn't match after p.m.. If you make following change to your match you get the behavior you expect:
match_regex = r'(^|\b)('+joined_words+r')(\s|$)'
You could try :
match_regex = r'(^|\b)('+joined_words+r')(\s|$)'
If you want the p.m and pm at the last line : example
or
match_regex = r'(^|\s)('+joined_words+r')(\s|$)'
if you want the first p.m. only : example

How to change a quantifier in a Regex based on a condition?

I would like to find words of length >= 1 which may contain a ' or a - within. Here is a test string:
a quake-prone area- (aujourd'hui-
In Python, I'm currently using this regex:
string = "a quake-prone area- (aujourd'hui-"
RE_WORDS = re.compile(r'[a-z]+[-\']?[a-z]+')
words = RE_WORDS.findall(string)
I would like to get this result:
>>> words
>>> [u'a', u'quake-prone', u'area', u"aujourd'hui"]
but I get this instead:
>>> words
>>> [u'quake-prone', u'area', u"aujourd'hui"]
Unfortunately, because of the last + quantifier, it skips all words of length 1. If I use the * quantifier, it will find a but also area- instead of area.
Then how could create a conditional regex saying: if the word contains an apostrophe or an hyphen, use the + quantifier else use the * quantifier ?
I suggest you to change the last [-\']?[a-z]+ part as optional by putting it into a group and then adding a ? quantifier next to that group.
>>> string = "a quake-prone area- (aujourd'hui-"
>>> RE_WORDS = re.compile(r'[a-z]+(?:[-\'][a-z]+)?')
>>> RE_WORDS.findall(string)
['a', 'quake-prone', 'area', "aujourd'hui"]
Reason for why the a is not printed is because of your regex contains two [a-z]+ which asserts that there must be atleast two lowercase letters present in the match.
Note that the regex i mentioned won't match area- because (?:[-\'][a-z]+)? optional group asserts that there must be atleast one lowercase letter would present just after to the - symbol. If no, then stop matching until it reaches the hyphen. So that you got area at the output instead of area- because there isn't an lowercase letter exists next to the -. Here it stops matching until it finds an hyphen without following lowercase letter.

Python Alphanumeric Regex

Below I have the following regex:
alphanumeric = compile('^[\w\d ]+$')
I'm running the current data against this regex:
Tomkiewicz Zigomalas Andrade Mcwalters
I have a separate regex to identify alpha characters only, yet the data above still matches the alphanumeric criteria.
Edit: How do I stop the only alpha data matching with the regex above?
Description: It can be in two forms:
Starts with numeric chars then there should be some chars, followed by any number of alpha-numeric chars are possible.
Starts with alphabets, then some numbers, followed by any number of alpha-numeric chars are possible.
Demo:
>>> an_re = r"(\d+[A-Z])|([A-Z]+\d)[\dA-Z]*"
>>> re.search(an_re, '12345', re.I) # not acceptable string
>>> re.search(an_re, 'abcd', re.I) # not acceptable string
>>> re.search(an_re, 'abc1', re.I) # acceptable string
<_sre.SRE_Match object at 0x14153e8>
>>> re.search(an_re, '1abc', re.I)
<_sre.SRE_Match object at 0x14153e8>
Use a lookahead to assert the condition that at least one alpha and at least one digit are present:
(?=.*[a-zA-Z])(?=.*[0-9])^[\w\d ]+$
The above RegEx utilizes two lookaheads to first check the entire string for each condition. The lookaheads search up until a single character in the specified range is found. If the assertion matches then it moves on to the next one. The last part I borrowed from the OP's original attempt and just ensures that the entire string is composed of one or more lower/upper alphas, underscores, digits, or spaces.

Using Regex to find words with characters that are the same or that are different

I have a list of words such as:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.
The desired final result:
['abca', 'bcab', 'cbac']
I tried this:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well.
I thought of using [^...] somehow, but I couldn't figure it out.
There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
Is it possible?
Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for #AlanMoore and #bukzor explanations.
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'
There are lots of ways to do this. Here's probably the simplest:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.
Are you required to use regexes? This is a much more pythonic way to do the same thing:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word
Here's how I would do it:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.
To heck with regexes.
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]
You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.
Not a Python guru, but maybe this
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line

Categories

Resources