Regex complete words pattern

Regex complete words pattern - python

I want to get patterns involving complete words, not pieces of words.
E.g. 12345 [some word] 1234567 [some word] 123 1679. Random text and the pattern appears again 1111 123 [word] 555.
This should return
[[12345, 1234567, 123, 1679],[1111, 123, 555]]
I am only tolerating one word between the numbers otherwise the whole string would match.
Also note that it is important to capture that 2 matches were found and so a two-element list was returned.
I am running this in python3.
I have tried:
\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b
but I am not sure how to scale this to an unrestricted number of matches.
re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', string)
This matches [number] [word] [number] but not any number that might follow with or without a word in between.

Are you expecting re.findall() to return a list of lists? It will only return a list - no matter what regex you use.
One approach is to split your input string into sentences and then loop through them
import re
inputArray = re.split('<pattern>',inputText)
outputArray = []
for item in inputArray:
outputArray.append(re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', item))
the trick is to find a <pattern> to split your input.

You can't do this in one operation with the Python re engine.
But you could match the sequence with one match, then extract the
digits with another.
This matches the sequence
r"(?<!\w)\d+(?:(?:[^\S\r\n]+[a-zA-Z](?:\w*[a-zA-Z])*)?[^\S\r\n]+\d+)*(?!\w)"
https://regex101.com/r/73AYLU/1
Explained
(?<! \w ) # Not a word behind
\d+ # Many digits
(?: # Optional word block
(?: # Optional words
[^\S\r\n]+ # Horizontal whitespace
[a-zA-Z] # Starts with a letter
(?: \w* [a-zA-Z] )* # Can be digits in middle, ends with a letter
)? # End words, do once
[^\S\r\n]+ # Horizontal whitespace
\d+ # Many digits
)* # End word block, do many times
(?! \w ) # Not a word ahead
This gets the array of digits from the sequence matched above (use findall)
r"(?<!\S)(\d+)(?!\S)"
https://regex101.com/r/BHov38/1
Explained
(?<! \S ) # Whitespace boundary
( \d+ ) # (1)
(?! \S ) # Whitespace boundary

This is a bit complicated, maybe this expression would be just something to look into:
(((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+)
and script the rest of the problem for a valid solution.
Demo

Related

Exclude words with pattern 'xyyx' but include words that start & ends with same letter

I have a regex to match words that starts and ends with the same letter (excluding single characters like 'a', '1' )
(^.).*\1$
and another regex to avoid matching any strings with the format 'xyyx' (e.g 'otto', 'trillion', 'xxxx', '-[[-', 'fitting')
^(?!.*(.)(.)\2\1)
How do I construct a single regex to meet both of the requirements?

You can start the pattern with the negative lookahead followed by the pattern for the match. But note to change the backreference to \3 for the last pattern as the lookahead already uses group 1 and group 2.
Note that the . also matches a space, so if you don't want to match spaces you can use \S to match non whitespace chars instead.
^(?!.*(.)(.)\2\1)(.).*\3$
Regex demo

I would place the negative look-ahead after the initial character, and let it exclude the final character (as those two should be part of a positive capture):
^(.)(?!.*(.)\2.).*\1$
Note that the negative check concerns characters between the start and ending character, and so these words would not be rejected:
oopso
livewell

Optional group except when it precede with a match

I want to match any string that starts with . and word and then optionally any character after a space.
r"^\.(\w+)(?:\s+(.+)\b)?"
eg:
should match
.just one two
.just
.blah one#nine
.blah
.jargon blah
should not match
.jargon
I want this second group mandatory if first group is jargon

Using Python you can exclude matching only jargon using a negative lookahead, and then match 1 or more word characters
Then optionally match 1 or more whitespace characters excluding newlines followed by at least 1 or more characters without newlines.
^\.(?!jargon$)\w+(?:[^\S\n]+.+)?$
The pattern matches:
^ Start of string
\. Match a dot
(?!jargon$) Exlude matching jargon as the only word on the line
\w+ Match 1+ word characters
(?: Non capture group
[^\S\n]+.+ match 1+ whitespace chars excluding newline and then 1+ chars except newlines
)? Close non capture group and make it optional
$ End of string
See a regex demo and a Python demo.
Example
import re
strings = [
".just one two",
".just",
".blah one#nine",
".blah",
".jargon blah",
".jargon"
]
for s in strings:
m = re.match(r"\.(?!jargon$)\w+(?:[^\S\n]+.+)?$", s)
if m:
print(m.group())
Output
.just one two
.just
.blah one#nine
.blah
.jargon blah

One approach would be to phrase your requirement using an alternation:
^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$
This pattern says to match:
^ from the start of the input
\. match dot
(?:
(?!jargon\b)\w+ match a first term which is NOT "jargon"
(?: \S+)* then match optional following terms zero or more times
| OR
jargon match "jargon" as the first term
(?: \S+)+ then match mandatory one or more terms
)
$ end of the input
Here is a sample Python script:
inp = [".just one two", ".just", ".blah one#nine", ".blah", ".jargon blah", "jargon"]
matches = [x for x in inp if re.search(r'^\.(?:(?!jargon\b)\w+(?: \S+)*|jargon(?: \S+)+)$', x)]
print(matches) # ['.just one two', '.just', '.blah one#nine', '.blah', '.jargon blah']

You could attempt to match the following regular expression:
^\.(?!jargon$)\w+(?= .|$).*
Demo
If successful, this will match the entire string. If one simply wants to know if the string conforms to the requirements .* can be dropped.
(?!jargon$) is a negative lookahead that asserts that the period is not immediately followed by 'jargon' at the end of the string.
(?= .|$) is a positive lookahead that asserts that the string of word characters is followed by a space followed by any character or they terminate the string.

How to extract string between space and symbol '>'?

String 1:
[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97
String 2:
[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17
In string 1, I want to extract CAR<7:5 and BIKE<4:0,
In string 2, I want to extract CAKE<4:0
Any regex for this in Python?

You can use \w+<[^>]+
DEMO
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy).
< matches the character <
[^>] Match a single character not present in the list
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)

We can use re.findall here with the pattern (\w+.*?)>:
inp = ["[impro:0,grp:00,time:0xac,magic:0x00ac] CAR<7:5>|BIKE<4:0>,orig:0x8c,new:0x97", "[impro:0,grp:00,time:0xbc,magic:0x00bc] CAKE<4:0>,orig:0x0d,new:0x17"]
for i in inp:
matches = re.findall(r'(\w+<.*?)>', i)
print(matches)
This prints:
['CAR<7:5', 'BIKE<4:0']
['CAKE<4:0']

In the first example, the BIKE part has no leading space but a pipe char.
A bit more precise match might be asserting either a space or pipe to the left, and match the digits separated by a colon and assert the > to the right.
(?<=[ |])[A-Z]+<\d+:\d+(?=>)
In parts, the pattern matches:
(?<=[ |]) Positive lookbehind, assert either a space or a pipe directly to the left
[A-Z]+ Match 1+ chars A-Z
<\d+:\d+ Match < and 1+ digits betqeen :
(?=>) Positive lookahead, assert > directly to the right
Regex demo
Or the capture group variant:
(?:[ |])([A-Z]+<\d+:\d)>
Regex demo

Match word boundary before non-alphanumerical character

I want to find words starting with a single non-alphanumerical character, say '$', in a string with re.findall
Example of matching words
$Python
$foo
$any_word123
Example of non-matching words
$$Python
foo
foo$bar
Why \b does not work
If the first character were to be alphanumerical, I could do this.
re.findall(r'\bA\w+', s)
But this does not work for a pattern like \b\$\w+ because \b matches the empty string only between a \w and a \W.
# The line below matches only the last '$baz' which is the one that should not be matched
re.findall(r'\b\$\w+', '$foo $bar x$baz').
The above outputs ['$baz'], but the desired pattern should output ['$foo', '$bar'].
I tried replacing \b by a positive lookbehind with pattern ^|\s, but this does not work because lookarounds must be fixed in length.
What is the correct way to handle this pattern?

The following will match a word starting with a single non-alphanumerical character.
re.findall(r'''
(?: # start non-capturing group
^ # start of string
| # or
\s # space character
) # end non-capturing group
( # start capturing group
[^\w\s] # character that is not a word or space character
\w+ # one or more word characters
) # end capturing group
''', s, re.X)
or just:
re.findall(r'(?:^|\s)([^\w\s]\w+)', s, re.X)
results in:
'$a $b a$c $$d' -> ['$a', '$b']

One way is to use a negative lookbehind with the non-whitespace metacharacter \S.
s = '$Python $foo foo$bar baz'
re.findall(r'(?<!\S)\$\w+', s) # output: ['$Python', '$foo']

Using Regex to find words with characters that are the same or that are different

I have a list of words such as:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.
The desired final result:
['abca', 'bcab', 'cbac']
I tried this:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well.
I thought of using [^...] somehow, but I couldn't figure it out.
There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
Is it possible?

Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for #AlanMoore and #bukzor explanations.
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\1). All this means is 'match the current character only if it isn't followed by the first character.'

There are lots of ways to do this. Here's probably the simplest:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [a-z] with \w. That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2}.
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.

Are you required to use regexes? This is a much more pythonic way to do the same thing:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word

Here's how I would do it:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [a-z] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to *, + or some other quantifier.

To heck with regexes.
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]

You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.

Not a Python guru, but maybe this
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex complete words pattern - python

This is a bit complicated, maybe this expression would be just something to look into: (((\d+)\s)(?:\s\[.?\]\s)((\d+)\s)*)|([A-za-z\s]+) and script the rest of the problem for a valid solution. Demo

Related

Exclude words with pattern 'xyyx' but include words that start & ends with same letter

Optional group except when it precede with a match

How to extract string between space and symbol '>'?

Match word boundary before non-alphanumerical character

Using Regex to find words with characters that are the same or that are different

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex complete words pattern - python

This is a bit complicated, maybe this expression would be just something to look into: (((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+) and script the rest of the problem for a valid solution. Demo

Related

Exclude words with pattern 'xyyx' but include words that start & ends with same letter

Optional group except when it precede with a match

How to extract string between space and symbol '>'?

Match word boundary before non-alphanumerical character

Using Regex to find words with characters that are the same or that are different

Categories

Resources

This is a bit complicated, maybe this expression would be just something to look into: (((\d+)\s)(?:\s\[.?\]\s)((\d+)\s)*)|([A-za-z\s]+) and script the rest of the problem for a valid solution. Demo