Python Regex Picking "not include" word - python

I am trying to find words in the string that do not contain any "a" characters. I wrote the code below but it does not work. How can I say to regex "do not include"? Can't I use "^" sign as "not"?
import re
string2 = "asfdba12312sssdr1 12şljş1 kf"
t = re.findall(r'([^a]\w*) | \w*[^a] ', string2 )
print(t)
The result of that code is "['sfdba12312sssdr1', '12şljş1']"

You need to use a regex with word boundaries with a re.UNICODE flag:
r = re.compile(ur'\b[^\Wa]+\b', re.UNICODE)
The \W and \b will become Unicode aware then.
See the regex demo
[^\Wa] matches any Unicode letter, digit or inderscore, but not a. Add a re.I flag to make it case-insensitive.
If you do not want to match words with digits, add \d to the char class: [^\W\da].
See Python demo:
# -*- coding: utf-8 -*-
import re
p = re.compile(ur'\b[^\Wa]+\b', re.UNICODE)
s = u"asfdba12312sssdr1 12şljş1 kf"
res = [x.encode('utf8') for x in p.findall(s)]
print(res)

[^a] is the single non-a character. [^a]\w* is a single non-a character followed by any number of word-characters. Note that a space is a non-a character, and word-characters can also include a...
The easiest and the most intuitive way to do this in Python is not using re.findall at all:
[word for word in string2.split() if not 'a' in word]

Related

how to match either word or sentence in this Python regex?

I have a decent familiarity with regex but this is tricky. I need to find instances like this from a SQL case statement:
when col_name = 'this can be a word or sentence'
I can match the above when it's just one word, but when it's more than one word it's not working.
s = """when col_name = 'a sentence of words'"""
x = re.search("when\s(\w+)\s*=\s*\'(\w+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a"
I want group(2) to return "a sentence of words" but I'm just getting the first word. That part could either be one word or several. How to do it?
When I add in the second \', then I get no match:
x = re.search("when\s(\w+)\s*=\s*\'(\w+)\'", s)
You may match all characters other than single quotation mark rather than matching letters, digits and connector punctuation ("word" chars) with the Group 2 pattern:
import re
s = """when col_name = 'a sentence of words'"""
x = re.search(r"when\s+(\w+)\s*=\s*'([^']+)", s)
if x:
print(x.group(1)) # this returns "col_name"
print(x.group(2)) # this returns "a sentence of words"
See the Python demo
The [^'] is a negated character class that matches any char but a single quotation mark, see the regex demo.
In case the string can contain escaped single quotes, you may consider replacing [^'] with
If the escape char is ': ([^']*(?:''[^']*)*)
If the escape char is \: ([^\\']*(?:\\.[^'\\]*)*).
Note the use of the raw string literal to define the regex pattern (all backslashes are treated as literal backslashes inside it).

Match charactes and whitespaces, but not numbers

I am trying to create a regex that will match characters, whitespaces, but not numbers.
So hello 123 will not match, but hell o will.
I tried this:
[^\d\w]
but, I cannot find a way to add whitespaces here. I have to use \w, because my strings can contain Unicode characters.
Brief
It's unclear what exactly characters refers to, but, assuming you mean alpha characters (based on your input), this regex should work for you.
Code
See regex in use here
^(?:(?!\d)[\w ])+$
Note: This regex uses the mu flags for multiline and Unicode (multiline only necessary if input is separated by newline characters)
Results
Input
ÀÇÆ some words
ÀÇÆ some words 123
Output
This only shows matches
ÀÇÆ some words
Explanation
^ Assert position at the start of the line
(?:(?!\d)[\w ])+ Match the following one or more times (tempered greedy token)
(?!\d) Negative lookahead ensuring what follows doesn't match a digit. You can change this to (?![\d_]) if you want to ensure _ is also not used.
[\w ] Match any word character or space (matches Unicode word characters with u flag)`
$ Assert position at the end of the line
You can use a lookahead:
(?=^\D+$)[\w\s]+
In Python:
import re
strings = ['hello 123', 'hell o']
rx = re.compile(r'(?=^\D+$)[\w\s]+')
new_strings = [string for string in strings if rx.match(string)]
print(new_strings)
# ['hell o']

python regex - replace newline (\n) to something else

I'm trying to convert multiple continuous newline characters followed by a Capital Letter to "____" so that I can parse them.
For example,
i = "Inc\n\nContact"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
In [25]: i
Out [25]: 'Inc____Contact'
This string works fine. I can parse them using ____ later.
However it doesn't work on this particular string.
i = "(2 months)\n\nML"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
Out [31]: '(2 months)____L'
It ate capital M.
What am I missing here?
EDIT To replace multiple continuous newline characters (\n) to ____, this should do:
>>> import re
>>> i = "(2 months)\n\nML"
>>> re.sub(r'(\n+)(?=[A-Z])', r'____', i)
'(2 months)____ML'
(?=[A-Z]) is to assert "newline characters followed by Capital Letter". REGEX DEMO.
Well let's take a look at your regex ([\n]+)([A-Z])+ - the first part ([\n]+) is fine, matching multiple occurences of a newline into one group (note - this wont match the carriage return \r). However the second part ([A-Z])+ leeds to your error it matches a single uppercase letter into a capturing group - multiple times, if there are multiple Uppercase letter, which will reset the group to the last matched uppercase letter, which is then used for the replace.
Try the following and see what happens
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'([\n]+)([A-Z])+', r"____\2", i)
You could simply place the + inside the capturing group, so multiple uppercase letters are matched into it. You could also just leave it out, as it doesn't make a difference, how many of these uppercase letters follow.
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)([A-Z])', r"____\2", i)
If you want to replace any sequence of linebreaks, no matter what follows - drop the ([A-Z]) completely and try
import re
i = "Inc\n\nABRAXAS"
i = re.sub(r'(\n+)', r"____", i)
You could also use ([\r\n]+) as pattern, if you want to consider carriage returns
Try:
import re
p = re.compile(ur'[\r?\n]')
test_str = u"(2 months)\n\nML"
subst = u"_"
result = re.sub(p, subst, test_str)
It will reduce string to
(2 months)__ML
See Demo

RegEx match word in string containing + and - using re.findall() Python

myreg = r"\babcb\"
mystr = "sdf ddabc"
mystr1 = "sdf abc"
print(re.findall(myreg,mystr))=[]
print(re.findall(myreg,mystr1))=[abc]
Until now everything works as expected but if i change my reg and my str to.
myreg = r"\b\+abcb\"
mystr = "sdf +abc"
print(re.findall(myreg,mystr)) = [] but i would like to get [+abc]
I have noticed that using the following works as expected.
myreg = "^\\+abc$"
mystr = "+abc"
mystr1 = "-+abc"
My question: Is it possible to achieve the same results as above without splitting the string?
Best regards,
Gabriel
Your problem is the following:
\b is defined as the boundary between a \w and a \W character
(or vice versa).
\w contains the character set [a-zA-Z0-9_]
\W contains the character set [^a-zA-Z0-9_], which means all characters except [a-zA-Z0-9_]
'+' is not contained in \w so you won't match the boundary between the whitespace and the '+'.
To get what you want, you should remove the first \b from your pattern:
import re
string = "sdf +abc"
pattern = r"\+abc\b"
matches = re.findall(pattern, string)
print matches
['+abc']
There are two problems
Before your + in +abc, there is no word boundary, so \b cannot match.
Your regex \b\+abcb\ tries to match a literal b character after abc (typo).
Word Boundaries
The word boundary \b matches at a position between a word character (letters, digits and underscore) and a non-word character (or a line beginning or ending). For instance, there is a word boundary between the + and the a
Solution: Make your Own boundary
If you want to match +abc but only when it is not preceded by a word character (for instance, you don't want it inside def+abc), then you can make your own boundary with a lookbehind:
(?<!\w)\+abc
This says "match +abc if it is not preceded by a word character (letter, digit, underscore)".

regular expression python `r'*(.|:|?|!)*'`

I want to accept words that include these characters: '.:?!'
I tried this as pattern: r'*(.|:|?|!)*' but does not work.
How I implement this by using python re.
the code I'd like to run is like this:
import re
pattern = r'*(.|:|?|!)*'
word = '.Flask'
match = re.match(pattern, word)
if match:
print('yes')
for example, I want to accept these words:'.flask', 'flask.','!flask','flask!'....
and even non ascii character. So I want to include those words as well:.日本語, 日本語.
that's why I wanted to use * symbol.
If you want to match words starting or ending with one of these characters, this regex should fit your needs:
pattern = r'[.:?!][^ ]*|[^ ]*[.:?!]'
Or even better with word boundary:
pattern = r'\b[.:?!][^ ]*\b|\b[^ ]*[.:?!]\b'
Explenation:
[.:?!][^ ]* matches words beginning with [.:?!] followed by all characters except whitespace.
[^ ]*[.:?!] matches all words beginning Wirth any character except whitespace ending with a [.:?!] character.
\b matches the word boundaries.

Categories

Resources