import re
strg = "what is AM&I"
z= re.sub(r'&', '', strg)
# pattern = r'(?:[A-Z]\.)+'
# pattern = r'\b(?:[A-Z][a-z]*){2,}'
regex = re.compile('[#_!#$%^&*()<>?/\|}{~:]')
print regex
print regex.search(strg)
i need to get the output as AM&I
This regular expression
regex = re.compile(r'(\S*[#_!#$%^&*()<>?/\|}{~:]\S*)')
will look for "words" (meaning strings of nonblank characters) that contain at least one of the special characters you are looking for.
>>> strg = "what is AM&I"
>>> m=regex.search(strg)
>>> m.group(1)
'AM&I'
though with only one example to go on it is very likely that this will fail to match other things you are looking for (false negatives) or will return things you are not really looking for (false positives).
Related
I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)
I want to write a line of regular expression that can match strings like "(2000)" with years in parentheses. then I can check if any string contains the substring "2000".
for example, I want the regex to match (2000) not 2000, or (20000),or (200).
That is to say: they have to have exactly four digits, the first digit between 1 and 2; they have to include the parentheses.
also 2000 is just an example I use but really I want to the regex to include all the possible years.
You have to escape the open and close paranthesis,
>>> import re
>>> str = """foo(2000)bar(1000)foobar2000"""
>>> regex = r'\(2000\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)']
OR
>>> import re
>>> str = """foo(2000)bar(1000)foobar(2014)barfoo(2020)"""
>>> regex = r'\([0-9]{4}\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)', '(1000)', '(2014)', '(2020)']
It matches all the four digit numbers(year's) present within the paranthesis.
Special characters need to be escaped with a backslash. A parenthesis ( becomes \(. Therefore (2000) becomes \(2000\).
Then you can do something like:
if re.search(r"\(2000\)", subject):
# Successful match
else:
# Match attempt failed
>>> import re
>>> x = re.match(r'\((\d*?)\)', "(2000)")
>>> x.group(1)
'2000'
I've messed around with regex a little bit but am pretty unfamiliar with it for the most part. The string will in the format:
\n\n*text here, can be any spaces, etc. etc.*
The string that I will get will have two line breaks, followed by an asterisk, followed by text, and then ending with another asterisk.
I want to exclude the beginning \n\n from the returned text. This is the pattern that I've come up with so far and it seems to work:
pattern = "(?<=\\n\\n)\*(.*)(\*)"
match = re.search(pattern, string)
if match:
text = match.group()
print (text)
else:
print ("Nothing")
I'm wondering if there is a better way to go about matching this pattern or if the way I'm handling it is okay.
Thanks.
You can avoid capturing groups and have the whole match as result using:
pattern = r'(?<=\n\n\*)[^*]*(?=\*)'
Example:
import re
print re.findall(r'(?<=\n\n\*)[^*]*(?=\*)','\n\n*text here, can be any spaces, etc. etc.*')
If you want to include the asterisk in the result you can use instead:
pattern = r'(?<=\n\n)\*[^*]*\*'
Regular expressions are overkill in a case like this -- if the delimiters are always static and at the head/tail of the string:
>>> s = "\n\n*text here, can be any spaces, etc. etc.*"
>>> def CheckString(s):
... if s.startswith("\n\n*") and s.endswith("*"):
... return s[3:-1]
... else:
... return "(nothing)"
>>> CheckString(s)
'text here, can be any spaces, etc. etc.'
>>> CheckString("no delimiters")
'(nothing)'
(adjusting the slice indexes as needed -- it wasn't clear to me if you want to keep the leading/trailing '*' characters. If you want to keep them, change the slice to
return s[2:]
Say I have three strings:
abc534loif
tvd645kgjf
tv96fjbd_gfgf
and three lists:
beginning captures just the first part of the string "the name"
middle captures just the number
end contains only the rest of the characters that are after the number portion
How do I accomplish this in the most efficent way?
Use regular expressions?
>>> import re
>>> strings = 'abc534loif tvd645kgjf tv96fjbd_gfgf'.split()
>>> for s in strings:
... for match in re.finditer(r'\b([a-z]+)(\d+)(.+?)\b', s):
... print match.groups()
...
('abc', '534', 'loif')
('tvd', '645', 'kgjf')
('tv', '96', 'fjbd_gfgf')
This is language agnostic approach that aims at higher efficiency:
find first digit in the string and save its position p0
find last digit in the string and save its position p1
extract substring from 0 to p0-1 into beginning
extract substring from p0 to p1 into middle
extract substring from p1+1 to length-1 into end
I guess you're looking for re.findall:
strs = """
abc534loif
tvd645kgjf
tv96fjbd_gfgf
"""
import re
print re.findall(r'\b(\w+?)(\d+)(\w+)', strs)
>> [('abc', '534', 'loif'), ('tvd', '645', 'kgjf'), ('tv', '96', 'fjbd_gfgf')]
>>> import itertools as it
>>> s="abc534loif"
>>> [''.join(j) for i,j in it.groupby(s, key=str.isdigit)]
['abc', '534', 'loif']
I'd something like this:
>>> import re
>>> l = ['abc534loif', 'tvd645kgjf', 'tv96fjbd_gfgf']
>>> regex = re.compile('([a-z_]+)(\d+)([a-z_]+)')
>>> beginning, middle, end = zip(*[regex.match(s).groups() for s in l])
>>> beginning
('abc', 'tvd', 'tv')
>>> middle
('534', '645', '96')
>>> end
('loif', 'kgjf', 'fjbd_gfgf')
I wouls use regualar expressions like:
(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)
and pull out the three matching sections.
import re
m = re.match(r"(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)", "abc534loif")
m.group('beginning')
m.group('middle')
m.group('end')
import re #You want to match a string against a pattern so you import the regular expressions module 're'
mystring = "abc1234def" #Just a string to test with
match = re.match(r"^(\D+)([0)9]+](\D+)$") #Our regular expression. Everything between brackets is 'captured', meaning that it is accessible as one of the 'groups' in the returned match object. The ^ sign matches at the beginning of a string, while the $ matches the end. the characters in between the square brackets [0-9] are character ranges, so [0-9] matches any digit character, \D is any non-digit character.
if match: # match will be None if the string didn't match the pattern, so we need to check for that, as None.group doesn't exist.
beginning = match.group(1)
middle = match.group(2)
end = match.group(3)
I am trying to remove chars from an unicode string. I have a whitelist of allowed unicode chars and I would like to remove everything that is not on the list.
allowed_list = ur'[\u0041-\u005A]|[\u0061-\u007A]|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u012F]|\u0131|[\u0386]|[\u0388-\u038A]'
negated_list = ur'[^\u0041-\u005A]|[^\u0061-\u007A]|[^\u00C0-\u00D6]|[^\u00D8-\u00F6]|[^\u00F8-\u012F]|^\u0131|[^\u0386]|[^\u0388-\u038A]'
I am testing it with a subset of my list and I don't get why it is not working.
This removes all but lowercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'rugg'
This removes all but uppercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'AT'
But when I combine them, all chars get removed:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]|[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
''
When I tested the regex [^\u0041-\u005A]|[^\u0061-\u007A] on https://pythex.org/ it does what I am expecting, but when I atempt to use it in my code, it is not doing what I want it to. What am I missing?
Thank you in advance!
Your regex is not correct, you are using | which checks if either one is true.
You need to create one expression with multiple ranges,
[^\u0041-\u005A\u0061-\u007A] will match any characters except range \u0041-\u005A or \u0061-\u007A.
import re
regex = r"[^\u0041-\u005A\u0061-\u007A]"
test_str = "Arugg^]T"
myre = re.compile(regex, re.UNICODE)
result = myre.sub('', test_str)
print(result)
# output,
AruggT
Implicitly positive, regex class items are OR'd together.
Your regex is then the same as
[\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
But for the Negative regex class [^], items are individually negated then AND'ed together.
That regex is then
[^\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
which is logically the same as
[^\u0041-\u005A] and [^\u0061-\u007A] and [^\u00C0-\u00D6] and [^\u00D8-\u00F6] and [^\u00F8-\u012F] and [^\u0131] and [^\u0386] and [^\u0388-\u038A]
What you tried to do was negate each item, then OR them together
which is not the same.
You are replacing all characters that are
not in '[^\u0041-\u005A]' or not in [^\u0061-\u007A]' (due to the ^) .
If either one is true, all get replaced by '' - so its always true no matter what you have.
Use ur'[^\u0041-\u005A\u0061-\u007A]' instead (both ranges inside one [...].