Regex conditionally remove text within parens - python

I was wondering how to remove the following text from a string in python using regular expressions.
string = "Hello (John)"
(magic regex)
string = "Hello "
However, I only want to remove the text in the parens if it contains the substring "John". So for example,
string = "Hello (Sally)"
(magic regex)
string = "Hello (Sally)"
Is this possible? Thanks!

This should be the gist of what you want:
>>> from re import sub
>>> mystr = "Hello (John)"
>>> sub("(?s)\(.*?John.*?\)", "", mystr)
'Hello '
>>> mystr = "Hello (Sally)"
>>> sub("(?s)\(.*?John.*?\)", "", mystr)
'Hello (Sally)'
>>> mystr = "Hello (John) My John (Sally)"
>>> sub("(?s)\(.*?John.*?\)", "", mystr)
'Hello My John (Sally)'
>>>
Breakdown:
(?s) # Dot-all flag to have . match newline characters
\( # Opening parenthesis
.*? # Zero or more characters matching non-greedily
John # Target
.*? # Zero or more characters matching non-greedily
\) # Closing parenthesis

If you to just remove all instances of John, you can do:
string = "Hello (John)"
string.replace("(John)", "")
print(string) # Prints "Hello "

import re
REGEX = re.compile(r'\(([^)]+)\)')
def replace(match):
if 'John' in match.groups()[0]:
return ''
return '(' + match.groups()[0] + ')'
my_string = 'Hello (John)'
print REGEX.sub(replace, my_string)
my_string = 'Hello (test John string)'
print REGEX.sub(replace, my_string)
my_string = 'Hello (Sally)'
print REGEX.sub(replace, my_string)
Hello
Hello
Hello (Sally)

Related

RegEx for matching capital letters and numbers

Hi I have a lot of corpus I parse them to extract all patterns:
like how to extract all patterns like: AP70, ML71, GR55, etc..
and all patterns for a sequence of words that start with capital letter like: Hello Little Monkey, How Are You, etc..
For the first case I did this regexp and don't get all matches:
>>> p = re.compile("[A-Z]+[0-9]+")
>>> res = p.search("aze azeaz GR55 AP1 PM89")
>>> res
<re.Match object; span=(10, 14), match='GR55'>
and for the second one:
>>> s = re.compile("[A-Z]+[a-z]+\s[A-Z]+[a-z]+\s[A-Z]+[a-z]+")
>>> resu = s.search("this is a test string, Hello Little Monkey, How Are You ?")
>>> resu
<re.Match object; span=(23, 42), match='Hello Little Monkey'>
>>> resu.group()
'Hello Little Monkey'
it's seems working but I want to get all matches when parsing a whole 'big' line.
Try these 2 regex:
(for safety, they are enclosed by whitespace/comma boundary's)
>>> import re
>>> teststr = "aze azeaz GR55 AP1 PM89"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[0-9]+(?![^\s,])", teststr)
>>> print(res)
['GR55', 'AP1', 'PM89']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [0-9]+
(?! [^\s,] )
and
>>> import re
>>> teststr = "this is a test string, ,Hello Little Monkey, How Are You ?"
>>> res = re.findall(r"(?<![^\s,])[A-Z]+[a-z]+(?:\s[A-Z]+[a-z]+){1,}(?![^\s,])", teststr)
>>> print(res)
['Hello Little Monkey', 'How Are You']
>>>
Readable regex
(?<! [^\s,] )
[A-Z]+ [a-z]+
(?: \s [A-Z]+ [a-z]+ ){1,}
(?! [^\s,] )
This expression might help you to do so, or design one. It seems you wish that your expression would contain at least one [A-Z] and at least one [0-9]:
(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)
Graph
This graph shows how your expression would work, and you can test more in this link:
Example Code:
This code shows how the expression would work in Python:
# -*- coding: UTF-8 -*-
import re
string = "aze azeaz GR55 AP1 PM89"
expression = r'(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')
Example Output
YAAAY! "GR55" is a match 💚💚💚
Performance
This JavaScript snippet shows the performance of your expression using a simple 1-million times for loop.
repeat = 1000000;
start = Date.now();
for (var i = repeat; i >= 0; i--) {
var string = 'aze azeaz GR55 AP1 PM89';
var regex = /(.*?)(?=[A-Z])(?=.+[0-9])([A-Z0-9]+)/g;
var match = string.replace(regex, "$2 ");
}
end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

regex between some character

input:
[1] string1 [2] string 2[3]string3 [5]string4
output:
string1
string 2
string 4
how to resolve the case as "input" and the result "output" with "regex" python?
Your instructions are a bit vague, that said - using the example input you provided:
import re
line = '[1] string1 [2] string 2[3]string3 [5]string4'
matches = re.match(r'\[1\] (.*?)\[2\] (.*?)\[3\](.*?)\[5\](.*)', line)
print matches.group(1) # prints string1
print matches.group(2) # prints string 2
print matches.group(3) # prints string3
print matches.group(4) # prints string4
It seems like you don't want the value of an index if it is preceded by a character other than white space or a starting anchor ,
>>> s = "[1] string1 [2] string 2[3]string3 [5]string4"
>>> import regex
>>> m = regex.findall(r'(?<= |^)\[\d+\]\s*([^\[]*)', s)
>>> for i in m:
... print i
...
string1
string 2
string4

Python: Modify Part of a String

I am taking an input string that is all one continuous group of letters and splitting it into a sentence. The problem is that as a beginner I can't figure out how to modify the string to ONLY capitalize the first letter and convert the others to lowercase. I know the string.lower but that converts everything to lowercase. Any ideas?
# This program asks user for a string run together
# with each word capitalized and gives back the words
# separated and only the first word capitalized
import re
def main():
# ask the user for a string
string = input( 'Enter some words each one capitalized, run together without spaces ')
for ch in string:
if ch.isupper() and not ch.islower():
newstr = re.sub('[A-Z]',addspace,string)
print(newstr)
def addspace(m) :
return ' ' + m.group(0)
#call the main function
main()
You can use capitalize():
Return a copy of the string with its first character capitalized and
the rest lowercased.
>>> s = "hello world"
>>> s.capitalize()
'Hello world'
>>> s = "hello World"
>>> s.capitalize()
'Hello world'
>>> s = "hELLO WORLD"
>>> s.capitalize()
'Hello world'
Unrelated example. To capitalize only the first letter you can do:
>>> s = 'hello'
>>> s = s[0].upper()+s[1:]
>>> print s
Hello
>>> s = 'heLLO'
>>> s = s[0].upper()+s[1:]
>>> print s
HeLLO
For a whole string, you can do
>>> s = 'what is your name'
>>> print ' '.join(i[0].upper()+i[1:] for i in s.split())
What Is Your Name
[EDIT]
You can also do:
>>> s = 'Hello What Is Your Name'
>>> s = ''.join(j.lower() if i>0 else j for i,j in enumerate(s))
>>> print s
Hello what is your name
If you only want to capitalize the start of sentences (and your string has multiple sentences), you can do something like:
>>> sentences = "this is sentence one. this is sentence two. and SENTENCE three."
>>> split_sentences = sentences.split('.')
>>> '. '.join([s.strip().capitalize() for s in split_sentences])
'This is sentence one. This is sentence two. And sentence three. '
If you don't want to change the case of the letters that don't start the sentence, then you can define your own capitalize function:
>>> def my_capitalize(s):
if s: # check that s is not ''
return s[0].upper() + s[1:]
return s
and then:
>>> '. '.join([my_capitalize(s.strip()) for s in split_sentences])
'This is sentence one. This is sentence two. And SENTENCE three. '

Python - How to clear spaces from a text

In Python, I have a lot of strings, containing spaces.
I would like to clear all spaces from the text, except if it is in quotation marks.
Example input:
This is "an example text" containing spaces.
And I want to get:
Thisis"an example text"containingspaces.
line.split() is not good, I think, because it clears all of spaces from the text.
What do you recommend?
For the simple case that only " are used as quotes:
>>> import re
>>> s = 'This is "an example text" containing spaces.'
>>> re.sub(r' (?=(?:[^"]*"[^"]*")*[^"]*$)', "", s)
'Thisis"an example text"containingspaces.'
Explanation:
[ ] # Match a space
(?= # only if an even number of spaces follows --> lookahead
(?: # This is true when the following can be matched:
[^"]*" # Any number of non-quote characters, then a quote, then
[^"]*" # the same thing again to get an even number of quotes.
)* # Repeat zero or more times.
[^"]* # Match any remaining non-quote characters
$ # and then the end of the string.
) # End of lookahead.
There is probably a more elegant solution than this, but:
>>> test = "This is \"an example text\" containing spaces."
>>> '"'.join([x if i % 2 else "".join(x.split())
for i, x in enumerate(test.split('"'))])
'Thisis"an example text"containingspaces.'
We split the text on quotes, then iterate through them in a list comprehension. We remove the spaces by splitting and rejoining if the index is odd (not inside quotes), and don't if it is even (inside quotes). We then rejoin the whole thing with quotes.
Using re.findall is probably the more easily understood/flexible method:
>>> s = 'This is "an example text" containing spaces.'
>>> ''.join(re.findall(r'(?:".*?")|(?:\S+)', s))
'Thisis"an example text"containingspaces.'
You could (ab)use the csv.reader:
>>> import csv
>>> ''.join(next(csv.reader([s.replace('"', '"""')], delimiter=' ')))
'Thisis"an example text"containingspaces.'
Or using re.split:
>>> ''.join(filter(None, re.split(r'(?:\s*(".*?")\s*)|[ ]', s)))
'Thisis"an example text"containingspaces.'
Use regular expressions!
import cStringIO, re
result = cStringIO.StringIO()
regex = re.compile('("[^"]*")')
text = 'This is "an example text" containing spaces.'
for part in regex.split(text):
if part and part[0] == '"':
result.write(part)
else:
result.write(part.replace(" ", ""))
return result.getvalue()
You can do this with csv as well:
import csv
out=[]
for e in csv.reader('This is "an example text" containing spaces. '):
e=''.join(e)
if e==' ': continue
if ' ' in e: out.extend('"'+e+'"')
else: out.extend(e)
print ''.join(out)
Prints Thisis"an example text"containingspaces.
'"'.join(v if i%2 else v.replace(' ', '') for i, v in enumerate(line.split('"')))
quotation_mark = '"'
space = " "
example = 'foo choo boo "blaee blahhh" didneid ei did '
formated_example = ''
if example[0] == quotation_mark:
inside_quotes = True
else:
inside_quotes = False
for character in example:
if inside_quotes != True:
formated_example += character
else:
if character != space:
formated_example += character
if character == quotation_mark:
if inside_quotes == True:
inside_quotes = False
else:
inside_quotes = True
print formated_example

How do I trim whitespace from a string?

How do I remove leading and trailing whitespace from a string in Python?
" Hello world " --> "Hello world"
" Hello world" --> "Hello world"
"Hello world " --> "Hello world"
"Hello world" --> "Hello world"
To remove all whitespace surrounding a string, use .strip(). Examples:
>>> ' Hello '.strip()
'Hello'
>>> ' Hello'.strip()
'Hello'
>>> 'Bob has a cat'.strip()
'Bob has a cat'
>>> ' Hello '.strip() # ALL consecutive spaces at both ends removed
'Hello'
Note that str.strip() removes all whitespace characters, including tabs and newlines. To remove only spaces, specify the specific character to remove as an argument to strip:
>>> " Hello\n ".strip(" ")
'Hello\n'
To remove only one space at most:
def strip_one_space(s):
if s.endswith(" "): s = s[:-1]
if s.startswith(" "): s = s[1:]
return s
>>> strip_one_space(" Hello ")
' Hello'
As pointed out in answers above
my_string.strip()
will remove all the leading and trailing whitespace characters such as \n, \r, \t, \f, space .
For more flexibility use the following
Removes only leading whitespace chars: my_string.lstrip()
Removes only trailing whitespace chars: my_string.rstrip()
Removes specific whitespace chars: my_string.strip('\n') or my_string.lstrip('\n\r') or my_string.rstrip('\n\t') and so on.
More details are available in the docs.
strip is not limited to whitespace characters either:
# remove all leading/trailing commas, periods and hyphens
title = title.strip(',.-')
This will remove all leading and trailing whitespace in myString:
myString.strip()
You want strip():
myphrases = [" Hello ", " Hello", "Hello ", "Bob has a cat"]
for phrase in myphrases:
print(phrase.strip())
This can also be done with a regular expression
import re
input = " Hello "
output = re.sub(r'^\s+|\s+$', '', input)
# output = 'Hello'
Well seeing this thread as a beginner got my head spinning. Hence came up with a simple shortcut.
Though str.strip() works to remove leading & trailing spaces it does nothing for spaces between characters.
words=input("Enter the word to test")
# If I have a user enter discontinous threads it becomes a problem
# input = " he llo, ho w are y ou "
n=words.strip()
print(n)
# output "he llo, ho w are y ou" - only leading & trailing spaces are removed
Instead use str.replace() to make more sense plus less error & more to the point.
The following code can generalize the use of str.replace()
def whitespace(words):
r=words.replace(' ','') # removes all whitespace
n=r.replace(',','|') # other uses of replace
return n
def run():
words=input("Enter the word to test") # take user input
m=whitespace(words) #encase the def in run() to imporve usability on various functions
o=m.count('f') # for testing
return m,o
print(run())
output- ('hello|howareyou', 0)
Can be helpful while inheriting the same in diff. functions.
In order to remove "Whitespace" which causes plenty of indentation errors when running your finished code or programs in Pyhton. Just do the following;obviously if Python keeps telling that the error(s) is indentation in line 1,2,3,4,5, etc..., just fix that line back and forth.
However, if you still get problems about the program that are related to typing mistakes, operators, etc, make sure you read why error Python is yelling at you:
The first thing to check is that you have your
indentation right. If you do, then check to see if you have
mixed tabs with spaces in your code.
Remember: the code
will look fine (to you), but the interpreter refuses to run it. If
you suspect this, a quick fix is to bring your code into an
IDLE edit window, then choose Edit..."Select All from the
menu system, before choosing Format..."Untabify Region.
If you’ve mixed tabs with spaces, this will convert all your
tabs to spaces in one go (and fix any indentation issues).
I could not find a solution to what I was looking for so I created some custom functions. You can try them out.
def cleansed(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
# return trimmed(s.replace('"', '').replace("'", ""))
return trimmed(s)
def trimmed(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
ss = trim_start_and_end(s).replace(' ', ' ')
while ' ' in ss:
ss = ss.replace(' ', ' ')
return ss
def trim_start_and_end(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
return trim_start(trim_end(s))
def trim_start(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
chars = []
for c in s:
if c is not ' ' or len(chars) > 0:
chars.append(c)
return "".join(chars).lower()
def trim_end(s: str):
""":param s: String to be cleansed"""
assert s is not (None or "")
chars = []
for c in reversed(s):
if c is not ' ' or len(chars) > 0:
chars.append(c)
return "".join(reversed(chars)).lower()
s1 = ' b Beer '
s2 = 'Beer b '
s3 = ' Beer b '
s4 = ' bread butter Beer b '
cdd = trim_start(s1)
cddd = trim_end(s2)
clean1 = cleansed(s3)
clean2 = cleansed(s4)
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s1, len(s1), cdd, len(cdd)))
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s2, len(s2), cddd, len(cddd)))
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s3, len(s3), clean1, len(clean1)))
print("\nStr: {0} Len: {1} Cleansed: {2} Len: {3}".format(s4, len(s4), clean2, len(clean2)))
If you want to trim specified number of spaces from left and right, you could do this:
def remove_outer_spaces(text, num_of_leading, num_of_trailing):
text = list(text)
for i in range(num_of_leading):
if text[i] == " ":
text[i] = ""
else:
break
for i in range(1, num_of_trailing+1):
if text[-i] == " ":
text[-i] = ""
else:
break
return ''.join(text)
txt1 = " MY name is "
print(remove_outer_spaces(txt1, 1, 1)) # result is: " MY name is "
print(remove_outer_spaces(txt1, 2, 3)) # result is: " MY name is "
print(remove_outer_spaces(txt1, 6, 8)) # result is: "MY name is"
How do I remove leading and trailing whitespace from a string in Python?
So below solution will remove leading and trailing whitespaces as well as intermediate whitespaces too. Like if you need to get a clear string values without multiple whitespaces.
>>> str_1 = ' Hello World'
>>> print(' '.join(str_1.split()))
Hello World
>>>
>>>
>>> str_2 = ' Hello World'
>>> print(' '.join(str_2.split()))
Hello World
>>>
>>>
>>> str_3 = 'Hello World '
>>> print(' '.join(str_3.split()))
Hello World
>>>
>>>
>>> str_4 = 'Hello World '
>>> print(' '.join(str_4.split()))
Hello World
>>>
>>>
>>> str_5 = ' Hello World '
>>> print(' '.join(str_5.split()))
Hello World
>>>
>>>
>>> str_6 = ' Hello World '
>>> print(' '.join(str_6.split()))
Hello World
>>>
>>>
>>> str_7 = 'Hello World'
>>> print(' '.join(str_7.split()))
Hello World
As you can see this will remove all the multiple whitespace in the string(output is Hello World for all). Location doesn't matter. But if you really need leading and trailing whitespaces, then strip() would be find.
One way is to use the .strip() method (removing all surrounding whitespaces)
str = " Hello World "
str = str.strip()
**result: str = "Hello World"**
Note that .strip() returns a copy of the string and doesn't change the underline object (since strings are immutable).
Should you wish to remove all whitespace (not only trimming the edges):
str = ' abcd efgh ijk '
str = str.replace(' ', '')
**result: str = 'abcdefghijk'
I wanted to remove the too-much spaces in a string (also in between the string, not only in the beginning or end). I made this, because I don't know how to do it otherwise:
string = "Name : David Account: 1234 Another thing: something "
ready = False
while ready == False:
pos = string.find(" ")
if pos != -1:
string = string.replace(" "," ")
else:
ready = True
print(string)
This replaces double spaces in one space until you have no double spaces any more

Categories

Resources