Replace a substring selectively inside a string - python

I have a string like this:
a = "\"java jobs in delhi\" delhi"
I want to replace delhi with "". But only delhi which lies outside the double-quotes. So, the output should look like this:
"\"java jobs in delhi\""
The string is a sample string.The substring not necessarily be "delhi".The substring to replace can occur anywhere in the input string. The order and number of quoted and unquoted parts in the string is not fixed
.replace() replaces both the delhi substrings. I can't use rstrip either as it wont necessarily appear at the end of the string. How can I do this?

Use re.sub
>>> a = "\"java jobs in delhi\" delhi"
>>> re.sub(r'\bdelhi\b(?=(?:"[^"]*"|[^"])*$)', r'', a)
'"java jobs in delhi" '
>>> re.sub(r'\bdelhi\b(?=(?:"[^"]*"|[^"])*$)', r'', a).strip()
'"java jobs in delhi"'
OR
>>> re.sub(r'("[^"]*")|delhi', lambda m: m.group(1) if m.group(1) else "", a)
'"java jobs in delhi" '
>>> re.sub(r'("[^"]*")|delhi', lambda m: m.group(1) if m.group(1) else "", a).strip()
'"java jobs in delhi"'

As a general way you can use re.split and a list comprehension :
>>> a = "\"java jobs in delhi\" delhi \"another text\" and this"
>>> sp=re.split(r'(\"[^"]*?\")',a)
>>> ''.join([i.replace('dehli','') if '"' in i else i for i in sp])
'"java jobs in delhi" delhi "another text" and this'
The re.split() function split your text based on sub-strings that has been surrounded with " :
['', '"java jobs in delhi"', ' delhi ', '"another text"', ' and this']
Then you can replace the dehli words which doesn't surrounded with 2 double quote!

Here is another alternative. This is a generic solution to remove any unquoted text:
def only_quoted_text(text):
output = []
in_quotes=False
for letter in a:
if letter == '"':
in_quotes = not in_quotes
output.append(letter)
elif in_quotes:
output.append(letter)
return "".join(output)
a = "list of \"java jobs in delhi\" delhi and \" python jobs in mumbai \" mumbai"
print only_quoted_text(a)
The output would be:
"java jobs in delhi"" python jobs in mumbai "
It also displays text if the final quote is missing.

Related

How to replace characters in a text by space except for list of words in python

I want to replace all characters in a text by spaces, but I want to leave a list of words.
For instante:
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
My wanted output would be:
output_text = "*********** ********** "
I would like to change unwanted characters to spaces before I do the * replacement:
"John Thomas Acme Corp. "
Right know I know how to replace only the list of words, but cannot come out with the spaces part.
rep = {key: len(key)*'_**_' for key in list_of_words}
rep = dict((re.escape(k), v) for k, v in rep.items())
pattern = re.compile("|".join(rep.keys()))
pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
You may build a pattern like
(?s)word1|word2|wordN|(.)
When Group 1 matches, replace with a space, else, replace with the same amount of asterisks as the match text length:
import re
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
pat = "|".join(sorted(map(re.escape, list_of_words), key=len, reverse=True))
pattern = re.compile(f'{pat}|(.)', re.S)
print(pattern.sub(lambda m: " " if m.group(1) else len(m.group(0))*"*", text))
=> '*********** ********** '
See the Python demo
Details
sorted(map(re.escape, list_of_words), key=len, reverse=True) - escapes words in list_of_words and sorts the list by length in descending order (it will be necessary if there are multiword items)
"|".join(...) - build the alternatives out of list_of_words items
lambda m: " " if m.group(1) else len(m.group(0))*"*" - if Group 1 matches, replace with a space, else with the asterisks of the same length as the match length.

Remove All Commas Between Quotes

I'm trying to remove all commas that are inside quotes (") with python:
'please,remove all the commas between quotes,"like in here, here, here!"'
^ ^
I tried this, but it only removes the first comma inside the quotes:
re.sub(r'(".*?),(.*?")',r'\1\2','please,remove all the commas between quotes,"like in here, here, here!"')
Output:
'please,remove all the commas between quotes,"like in here here, here!"'
How can I make it remove all the commas inside the quotes?
Assuming you don't have unbalanced or escaped quotes, you can use this regex based on negative lookahead:
>>> str = r'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '', str)
'foo,bar,"foobar barfoo foobarfoobar"'
This regex will find commas if those are inside the double quotes by using a negative lookahead to assert there are NOT even number of quotes after the comma.
Note about the lookaead (?!...):
([^"]*"){2} finds a pair of quotes
(([^"]*"){2})* finds 0 or more pair of quotes
[^"]*$ makes sure we don't have any more quotes after last matched quote
So (?!...) asserts that we don't have even number of quotes ahead thus matching commas inside the quoted string only.
You can pass a function as the repl argument instead of a replacement string. Just get the entire quoted string and do a simple string replace on the commas.
>>> s = 'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ''), s)
'foo,bar,"foobar barfoo foobarfoobar"'
Here is another option I came up with if you don't want to use regex.
input_str = 'please,remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
What about doing it with out regex?
input_str = '...'
first_slice = input_str.split('"')
second_slice = [first_slice[0]]
for slc in first_slice[1:]:
second_slice.extend(slc.split(','))
result = ''.join(second_slice)
The above answer with for-looping through the string is very slow, if you want to apply your algorithm to a 5 MB csv file.
This seems to be reasonably fast and provides the same result as the for loop:
#!/bin/python3
data = 'hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma\n "ka ku"; "ki; ko"\n "ko;ma"; "ki ma"\n"ehe;";koko'
first_split=data.split('"')
split01=[]
split02=[]
for slc in first_split[0::2]:
split01.append(slc)
for slc in first_split[1::2]:
slc_new=",".join(slc.split(";"))
split02.append(slc_new)
resultlist = [item for sublist in zip(split01, split02) for item in sublist]
if len(split01) > len (split02):
resultlist.append(split01[-1])
if len(split01) < len (split02):
resultlist.append(split02[-1])
result='"'.join(resultlist)
print(data)
print(split01)
print(split02)
print(result)
Results in:
hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma
"ka ku"; "ki; ko"
"ko;ma"; "ki ma"
"ehe;";koko
['hoko foko; moko soko; ', '; ', '; ', '; ', '; ', '; ehe mo; ', '; ko ma\n ', '; ', '\n ', '; ', '\n', ';koko']
['aaa mo, bia', 'ee mo', 'eka koka', 'koni, masa', 'co co', 'bi, ko', 'ka ku', 'ki, ko', 'ko,ma', 'ki ma', 'ehe,']
hoko foko; moko soko; "aaa mo, bia"; "ee mo"; "eka koka"; "koni, masa"; "co co"; ehe mo; "bi, ko"; ko ma
"ka ku"; "ki, ko"
"ko,ma"; "ki ma"
"ehe,";koko

remove multiple substrings inside a string

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.

Replacing spaces " " with \s using .replace()

I have a set of two names with spaces in them. I want to do a regex search for "George Bush" or "Barack Obama". Following this example I tried this, which gets the desired output
p = "(George\sBush|Barack\sObama)"
s = "recent Presidents George Bush and Barack Obama"
print re.findall(p,s) #Prints George Bush and Barack Obama
However, now I want to go from a list ["George Bush", "Barack Obama"] to the pattern shown above.
I tried this:
for l in list:
p = p + "|" + l
p = p.strip("|")
p = ('.{75}(' + p + ').{75}').replace(" ", "\s")
But it gives : '.{75}(George\\sBush|Barack\\sObama).{75}'
How can I replace space characters with just "\s" instead of "\\s"?
You already have. The backslash is special and must be escaped in the representation (and should be escaped in the string), but you really do have "\s". Try printing the string instead.

Regex pattern for matching entire word if it have a ; in the word in python

I am trying to remove some garbage from a text and would like to remove all words that have "," in the middle of 2 characters. I have tried both expressions bellow
r'\s.*;.*\s' and r'\s.*\W.*\s'
in this text
'the cat as;asas was wjdwi;qs at home'
And it seems to miss some white spaces, returning
'cat as;asas was wjdwi;qs at '
When I needed
'the cat was at home'
Simple solution is to not use a regex:
s = 'the cat as;asas was wjdwi;qs at home'
res = ' '.join(w for w in s.split() if ';' not in w)
# the cat was at home
You might need a more complicated check, but split it into "words" first, then apply a check to each "word"...
You can use this:
re.sub(r'(?i)\s?[a-z]+;[a-z]+\s?', ' ', yourstr)

Categories

Resources