I want to replace all characters in a text by spaces, but I want to leave a list of words.
For instante:
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
My wanted output would be:
output_text = "*********** ********** "
I would like to change unwanted characters to spaces before I do the * replacement:
"John Thomas Acme Corp. "
Right know I know how to replace only the list of words, but cannot come out with the spaces part.
rep = {key: len(key)*'_**_' for key in list_of_words}
rep = dict((re.escape(k), v) for k, v in rep.items())
pattern = re.compile("|".join(rep.keys()))
pattern.sub(lambda m: rep[re.escape(m.group(0))], text)
You may build a pattern like
(?s)word1|word2|wordN|(.)
When Group 1 matches, replace with a space, else, replace with the same amount of asterisks as the match text length:
import re
text = "John Thomas bought 300 shares of Acme Corp. in 2006."
list_of_words = ['Acme Corp.', 'John Thomas']
pat = "|".join(sorted(map(re.escape, list_of_words), key=len, reverse=True))
pattern = re.compile(f'{pat}|(.)', re.S)
print(pattern.sub(lambda m: " " if m.group(1) else len(m.group(0))*"*", text))
=> '*********** ********** '
See the Python demo
Details
sorted(map(re.escape, list_of_words), key=len, reverse=True) - escapes words in list_of_words and sorts the list by length in descending order (it will be necessary if there are multiword items)
"|".join(...) - build the alternatives out of list_of_words items
lambda m: " " if m.group(1) else len(m.group(0))*"*" - if Group 1 matches, replace with a space, else with the asterisks of the same length as the match length.
I'm trying to remove all commas that are inside quotes (") with python:
'please,remove all the commas between quotes,"like in here, here, here!"'
^ ^
I tried this, but it only removes the first comma inside the quotes:
re.sub(r'(".*?),(.*?")',r'\1\2','please,remove all the commas between quotes,"like in here, here, here!"')
Output:
'please,remove all the commas between quotes,"like in here here, here!"'
How can I make it remove all the commas inside the quotes?
Assuming you don't have unbalanced or escaped quotes, you can use this regex based on negative lookahead:
>>> str = r'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'(?!(([^"]*"){2})*[^"]*$),', '', str)
'foo,bar,"foobar barfoo foobarfoobar"'
This regex will find commas if those are inside the double quotes by using a negative lookahead to assert there are NOT even number of quotes after the comma.
Note about the lookaead (?!...):
([^"]*"){2} finds a pair of quotes
(([^"]*"){2})* finds 0 or more pair of quotes
[^"]*$ makes sure we don't have any more quotes after last matched quote
So (?!...) asserts that we don't have even number of quotes ahead thus matching commas inside the quoted string only.
You can pass a function as the repl argument instead of a replacement string. Just get the entire quoted string and do a simple string replace on the commas.
>>> s = 'foo,bar,"foobar, barfoo, foobarfoobar"'
>>> re.sub(r'"[^"]*"', lambda m: m.group(0).replace(',', ''), s)
'foo,bar,"foobar barfoo foobarfoobar"'
Here is another option I came up with if you don't want to use regex.
input_str = 'please,remove all the commas between quotes,"like in here, here, here!"'
quotes = False
def noCommas(string):
quotes = False
output = ''
for char in string:
if char == '"':
quotes = True
if quotes == False:
output += char
if char != ',' and quotes == True:
output += char
return output
print noCommas(input_str)
What about doing it with out regex?
input_str = '...'
first_slice = input_str.split('"')
second_slice = [first_slice[0]]
for slc in first_slice[1:]:
second_slice.extend(slc.split(','))
result = ''.join(second_slice)
The above answer with for-looping through the string is very slow, if you want to apply your algorithm to a 5 MB csv file.
This seems to be reasonably fast and provides the same result as the for loop:
#!/bin/python3
data = 'hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma\n "ka ku"; "ki; ko"\n "ko;ma"; "ki ma"\n"ehe;";koko'
first_split=data.split('"')
split01=[]
split02=[]
for slc in first_split[0::2]:
split01.append(slc)
for slc in first_split[1::2]:
slc_new=",".join(slc.split(";"))
split02.append(slc_new)
resultlist = [item for sublist in zip(split01, split02) for item in sublist]
if len(split01) > len (split02):
resultlist.append(split01[-1])
if len(split01) < len (split02):
resultlist.append(split02[-1])
result='"'.join(resultlist)
print(data)
print(split01)
print(split02)
print(result)
Results in:
hoko foko; moko soko; "aaa mo; bia"; "ee mo"; "eka koka"; "koni; masa"; "co co"; ehe mo; "bi; ko"; ko ma
"ka ku"; "ki; ko"
"ko;ma"; "ki ma"
"ehe;";koko
['hoko foko; moko soko; ', '; ', '; ', '; ', '; ', '; ehe mo; ', '; ko ma\n ', '; ', '\n ', '; ', '\n', ';koko']
['aaa mo, bia', 'ee mo', 'eka koka', 'koni, masa', 'co co', 'bi, ko', 'ka ku', 'ki, ko', 'ko,ma', 'ki ma', 'ehe,']
hoko foko; moko soko; "aaa mo, bia"; "ee mo"; "eka koka"; "koni, masa"; "co co"; ehe mo; "bi, ko"; ko ma
"ka ku"; "ki, ko"
"ko,ma"; "ki ma"
"ehe,";koko
Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.
I have a set of two names with spaces in them. I want to do a regex search for "George Bush" or "Barack Obama". Following this example I tried this, which gets the desired output
p = "(George\sBush|Barack\sObama)"
s = "recent Presidents George Bush and Barack Obama"
print re.findall(p,s) #Prints George Bush and Barack Obama
However, now I want to go from a list ["George Bush", "Barack Obama"] to the pattern shown above.
I tried this:
for l in list:
p = p + "|" + l
p = p.strip("|")
p = ('.{75}(' + p + ').{75}').replace(" ", "\s")
But it gives : '.{75}(George\\sBush|Barack\\sObama).{75}'
How can I replace space characters with just "\s" instead of "\\s"?
You already have. The backslash is special and must be escaped in the representation (and should be escaped in the string), but you really do have "\s". Try printing the string instead.
I am trying to remove some garbage from a text and would like to remove all words that have "," in the middle of 2 characters. I have tried both expressions bellow
r'\s.*;.*\s' and r'\s.*\W.*\s'
in this text
'the cat as;asas was wjdwi;qs at home'
And it seems to miss some white spaces, returning
'cat as;asas was wjdwi;qs at '
When I needed
'the cat was at home'
Simple solution is to not use a regex:
s = 'the cat as;asas was wjdwi;qs at home'
res = ' '.join(w for w in s.split() if ';' not in w)
# the cat was at home
You might need a more complicated check, but split it into "words" first, then apply a check to each "word"...
You can use this:
re.sub(r'(?i)\s?[a-z]+;[a-z]+\s?', ' ', yourstr)