remove multiple substrings inside a string - python

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.

What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'

I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']

>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.

As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.

Related

RegEx or Pygrok Pattern match

I have text like this
Example:
"visa code: ab c master number: efg discover: i j k"
Output should be like this:
abc, efg, ijk
Is there a way, I can use Grok pattern match or Reg EX to get 3 characters after the ":" (not considering space) ?
You can start with this:
>>> import re
>>> p = re.compile(r"\b((?:\w\s*){2}\w)\b")
>>> re.findall(p, "visa code: ab c master number: efg discover: i j k")
['ab c', 'efg', 'i j k']
But you have more work to do. For example, nobody can guess what you mean - exactly - by "characters".
Beyond that, pattern matching systems match strings, but do not convert them. You'll have to remove spaces you don't want via some other means (which should be easy).

Python Regex Findall non-greedy

I am relatively new to regex and I seem to be struggling to understand the greedy vs non-greedy search (if that is indeed the issue here). Let's say I have a simple text such as this:
# numbers: 4 A 3 B
My goal would be to run a findall to get something like the following output:
['# number:', '4 A 3 B', ' 4 A', ' 3 B']
So if I use the following regex with findall, I would expect it to work:
matches = re.findall(r"(# numbers:)(((?:\s\d)(?:\s\D))*)", "# numbers: 4 A 3 B")
However, the actual output is this:
[('# numbers:', ' 4 A 3 B', ' 3 B')]
Can someone explain why the group ((\s\d)(\d\D)) is only matching ' 3 B' and not also ' 4 A'? I assume it has something to do with the greedy vs. non-greedy search of * is that true? And if so, could you explain how to solve this issue?
Thanks in advance!
I would use re.findall here, twice. First, extract the digit/non digit text series, then use re.findall a second time to find the tuples:
inp = "# numbers: 4 A 3 B"
text = re.findall(r'^# numbers:\s+(.*)$', inp)[0]
matches = re.findall(r'(\d+)\s+(\D+)', text)
print(matches) # [('4', 'A '), ('3', 'B')]

Using Regular expressions to match a portion of the string?(python)

What regular expression can i use to match genes(in bold) in the gene list string:
GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8
I tried : GENE_List:((( \w+).(\w+));)+* but it only captures the last gene
Given:
>>> s="GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
You can use Python string methods to do:
>>> s.split(': ')[1].split('; ')
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
For a regex:
(?<=[:;]\s)([^\s;]+)
Demo
Or, in Python:
>>> re.findall(r'(?<=[:;]\s)([^\s;]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
You can use the following:
\s([^;\s]+)
Demo
The captured group, ([^;\s]+), will contain the desired substrings followed by whitespace (\s)
>>> s = 'GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8'
>>> re.findall(r'\s([^;\s]+)', s)
['F59A7.7', 'T25D3.3', 'F13B12.4', 'cysl-1', 'cysl-2', 'cysl-3', 'cysl-4', 'F01D4.8']
UPDATE
It's in fact much simpler:
[^\s;]+
however, first use substring to take only the part you need (the genes, without GENELIST )
demo: regex demo
string = "GENE_LIST: F59A7.7; T25D3.3; F13B12.4; cysl-1; cysl-2; cysl-3; cysl-4; F01D4.8"
re.findall(r"([^;\s]+)(?:;|$)", string)
The output is:
['F59A7.7',
'T25D3.3',
'F13B12.4',
'cysl-1',
'cysl-2',
'cysl-3',
'cysl-4',
'F01D4.8']

What is the regex to remove the content inside brackets?

I want to do something like this,
Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5
to
Alice in the Wonderland Rating 4.5/5
What is the regex command to achieve this ?
You want to escape the the brackets and use the non-greed modifier ? with the catch-all expression .+.
>>> s = 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5'
>>> re.sub(r'\[.+?\]\s*', '', s)
'Alice in the Wonderland Rating 4.5/5'
Explanations:
The . means any character and + one or more occurrences. This expression is "greedy" and will match everything (the rest of the string including any closing bracket) so you need the non-greedy modifier ? to make it stop at the closing bracket. Note that x? means zero or one occurrences of "x", so context matters.
Change it to .* if you want to catch "[]", * means zero or more occurrences
The \s represents any space character
You can use the "negated" character class instead of .+? - the [^x] means not "x", but the resulting expression is harder to read: \[[^\]]+\].
Justhalf's observation is very pertinent: this one works as long as brackets are not nested.
Regex is not good for matching arbitrary number of open and closing parentheses, but if they are not nested, it can be accomplished with this regex:
import re
string = 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5'
re.sub('\[[^\]]+\]\s*','',string)
Note that it will also remove any space after the brackets.
You could use re.sub:
>>> re.sub(r'\[[^]]*\]\s?' , '', 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5')
'Alice in the Wonderland Rating 4.5/5'
>>>
If you prefer lots of [] in your regex :)
>>> import re
>>> s = 'Alice in the Wonderland [1865] [Charles Lutwidge Dodgson] Rating 4.5/5'
>>> re.sub('[[].*?[]]\s*', '', s)
'Alice in the Wonderland Rating 4.5/5'
>>> re.sub('[[][^]]*.\s*', '', s)
'Alice in the Wonderland Rating 4.5/5'
Reiterating what #justhalf said. Python regex are no good for nested [

Python regex to remove all words which contains number

I am trying to make a Python regex which allows me to remove all worlds of a string containing a number.
For example:
in = "ABCD abcd AB55 55CD A55D 5555"
out = "ABCD abcd"
The regex for delete number is trivial:
print(re.sub(r'[1-9]','','Paris a55a b55 55c 555 aaa'))
But I don't know how to delete the entire word and not just the number.
Could you help me please?
Do you need a regex? You can do something like
>>> words = "ABCD abcd AB55 55CD A55D 5555"
>>> ' '.join(s for s in words.split() if not any(c.isdigit() for c in s))
'ABCD abcd'
If you really want to use regex, you can try \w*\d\w*:
>>> re.sub(r'\w*\d\w*', '', words).strip()
'ABCD abcd'
Here's my approach:
>>> import re
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
'ABCD abcd'
>>>
The below code of snippet removes the words mixed with digits only
string='1i am20 years old, weight is in between 65-70 kg '
string=re.sub(r"[A-Za-z]+\d+|\d+[A-Za-z]+",'',string).strip()
print(s)
OUTPUT:
years old and weight is in between 65-70 kg name

Categories

Resources