Python regex to remove all words which contains number - python

I am trying to make a Python regex which allows me to remove all worlds of a string containing a number.
For example:
in = "ABCD abcd AB55 55CD A55D 5555"
out = "ABCD abcd"
The regex for delete number is trivial:
print(re.sub(r'[1-9]','','Paris a55a b55 55c 555 aaa'))
But I don't know how to delete the entire word and not just the number.
Could you help me please?

Do you need a regex? You can do something like
>>> words = "ABCD abcd AB55 55CD A55D 5555"
>>> ' '.join(s for s in words.split() if not any(c.isdigit() for c in s))
'ABCD abcd'
If you really want to use regex, you can try \w*\d\w*:
>>> re.sub(r'\w*\d\w*', '', words).strip()
'ABCD abcd'

Here's my approach:
>>> import re
>>> s = "ABCD abcd AB55 55CD A55D 5555"
>>> re.sub("\S*\d\S*", "", s).strip()
'ABCD abcd'
>>>

The below code of snippet removes the words mixed with digits only
string='1i am20 years old, weight is in between 65-70 kg '
string=re.sub(r"[A-Za-z]+\d+|\d+[A-Za-z]+",'',string).strip()
print(s)
OUTPUT:
years old and weight is in between 65-70 kg name

Related

RegEx or Pygrok Pattern match

I have text like this
Example:
"visa code: ab c master number: efg discover: i j k"
Output should be like this:
abc, efg, ijk
Is there a way, I can use Grok pattern match or Reg EX to get 3 characters after the ":" (not considering space) ?
You can start with this:
>>> import re
>>> p = re.compile(r"\b((?:\w\s*){2}\w)\b")
>>> re.findall(p, "visa code: ab c master number: efg discover: i j k")
['ab c', 'efg', 'i j k']
But you have more work to do. For example, nobody can guess what you mean - exactly - by "characters".
Beyond that, pattern matching systems match strings, but do not convert them. You'll have to remove spaces you don't want via some other means (which should be easy).

Finding words in phrases using regular expression

I wanna use regular expression to find phrases that contains
1 - One of the N words (any)
2 - All the N words (all )
>>> import re
>>> reg = re.compile(r'.country.|.place')
>>> phrases = ["This is an place", "France is a European country, and a wonderful place to visit", "Paris is a place, it s the capital of the country.side"]
>>> for phrase in phrases:
... found = re.findall(reg,phrase)
... print found
...
Result:
[' place']
[' country,', ' place']
[' place', ' country.']
It seems that I am messing around, I need to specify that I need to find a word, not just a part of word in both cases.
Can anyone help by pointing to the issue ?
Because you are trying to match entire words, use \b to match word boundaries:
reg = re.compile(r'\bcountry\b|\bplace\b')

remove multiple substrings inside a string

Here's hoping somebody can shed some light on this question because it has me stumped. I have a string that looks like this:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
I want this result:
abcdef ghijk lmnop qrs tuv wxyz 0123456789
Having reviewed numerous questions and answers here, the closest I have come to a solution is:
s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
s = re.sub('\[\[.*?\|', '', s)
s = re.sub('[\]\]]', '', s)
--> abcdef ghijk lmnop wxyz 0123456789
Since not every substring within double brackets contains a pipe, the re.sub removes everything from '[[' to next '|' instead of checking within each set of double brackets.
Any assistance would be most appreciated.
What about this:
In [187]: re.sub(r'([\[|\]])|((?<=\[)\w+\s+\w+(?=|))', '', s)
Out[187]: 'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
I purpose you a contrary method, instead of remove it you can just catch patterns you want. I think this way can make your code more semantics.
There are two patterns you wish to catch:
Case: words outside [[...]]
Pattern: Any words are either leaded by ']] ' or trailed by ' [['.
Regex: (?<=\]\]\s)\w+|\w+(?=\s\[\[)
Case: words inside [[...]]
Pattern: Any words are trailed by ']]'
Regex: \w+(?=\]\])
Example code
1 #!/usr/bin/env python
2 import re
3
4 s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789 "
5
6 p = re.compile('(?<=\]\]\s)\w+|\w+(?=\s\[\[)|\w+(?=\]\])')
7 print p.findall(s)
Result:
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
>>> import re
>>> s = "abcdef [[xxxx xxx|ghijk]] lmnop [[qrs]] tuv [[xx xxxx|wxyz]] 0123456789"
>>> re.sub(r'(\[\[[^]]+?\|)|([\[\]])', '', s)
'abcdef ghijk lmnop qrs tuv wxyz 0123456789'
This searches for and removes the following two items:
Two opening brackets followed by a bunch of stuff that isn't a closing bracket followed by a pipe.
Opening or closing brackets.
As a general regex using built-in re module you can use follwing regex that used look-around:
(?<!\[\[)\b([\w]+)\b(?!\|)|\[\[([^|]*)\]\]
you can use re.finditer to get the desire result :
>>> g=re.finditer(r'(?<!\[\[)\b([\w]+)\b(?!\|)|(?<=\[\[)[^|]*(?=\]\])',s)
>>> [j.group() for j in g]
['abcdef', 'ghijk', 'lmnop', 'qrs', 'tuv', 'wxyz', '0123456789']
The preceding regex contains from 2 part one is :
(?<=\[\[)[^|]*(?=\]\])
which match any combinations of word characters that not followed by | and not precede by [[.
the second part is :
\[\[([^|]*)\]\]
that will match any thing between 2 brackets except |.

Replace a substring selectively inside a string

I have a string like this:
a = "\"java jobs in delhi\" delhi"
I want to replace delhi with "". But only delhi which lies outside the double-quotes. So, the output should look like this:
"\"java jobs in delhi\""
The string is a sample string.The substring not necessarily be "delhi".The substring to replace can occur anywhere in the input string. The order and number of quoted and unquoted parts in the string is not fixed
.replace() replaces both the delhi substrings. I can't use rstrip either as it wont necessarily appear at the end of the string. How can I do this?
Use re.sub
>>> a = "\"java jobs in delhi\" delhi"
>>> re.sub(r'\bdelhi\b(?=(?:"[^"]*"|[^"])*$)', r'', a)
'"java jobs in delhi" '
>>> re.sub(r'\bdelhi\b(?=(?:"[^"]*"|[^"])*$)', r'', a).strip()
'"java jobs in delhi"'
OR
>>> re.sub(r'("[^"]*")|delhi', lambda m: m.group(1) if m.group(1) else "", a)
'"java jobs in delhi" '
>>> re.sub(r'("[^"]*")|delhi', lambda m: m.group(1) if m.group(1) else "", a).strip()
'"java jobs in delhi"'
As a general way you can use re.split and a list comprehension :
>>> a = "\"java jobs in delhi\" delhi \"another text\" and this"
>>> sp=re.split(r'(\"[^"]*?\")',a)
>>> ''.join([i.replace('dehli','') if '"' in i else i for i in sp])
'"java jobs in delhi" delhi "another text" and this'
The re.split() function split your text based on sub-strings that has been surrounded with " :
['', '"java jobs in delhi"', ' delhi ', '"another text"', ' and this']
Then you can replace the dehli words which doesn't surrounded with 2 double quote!
Here is another alternative. This is a generic solution to remove any unquoted text:
def only_quoted_text(text):
output = []
in_quotes=False
for letter in a:
if letter == '"':
in_quotes = not in_quotes
output.append(letter)
elif in_quotes:
output.append(letter)
return "".join(output)
a = "list of \"java jobs in delhi\" delhi and \" python jobs in mumbai \" mumbai"
print only_quoted_text(a)
The output would be:
"java jobs in delhi"" python jobs in mumbai "
It also displays text if the final quote is missing.

Python - Don't Understand The Returned Results of This Concatenated Regex Pattern

I am a Python newb trying to get more understanding of regex. Just when I think I got a good grasp of the basics, something throws me - such as the following:
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s' + '|'.join(noun_list) + r'\s'
>>> found = re.findall(noun_patt, text)
>>> found
[' eggs', 'bacon', 'donkey']
Since I set the regex pattern to find 'whitespace' + 'pipe joined list of nouns' + 'whitespace' - how come:
' eggs' was found with a space before it and not after it?
'bacon' was found with no spaces either side of it?
'donkey' was found with no spaces either side of it and the fact there is no whitespace after it?
The result I was expecting: [' eggs ', ' bacon ']
I am using Python 2.7
You misunderstand the pattern. There is no group around the joint list of nouns, so the first \s is part of the eggs option, the bacon and donkey options have no spaces, and the dog option includes the final \s meta character .
You want to put a group around the nouns to delimit what the | option applies to:
noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
The non-capturing group here ((?:...)) puts a limit on what the | options apply to. The \s spaces are now outside of the group and are thus not part of the 4 choices.
You need to use a non-capturing group because if you were to use a regular (capturing) group .findall() would return just the noun, not the spaces.
Demo:
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> import re
>>> text = "Some nouns like eggs egg bacon what a lovely donkey"
>>> noun_list = ['eggs', 'bacon', 'donkey', 'dog']
>>> noun_patt = r'\s(?:{})\s'.format('|'.join(noun_list))
>>> re.findall(noun_patt, text)
[' eggs ', ' bacon ']
Now both spaces are part of the output.

Categories

Resources