Removing numbers from string while keeping alphanumeric words - python

I want to remove numeric values without losing numbers that are part of alphanumeric words from a string.
String = "Jobid JD123 has been abended with code 346"
Result = "Jobid JD123 has been abended with code"
I am using the following code:
result = ''.join([i for i in String if not i.isdigit()])
which gives me the result as 'Jobid JD has been abended with code'
Is there anyway we can remove the words that only contain digits, while retaining those that contain a mix of letters and digits?

You can use regex to find runs of one or more digits \d+ between two word boundaries \b, and replace them with nothing.
>>> import re
>>> string = "Jobid JD123 has been abended with code 346"
>>> re.sub(r"\b\d+\b", "", string).strip()
'Jobid JD123 has been abended with code'
Note that the regex doesn't get rid of the trailing space (between "code" and the digits), so you need to strip() the result of the re.sub().

Use .isnumeric() to remove any word that doesn't only contain numbers:
s = "Jobid JD123 has been abended with code 346"
result = ' '.join(c for c in s.split() if not c.isnumeric())
print(result)
This outputs:
Jobid JD123 has been abended with code

split the string into words and check if the entire word is numerical
Result = " ".join(word for word in String.split() if not word.isnumeric())
>>> Result
'Jobid JD123 has been abended with code'

Related

python re split at all space and punctuation except for the apostrophe

i want to split a string by all spaces and punctuation except for the apostrophe sign. Preferably a single quote should still be used as a delimiter except for when it is an apostrophe. I also want to keep the delimeters.
example string
words = """hello my name is 'joe.' what's your's"""
Here is my re pattern thus far splitted = re.split(r"[^'-\w]",words.lower())
I tried throwing the single quote after the ^ character but it is not working.
My desired output is this. splitted = [hello,my,name,is,joe,.,what's,your's]
It might be simpler to simply process your list after splitting without accounting for them at first:
>>> words = """hello my name is 'joe.' what's your's"""
>>> split_words = re.split(r"[ ,.!?]", words.lower()) # add punctuation you want to split on
>>> split_words
['hello', 'my', 'name', 'is', "'joe.'", "what's", "your's"]
>>> [word.strip("'") for word in split_words]
['hello', 'my', 'name', 'is', 'joe.', "what's", "your's"]
One option is to make use of lookarounds to split at the desired positions, and use a capture group what you want to keep in the split.
After the split, you can remove the empty entries from the resulting list.
\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])
The pattern matches
\s+ Match 1 or more whitespace chars
| Or
(?<=\s)' Match ' preceded by a whitespace char
| Or
'(?=\s) Match ' when followed by a whitespace char
| Or
(?<=\w)([,.!?]) Capture one of , . ! ? in group 1, when preceded by a word character
See a regex demo and a Python demo.
Example
import re
pattern = r"\s+|(?<=\s)'|'(?=\s)|(?<=\w)([,.!?])"
words = """hello my name is 'joe.' what's your's"""
result = [s for s in re.split(pattern, words) if s]
print(result)
Output
['hello', 'my', 'name', 'is', 'joe', '.', "what's", "your's"]
I love regex golf!
words = """hello my name is 'joe.' what's your's"""
splitted = re.findall(r"\b(?:\w'\w|\w)+\b", words)
The part in the parenthesis is a group that matches either an apostrophe surrounded by letters or a single letter.
EDIT:
This is more flexible:
re.findall(r"\b(?:(?<=\w)'(?=\w)|\w)+\b", words)
It's getting a bit unreadable at this point though, in practice you should probably use Woodford's answer.

Can string replace be written in list comprehension?

I have a text and a list.
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
I want to replace all elements from list in the string. So, I can do this:
for element in remove_list:
text = text.replace(element, '')
I can also use regex.
But can this be done in list comprehension or any single liner?
You can use functools.reduce:
from functools import reduce
text = reduce(lambda x, y: x.replace(y, ''), remove_list, text)
# 'Some texts that I want to replace'
I would do this with re.sub to remove all the substrings in one pass:
>>> import re
>>> regex = '|'.join(map(re.escape, remove_list))
>>> re.sub(regex, '', text)
'Some texts that I want to replace'
Note that the result has two spaces instead of one where each part was removed. If you want each occurrence to leave just one space, you can use a slightly more complicated regex:
>>> re.sub(r'\s*(' + regex + r')', '', text)
'Some texts that I want to replace'
There are other ways to write similar regexes; this one will remove the space preceding a match, but you could alternatively remove the space following a match instead. Which behaviour you want will depend on your use-case.
You can do this with a regex by building a regex from an alternation of the words to remove, taking care to escape the strings so that the [ and ] in them don't get treated as special characters:
import re
text = "Some texts [remove me] that I want to [and remove me] replace"
remove_list = ["[remove me]", "[and remove me]"]
regex = re.compile('|'.join(re.escape(r) for r in remove_list))
text = regex.sub('', text)
print(text)
Output:
Some texts that I want to replace
Since this may result in double spaces in the result string, you can remove them with replace e.g.
text = regex.sub('', text).replace(' ', ' ')
Output:
Some texts that I want to replace

Substring regex from characters to end of word

I looking for a regex term that will capture a subset of a string beginning with a a certain sequence of characters (http in my case)up until a whitespace.
I am doing the problem in python, working over a list of strings and replacing the 'bad' substring with ''.
The difficulty stems from the characters not necessarily beginning the words within the substring. Example below, with bold being the part I am looking to capture:
"Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous "
Thank you
Use findall:
>>> text = '''Pasforcémenthttpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php merci httpswwwgooglecomsilvous '''
>>> import re
>>> re.findall(r'http\S+', text)
['httpwwwsudouestfr20101129lesyndromedeliledererevientdanslactualite2525391381php', 'httpswwwgooglecomsilvous']
For substitution (if memory not an issue):
>>> rep = re.compile(r'http\S+')
>>> rep.sub('', text)
You can try this:
strings = [] #your list of strings goes here
import re
new_strings = [re.sub("https.*?php|https.*?$", '.', i) for i in strings]

Remove a character in string if it doesn't belong to a group of matching pattern in Python

If I have a string such that it contains many words. I want to remove the closing parenthesis if the word in the string doesn't start with _.
Examples input:
this is an example to _remove) brackets under certain) conditions.
Output:
this is an example to _remove) brackets under certain conditions.
How can I do that without splitting the words using re.sub?
re.sub accepts a callable as the second parameter, which comes in handy here:
>>> import re
>>> s = 'this is an example to _remove) brackets under certain) conditions.'
>>> re.sub('(\w+)\)', lambda m: m.group(0) if m.group(0).startswith('_') else m.group(1), s)
'this is an example to _remove) brackets under certain conditions.'
I wouldn't use regex here when a list comprehension can do it.
result = ' '.join([word.rstrip(")") if not word.startswith("_") else word
for word in words.split(" ")])
If you have possible input like:
someword))
that you want to turn into:
someword)
Then you'll have to do:
result = ' '.join([word[:-1] if word.endswith(")") and not word.startswith("_") else word
for word in words.split(" ")])

regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. But now I also need to replace repeating words, three or more word will be replaced by two words.
Like
bye! bye! bye!
should become
bye! bye!
My code so far:
def replaceThreeOrMoreCharachetrsWithTwoCharacters(string):
# pattern to look for three or more repetitions of any character, including newlines.
pattern = re.compile(r"(.)\1{2,}", re.DOTALL)
return pattern.sub(r"\1\1", string)
Assuming that what is called "word" in your requirements is one or more non-whitespaces characters surrounded by whitespaces or string limits, you can try this pattern:
re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)
You could try the below regex also,
(?<= |^)(\S+)(?: \1){2,}(?= |$)
Sample code,
>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"
DEMO
I know you were after a regular expression but you could use a simple loop to achieve the same thing:
def max_repeats(s, max=2):
last = ''
out = []
for word in s.split():
same = 0 if word != last else same + 1
if same < max: out.append(word)
last = word
return ' '.join(out)
As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). If there is more than one space between each word, it will be lost. It's up to you whether you consider that to be a bug or a feature :)
Try the following:
import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )
You can see a sample code here: http://codepad.org/YyS9JCLO
def replaceThreeOrMoreWordsWithTwoWords(string):
# Pattern to look for three or more repetitions of any words.
pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
return pattern.sub(r"\1", string)

Categories

Resources