Remove words of length less than 4 from string [duplicate] - python

This question already has answers here:
Remove small words using Python
(4 answers)
Closed 8 years ago.
I am trying to remove words of length less than 4 from a string.
I use this regex:
re.sub(' \w{1,3} ', ' ', c)
Though this removes some strings but it fails when 2-3 words of length less than 4 appear together. Like:
I am in a bank.
It gives me:
I in bank.
How to resolve this?

Don't include the spaces; use \b word boundary anchors instead:
re.sub(r'\b\w{1,3}\b', '', c)
This removes words of up to 3 characters entirely:
>>> import re
>>> re.sub(r'\b\w{1,3}\b', '', 'The quick brown fox jumps over the lazy dog')
' quick brown jumps over lazy '
>>> re.sub(r'\b\w{1,3}\b', '', 'I am in a bank.')
' bank.'

If you want an alternative to regex:
new_string = ' '.join([w for w in old_string.split() if len(w)>3])

Answered by Martijn, but I just wanted to explain why your regex doesn't work. The regex string ' \w{1,3} ' matches a space, followed by 1-3 word characters, followed by another space. The I doesn't get matched because it doesn't have a space in front of it. The am gets replaced, and then the regex engine starts at the next non-matched character: the i in in. It doesn't see the space before in, since it was placed there by the substitution. So, the next match it finds is a, which produces your output string.

Related

Split string at a specific number that is also contained in a larger number in the same string [duplicate]

This question already has answers here:
How do I check for an exact word or phrase in a string in Python
(8 answers)
Closed 2 years ago.
I have the string: 'This line is 14 1400'
I would like to split it keeping everything to the right of 14.
I have tried:
split2 = re.split('14', string)[2]
This returns: 00
I would like it to return 1400
How would I modify this to get this output? I've experimented with expression operations to only consider 14 but can't seem to get this to work.
To split only on 14 and not on 1400, use the word boundary metacharacter \b.
Make sure to use a raw string to avoid having to escape the \.
>>> split2 = re.split(r'\b14\b', string)
>>> split2
['This line is ', ' 1400']
>>> split2[1]
' 1400'
Alternatively, to also get rid of the leading space in ' 1400', do not split only on 14 but also on any spaces surrounding it:
>>> re.split(r'\s*\b14\b\s*', s)
['This line is', '1400']

In Python, ignore digits in string but remove pure digits [duplicate]

This question already has answers here:
How to match a whole word with a regular expression?
(4 answers)
Closed 2 years ago.
I am processing string like
This is python3 and learning it takes 100 hours
I want to remove only digits like 100 but want to keep digits when it is part of anything like python3.
I am trying the regex
text = re.sub('[0-9]', '', text)
but it is not working as expected. Help is appreciated.
You can just add a space to both sides of your regex, and then have a single space as the replacement. Remember to also a + to match one or more digits:
import re
text = 'This is python3 and learning it takes 100 hours'
text = re.sub(r' [0-9]+ ', ' ', text)
print(text)
Output:
This is python3 and learning it takes hours
Try below,
text = re.sub(' [0-9]{1,} ', ' ', text)
You can use \b word boundary (class \d is for [0-9]) :
def clean(value):
return re.sub(r"\b\d+\b", "", value)
if __name__ == "__main__":
print(clean("This is python3 and learning it takes 100 hours")) # This is python3 and learning it takes hours
Regex demo

Not getting required output using findall in python

Earlier ,I could not put the exact question.My apologies.
Below is what I am looking for :
I am reading a string from file as below and there can be multiple such kind of strings in the file.
" VEGETABLE 1
POTATOE_PRODUCE 1.1 1SIMLA(INDIA)
BANANA 1.2 A_BRAZIL(OR INDIA)
CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA"
I want to capture the entire string as output using findall only.
My script:
import re
import string
f=open('log.txt')
contents = f.read()
output=re.findall('(VEGETABLE.*)(\s+\w+\s+.*)+',contents)
print output
Above script is giving output as
[('VEGETABLE 1', '\n CARROT_PRODUCE 1.3 A_BRAZIL/AFRICA')]
But contents in between are missing.
Solution in last snippet in this answer.
>>> import re
>>> str2='d1 talk walk joke'
>>> re.findall('(\d\s+)(\w+\s)+',str2)
[('1 ', 'walk ')]
output is a list with only one occurrence of the given pattern. The tuple in the list contains two strings that matched corresponding two groupings given within () in the pattern
Experiment 1
Removed the last '+' which made pattern to select the first match instead of greedy last match
>>> re.findall('(\d\s+)(\w+\s)',str2)
[('1 ', 'talk ')]
Experiment 2
Added one more group to find the third words followed with one or more spaces. But if the sting has more than 3 words followed by spaces, this will still find only three words.
>>> re.findall('(\d\s+)(\w+\s)(\w+\s)',str2)
[('1 ', 'talk ', 'walk ')] #
Experiment 3
Using '|' to match the pattern multipel times. Note the tuple has disappeared. Also note that the first match is not containing only the number. This may be because \w is superset of \d
>>> re.findall('\d\s+|\w+\s+',str2)
['d1 ', 'talk ', 'walk ']
Final Experiment
>>> re.findall('\d\s+|[a-z]+\s+',str2)
['1 ', 'talk ', 'walk ']
Hope this helps.

Python (2.7) - Replacing multiple patterns in a string using re

I am trying to think of a more elegant way of replacing multiple patterns in a given string using re in relation to a little problem, which is to remove from a given string all substrings consisting of more than two spaces and also all substrings where a letter starts after a period without any space. So the sentence
'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
should be corrected to:
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
My solution, below, seems a bit messy. I was wondering whether there was a nicer way of doing this, as in a one-liner regex.
def correct( astring ):
import re
bstring = re.sub( r' +', ' ', astring )
letters = [frag.strip( '.' ) for frag in re.findall( r'\.\w', bstring )]
for letter in letters:
bstring = re.sub( r'\.{}'.format( letter ), '. {}'.format( letter ), bstring )
return bstring
s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
print(re.sub("\s+"," ",s).replace(".",". ").rstrip())
This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.
You could use re.sub function like below. This would add exactly two spaces next to the dot except the last dot and it also replaces one or more spaces except the one after dot with a single space.
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub(r'(?<!\.)\s+', ' ' ,re.sub(r'\.\s*(?!$)', r'. ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
OR
>>> re.sub(r'\.\s*(?!$)', r'. ', re.sub(r'\s+', ' ', s))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
An approach without using any RegEX
>>> ' '.join(s.split()).replace('.','. ')[:-1]
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'
What pure regex? Like this?
>>> import re
>>> s = 'This is a strange sentence. There are too many spaces.And.Some periods are not. placed properly.'
>>> re.sub('\s+$', '', re.sub('\s+', ' ', re.sub('\.', '. ', s)))
'This is a strange sentence. There are too many spaces. And. Some periods are not. placed properly.'

Confusing Behaviour of regex in Python

I'm trying to match a specific pattern using the re module in python.
I wish to match a full sentence (More correctly I would say that they are alphanumeric string sequences separated by spaces and/or punctuation)
Eg.
"This is a regular sentence."
"this is also valid"
"so is This ONE"
I'm tried out of various combinations of regular expressions but I am unable to grasp the working of the patterns properly, with each expression giving me a different yet inexplicable result (I do admit I am a beginner, but still).
I'm tried:
"((\w+)(\s?))*"
To the best of my knowledge this should match one or more alpha alphanumerics greedily followed by either one or no white-space character and then it should match this entire pattern greedily. This is not what it seems to do, so clearly I am wrong but I would like to know why. (I expected this to return the entire sentence as the result)
The result I get for the first sample string mentioned above is [('sentence', 'sentence', ''), ('', '', ''), ('', '', ''), ('', '', '')].
"(\w+ ?)*"
I'm not even sure how this one should work. The official documentation(python help('re')) says that the ,+,? Match x or x (greedy) repetitions of the preceding RE.
In such a case is simply space the preceding RE for '?' or is '\w+ ' the preceding RE? And what will be the RE for the '' operator? The output I get with this is ['sentence'].
Others such as "(\w+\s?)+)" ; "((\w*)(\s??)) etc. which are basically variation of the same idea that the sentence is a set of alpha numerics followed by a single/finite number of white spaces and this pattern is repeated over and over.
Can someone tell me where I go wrong and why, and why the above expressions do not work the way I was expecting them to?
P.S I eventually got "[ \w]+" to work for me but With this I cannot limit the number of white-space characters in continuation.
Your reasoning about the regex is correct, your problem is coming from using capturing groups with *. Here's an alternative:
>>> s="This is a regular sentence."
>>> import re
>>> re.findall(r'\w+\s?', s)
['This ', 'is ', 'a ', 'regular ', 'sentence']
In this case it might make more sense for you to use \b in order to match word boundries.
>>> re.findall(r'\w+\b', s)
['This', 'is', 'a', 'regular', 'sentence']
Alternatively you can match the entire sentence via re.match and use re.group(0) to get the whole match:
>>> r = r"((\w+)(\s?))*"
>>> s = "This is a regular sentence."
>>> import re
>>> m = re.match(r, s)
>>> m.group(0)
'This is a regular sentence'
Here's an awesome Regular Expression tutorial website:
http://regexone.com/
Here's a Regular Expression that will match the examples given:
([a-zA-Z0-9,\. ]+)
Why do you want to limit the number of white space character in continuation? Because a sentence can have any number of words (sequences of alphanumeric characters) and spaces in a row, but rather a sentence is the area of text that ends with a punctuation mark or rather something that is not in the above sequence including white space.
([a-zA-Z0-9\s])*
The above regex will match a sentence wherein it is a series or spaces in series zero or more times. You can refine it to be the following though:
([a-zA-Z0-9])([a-zA-Z0-9\s])*
Which simply states that the above sequence must be prefaced with a alphanumeric character.
Hope this is what you were looking for.
Maybe this will help:
import re
source = """
This is a regular sentence.
this is also valid
so is This ONE
how about this one followed by this one
"""
re_sentence = re.compile(r'[^ \n.].*?(\.|\n| +)')
def main():
i = 0
for s in re_sentence.finditer(source):
print "%d:%s" % (i, s.group(0))
i += 1
if __name__ == '__main__':
main()
I am using alternation in the expression (\.|\n| +) to describe the end-of-sentence condition. Note the use of two spaces in the third alternation. The second space has the '+' meta-character so that two or more spaces in a row will be an end-of-sentence.

Categories

Resources