Simple Python Regex Find pattern - python

I have a sentence. I want to find all occurrences of a word that start with a specific character in that sentence. I am very new to programming and Python, but from the little I know, this sounds like a Regex question.
What is the pattern match code that will let me find all words that match my pattern?
Many thanks in advance,
Brock

import re
print re.findall(r'\bv\w+', thesentence)
will print every word in the sentence that starts with 'v', for example.
Using the split method of strings, as another answer suggests, would not identify words, but space-separated chunks that may include punctuation. This re-based solution does identify words (letters and digits, net of punctuation).

I second the Dive Into Python recommendation. But it's basically:
m = re.findall(r'\bf.*?\b', 'a fast and friendly dog')
print(m)
\b means word boundary, and .*? ensures we store the whole word, but back off to avoid going too far (technically, ? is called a lazy operator).

You could do (doesn't use re though):
matching_words = [x for x in sentence.split() if x.startswith(CHAR_TO_FIND)]
Regular expressions work too (see the other answers) but I think this solution will be a little more readable, and as a beginner learning Python, you'll find list comprehensions (like the solution above) important to gain a comfort level with.

>>> sentence="a quick brown fox for you"
>>> pattern="fo"
>>> for word in sentence.split():
... if word.startswith(pattern):
... print word
...
fox
for
Split the sentence on spaces, use a loop to search for the pattern and print them out.

import re
s = "Your sentence that contains the word ROAD"
s = re.sub(r'\bROAD', 'RD.', s)
print s
Read: http://diveintopython3.org/regular-expressions.html

Related

Counting the occurrences of a substring in a string python

In my program I'm using count=strr2.lower().count("ieee") to calculate the number of occurrences in the following string,
"i love ieee and ieeextream is the best coding competition ever"
In here it counts "ieeextream" is also as one occurrence which is not my expected result. The expected output is count=1
So are there any method to check only for "ieee" word or can we change the same code with different implementation? Thanks for your time
If you are trying to find the sub-string as a whole word present in the original string, then I guess, this is what you need :
count=strr2.lower().split().count("ieee")
If you want to count only whole words, you can use a regular expression, wrapping the word to be found in word-boundary characters \b. This will also work if the word is surrounded by punctuation.
>>> import re
>>> s = "i love IEEE, and ieeextream is the best coding competition ever"
>>> len(re.findall(r"\bieee\b", s.lower()))
1

Find all word with a #

I want to find all words which have a # attached to it.
I tried:
import re
text = "I was searching my #source to make a big desk yesterday."
re.findall(r'\b#\w+', text)
but it does not work...
Here's a small regex to do that:
>>> import re
>>> s = "I was searching my #source to make a big desk yesterday."
>>> re.findall(r"#(\w+)", s)
['source']
If you want to include the hashtag then use:
>>> re.findall(r"#.\w+", s)
['#source']
You can use:
re.findall(r"#.+?\b", text)
which gives:
['#source']
Here is a link to regex101 which gives in-depth insight into what each part does.
Basically what is happening is:
the # means capture the '#' character literally
then we say to match any character with a .
but the + signifies to capture one or more of them
then the ? begins a non-greedy match to whatever follows
the \b is a word boundary and signifies when to stop the lookup
Update
As pointed out by #AnthonySottile, there is a case where the above regex will fail, namely:
hello#fred
where a match is made when it shouldn't be.
To get around this problem, a /s could be added to the front of the regex so as to make sure the # comes after some whitespace, but this fails in the case where the hashtag comes right at the start of the string. A /b also won't suffice as the # makes the hashtag not count as a word.
So, to get around these, I came up with this rather ugly solution of adding a space to the start of the string before doing the findall:
re.findall(r"\s(#.+?)\b", " " + text)
It's not very neat I know but there really isn't another way of doing it. I tried using an OR at the start to match a whitespace or the start of the string, as in (^|\s), but this will produce multiple groups (as tuples) in the list that is returned from re.findall so would requires some post-processing which is even less neat.
You do not need regex to solve this problem:
text = "I was searching my #source to make a big desk yesterday."
final_text = [i for i in text.split() if i.startswith('#')]
Output:
['#source']
However, this regex will work:
import re
text = "I was searching my #source to make a big desk yesterday."
final_text = filter(lambda x:x, re.findall('(?<=^)|(?<=\s)#\w+(?=\s)|(?=$)', text))
Output:
['#source']

Regex: How to match words without consecutive vowels?

I'm really new to regex and I've been able to find regex which can match this quite easily, but I am unsure how to only match words without it.
I have a .txt file with words like
sheep
fleece
eggs
meat
potato
I want to make a regular expression that matches words in which vowels are not repeated consecutively, so it would return eggs meat potato.
I'm not very experienced with regex and I've been unable to find anything about how to do this online, so it'd be awesome if someone with more experience could help me out. Thanks!
I'm using python and have been testing my regex with https://regex101.com.
Thanks!
EDIT: provided incorrect examples of results for the regular expression. Fixed.
Note that, since the desired output includes meat but not fleece, desired words are allowed to have repeated vowels, just not the same vowel repeated.
To select lines with no repeated vowel:
>>> [w for w in open('file.txt') if not re.search(r'([aeiou])\1', w)]
['eggs\n', 'meat\n', 'potato\n']
The regex [aeiou] matches any vowel (you can include y if you like). The regex ([aeiou])\1 matches any vowel followed by the same vowel. Thus, not re.search(r'([aeiou])\1', w) is true only for strings w that contain no repeated vowels.
Addendum
If we wanted to exclude meat because it has two vowels in a row, even though they are not the same vowel, then:
>>> [w for w in open('file.txt') if not re.search(r'[aeiou]{2}', w)]
['eggs\n', 'potato\n']
#John1024 's answer should work
I also would try
"\w*(a{2,}|e{2,}|i{2,}|o{2,}|u{2,})\w*"ig

Multiple punctuation stripping

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ',
This code:
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.match(i):
regex.sub('', i)
I got from:
Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations.
I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it
Am I just missing some obvious piece that I am just being oblivious too?
I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets
I use Python 2.7
Your code doesn't work because regex.match needs the beginning of the string or complete string to match.
Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.
regex.search returns a match if the pattern is found anywhere in the string and works as expected:
import re
import string
words = ['a.bc,,', 'cdd,gf.f.d,fe']
regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
while regex.search(i):
i = regex.sub('', i)
print i
Edit: As pointed out below by #senderle, the while clause isn't necessary and can be left out completely.
this will replace everything not alphanumeric ...
re.sub("[^a-zA-Z0-9 ]","",my_text)
>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot tin roof'
Here is a simple way:
>>> print str.translate("My&& Dog's {{{%!##%!##$L&&&ove Sal*mon", None,'~`!##$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon
Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

extract a sentence using python

I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized and is not a proper noun, then Dr. can end a sentence, for example).
If you're interested in this (it's a natural language processing topic) you could check out:
the natural language tool kit's (nltk) punkt module.
If you have each sentence in a string you can use find() on your word and if found return the sentence. Otherwise you could use a regex, something like this
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
sentence = match.group("sentence")
I havent tested this but something along those lines.
My test:
import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
print match.group("sentence")
dutt did a good job answering this. just wanted to add a couple things
import re
text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
sentence = match.group("sentence")
obviously, you'll need to import the regex library (import re) before you begin. here is a teardown of what the regular expression actually does (more info can be found at the Python re library page)
\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".
again, the link to the library ref page is key.

Categories

Resources