I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized and is not a proper noun, then Dr. can end a sentence, for example).
If you're interested in this (it's a natural language processing topic) you could check out:
the natural language tool kit's (nltk) punkt module.
If you have each sentence in a string you can use find() on your word and if found return the sentence. Otherwise you could use a regex, something like this
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, yourwholetext)
if match != None:
sentence = match.group("sentence")
I havent tested this but something along those lines.
My test:
import re
text = "muffins are good, cookies are bad. sauce is awesome, veggies too. fmooo mfasss, fdssaaaa."
pattern = "\.?(?P<sentence>.*?good.*?)\."
match = re.search(pattern, text)
if match != None:
print match.group("sentence")
dutt did a good job answering this. just wanted to add a couple things
import re
text = "go directly to jail. do not cross go. do not collect $200."
pattern = "\.(?P<sentence>.*?(go).*?)\."
match = re.search(pattern, text)
if match != None:
sentence = match.group("sentence")
obviously, you'll need to import the regex library (import re) before you begin. here is a teardown of what the regular expression actually does (more info can be found at the Python re library page)
\. # looks for a period preceding sentence.
(?P<sentence>...) # sets the regex captured to variable "sentence".
.*? # selects all text (non-greedy) until the word "go".
again, the link to the library ref page is key.
Related
I am after getting the last words of a text up to a stopword.
Imagine I have the text:
first_part = "This is a text that with the blue paper"
going from end back I would like to get "blue paper".
In order to do that I use the regex module
import regex as re
print(first_part)
result=re.search(r"(?r)(?<=(\s*\b(an|a|the|for)\b\s*))(?P<feature>.*?)(?=\s*)$",first_part)
print(result)
Regex explanation:
(?r) = reverse
(?<=(\s*\b(an|a|the|for)\b\s*)) =look behind any of the stop words with word boundary \b
(?P feature .?) = basically whatever .
$ = from the end of the string
This works just fine.
but I am using the module regex in order to be able to use "(?r)" meaning reverse.
Anyone knows if it would be possible to do this using re?
I need to implement this functionality with standard libraries functionalities.
If you add a greedy match in front and a lazy one in the back, you will just get the last words.. Not 100% sure this is what you want though.
>>> first_part = "This is a text that with the blue paper"
>>> m = re.match(r"(?:.*)(?:an|a|the|for)\W(.+?)$", first_part)
>>> m[1]
'blue paper'
I want to find all words which have a # attached to it.
I tried:
import re
text = "I was searching my #source to make a big desk yesterday."
re.findall(r'\b#\w+', text)
but it does not work...
Here's a small regex to do that:
>>> import re
>>> s = "I was searching my #source to make a big desk yesterday."
>>> re.findall(r"#(\w+)", s)
['source']
If you want to include the hashtag then use:
>>> re.findall(r"#.\w+", s)
['#source']
You can use:
re.findall(r"#.+?\b", text)
which gives:
['#source']
Here is a link to regex101 which gives in-depth insight into what each part does.
Basically what is happening is:
the # means capture the '#' character literally
then we say to match any character with a .
but the + signifies to capture one or more of them
then the ? begins a non-greedy match to whatever follows
the \b is a word boundary and signifies when to stop the lookup
Update
As pointed out by #AnthonySottile, there is a case where the above regex will fail, namely:
hello#fred
where a match is made when it shouldn't be.
To get around this problem, a /s could be added to the front of the regex so as to make sure the # comes after some whitespace, but this fails in the case where the hashtag comes right at the start of the string. A /b also won't suffice as the # makes the hashtag not count as a word.
So, to get around these, I came up with this rather ugly solution of adding a space to the start of the string before doing the findall:
re.findall(r"\s(#.+?)\b", " " + text)
It's not very neat I know but there really isn't another way of doing it. I tried using an OR at the start to match a whitespace or the start of the string, as in (^|\s), but this will produce multiple groups (as tuples) in the list that is returned from re.findall so would requires some post-processing which is even less neat.
You do not need regex to solve this problem:
text = "I was searching my #source to make a big desk yesterday."
final_text = [i for i in text.split() if i.startswith('#')]
Output:
['#source']
However, this regex will work:
import re
text = "I was searching my #source to make a big desk yesterday."
final_text = filter(lambda x:x, re.findall('(?<=^)|(?<=\s)#\w+(?=\s)|(?=$)', text))
Output:
['#source']
Is there a way to get regexp to match as much of a specific word as is possible? For example, if I am looking for the following words: yesterday, today, tomorrow
I want the following full words to be extracted:
yest
yesterday
tod
toda
today
tom
tomor
tomorrow
The following whole words should fail to match (basically, spelling mistakes):
yesteray
tomorow
tommorrow
tody
The best I could come up with so far is:
\b((tod(a(y)?)?)|(tom(o(r(r(o(w)?)?)?)?)?)|(yest(e(r(d(a(y)?)?)?)?)?))\b (Example)
Note: I could implement this using a finite state machine but thought it would be a giggle to get regexp to do this. Unfortunately, anything I come up with is ridiculously complex and I'm hoping that I've just missed something.
The regex you are looking for should include optional groups with alternations.
\b(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:ow?)?)?)?)?)\b
See demo
Note that \b word boundaries are very important since you want to match whole words only.
Regex explanation:
\b - leading word boundary
(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:o(?:w)?)?)?)?)?) - a capturing group matching
yest(?:e(?:r(?:d(?:ay?)?)?)?)? - yest, yeste, yester, yesterd, yesterda or yesterday
tod(?:ay?)? - tod or toda or today
tom(?:o(?:r(?:r(?:o(?:w)?)?)?)?)? - tom, tomo, tomor, tomorr, tomorro, or tomorrow
\b - trailing word boundary
See Python demo:
import re
p = re.compile(ur'\b(yest(?:e(?:r(?:d(?:ay?)?)?)?)?|tod(?:ay?)?|tom(?:o(?:r(?:r(?:ow?)?)?)?)?)\b', re.IGNORECASE)
test_str = u"yest\nyeste\nyester\nyesterd\nyesterda\nyesterday\ntod\ntoda\ntoday\ntom\ntomo\ntomor\ntomorr\ntomorro\ntomorrow\n\nyesteray\ntomorow\ntommorrow\ntody\nyesteday"
print(p.findall(test_str))
# => [u'yest', u'yeste', u'yester', u'yesterd', u'yesterda', u'yesterday', u'tod', u'toda', u'today', u'tom', u'tomo', u'tomor', u'tomorr', u'tomorro', u'tomorrow']
Pipe separate all the valid words or word substrings like below. This will only match the valid spellings as desired
^(?|yest|yesterday|tod|today)\b
Tested this already at https://regex101.com/
I have a Twitter bot that needs to ignore tweets that contain certain blacklisted words.
This works, but only if the words in the tweet are exactly as they're seen in the list of blacklisted words.
timeline = filter(lambda status: not any(word in status.text.split() for word in wordBlacklist), timeline)
I want to make sure that tweets can't bypass this by putting symbols or adding additional characters around a word, such as bypassing blacklisted word "face" by appending "book" to the end of it, like so "facebook".
How do I do this in a way that fits within my filter's lambda?
You can make use of re here.
import re
timeline = filter(lambda status: not any(re.findall(r"[a-zA-Z0-9]*"+word+r"[a-zA-Z0-9]*",status.text) for word in wordBlacklist), timeline)
You can also use re.escape() over word if word can contain some escape characters
If you expect symbols as well ,try
timeline = filter(lambda status: not any(re.findall(r"\S*"+word+r"\S*",status.text) for word in wordBlacklist), timeline)
You can construct a regular expression based on the blacklist:
from itertools import ifilterfalse
import re
wordBlacklist = ['face', 'hello']
r = re.compile('|'.join(map(re.escape, wordBlacklist)))
...
timeline = list(ifilterfalse(lambda status: r.search(status.text), timeline))
Instead of filter, you can use a list comprehension, which is the same idea with a slightly different syntax, and then use regular expressions for the filtering, as your example is beyond the capabilities of string operations:
import re
blacklist = re.compile('face|friend|advertisement')
timeline = [word for word in status.split() if not blacklist.search(word)]
# filter version of this command:
timeline = filter(lambda word: not blacklist.search(word), status.split())
Now timeline will return a list of words that don't have any match to your blacklist within them, so "facebook" would be blocked because it matches "face", "friendly" would be blocked because it contains "friend", etc. However, you are going to need to get fancier for things like "f*acebook" or other tricks-- these would bypass the filter currently. Try out regex and get comfortable with them, and you can really make pretty fancy filters. Here is a good practice site for regex.
I have a sentence. I want to find all occurrences of a word that start with a specific character in that sentence. I am very new to programming and Python, but from the little I know, this sounds like a Regex question.
What is the pattern match code that will let me find all words that match my pattern?
Many thanks in advance,
Brock
import re
print re.findall(r'\bv\w+', thesentence)
will print every word in the sentence that starts with 'v', for example.
Using the split method of strings, as another answer suggests, would not identify words, but space-separated chunks that may include punctuation. This re-based solution does identify words (letters and digits, net of punctuation).
I second the Dive Into Python recommendation. But it's basically:
m = re.findall(r'\bf.*?\b', 'a fast and friendly dog')
print(m)
\b means word boundary, and .*? ensures we store the whole word, but back off to avoid going too far (technically, ? is called a lazy operator).
You could do (doesn't use re though):
matching_words = [x for x in sentence.split() if x.startswith(CHAR_TO_FIND)]
Regular expressions work too (see the other answers) but I think this solution will be a little more readable, and as a beginner learning Python, you'll find list comprehensions (like the solution above) important to gain a comfort level with.
>>> sentence="a quick brown fox for you"
>>> pattern="fo"
>>> for word in sentence.split():
... if word.startswith(pattern):
... print word
...
fox
for
Split the sentence on spaces, use a loop to search for the pattern and print them out.
import re
s = "Your sentence that contains the word ROAD"
s = re.sub(r'\bROAD', 'RD.', s)
print s
Read: http://diveintopython3.org/regular-expressions.html