Extract the first occurence of a number before a substring [duplicate] - python

There is a sentence "i have 5 kg apples and 6 kg pears".
I just want to extract the weight of apples.
So I use
sentence = "I have 5 kg apples and 6 kg pears"
number = re.findall(r'(\d+) kg apples', sentence)
print (number)
However, it just works for integer numbers. So what should I do if the number I want to extract is 5.5?

You can try something like this:
import re
sentence = ["I have 5.5 kg apples and 6 kg pears",
"I have 5 kg apples and 6 kg pears"]
for sen in sentence:
print re.findall(r'(\d+(?:\.\d+)?) kg apples', sen)
Output:
['5.5']
['5']

? designates an optional segment of a regex.
re.findall(r'((\d+\.)?\d+)', sentence)

You can use number = re.findall(r'(\d+\.?\d*) kg apples', sentence)

You change your regex to match it:
(\d+(?:\.\d+)?)
\.\d+ matches a dot followed by at least one digit. I made it optional, because you still want one digit.

re.findall(r'[-+]?[0-9]*\.?[0-9]+.', sentence)

Non-regex solution
sentence = "I have 5.5 kg apples and 6 kg pears"
words = sentence.split(" ")
[words[idx-1] for idx, word in enumerate(words) if word == "kg"]
# => ['5.5', '6']
You can then check whether these are valid floats using
try:
float(element)
except ValueError:
print "Not a float"

The regex you need should look like this:
(\d+.?\d*) kg apples
You can do as follows:
number = re.findall(r'(\d+.?\d*) kg apples', sentence)
Here is an online example

Related

Count the word but ignore when it has a word with first letter capitalized before

I am trying to determine whether the word "McDonald" is in the cell. However, I wish to ignore the case where the word before "McDonald" has a first captilized letter like 'Kevin McDonald'. Any suggestion how to get it right through regex in a pandas dataframe?
data = {'text':["Kevin McDonald has bought a burger.",
"The best burger in McDonald is cheeze buger."]}
df = pd.DataFrame(data)
long_list = ['McDonald', 'Five Guys']
# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
text
0 Kevin McDonald has bought a burger.
1 The best burger in McDonald is cheeze buger.
Expected output:
text count
0 Kevin McDonald has bought a burger. 0
1 The best burger in McDonald is cheeze buger. 1
You can try this pattern:
pattern = r'\b[a-z].*?\b {}'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
IIUC, the goal is not to match when there is a preceding word that is capitalized. Checking that there is a non capitalized word before would remove many legitimate possibilities.
Here is a regex that works for a few more possibilities (start of sentence, non word before):
regex = '|'.join(fr'(?:\b[^A-Z]\S*\s+|[^\w\s] ?|^){i}' for i in long_list)
df['count'] = df['text'].str.count(regex)
example:
text count
0 Kevin McDonald has bought a burger. 0
1 The best burger in McDonald is cheeze buger. 1
2 McDonald's restaurants. 1
3 Blah. McDonald's restaurants. 1
You can test and understand the regex here

How to get all sentences that contain multiple words in Python

I am trying to make a regular expressions to get all sentences containing two words (order doesn't matter), but I can't find the solution for this.
"Supermarket. This apple costs 0.99."
I want to get back the following sentence:
This apple costs 0.99.
I tried:
([^.]*?(apple)*?(costs)[^.]*\.)
I have problems because the price contains a dot. Also this expressions gives back results with only one of the words.
Approach: For each Phrase, we have to find the sentences which contain all the words of the phrase. So, for each word in the given phrase, we check if a sentence contains it. We do this for each sentence. This process of searching may become faster if the words in the sentence are stored in a set instead of a list.
Below is the implementation of above approach in python:
def getRes(sent, ph):
sentHash = dict()
# Loop for adding hased sentences to sentHash
for s in range(1, len(sent)+1):
sentHash[s] = set(sent[s-1].split())
# For Each Phrase
for p in range(0, len(ph)):
print("Phrase"+str(p + 1)+":")
# Get the list of Words
wordList = ph[p].split()
res = []
# Then Check in every Sentence
for s in range(1, len(sentHash)+1):
wCount = len(wordList)
# Every word in the Phrase
for w in wordList:
if w in sentHash[s]:
wCount -= 1
# If every word in phrase matches
if wCount == 0:
# add Sentence Index to result Array
res.append(s)
if(len(res) == 0):
print("NONE")
else:
print('% s' % ' '.join(map(str, res)))
# Driver Function
def main():
sent = ["Strings are an array of characters",
"Sentences are an array of words"]
ph = ["an array of", "sentences are strings"]
getRes(sent, ph)
main()
You use a negated character class [^.] which matches any character except a dot.
But in your example data Supermarket. This apple costs 0.99. there are 2 dots before the dot at the end, so you can not cross the dot after Supermarket. to match apple
You could for example match until the first dot, then assert costs and use a capture group to match the part with apple and make sure the line ends with a dot.
The assertion for word 1 with a match for word 2 will match the words in both combinations.
^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$
Explanation
^[^.]*\. From the start of the string, match until and including the first dot
\s* Match 0+ whitespace character
(?=.*\bcosts\b) Positive lookahead, assert costs at the right
( Capture group 1 (this has the desired value)
.*\bapple\b.*\. Match the rest of the line that includes apple and ends with a dot
) Close group 1
$ Assert end of string
Regex demo | Python demo
import re
regex = r"^[^.]*\.\s*(?=.*\bcosts\b)(.*\bapple\b.*\.)$"
test_str = ("Supermarket. This apple costs 0.99.\n"
"Supermarket. This costs apple 0.99.\n"
"Supermarket. This apple is 0.99.\n"
"Supermarket. This orange costs 0.99.")
print(re.findall(regex, test_str, re.MULTILINE))
Output
['This apple costs 0.99.', 'This costs apple 0.99.']
I also suggest to first extract sentences and then find sentences that have both words.
However, the problem of splitting text into sentences is pretty hard because of existence of abbreviations, unusual names, etc. One way to do it is by using nltk.tokenize.punkt module.
You'll need to install NLTK and then run this in Python:
import nltk
nltk.download('punkt')
After that you can use English language sentence tokenizer with two regexes:
TEXT = 'Mr. Bean is in supermarket. iPhone 12 by Apple Inc. costs $999.99.'
WORD1 = 'apple'
WORD2 = 'costs'
import nltk.data, re
# Regex helper
find_word = lambda w, s: re.search(r'(^|\W)' + w + r'(\W|$)', s, re.I)
eng_sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
for sent in eng_sent_detector.tokenize(TEXT):
if find_word(WORD1, sent) and find_word(WORD2, sent):
print (sent,"\n----")
Output:
iPhone 12 by Apple Inc. costs $999.99.
----
Notice that it handles numbers and abbreviations for you.

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

Python, find words from array in string

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks
import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']
Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]
Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Writing an If condition to filter out the first word

I have a string:
Father ate a banana and slept on a feather
Part of my code shown below:
...
if word.endswith(('ther')):
print word
This prints feather and also Father
But i want to modify this if condition so it doesn't apply this search for the first word of a sentence. So the result should only print feather.
I tried having and but it didn't work:
...
if word.endswith(('ther')) and word[0].endswith(('ther')):
print word
This doesn't work. HELP
You can use a range to skip first word and apply the endswith() function to the rest of words, like:
s = 'Father ate a banana and slept on a feather'
[w for w in s.split()[1:] if w.endswith('ther')]
You can build a regex:
import re
re.findall(r'(\w*ther)',s)[1:]
['feather']
If i understand your question correctly, you don't want it to print the word if it's the first word in the string. So, you could copy the string and drop the first word.
I'll walk you through it. Say you have this string:
s = "Father ate a banana and slept on a feather"
You can split it by running s.split() and catching that output:
['Father', 'ate', 'a', 'banana', 'and', 'slept', 'on', 'a', 'feather']
So if you want all the words, except the first, you can use the index [1:]. And you can combine the list of words by joining with a space.
s1 = "Father ate a banana and slept on a feather"
s2 = " ".join(s1.split()[1:])
The string s2 will now be the following:
ate a banana and slept on a feather
You can use that string and iterate over the words like you did above.
If you want to avoid making a temporary string
[w for i, w in enumerate(s.split()) if w.endswith('ther') and i]

Categories

Resources