Extract weight of an item from its description using regex in python - python

I have a list of product descriptions. for example:
items = ['avuhovi Grillikaapeli 320g','Savuhovi Kisamakkara 320g',
'Savuhovi Raivo 250g', 'AitoMaku str.garl.sal.dres.330ml', 'Rydbergs
225ml Hollandaise sauce']
I want to extract the weights that is, 320g, 320g, 250ml, 330ml. I know we can use regex for this but do not know how to buil regex to extract that. You can see that weights are sometimes in the middle of the description and sometimes having dot(.) as separator rather than space. So, I am confused how to extract.
Thanks for help in advance :)

Here is one solution that may work (using search and group suggested by Wiktor):
>>> for t in items :
... re.search(r'([0-9]+(g|ml))', t).group(1)
...
'320g'
'320g'
'250g'
'330ml'
'225ml'
Indeed a better solution (thanks Wiktor) would be to test if there is a match :
>>> res = []
>>> for t in items :
... m = re.search(r'(\d+(g|ml))', t)
... if m:
... res.append(m.group(1))
print res

https://regex101.com/r/gy5YTp/4
Match any digit with \d+ then create a matching but non selecting group with (?:ml|g) this will match ml or g.
import re
items = ['avuhovi Grillikaapeli 320g', 'Savuhovi 333ml Kisamakkara 320g', 'Savuhovi Raivo 250g', 'AitoMaku str.garl.sal.dres.330ml', 'Rydbergs 225ml Hollandaise sauce']
groupedWeights = [re.findall('(\d+(?:ml|g))', i) for i in items]
flattenedWeights = [y for x in groupedWeights for y in x]
print(flattenedWeights)
The match that we make returns a list of lists of weights found so we need to flatten that with [y for x in groupedWeights for y in x]
That is if you ever have more than one weight in an element. Otherwise we can take the first element of each list like this.
weights = [re.findall('(\d+(?:ml|g))', i)[0] for i in items]

Related

python if/else list comprehension

I was wondering if it's possible to use list comprehension in the following case, or if it should be left as a for loop.
temp = []
for value in my_dataframe[my_col]:
match = my_regex.search(value)
if match:
temp.append(value.replace(match.group(1),'')
else:
temp.append(value)
I believe I can do it with the if/else section, but the 'match' line throws me off. This is close but not exactly it.
temp = [value.replace(match.group(1),'') if (match) else value for
value in my_dataframe[my_col] if my_regex.search(value)]
Single-statement approach:
result = [
value.replace(match.group(1), '') if match else value
for value, match in (
(value, my_regex.search(value))
for value in my_dataframe[my_col])]
Functional approach - python 2:
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda (v, m): v.replace(m.group(1), '') if m else v
result = map(fix, gen)
Functional approach - python 3:
from itertools import starmap
data = my_dataframe[my_col]
gen = zip(data, map(my_regex.search, data))
fix = lambda v, m: v.replace(m.group(1), '') if m else v
result = list(starmap(fix, gen))
Pragmatic approach:
def fix_string(value):
match = my_regex.search(value)
return value.replace(match.group(1), '') if match else value
result = [fix_string(value) for value in my_dataframe[my_col]]
This is actually a good example of a list comprehension that performs worse than its corresponding for-loop and is (far) less readable.
If you wanted to do it, this would be the way:
temp = [value.replace(my_regex.search(value).group(1),'') if my_regex.search(value) else value for value in my_dataframe[my_col]]
# ^ ^
Note that there is no place for us to define match inside the comprehension and as a result we have to call my_regex.search(value) twice.. This is of course inefficient.
As a result, stick to the for-loop!
use a regular expression pattern with a sub group pattern looking for any word until an space plus character and characters he plus character is found and a space plus character and el is found plus any character . repeat the sub group pattern
paragraph="""either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to wonder what was
going to happen next. first, she tried to look down and make out what
she was coming to, but it was too dark to see anything; then she
looked at the sides of the well, and noticed that they were filled with
cupboards and book-shelves; here and there she saw maps and pictures
hung upon pegs. she took down a jar from one of the shelves as
she passed; it was labelled 'orange marmalade', but to her great
disappointment it was empty: she did not like to drop the jar for fear
of killing somebody, so managed to put it into one of the cupboards as
she fell past it."""
sentences=paragraph.split(".")
pattern="\w+\s+((\whe)\s+(\w+el\w+)){1}\s+\w+"
temp=[]
for sentence in sentences:
result=re.findall(pattern,sentence)
for item in result:
temp.append("".join(item[0]).replace(' ',''))
print(temp)
output:
['thewell', 'shefell', 'theshelves', 'shefell']

Match unique patterns in string - Python

I have a list of strings called txtFreeForm:
['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.]
I need to check if only 'Add roth' exists in the sentence. To do that i used this
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if match is not None:
print(each_line)
But this obviously returns both the strings in my list as both contain 'add roth'. Is there a way to exclusively search for 'Add roth' in a sentence, because i have a bunch of these patterns to search in strings.
Thanks for your help!
Can you fix this problem by using the .Length property of strings? I'm not an experienced Python programmer, but here is how I think it should work:
for each_line in txtFreeForm:
match = re.search('add roth',each_line.lower())
if (match is not None) and (len(txtFreeForm) == len("Add Roth")):
print(each_line)
Basically, if the text is in the string, AND the length of the string is exactly to the length of the string "Add Roth", then it must ONLY contain "Add Roth".
I hope this was helpful.
EDIT:
I misunderstood what you were asking. You want to print out sentences that contain "Add Roth", but not sentences that contain "Add Roth in plan". Is this correct?
How about this code?
for each_line in txtFreeForm:
match_AR = re.search('add roth',each_line.lower())
match_ARIP = re.search('add roth in plan',each_line.lower())
if (match_AR is True) and (match_ARIP is None):
print(each_line)
This seems like it should fix the problem. You can exclude any strings (like "in plan") by searching for them too and adding them to the comparison.
You're close :) Give this a shot:
for each_line in txtFreeForm:
match = re.search('add roth (?!in[-]plan)',each_line.lower())
if match is not None:
print(each_line[match.end():])
EDIT:
Ahhh I misread... you have a LOT of these. This calls for some more aggressive magic.
import re
from functools import partial
txtFreeForm = ['Add roth Sweep non vested money after 5 years of termination',
'Add roth in-plan to the 401k plan.']
def roths(rows):
for row in rows:
match = re.search('add roth\s*', row.lower())
if match:
yield row, row[match.end():]
def filter_pattern(pattern):
return partial(lazy_filter_out, pattern)
def lazy_filter(pattern):
return partial(lazy_filter, pattern)
def lazy_filter_out(pattern, rows):
for row, rest in rows:
if not re.match(pattern, rest):
yield row, rest
def magical_transducer(bad_words, nice_rows):
magical_sentences = reduce(lambda x, y: y(x), [roths] + map(filter_pattern, bad_words), nice_rows)
for row, _ in magical_sentences:
yield row
def main():
magic = magical_transducer(['in[-]plan'], txtFreeForm)
print(list(magic))
if __name__ == '__main__':
main()
To explain a bit about what's happening hear, you mentioned you have a LOT of these words to process. The traditional way you might compare two groups of items is with nested for-loops. So,
results = []
for word in words:
for pattern in patterns:
data = do_something(word_pattern)
results.append(data)
for item in data:
for thing in item:
and so on...
and so fourth...
I'm using a few different techniques to attempt to achieve a "flatter" implementation and avoid the nested loops. I'll do my best to describe them.
**Function compositions**
# You will often see patterns that look like this:
x = foo(a)
y = bar(b)
z = baz(y)
# You may also see patterns that look like this:
z = baz(bar(foo(a)))
# an alternative way to do this is to use a functional composition
# the technique works like this:
z = reduce(lambda x, y: y(x), [foo, bar, baz], a)

regex sbustitute only specific hit sequence

i have multiple string variations: "gr_shoulder_r_tmp", "r_shoulder_tmp"
i need to substitute:
"r_" to l_, here:
"gr_shoulder_r_tmp" > "gr_shoulder_l_tmp"
"r_shoulder_tmp" > "l_shoulder_tmp"
in other words i need to subustitute 3rd coinsidence in frist example
and 1st in second example of stirngs
im started digging myself...
and came up into halfesolved result, which bore one more interesting question:
a) Find index of right hit
[i for i, x in enumerate(re.findall("(.?)(r_)", "gr_shoulder_r_tmp")) if filter(None, x).__len__() == 1]
which gives me indx = 2
?) how to use that hit index :[
while wrote this i found straight simple solution..
b) split by underscore, replace standalone letter, and join back
findtag = "r"
newtag = "l"
itemA = "gr_shoulder_r_tmp"
itemB = "r_shoulderr_tmp"
spl_str = itemA.split("_")
hit = spl_str.index(findtag)
spl_str[hit] = newtag
new_item = "_".join(spl_str)
both itemA,itemB gives me what i need.. but im not happy of it, too heavy and so rough
A simple regex will do this job.
re.sub(r'(?<![a-zA-Z])r_', 'l_', s)
(?<![a-zA-Z]) negative lookbehind which asserts that the match would be preceeded by any but not a letter.
Example:
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"gr_shoulder_r_tmp")
'gr_shoulder_l_tmp'
>>> re.sub(r'(?<![a-zA-Z])r_', 'l_',"r_shoulder_tmp")
'l_shoulder_tmp'

Python - delete part of link in array

I have a list of links in an array, such as
results = [link1/1254245,
'q%(random part)cache:link2/1254245& (random part) Dclnk',
'link3/1254245]
whereas link = http://www.whatever.com.
I want to replace the term q%3(random part)cache and &(random part)Dclnk with nothing so that the "clean" link2 is "cut" out and left over among the other "clean" links. The random part changes always in content and length. The q%3 : and & Dclnk stay the same.
How do I do that? I could not find a straight answer to that so far.
You could achieve this through re.sub and list comprehension.
>>> l = ['link1/1254245', 'q%(random part)cache:link2/1254245& (random part) Dclnk', 'link3/1254245']
>>> [re.sub(r'q%[^(]*\([^()]*\)cache:|&\s*\([^()]*\)\s*Dclnk', r'', i) for i in l]
['link1/1254245', 'link2/1254245', 'link3/1254245']
[^()]* matches any character but not of ( or ) zero or more times. Specify an | alteration operator to use multiple patterns.

Finding max value of float in a string - regex

I'm confronted with such a challenge right now. I've read some web classes and Dive Into Python on regex and nothing I found on my issue, thus I'm not sure if this is even possible to achieve.
Given this dict-alike string:
"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]
I'd like to compare values in brackets at corresponding positions and take only the greatest one. So comparing 11.76, 11.76, 11.91 should result in 11.91.
My alternative is to get all the values and compare them afterwards but I'm wondering whether regex could cope?
>>> import ast
>>> text = '''"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]'''
>>> rows = ast.literal_eval('{' + text + '}').values()
>>> [max(col) for col in zip(*rows)]
[11.91, 7.19]
Try this:
import re
text = '''"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]'''
values = re.findall(r'\[(.*?)\]', text)
values = map(lambda x: x.split(','), values)
values = zip(*values)
print max(map(float, values[0]))
print max(map(float, values[1]))
Output:
11.91
7.19

Categories

Resources