Python Regex Failure....? - python

I have been trying various solutions all yesterday, before I hung it up and went to bed. After coming back today and taking another look at it... I still cannot understand what is wrong with my regex statement.
I am trying to search my inventory based on a simple name and return an item index and the amount of that item that I have.
for instance, in my inventory instead of knife I could have bloody_knife[9] at the 0 index and the script should return 9, and 0, based on the query of knife.
The code:
import re
inventory = ["knife", "bottle[1]", "rope", "flashlight"]
def search_inventory(item):
numbered_item = '.*' + item + '\[([0-9]*)\].*'
print(numbered_item) #create regex statement
regex_num_item = re.compile(numbered_item)
print(regex_num_item) #compiled regex statement
for x in item:
match1 = regex_num_item.match(x) #regex match....
print(match1) #seems to be producing nothing.
if match1: #since it produces nothing the code fails.
num_item = match1.group()
count = match1.group(1)
print(count)
index = inventory.index(num_item)
else: #eventually this part will expand to include "item not in inventory"
print("code is wrong")
return count, index
num_of_item, item_index = search_inventory("knife")
print(num_of_item)
print(item_index)
The output:
.*knife\[([0-9]*)\].*
re.compile('.*knife.*\\[([0-9]*)\\].*')
None
code is wrong
One thing that I cannot seem to settle well with is when python takes the code in my numbered_item variable and uses it in the re.compile() function. why is it adding additional escapes when I already have the necessary [] escaped.
Has anyone run into something like this before?

Your issue is here:
for x in item:
That is looking at "for every character in your item knife". So your regex was running on k, then n, and so on. Your regex won't want that of course. If you still wanted to "see it", add a print x:
for x in item:
print x #add this line
match1 = regex_num_item.match(x) #regex match....
print(match1) #seems to be producing nothing.
You'll see that it will print each letter of the item. That's what you're matching against in your match1 = regex_num_item.match(x) so obiously it won't work.
You want to iterate over the inventory.
So you want:
for x in inventory: #meaning, for every item in inventory
Is the index important to you? Because you can change the inventory into a dictionary and you don't have to use regex:
inventory = {'knife':8, 'bottle':1, 'rope':1, 'flashlight':0, 'bloody_knife':1}
And then, if you wanted to find every item that has the word knife and how many you have of it:
for item in inventory:
if "knife" in item:
itemcount = inventory[item] #in a dictionary, we get the 'value' of the key this way
print "Item Name: " + item + "Count: " + str(itemcount)
Output:
Item Name: bloody_knife, Count: 1
Item Name: knife, Count: 8

Related

Find Similar Elements in List using Python

I need to look for similar Items in a list using python. (e.g. 'Limits' is similar to 'Limit' or 'Download ICD file' is similar to 'Download ICD zip file')
I really want my results to be similar with chars, not with digits (e.g. 'Angle 1' is similar to 'Angle 2'). All these strings in my list end with an '\0'
What I am trying to do is split every item at blanks and look if any part consists of a digit.
But somehow it is not working as I want it to work.
Here is my code example:
for k in range(len(split)): # split already consists of splitted list entry
replace = split[k].replace(
"\\0", ""
) # replace \0 at every line ending to guarantee it is only a digit
is_num = lambda q: q.replace(
".", "", 1
).isdigit() # lambda i found somewhere on the internet
check = is_num(replace)
if check == True: # break if it is a digit and split next entry of list
break
elif check == False: # i know, else would be fine too
seq = difflib.SequenceMatcher(a=List[i].lower(), b=List[j].lower())
if seq.ratio() > 0.9:
print(Element1, "is similar to", Element2, "\t")
break
Try this, its using get_close_matches from difflib instead of sequencematcher.
from difflib import get_close_matches
a = ["abc/0", "efg/0", "bc/0"]
b=[]
for i in a:
x = i.rstrip("/0")
b.append(x)
for i in range(len(b)):
print(get_close_matches(b[i], (b)))

How To Speed Up Item Descriptions Matching Using FuzzyWuzzy in Python

I have a list of item descriptions that I read from a csv column, and I am trying to look for matches on another list of item descriptions. But it is currently extremely slow as it is trying to match each item on list 1 to every item on list 2.
Here is an example of the item descriptions:
Item description list 1 = [BAR EVENING DREAM INTENSE DARK 3.5 OZ GHRDLLI]
Item description list 2 = [GHIARDELLI EVENING DREAM INTENSE DARK BAR 3.5 OZ 60% (60716)]
this shows the closest match.
A little bit of my code that uses FuzzyWuzzy extractOne token_sort_ration
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import re
from re import findall
regex_size = '([0-9]+((\.\d+)?)+(OZ|CT|oz|ct|(\sOZ)|(\sCT)|(\soz)|(\sct)))'
regex_size_oz_check = '(?=(OZ|oz|(\sOZ)|(\soz)))'
regex_size_ct_check = '(?=(CT|ct|(\sCT)|(\sct)))'
with open('IC_ITM_CROSS_REF', 'r') as hosts:
reader = csv.reader(hosts, delimiter='|')
#iterate through each
for row in reader:
#row[2] = size, can be in OZ or CT
list1_item_desc = row[1] + " " + row[2] + " " + row[4];
#look for matching REPORT_UPC_CODE
if row[7] in UPC:
message = "UPC match"
#if items UPC doesn't match, check if item descriptions match or not
else:
#look for matching Item Desc (return the highest percentage of item desc matching)
#more defined search on the item size
highest = process.extractOne(list1_item_desc,list(all_other_item_desc),scorer=fuzz.token_sort_ratio)
if highest[1] > 80:
#check if the size match
other_item_size = list(map(lambda x: x[0], findall(regex_size, highest[0])))
other_item_size_lower = list(map(lambda x:x.replace(" ", "").lower(),other_item_size))
if(row[2].replace(" ", "").lower() in other_item_size_lower) or not other_item_size:
print("MATCH")
The way the code works currently is that it first tries to see if the item's UPC codes match or no. If it does not, then it will try to look at the item descriptions. For each item description on list 1, it will try to pull one item description from other_item_description list that it most closely matches.
Currently, I have about thousands of items in list1 and thousands of items in another list. So it is extremely slow, it can take a couple of hours to finish running. Is there a way to speed this up? I'm still so new to Python programming and any suggestions would be helpful. Thanks!
There are a couple of things you can do to speed this up.
You can replace FuzzyWuzzy with rapidfuzz (I am the author), which does the same but is faster.
Right now your strings get preprocessed when calling extractOne, so they are e.g. lowercased before comparing them. For the list of choices this can be done once in front of your loop.
Beside this I replaced your map comnstructs with a set, which should be slightly faster, but is especially simpler to read.
UPC should be a set aswell, so you have the constant lookup time, while with a list it has to iterate over the whole list until it finds the item (which is slow when working with big lists like you do)
I could not test this, since I do not have access to the required data, but these changes should give you quite a big performance improvement.
from rapidfuzz import process, fuzz, utils
import re
from re import findall
regex_size = '([0-9]+((\.\d+)?)+(OZ|CT|oz|ct|(\sOZ)|(\sCT)|(\soz)|(\sct)))'
regex_size_oz_check = '(?=(OZ|oz|(\sOZ)|(\soz)))'
regex_size_ct_check = '(?=(CT|ct|(\sCT)|(\sct)))'
with open('IC_ITM_CROSS_REF', 'r') as hosts:
reader = csv.reader(hosts, delimiter='|')
choice_mappings = {choice: utils.default_process(choice) for choice in all_other_item_desc}
#iterate through each
for row in reader:
#row[2] = size, can be in OZ or CT
list1_item_desc = row[1] + " " + row[2] + " " + row[4]
#look for matching REPORT_UPC_CODE
if row[7] in UPC:
message = "UPC match"
#if items UPC doesn't match, check if item descriptions match or not
else:
match = process.extractOne(
utils.default_process(list1_item_desc),
choice_mappings,
processor=None,
scorer=fuzz.token_sort_ratio,
score_cutoff=80)
if match:
other_item_size = {x[0].replace(" ", "").lower() for x in findall(regex_size, match[2])}
if(row[2].replace(" ", "").lower() in other_item_size) or not other_item_size:
print("MATCH")

Unable to get desired output from python list

I have a little issue with my the following python code I have.
I have a list which contains the following elements. This list could also be empty with NO contents.
As you can see, the This is my first stock I bought 01.27.2019 is encode and I have to decode it to remove the b''. When I perform the split operation on the mylist, I get '' as the first item in the list, and I am not sure why.
mylist = "$17$b'This is my first stock I bought 01.27.2019'"
The fields in mylist are seperated by a $ rather than a ,
tmp_list = mylist.split('$')
print (tmp_list) # ['', '17', "b'This is my first stock I bought 01.27.2019'"] ---> Not sure why I have the '' as the first item in the tmp_list
tmp_iter = iter(tmp_list)
res['myinfo']= '{' + '},{'.join(f'{n},{s}' for n, s in zip(tmp_iter, tmp_iter)) + '}'
I want my res['myinfo'] to be "{{17, This is my first stock I bought 01.27.2019}, ...many more {,}}.
At times, the res['myinfo'] could just be "{}" if the mylist = [""].
I am not sure on how to fix my code, any help would be appreciated.
First of all
mylist = "$17$b'This is my first stock I bought 01.27.2019'"
Should be
mylist = "17$b'This is my first stock I bought 01.27.2019'"
Just as #takendarkk said
That's probably because the first character in the string is your
separator ($). What comes before that? Nothing. – takendarkk
As for decoding your string assuming they are all like: " b' some string ' " and have an age before them you have to decode every odd indexed elements in your list. A very straigth forward solution could be:
def decodeString(stringToDecode):
decodedList = []
for i in range(2, len(stringToDecode)-1): #every character except the first 2 and the last one
decodedList.append(stringToDecode[i])
decodedString = ''.join(decodedList)
return decodedString
for i in range(len(tmp_list)):
if i % 2 == 1: #every odd indexed element
decodedString = decodeString(tmp_list[i])
tmp_list[i] = decodedString
And also it seems like you need 1 more layer of '{}' here:
res['myinfo']= '{' + '},{'.join(f'{n},{s}' for n, s in zip(tmp_iter, tmp_iter)) + '}'
This way your res['myinfo'] is {17,This is my first stock I bought 01.27.2019},{18,This is my second stock I bought 01.28.2019} but if you want it to be '{{17,This is my first stock I bought 01.27.2019},{18,This is my second stock I bought 01.27.2019}}' as you said you need:
res['myinfo']= '{' + '{' + '},{'.join(f'{n},{s}' for n, s in zip(tmp_iter, tmp_iter)) + '}' + '}'

Excaption not including

I have a file at /location/data.txt . In this file I have entry like :
aaa:xxx:abc.com:1857:xxx1:rel5t2:y
ifa:yyy:xyz.com:1858:yyy1:rel5t2:y
I want to access 'aaa' from my code either I mention aaa while giving the input in caps or small after running my python code it should return me aaa is the right item
But here I want to include one exception that if I give the input with -mc suffix (aaa-mc) either in small latters or in caps it should ignore the -mc.
Below is the my code and output as well which I am getting now.
def pITEMName():
global ITEMList,fITEMList
pITEMList = []
fITEMList = []
ITEMList = str(raw_input('Enter pipe separated list of ITEMS : ')).upper().strip()
items = ITEMList.split("|")
count = len(items)
print 'Total Distint ITEM Count : ', count
pipelst = [i.split('-mc')[0] for i in ITEMList.split('|')]
filepath = '/location/data.txt'
f = open(filepath, 'r')
for lns in f:
split_pipe = lns.split(':', 1)
if split_pipe[0] in pipelst:
index = pipelst.index(split_pipe[0])
pITEMList=split_pipe[0]+"|"
fITEMList.append(pITEMList)
del pipelst[index]
for lns in pipelst:
print bcolors.red + lns,' is wrong ITEM Name' + bcolors.ENDC
f.close()
When I execute above code it prompts me like :
Enter pipe separated list of ITEMS :
And if I provide the list like :
Enter pipe separated list of ITEMS : aaa-mc|ifa
it gives me the result as :
Total Distint item Count : 2
AAA-MC is wrong item Name
items Belonging to other :
Other center :
item Count From Other center = 0
items Belonging to Current Centers :
Active items in US1 :
^IFA$
Active items in US2 :
^AAA$
Ignored item Count From Current center = 0
You Have Entered itemList belonging to this center as: ^IFA$|^AAA$
Active item Count : 2
Do You Want To Continue [YES|Y|NO|N] :
As you must be see in above result aaa is coming as valid count (active item count : 2) because its available in /location/data.txt file. but also its coming as AAA-MC is wrong item name (2nd line from above result). I want '-mc or -MC' to ignore with any item present or non present in /location/data.txt file.
Please let me know what's wrong with my above code to achieving this.
The issue you're having is that your code expects the "-mc" suffix to appear in lowercase, but you're calling the upper() method on the input string, resulting in text that is all upper case. You need to change one of those so that they match (it doesn't really matter which one).
Either replace the upper() call with lower(), or replace the string "-mc" with "-MC", and your code should work better (I'm not certain I understand all of it, so there may be other issues).
The way you are constructing ITEMList is by reading in a string, capitalizing it (with upper()), and stripping all whitespace. Therefore, something like 'aaa-mc' is being converted to 'AAA-MC'. You're later splitting this uppercase string on the token '-mc', which is impossible for it to contain, so.
I'd reccommed either replacing upper() with lower() when you are reading your string in, or doing a hard replace on the types of '-mc', so instead of
i.split('-mc')[0]
try using
i.replace('-mc','').replace('-MC','')
in your list comprension.

python -- for with an if statement

I dont understand why, when i run my code the for each loop under the if statement isn't run. Even when the number of found is greater than 0!
def findpattern(commit_msg):
pattern = re.compile("\w\w*-\d\d*")
group = pattern.finditer(commit_msg)
found = getIterLength(group)
print found
if found > 0:
issues = 0
for match in group:
print " print matched issues:"
auth = soap.login(jirauser,passwd)
print match.group(0)
getIssue(auth,match.group(0))
issues = issues + 1
else:
sys.exit("No issue patterns found.")
print "Retrieved issues: " + str(issues)
Any help would be appreciated, I have been banging my head on this for an hour.
Your getIterLength() function is finding the length by exhausting the iterator returned by finditer(). You would then need a new iterator instance for the for loop. Instead, I would restructure your code like this:
def findpattern(commit_msg):
pattern = re.compile("\w\w*-\d\d*")
group = pattern.finditer(commit_msg)
found = 0
issues = 0
for match in group:
print " print matched issues:"
auth = soap.login(jirauser,passwd)
print match.group(0)
getIssue(auth,match.group(0))
issues = issues + 1
found += 1
if found == 0:
sys.exit("No issue patterns found.")
print "Retrieved issues: " + str(issues)
OR, you could use the findall() method instead of finditer() to give you a list (which is an iterable, not an iterator) on which you can run len(group) to get the size and then use it to iterate over in your for loop.
Check your code formatting it looks like under the for you have a double tab instead of a single tab, remember python is very picky about indentation

Categories

Resources