I have a list of item descriptions that I read from a csv column, and I am trying to look for matches on another list of item descriptions. But it is currently extremely slow as it is trying to match each item on list 1 to every item on list 2.
Here is an example of the item descriptions:
Item description list 1 = [BAR EVENING DREAM INTENSE DARK 3.5 OZ GHRDLLI]
Item description list 2 = [GHIARDELLI EVENING DREAM INTENSE DARK BAR 3.5 OZ 60% (60716)]
this shows the closest match.
A little bit of my code that uses FuzzyWuzzy extractOne token_sort_ration
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import re
from re import findall
regex_size = '([0-9]+((\.\d+)?)+(OZ|CT|oz|ct|(\sOZ)|(\sCT)|(\soz)|(\sct)))'
regex_size_oz_check = '(?=(OZ|oz|(\sOZ)|(\soz)))'
regex_size_ct_check = '(?=(CT|ct|(\sCT)|(\sct)))'
with open('IC_ITM_CROSS_REF', 'r') as hosts:
reader = csv.reader(hosts, delimiter='|')
#iterate through each
for row in reader:
#row[2] = size, can be in OZ or CT
list1_item_desc = row[1] + " " + row[2] + " " + row[4];
#look for matching REPORT_UPC_CODE
if row[7] in UPC:
message = "UPC match"
#if items UPC doesn't match, check if item descriptions match or not
else:
#look for matching Item Desc (return the highest percentage of item desc matching)
#more defined search on the item size
highest = process.extractOne(list1_item_desc,list(all_other_item_desc),scorer=fuzz.token_sort_ratio)
if highest[1] > 80:
#check if the size match
other_item_size = list(map(lambda x: x[0], findall(regex_size, highest[0])))
other_item_size_lower = list(map(lambda x:x.replace(" ", "").lower(),other_item_size))
if(row[2].replace(" ", "").lower() in other_item_size_lower) or not other_item_size:
print("MATCH")
The way the code works currently is that it first tries to see if the item's UPC codes match or no. If it does not, then it will try to look at the item descriptions. For each item description on list 1, it will try to pull one item description from other_item_description list that it most closely matches.
Currently, I have about thousands of items in list1 and thousands of items in another list. So it is extremely slow, it can take a couple of hours to finish running. Is there a way to speed this up? I'm still so new to Python programming and any suggestions would be helpful. Thanks!
There are a couple of things you can do to speed this up.
You can replace FuzzyWuzzy with rapidfuzz (I am the author), which does the same but is faster.
Right now your strings get preprocessed when calling extractOne, so they are e.g. lowercased before comparing them. For the list of choices this can be done once in front of your loop.
Beside this I replaced your map comnstructs with a set, which should be slightly faster, but is especially simpler to read.
UPC should be a set aswell, so you have the constant lookup time, while with a list it has to iterate over the whole list until it finds the item (which is slow when working with big lists like you do)
I could not test this, since I do not have access to the required data, but these changes should give you quite a big performance improvement.
from rapidfuzz import process, fuzz, utils
import re
from re import findall
regex_size = '([0-9]+((\.\d+)?)+(OZ|CT|oz|ct|(\sOZ)|(\sCT)|(\soz)|(\sct)))'
regex_size_oz_check = '(?=(OZ|oz|(\sOZ)|(\soz)))'
regex_size_ct_check = '(?=(CT|ct|(\sCT)|(\sct)))'
with open('IC_ITM_CROSS_REF', 'r') as hosts:
reader = csv.reader(hosts, delimiter='|')
choice_mappings = {choice: utils.default_process(choice) for choice in all_other_item_desc}
#iterate through each
for row in reader:
#row[2] = size, can be in OZ or CT
list1_item_desc = row[1] + " " + row[2] + " " + row[4]
#look for matching REPORT_UPC_CODE
if row[7] in UPC:
message = "UPC match"
#if items UPC doesn't match, check if item descriptions match or not
else:
match = process.extractOne(
utils.default_process(list1_item_desc),
choice_mappings,
processor=None,
scorer=fuzz.token_sort_ratio,
score_cutoff=80)
if match:
other_item_size = {x[0].replace(" ", "").lower() for x in findall(regex_size, match[2])}
if(row[2].replace(" ", "").lower() in other_item_size) or not other_item_size:
print("MATCH")
Related
I have been trying various solutions all yesterday, before I hung it up and went to bed. After coming back today and taking another look at it... I still cannot understand what is wrong with my regex statement.
I am trying to search my inventory based on a simple name and return an item index and the amount of that item that I have.
for instance, in my inventory instead of knife I could have bloody_knife[9] at the 0 index and the script should return 9, and 0, based on the query of knife.
The code:
import re
inventory = ["knife", "bottle[1]", "rope", "flashlight"]
def search_inventory(item):
numbered_item = '.*' + item + '\[([0-9]*)\].*'
print(numbered_item) #create regex statement
regex_num_item = re.compile(numbered_item)
print(regex_num_item) #compiled regex statement
for x in item:
match1 = regex_num_item.match(x) #regex match....
print(match1) #seems to be producing nothing.
if match1: #since it produces nothing the code fails.
num_item = match1.group()
count = match1.group(1)
print(count)
index = inventory.index(num_item)
else: #eventually this part will expand to include "item not in inventory"
print("code is wrong")
return count, index
num_of_item, item_index = search_inventory("knife")
print(num_of_item)
print(item_index)
The output:
.*knife\[([0-9]*)\].*
re.compile('.*knife.*\\[([0-9]*)\\].*')
None
code is wrong
One thing that I cannot seem to settle well with is when python takes the code in my numbered_item variable and uses it in the re.compile() function. why is it adding additional escapes when I already have the necessary [] escaped.
Has anyone run into something like this before?
Your issue is here:
for x in item:
That is looking at "for every character in your item knife". So your regex was running on k, then n, and so on. Your regex won't want that of course. If you still wanted to "see it", add a print x:
for x in item:
print x #add this line
match1 = regex_num_item.match(x) #regex match....
print(match1) #seems to be producing nothing.
You'll see that it will print each letter of the item. That's what you're matching against in your match1 = regex_num_item.match(x) so obiously it won't work.
You want to iterate over the inventory.
So you want:
for x in inventory: #meaning, for every item in inventory
Is the index important to you? Because you can change the inventory into a dictionary and you don't have to use regex:
inventory = {'knife':8, 'bottle':1, 'rope':1, 'flashlight':0, 'bloody_knife':1}
And then, if you wanted to find every item that has the word knife and how many you have of it:
for item in inventory:
if "knife" in item:
itemcount = inventory[item] #in a dictionary, we get the 'value' of the key this way
print "Item Name: " + item + "Count: " + str(itemcount)
Output:
Item Name: bloody_knife, Count: 1
Item Name: knife, Count: 8
I am currently working with a text file that has a list of DNA extraction sequences (contigs), each with a header followed by lines of nucleotides, which is the nucleotide length of that contig. there are 120 contigs, with each entry marked by a line that starts with ">" to denote the sequence information. after this line, a length of nucleotides of that sequence is given.
example:
>gi|571136972|ref|XM_006625214.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 5 (Rps5) (rps5) mRNA, complete cds
ATGAGAAATATTTTATTAAAGAAAAAATTATATAATAGTAAAAATATTTATATTTTATATTATATTTTAATAATATTTAAAAGTATTTTTATTATTTTATTTAATAGTAAATATAATGTGAATTATTATTTATATAATAAAATTTATAATTTATTTATTATATATATAAAATTATATTATATTATAAATAATATATATTATAATAATAATTATTATTATATATATAATATGAATTATATA
TATTTTTATATTTATAAATATAATAGTTTAAATAATA
>gi|571136996|ref|XM_006625226.1| Plasmodium chabaudi chabaudi small subunit ribosomal protein 2 (Rps2) (rps2) mRNA, complete cds
ATGTTTATTACATTTAAAGATTTATTAAAATCTAAAATATATATAGGAAATAATTATAAAAATATTTATATTAATAATTATAAATTTATATATAAAATAAAATATAATTATTGTATTTTAAATTTTACATTAATTATATTATATTTATATAAATTATATTTATATATTTATAATATATCTATATTTAATAATAAAATTTTATTTATTATTAATAATAATTTAATTACAAATTTAATTATT
AATATATGTAATTTAACTAATAATTTTTATATTATTA
what I would like to do is make a list of every contig. My problem is, I do not know the syntax needed to tell Python to:
find the line after the line that starts with ">"
take a count of all of the characters in the lines of that sequence
return a value to a list of all contig values (a list that gives a list of length of every contig, ie 126, 300, 25...)
make sure the last contig (which has no ">" to denote its end) is counted.
I would like a list of integers, so that I can calculate things like the mean length of the contigs, standard deviation, cool gene equations etc.
I am relatively new to programming. if I am unclear or further information is needed, please let me know.
Don't reinvent the wheel, use biopython as Martin has suggested. Here's a start for you that will print the sequence ID and length to terminal. You can install biopython with pip, i.e. pip install biopython
from Bio import SeqIO
import sys
FileIn = sys.argv[1]
handle = open(FileIn, 'rU')
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
print "%s: %i bp" % (record.id, length) #print sequence ID: seq length
Or you could store the results in a dictionary:
handle = open(FileIn, 'rU')
sequence_lengths = {}
SeqRecords = SeqIO.parse(handle, 'fasta')
for record in SeqRecords: #loop through each fasta entry
length = len(record.seq) #get sequence length
sequence_lengths[record.id] = length
#access dictionary outside of loop
print sequence_lengths
This might work for you: it prints the number of ACGT's in the lines that follow a line that includes >:
import re
with open("input.txt") as input_file:
data = input_file.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print(data)
thanks for all the help. I have looked at the biopython stuff and am excited to understand it and incorporate it. The overall goal of this assignment was to teach me how to understand python, rather than finding the solution outright, or at least if I find the solution, I have to be able to explain it in my own words.
Anyway, I have created a code incorporating that element as well as others. I have a few more things to do, and if I am confused, I will return to ask.
here is my first working code outside of working directly with my supervisor or tutorials that I made and understand (woo!):
import re
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
contigs = 0
for line in fasta:
if line.strip().startswith('>'):
contigs = contigs + 1
with open("COPYFORTESTINGplastid.1.rna.fna") as fasta:
data = fasta.read()
data = re.split(r">.*", data)[1:]
data = [sum(1 for ch in datum if ch in 'ACGT') for datum in data]
print "Total number of contigs: %s" %contigs
total_contigs = sum(data)
N50 = sum(data)/2
print "number used to determine N50 = %s" %N50
average = 0
total = 0
for n in data:
total = total + n
mean = total / len(data)
print "mean length of contigs: %s" %mean
print "total nucleotides in fasta = %s" %total_contigs
#print "list of contigs by length: %s" %sorted([data])
l = data
l.sort(reverse = True)
print "list of contigs by length: %s" %l
this does what I want it to do, but if you have any comments or advice, I would love to hear.
next up, determining N50 with this sweet sweet list. thanks again!
I created a function to calculate N50 and it seemed to work nicely. I can parse the command line and run any .fa file through the program
def calc_n50(array):
array.sort(reverse = True)
n50 = 0 #sums lengths
n = 0 #n50 sequence
half = sum(array)/2
for val in array:
n50 += val
if n50 >= half:
n = val
break #breaks loop when condition is met
print "N50 is",n
This is more about to find the fastest way to do it.
I have a file1 which contains about one million strings(length 6-40) in separate line. I want to search each of them in another file2 which contains about 80,000 strings and count occurrence(if small string is found in one string multiple times, the occurence of this string is still 1). If anyone is interested to compare performance, there is link to download file1 and file2.
dropbox.com/sh/oj62918p83h8kus/sY2WejWmhu?m
What i am doing now is construct a dictionary for file 2, use strings ID as key and string as value. (because strings in file2 have duplicate values, only string ID is unique)
my code is
for line in file1:
substring=line[:-1].split("\t")
for ID in dictionary.keys():
bigstring=dictionary[ID]
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
output.write("%s\t%s\n" % (substring,str(len(IDlist))))
My code will take hours to finish. Can anyone suggest a faster way to do it?
both file1 and file2 are just around 50M, my pc have 8G memory, you can use as much memory as you need to make it faster. Any method that can finish in one hour is acceptable:)
Here, after I have tried some suggestions from these comments below, see performance comparison, first comes the code then it is the run time.
Some improvements suggested by Mark Amery and other peoples
import sys
from Bio import SeqIO
#first I load strings in file2 to a dictionary called var_seq,
var_seq={}
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[record.id]=str(record.seq)
print len(var_seq) #Here print out 76827, which is the right number. loading file2 to var_seq doesn't take long, about 1 second, you shall not focus here to improve performance
output=open(outputfilename,'w')
icount=0
input1=open(file1,'r')
for line in input1:
icount+=1
row=line[:-1].split("\t")
ensp=row[0] #ensp is just peptides iD
peptide=row[1] #peptides is the substrings i want to search in file2
num=0
for ID,bigstring in var_seq.iteritems():
if peptide in bigstring:
num+=1
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It will take 1m4s to finish. Improved 20s compared to my old one
#######NEXT METHOD suggested by entropy
from collections import defaultdict
var_seq=defaultdict(int)
handle=SeqIO.parse(file2,'fasta')
for record in handle:
var_seq[str(record.seq)]+=1
print len(var_seq) # here print out 59502, duplicates are removed, but occurances of duplicates are stored as value
handle.close()
output=open(outputfilename,'w')
icount=0
with open(file1) as fd:
for line in fd:
icount+=1
row=line[:-1].split("\t")
ensp=row[0]
peptide=row[1]
num=0
for varseq,num_occurrences in var_seq.items():
if peptide in varseq:
num+=num_occurrences
newline="%s\t%s\t%s\n" % (ensp,peptide,str(num))
output.write(newline)
if icount%1000==0:
break
output.close()
This one takes 1m10s,not faster as expected since it avoids searching duplicates, don't understand why.
Haystack and Needle method suggested by Mark Amery, which turned out to be the fastest, The problem of this method is that counting result for all substrings is 0, which I don't understand yet.
Here is the code I implemented his method.
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
input1=open(file1,'r')
needles={}
for line in input1:
row=line[:-1].split("\t")
needles[row[1]]=row[0]
print len(needles)
handle=SeqIO.parse(file2,'fasta')
haystacks={}
for record in handle:
haystacks[record.id]=str(record.seq)
print len(haystacks)
for haystack_id, haystack in haystacks.iteritems(): #should be the same as enumerate(list)
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
icount=0
output=open(outputfilename,'w')
for needle in needles:
icount+=1
count = search_haystack_tree(needle)
newline="%s\t%s\t%s\n" % (needles[needle],needle,str(count))
output.write(newline)
if icount%1000==0:
break
input1.close()
handle.close()
output.close()
It takes only 0m11s to finish, which is much faster than other methods. However, I don't know it is my mistakes to make all counting result as 0, or there is a flaw in the Mark's method.
Your code doesn't seem like it works(are you sure you didn't just quote it from memory instead of pasting the actual code?)
For example, this line:
substring=line[:-1].split("\t")
will cause substring t be a list. But later you do:
if bigstring.find(substring)!=-1:
That would cause an error if you call str.find(list).
In any case, you are building lists uselessly in your innermost loop. This:
IDlist=[]
if bigstring.find(substring)!=-1:
IDlist.append(ID)
#later
len(IDlist)
That will uselessly allocate and free lists which would cause memory thrashing as well as uselessly bogging everything down.
This is code that should work and uses more efficient means to do the counting:
from collections import defaultdict
dictionary = defaultdict(int)
with open(file2) as fd:
for line in fd:
for s in line.split("\t"):
dictionary[s.strip()] += 1
with open(file1) as fd:
for line in fd:
for substring in line.split('\t'):
count = 0
for bigstring,num_occurrences in dictionary.items():
if substring in bigstring:
count += num_occurrences
print substring, count
PS: I am assuming that you have multiple words per line that are tab-split because you do line.split("\t") at some point. If that is wrong, it should be easy to revise the code.
PPS: If this ends up being too slow for your use(you'd have to try it, but my guess is this should run in ~10min given the number of strings you said you had). You'll have to use suffix trees as one of the comments suggested.
Edit: Amended the code so that it handles multiple occurrences of the same string in file2 without negatively affecting performance
Edit 2: Trading maximum space for time.
Below is code that will consume quite a bit of memory and take a while to build the dictionary. However, once that's done, each search out of the million strings to search for should complete in the time it takes for a single hashtable lookup, that is O(1).
Note, I have added some statements to log the time it takes for each step of the process. You should keep those so you know which part of the time is taken when searching. Since you are testing with 1000 strings only this matters a lot since if 90% of the cost is the build time, not the search time, then when you test with 1M strings you will still only be doing that once, so it won't matter
Also note that I have amended my code to parse file1 and file2 as you do, so you should be able to just plug this in and test it:
from Bio import SeqIO
from collections import defaultdict
from datetime import datetime
def all_substrings(s):
result = set()
for length in range(1,len(s)+1):
for offset in range(len(s)-length+1):
result.add(s[offset:offset+length])
return result
print "Building dictionary...."
build_start = datetime.now()
dictionary = defaultdict(int)
handle = SeqIO.parse(file2, 'fasta')
for record in handle:
for sub in all_substrings(str(record.seq).strip()):
dictionary[sub] += 1
build_end = datetime.now()
print "Dictionary built in: %gs" % (build-end-build_start).total_seconds()
print "Searching...\n"
search_start = datetime.now()
with open(file1) as fd:
for line in fd:
substring = line.strip().split("\t")[1]
count = dictionary[substring]
print substring, count
search_end = datetime.now()
print "Search done in: %gs" % (search_end-search_start).total_seconds()
I'm not an algorithms whiz, but I reckon this should give you a healthy performance boost. You need to set 'haystacks' to be a list of the big words you want to look in, and 'needles' to be a list of the substrings you're looking for (either can contain duplicates), which I'll let you implement on your end. It'd be great if you could post your list of needles and list of haystacks so that we can easily compare performance of proposed solutions.
haystacks = <some list of words>
needles = <some list of words>
class Node(object):
def __init__(self):
self.words = set()
self.links = {}
base = Node()
for haystack_id, haystack in enumerate(haystacks):
for i in xrange(len(haystack)):
current_node = base
for char in haystack[i:]:
current_node = current_node.links.setdefault(char, Node())
current_node.words.add(haystack_id)
def search_haystack_tree(needle):
current_node = base
for char in needle:
try:
current_node = current_node.links[char]
except KeyError:
return 0
return len(current_node.words)
for needle in needles:
count = search_haystack_tree(needle)
print "Count for %s: %i" % (needle, count)
You can probably figure out what's going on by looking at the code, but just to put it in words: I construct a huge tree of substrings of the haystack words, such that given any needle, you can navigate the tree character by character and end up at a node which has attached to it the set of all haystack ids of haystacks containing that substring. Then for each needle we just go through the tree and count the size of the set at the end.
I'm stuck in a script I have to write and can't find a way out...
I have two files with partly overlapping information. Based on the information in one file I have to extract info from the other and save it into multiple new files.
The first is simply a table with IDs and group information (which is used for the splitting).
The other contains the same IDs, but each twice with slightly different information.
What I'm doing:
I create a list of lists with ID and group informazion, like this:
table = [[ID, group], [ID, group], [ID, group], ...]
Then, because the second file is huge and not sorted in the same way as the first, I want to create a dictionary as index. In this index, I would like to save the ID and where it can be found inside the file so I can quickly jump there later. The problem there, of course, is that every ID appears twice. My simple solution (but I'm in doubt about this) is adding an -a or -b to the ID:
index = {"ID-a": [FPos, length], "ID-b": [FPOS, length], "ID-a": [FPos, length], ...}
The code for this:
for line in file:
read = (line.split("\t"))[0]
if not (read+"-a") in indices:
index = read + "-a"
length = len(line)
indices[index] = [FPos, length]
else:
index = read + "-b"
length = len(line)
indices[index] = [FPos, length]
FPos += length
What I am wondering now is if the next step is actually valid (I don't get errors, but I have some doubts about the output files).
for name in table:
head = name[0]
## first round
(FPos,length) = indices[head+"-a"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "1:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
##second round
(FPos,length) = indices[head+"-b"]
file.seek(FPos)
line = file.read(length)
line = line.rstrip()
items = line.split("\t")
output = ["#" + head +" "+ "2:N:0:" +"\n"+ items[9] +"\n"+ "+" +"\n"+ items[10] +"\n"]
name.append(output)
Is it ok to use a for loop like that?
Is there a better, cleaner way to do this?
Use a defaultdict(list) to save all your file offsets by ID:
from collections import defaultdict
index = defaultdict(list)
for line in file:
# ...code that loops through file finding ID lines...
index[id_value].append((fileposn,length))
The defaultdict will take care of initializing to an empty list on the first occurrence of a given id_value, and then the (fileposn,length) tuple will be appended to it.
This will accumulate all references to each id into the index, whether there are 1, 2, or 20 references. Then you can just search through the given fileposn's for the related data.
My code is below. Basically, I've got a CSV file and a text file "input.txt". I'm trying to create a Python application which will take the input from "input.txt" and search through the CSV file for a match and if a match is found, then it should return the first column of the CSV file.
import csv
csv_file = csv.reader(open('some_csv_file.csv', 'r'), delimiter = ",")
header = csv_file.next()
data = list(csv_file)
input_file = open("input.txt", "r")
lines = input_file.readlines()
for row in lines:
inputs = row.strip().split(" ")
for input in inputs:
input = input.lower()
for row in data:
if any(input in terms.lower() for terms in row):
print row[0]
Say my CSV file looks like this:
book title, author
The Rock, Herry Putter
Business Economics, Herry Putter
Yogurt, Daniel Putter
Short Story, Rick Pan
And say my input.txt looks like this:
Herry
Putter
Therefore when I run my program, it prints:
The Rock
Business Economics
The Rock
Business Economics
Yogurt
This is because it searches for all titles with "Herry" first, and then searches all over again for "Putter". So in the end, I have duplicates of the book titles. I'm trying to figure out a way to remove them...so if anyone can help, that would be greatly appreciated.
If original order does not matter, then stick the results into a set first, and then print them out at the end. But, your example is small enough where speed does not matter that much.
Stick the results in a set (which is like a list but only contains unique elements), and print at the end.
Something like;
if any(input in terms.lower() for terms in row):
if not row[0] in my_set:
my_set.add(row[0])
During the search stick results into a list, and only add new results to the list after first searching the list to see if the result is already there. Then after the search is done print the list.
First, get the set of search terms you want to look for in a single list. We use set(...) here to eliminate duplicate search terms:
search_terms = set(open("input.txt", "r").read().lower().split())
Next, iterate over the rows in the data table, selecting each one that matches the search terms. Here, I'm preserving the behavior of the original code, in that we search for the case-normalized search term in any column for each row. If you just wanted to search e.g. the author column, then this would need to be tweaked:
results = [row for row in data
if any(search_term in item.lower()
for item in row
for search_term in search_terms)]
Finally, print the results.
for row in results:
print row[0]
If you wanted, you could also list the authors or any other info in the table. E.g.:
for row in results:
print '%30s (by %s)' % (row[0], row[1])