Smart filter with python - python

Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...

If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.

I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))

I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))

Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match

filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break

I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.

Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)

It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.

Related

Python3 dictionary values being overwritten

I’m having a problem with a dictionary. I"m using Python3. I’m sure there’s something easy that I’m just not seeing.
I’m reading lines from a file to create a dictionary. The first 3 characters of each line are used as keys (they are unique). From there, I create a list from the information in the rest of the line. Each 4 characters make up a member of the list. Once I’ve created the list, I write to the directory with the list being the value and the first three characters of the line being the key.
The problem is, each time I add a new key:value pair to the dictionary, it seems to overlay (or update) the values in the previously written dictionary entries. The keys are fine, just the values are changed. So, in the end, all of the keys have a value equivalent to the list made from the last line in the file.
I hope this is clear. Any thoughts would be greatly appreciated.
A snippet of the code is below
formatDict = dict()
sectionList = list()
for usableLine in formatFileHandle:
lineLen = len(usableLine)
section = usableLine[:3]
x = 3
sectionList.clear()
while x < lineLen:
sectionList.append(usableLine[x:x+4])
x += 4
formatDict[section] = sectionList
for k, v in formatDict.items():
print ("for key= ", k, "value =", v)
formatFileHandle.close()
You always clear, then append and then insert the same sectionList, that's why it always overwrites the entries - because you told the program it should.
Always remember: In Python assignment never makes a copy!
Simple fix
Just insert a copy:
formatDict[section] = sectionList.copy() # changed here
Instead of inserting a reference:
formatDict[section] = sectionList
Complicated fix
There are lots of things going on and you could make it "better" by using functions for subtasks like the grouping, also files should be opened with with so that the file is closed automatically even if an exception occurs and while loops where the end is known should be avoided.
Personally I would use code like this:
def groups(seq, width):
"""Group a sequence (seq) into width-sized blocks. The last block may be shorter."""
length = len(seq)
for i in range(0, length, width): # range supports a step argument!
yield seq[i:i+width]
# Printing the dictionary could be useful in other places as well -> so
# I also created a function for this.
def print_dict_line_by_line(dct):
"""Print dictionary where each key-value pair is on one line."""
for key, value in dct.items():
print("for key =", key, "value =", value)
def mytask(filename):
formatDict = {}
with open(filename) as formatFileHandle:
# I don't "strip" each line (remove leading and trailing whitespaces/newlines)
# but if you need that you could also use:
# for usableLine in (line.strip() for line in formatFileHandle):
# instead.
for usableLine in formatFileHandle:
section = usableLine[:3]
sectionList = list(groups(usableLine[3:]))
formatDict[section] = sectionList
# upon exiting the "with" scope the file is closed automatically!
print_dict_line_by_line(formatDict)
if __name__ == '__main__':
mytask('insert your filename here')
You could simplify your code here by using a with statement to auto close the file and chunk the remainder of the line into groups of four, avoiding the re-use of a single list.
from itertools import islice
with open('somefile') as fin:
stripped = (line.strip() for line in fin)
format_dict = {
line[:3]: list(iter(lambda it=iter(line[3:]): ''.join(islice(it, 4)), ''))
for line in stripped
}
for key, value in format_dict.items():
print('key=', key, 'value=', value)

What's an elegant way for one loop iteration to affect another?

I needed to process a config file just now. Because of the way it was generated, it contains lines like this:
---(more 15%)---
The first step is to strip these unwanted lines out. As a slight twist, each of these lines is followed by a blank line, which I also want to strip. I created a quick Python script to do this:
skip_next = False
for line in sys.stdin:
if skip_next:
skip_next = False
continue
if line.startswith('---(more'):
skip_next = True
continue
print line,
Now, this works, but it's more hacky than I'd hoped. The difficulty is that when looping through the lines, we want the content of one line to affect the subsequent line. Hence my question: What's an elegant way for one loop iteration to affect another?
The reason this feels awkward is that you're fundamentally Doing It Wrong. A for loop is supposed to be a sequential iteration over each element of a series. If you're doing something that's calling continue without even looking at the current element, based on something that happened in a previous element of a series, you're breaking that basic abstraction. You're then introducing awkwardness with the extra moving parts required to take care of the square-peg-in-round-hole solution you're setting up.
Instead, try keeping the action close to the condition that causes it. We know that a for loop is just syntactic sugar for a special case of a while loop, so let's use that. Pseudocode, since I'm not familiar with Python's I/O subsystem:
while not sys.stdin.eof: //or whatever
line = sys.stdin.ReadLine()
if line.startswith('---(more'):
sys.stdin.ReadLine() //read the next line and ignore it
continue
print line
Another way to do this is with itertools.tee, which allows you to split the iterator into two. You can then advance one iterator by one step, putting one iterator one line ahead of the other. You can then zip up the two iterators and look at both the previous line and the current line at each step of the for loop (I use izip_longest so it doesn't drop the last line):
from itertools import tee, izip_longest
in1, in2 = tee(sys.stdin, 2)
next(in2)
for line, prevline in izip_longest(in1, in2, fillvalue=''):
if line.startswith('---(more') or prevline.startswith('---(more'):
continue
print line
This could also be done as an equivalent generator expression:
from itertools import tee, izip_longest
in1, in2 = tee(sys.stdin, 2)
next(in2)
pairs = izip_longest(in1, in2, fillvalue='')
res = (line for line, prevline in pairs
if not line.startswith('---(more') and not prevline.startswith('---(more'))
for line in res:
print line
Or you could use filter, which allows you to drop iterator items when a condition is not true.
from itertools import tee, izip_longest
in1, in2 = tee(sys.stdin, 2)
next(in2)
pairs = izip_longest(in1, in2, fillvalue='')
cond = lambda pair: not pair[0].startswith('---(more') and not pair[1].startswith('---(more')
res = filter(cond, pairs)
for line in res:
print line
If you are willing to go outside the python standard library, the toolz package makes this even easier. It provides a sliding_window function, which allows you to split up an iterator such as a b c d e f into something like (a,b), (b,c), (c,d), (d,e), (e,f). This does basically the same thing as the tee approach above, it just combined three lines into one:
from toolz.itertoolz import sliding_window
for line, prevline in sliding_wind(2, sys.stdin):
if line.startswith('---(more') or prevline.startswith('---(more'):
continue
print line
you could additionally use remove, which is basically the opposite of filter, to drop the items without needing a for loop:
from tools.itertoolz import sliding_window, remove
pairs = sliding_window(2, sys.stdin)
cond = lambda x: x[0].startswith('---(more') or x[1].startswith('---(more')
res = remove(cond, pairs)
for line in res:
print line
In this case, we can skip a line by manually advancing the iterator. This results in code that is somewhat similar to Mason Wheeler's solution, but still uses the iteration syntax. There is a related Stack Overflow question:
for line in sys.stdin:
if line.startswith('---(more'):
sys.stdin.next()
continue
print line,

Parsing sequences from a FASTA file in python

I have a text file:
>name_1
data_1
>name_2
data_2
>name_3
data_3
>name_4
data_4
>name_5
data_5
I want to store header (name_1, name_2....) in one list and data (data_1, data_2....) in another list in a Python program.
def parse_fasta_file(fasta):
desc=[]
seq=[]
seq_strings = fasta.strip().split('>')
for s in seq_strings:
if len(s):
sects = s.split()
k = sects[0]
v = ''.join(sects[1:])
desc.append(k)
seq.append(v)
for l in sys.stdin:
data = open('D:\python\input.txt').read().strip()
parse_fasta_file(data)
print seq
this is my code which i have tried but i am not able to get the answer.
The most fundamental error is trying to access a variable outside of its scope.
def function (stuff):
seq = whatever
function('data')
print seq ############ error
You cannot access seq outside of function. The usual way to do this is to have function return a value, and capture it in a variable within the caller.
def function (stuff):
seq = whatever
return seq
s = function('data')
print s
(I have deliberately used different variable names inside the function and outside. Inside function you cannot access s or data, and outside, you cannot access stuff or seq. Incidentally, it would be quite okay, but confusing to a beginner, to use a different variable with the same name seq in the mainline code.)
With that out of the way, we can attempt to write a function which returns a list of sequences and a list of descriptions for them.
def parse_fasta (lines):
descs = []
seqs = []
data = ''
for line in lines:
if line.startswith('>'):
if data: # have collected a sequence, push to seqs
seqs.append(data)
data = ''
descs.append(line[1:]) # Trim '>' from beginning
else:
data += line.rstrip('\r\n')
# there will be yet one more to push when we run out
seqs.append(data)
return descs, seqs
This isn't particularly elegant, but should get you started. A better design would be to return a list of (description, data) tuples where the description and its data are closely coupled together.
descriptions, sequences = parse_fasta(open('file', 'r').read().split('\n'))
The sys.stdin loop in your code does not appear to do anything useful.

Pythonic way to find any item from a list within a string

The problem is simple enough. I'm writing a function which returns true if the string contains any of the illegal names.
line is a string.
illegal_names is an iterable.
def has_illegal_item(line, illegal_names):
illegal_items_found = False
for item in illegal_names:
if item in line:
illegal_items_found = True
break
return illegal_items_found
So this works fine, but it seems a bit clumsy and long-winded. Can anyone suggest a neater and more pythonic way to write this? Could I use a list comprehension or regex?
any(name in line for name in illegal_names)
let's have a look on following sample:
line = "This is my sample line with an illegal item: FOO"
illegal_names = ['FOO', 'BAR']
and you want now to find out whether your line contains one of the illegal names, then you do this with the widespread List Comprehension and a length check:
is_correct = not bool(len([ hit for hit in illegal_names if hit in line ]))
# or
is_correct = len([ hit for hit in illegal_names if hit in line ]) == 0
Pretty easy, short and in relation to other lambda version easy to read and unterstand.
-Colin-
I go with #larsmans solution, but there another ways of doing it.
def check_names(line, names):
#names should be set
parsed_line = set(line.replace(',', '').split())
return bool(parsed_line.intersection(names))
This can also be written using lambda
In [57]: x = lambda line, names: bool(set(line.replace(",", "").split()).intersection(names))
In [58]: x("python runs in linux", set(["python"]))
Out[58]: True
lamdba version can be overkill of readability

Which of these is more python-like?

I'm doing some exploring of various languages I hadn't used before, using a simple Perl script as a basis for what I want to accomplish. I have a couple of versions of something, and I'm curious which is the preferred method when using Python -- or if neither is, what is?
Version 1:
workflowname = []
paramname = []
value = []
for line in lines:
wfn, pn, v = line.split(",")
workflowname.append(wfn)
paramname.append(pn)
value.append(v)
Version 2:
workflowname = []
paramname = []
value = []
i = 0;
for line in lines:
workflowname.append("")
paramname.append("")
value.append("")
workflowname[i], paramname[i], value[i] = line.split(",")
i = i + 1
Personally, I prefer the second, but, as I said, I'm curious what someone who really knows Python would prefer.
A Pythonic solution might a bit like #Bogdan's, but using zip and argument unpacking
workflowname, paramname, value = zip(*[line.split(',') for line in lines])
If you're determined to use a for construct, though, the 1st is better.
Of your two attepts the 2nd one doesn't make any sense to me. Maybe in other languages it would. So from your two proposed approaces the 1st one is better.
Still I think the pythonic way would be something like Matt Luongo suggested.
Bogdan's answer is best. In general, if you need a loop counter (which you don't in this case), you should use enumerate instead of incrementing a counter:
for index, value in enumerate(lines):
# do something with the value and the index
Version 1 is definitely better than version 2 (why put something in a list if you're just going to replace it?) but depending on what you're planning to do later, neither one may be a good idea. Parallel lists are almost never more convenient than lists of objects or tuples, so I'd consider:
# list of (workflow,paramname,value) tuples
items = []
for line in lines:
items.append( line.split(",") )
Or:
class WorkflowItem(object):
def __init__(self,workflow,paramname,value):
self.workflow = workflow
self.paramname = paramname
self.value = value
# list of objects
items = []
for line in lines:
items.append( WorkflowItem(*line.split(",")) )
(Also, nitpick: 4-space tabs are preferable to 8-space.)

Categories

Resources