Which of these is more python-like? - python

I'm doing some exploring of various languages I hadn't used before, using a simple Perl script as a basis for what I want to accomplish. I have a couple of versions of something, and I'm curious which is the preferred method when using Python -- or if neither is, what is?
Version 1:
workflowname = []
paramname = []
value = []
for line in lines:
wfn, pn, v = line.split(",")
workflowname.append(wfn)
paramname.append(pn)
value.append(v)
Version 2:
workflowname = []
paramname = []
value = []
i = 0;
for line in lines:
workflowname.append("")
paramname.append("")
value.append("")
workflowname[i], paramname[i], value[i] = line.split(",")
i = i + 1
Personally, I prefer the second, but, as I said, I'm curious what someone who really knows Python would prefer.

A Pythonic solution might a bit like #Bogdan's, but using zip and argument unpacking
workflowname, paramname, value = zip(*[line.split(',') for line in lines])
If you're determined to use a for construct, though, the 1st is better.

Of your two attepts the 2nd one doesn't make any sense to me. Maybe in other languages it would. So from your two proposed approaces the 1st one is better.
Still I think the pythonic way would be something like Matt Luongo suggested.

Bogdan's answer is best. In general, if you need a loop counter (which you don't in this case), you should use enumerate instead of incrementing a counter:
for index, value in enumerate(lines):
# do something with the value and the index

Version 1 is definitely better than version 2 (why put something in a list if you're just going to replace it?) but depending on what you're planning to do later, neither one may be a good idea. Parallel lists are almost never more convenient than lists of objects or tuples, so I'd consider:
# list of (workflow,paramname,value) tuples
items = []
for line in lines:
items.append( line.split(",") )
Or:
class WorkflowItem(object):
def __init__(self,workflow,paramname,value):
self.workflow = workflow
self.paramname = paramname
self.value = value
# list of objects
items = []
for line in lines:
items.append( WorkflowItem(*line.split(",")) )
(Also, nitpick: 4-space tabs are preferable to 8-space.)

Related

Python3 dictionary values being overwritten

I’m having a problem with a dictionary. I"m using Python3. I’m sure there’s something easy that I’m just not seeing.
I’m reading lines from a file to create a dictionary. The first 3 characters of each line are used as keys (they are unique). From there, I create a list from the information in the rest of the line. Each 4 characters make up a member of the list. Once I’ve created the list, I write to the directory with the list being the value and the first three characters of the line being the key.
The problem is, each time I add a new key:value pair to the dictionary, it seems to overlay (or update) the values in the previously written dictionary entries. The keys are fine, just the values are changed. So, in the end, all of the keys have a value equivalent to the list made from the last line in the file.
I hope this is clear. Any thoughts would be greatly appreciated.
A snippet of the code is below
formatDict = dict()
sectionList = list()
for usableLine in formatFileHandle:
lineLen = len(usableLine)
section = usableLine[:3]
x = 3
sectionList.clear()
while x < lineLen:
sectionList.append(usableLine[x:x+4])
x += 4
formatDict[section] = sectionList
for k, v in formatDict.items():
print ("for key= ", k, "value =", v)
formatFileHandle.close()
You always clear, then append and then insert the same sectionList, that's why it always overwrites the entries - because you told the program it should.
Always remember: In Python assignment never makes a copy!
Simple fix
Just insert a copy:
formatDict[section] = sectionList.copy() # changed here
Instead of inserting a reference:
formatDict[section] = sectionList
Complicated fix
There are lots of things going on and you could make it "better" by using functions for subtasks like the grouping, also files should be opened with with so that the file is closed automatically even if an exception occurs and while loops where the end is known should be avoided.
Personally I would use code like this:
def groups(seq, width):
"""Group a sequence (seq) into width-sized blocks. The last block may be shorter."""
length = len(seq)
for i in range(0, length, width): # range supports a step argument!
yield seq[i:i+width]
# Printing the dictionary could be useful in other places as well -> so
# I also created a function for this.
def print_dict_line_by_line(dct):
"""Print dictionary where each key-value pair is on one line."""
for key, value in dct.items():
print("for key =", key, "value =", value)
def mytask(filename):
formatDict = {}
with open(filename) as formatFileHandle:
# I don't "strip" each line (remove leading and trailing whitespaces/newlines)
# but if you need that you could also use:
# for usableLine in (line.strip() for line in formatFileHandle):
# instead.
for usableLine in formatFileHandle:
section = usableLine[:3]
sectionList = list(groups(usableLine[3:]))
formatDict[section] = sectionList
# upon exiting the "with" scope the file is closed automatically!
print_dict_line_by_line(formatDict)
if __name__ == '__main__':
mytask('insert your filename here')
You could simplify your code here by using a with statement to auto close the file and chunk the remainder of the line into groups of four, avoiding the re-use of a single list.
from itertools import islice
with open('somefile') as fin:
stripped = (line.strip() for line in fin)
format_dict = {
line[:3]: list(iter(lambda it=iter(line[3:]): ''.join(islice(it, 4)), ''))
for line in stripped
}
for key, value in format_dict.items():
print('key=', key, 'value=', value)

Creating a list of dictionaries

I have code that generates a list of 28 dictionaries. It cycles thru 28 files and links data points from each file in the appropriate dictionary. In order to make my code more flexible I wanted to use:
tegDics = [dict() for x in range(len(files))]
But when I run the code the first 27 dictionaries are blank and only the last, tegDics[27], has data. Below is the code including the clumsy, yet functional, code I'm having to use that generates the dictionaries:
x=0
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}] # THIS WORKS!!!
#tegDics = [dict() for x in range(len(files))] - THIS WON'T WORK!!!
allRads=[]
while x<len(tegDics): # now builds dictionaries
for line in open(files[x]):
z=line.split('\t')
allRads.append(z[2])
tegDics[x][z[2]]=z[4] # pairs catNo with locNo
x+=1
Does anybody know why the more elegant code doesn't work.
Since you're using x within the list comprehension, it will no longer be zero by the time you reach the while loop - it will be len(files)-1 instead. I suggest changing the variable you use to something else. It's traditional to use a single underscore for a value you don't care about.
tegDics = [dict() for _ in range(len(files))]
It could be useful to eliminate your use of x entirely. It's customary in python to iterate directly over the objects in a sequence, rather than using a counter variable. You might do something like:
for tegDic in tegDics:
#do stuff with tegDic here
Although it's slightly trickier in your case, since you want to simultaneously iterate through tegDics and files at the same time. You can use zip to do that.
import os
files=os.listdir("DirPath")
os.chdir("DirPath")
tegDics = [dict() for _ in range(len(files))]
allRads=[]
for file, tegDic in zip(files,tegDics):
for line in open(file):
z=line.split('\t')
allRads.append(z[2])
tegDic[z[2]]=z[4] # pairs catNo with locNo
Anyway there is a simplest way imho:
taegDics = [{}]*len(files)

Custom sort method in Python is not sorting list properly

I'm a student in a Computing class and we have to write a program which contains file handling and a sort. I've got the file handling done and I wrote out my sort (it's a simple sort) but it doesn't sort the list. My code is this:
namelist = []
scorelist = []
hs = open("hst.txt", "r")
namelist = hs.read().splitlines()
hss = open("hstscore.txt","r")
for line in hss:
scorelist.append(int(line))
scorelength = len(scorelist)
for i in range(scorelength):
for j in range(scorelength + 1):
if scorelist[i] > scorelist[j]:
temp = scorelist[i]
scorelist[i] = scorelist[j]
scorelist[j] = temp
return scorelist
I've not been doing Python for very long so I know the code may not be efficient but I really don't want to use a completely different method for sorting it and we're not allowed to use .sort() or .sorted() since we have to write our own sort function. Is there something I'm doing wrong?
def super_simple_sort(my_list):
switched = True
while switched:
switched = False
for i in range(len(my_list)-1):
if my_list[i] > my_list[i+1]:
my_list[i],my_list[i+1] = my_list[i+1],my_list[i]
switched = True
super_simple_sort(some_list)
print some_list
is a very simple sorting implementation ... that is equivelent to yours but takes advantage of some things to speed it up (we only need one for loop, and we only need to repeat as long as the list is out of order, also python doesnt require a temp var for swapping values)
since its changing the actual array values you actually dont even need to return

Is this the fastest way to build dict?

I am reading an element list from an xml file and make the data into 2 dictionaries.
Was this the fastest way? (I don't think this is the best, you guys always surprise me.;-)
ADict = {}
BDict = {}
for x in fields:
key = x.get('key')
ADict[key] = x.find('A').text
BDict[key] = x.find('B').text
I think add it one by one is a bad idea, but write it in a single line. aka more pythonic way like this
ADict,BDict = [dict(k) for k in zip(*([(x.get('key'),x.find('A').text),(x.get('key'),x.find('B').text)] for x in fields))]
I don't think it's better, two reasons,
first, x.get('key') was called twice
second, create too much temp tuples
Not tested, but should work
ADict = dict((x.get('key'), x.find('A').text) for x in fields)
BDict = dict((x.get('key'), x.find('B').text) for x in fields)

Smart filter with python

Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...
If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.
I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))
I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))
Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match
filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break
I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.
Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)
It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.

Categories

Resources