Hi I have the following procedure,
Questions:
- How to make it elegant, more readable, compact.
- What can I do to extract common loops to another method.
Assumptions:
From a given rootDir the dirs are organized as in ex below.
What the proc does:
If input is 200, it deletes all DIRS that are OLDER than 200 days. NOT based on modifytime, but based on dir structure and dir name [I will later delete by brute force "rm -Rf" on each dir that are older]
e.g dir structure:
-2009(year dirs) [will force delete dirs e.g "rm -Rf" later]
-2010
-01...(month dirs)
-05 ..
-01.. (day dirs)
-many files. [I won't check mtime at file level - takes more time]
-31
-12
-2011
-2012 ...
Code that I have:
def get_dirs_to_remove(dir_path, olderThanDays):
today = datetime.datetime.now();
oldestDayToKeep = today + datetime.timedelta(days= -olderThanDays)
oldKeepYear = int(oldestDayToKeep.year)
oldKeepMonth =int(oldestDayToKeep.month);
oldKeepDay = int(oldestDayToKeep.day);
for yearDir in os.listdir(dirRoot):
#iterate year dir
yrPath = os.path.join(dirRoot, yearDir);
if(is_int(yearDir) == False):
problemList.append(yrPath); # can't convery year to an int, store and report later
continue
if(int(yearDir) < oldKeepYear):
print "old Yr dir: " + yrPath
#deleteList.append(yrPath); # to be bruteforce deleted e.g "rm -Rf"
yield yrPath;
continue
elif(int(yearDir) == oldKeepYear):
# iterate month dir
print "process Yr dir: " + yrPath
for monthDir in os.listdir(yrPath):
monthPath = os.path.join(yrPath, monthDir)
if(is_int(monthDir) == False):
problemList.append(monthPath);
continue
if(int(monthDir) < oldKeepMonth):
print "old month dir: " + monthPath
#deleteList.append(monthPath);
yield monthPath;
continue
elif (int(monthDir) == oldKeepMonth):
# iterate Day dir
print "process Month dir: " + monthPath
for dayDir in os.listdir(monthPath):
dayPath = os.path.join(monthPath, dayDir)
if(is_int(dayDir) == False):
problemList.append(dayPath);
continue
if(int(dayDir) < oldKeepDay):
print "old day dir: " + dayPath
#deleteList.append(dayPath);
yield dayPath
continue
print [ x for x in get_dirs_to_remove(dirRoot, olderThanDays)]
print "probList" % problemList # how can I get this list also from the same proc?
This actually looks pretty nice, except for the one big thing mentioned in this comment:
print "probList" % problemList # how can I get this list also from the same proc?
It sounds like you're storing problemList in a global variable or something, and you'd like to fix that. Here are a few ways to do this:
Yield both delete files and problem files—e.g., yield a tuple where the first member says which kind it is, and the second what to do with it.
Take the problemList as a parameter. Remember that lists are mutable, so appending to the argument will be visible to the caller.
yield the problemList at the end—which means you need to restructure the way you use the generator, because it's no longer just a simple iterator.
Code the generator as a class instead of a function, and store problemList as a member variable.
Peek at the internal generator information and cram problemList in there, so the caller can retrieve it.
Meanwhile, there are a few ways you could make the code more compact and readable.
Most trivially:
print [ x for x in get_dirs_to_remove(dirRoot, olderThanDays)]
This list comprehension is exactly the same as the original iteration, which you can write more simply as:
print list(get_dirs_to_remove(dirRoot, olderThanDays))
As for the algorithm itself, you could partition the listdir, and then just use the partitioned lists. You could do it lazily:
yearDirs = os.listdir(dirRoot):
problemList.extend(yearDir for yearDir in yearDirs if not is_int(yearDir))
yield from (yearDir for yearDir in yearDirs if int(yearDir) < oldKeepYear)
for year in (yearDir for yearDir in yearDirs if int(yearDir) == oldKeepYear):
# next level down
Or strictly:
yearDirs = os.listdir(dirRoot)
problems, older, eq, newer = partitionDirs(yearDirs, oldKeepYear)
problemList.extend(problems)
yield from older
for year in eq:
# next level down
The latter probably makes more sense, especially given that yearDirs is already a list, and isn't likely to be that big anyway.
Of course you need to write that partitionDirs function—but the nice thing is, you get to use it again in the months and days levels. And it's pretty simple. In fact, I might actually do the partitioning by sorting, because it makes the logic so obvious, even if it's more verbose:
def partitionDirs(dirs, keyvalue):
problems = [dir for dir in dirs if not is_int(dir)]
values = sorted(dir for dir in dirs if is_int(dir), key=int)
older, eq, newer = partitionSortedListAt(values, keyvalue, key=int)
If you look around (maybe search "python partition sorted list"?), you can find lots of ways to implement the partitionSortedListAt function, but here's a sketch of something that I think is easy to understand for someone who hasn't thought of the problem this way:
i = bisect.bisect_right(vals, keyvalue)
if vals[i] == keyvalue:
return problems, vals[:i], [vals[i]], vals[i+1:]
else:
return problems, vals[:i], [], vals[i:]
If you search for "python split predicate" you can also find other ways to implement the initial split—although keep in mind that most people are either concerned with being able to partition arbitrary iterables (which you don't need here), or, rightly or not, worried about efficiency (which you don't care about here either). So, don't look for the answer that someone says is "best"; look at all of the answers, and pick the one that seems most readable to you.
Finally, you may notice that you end up with three levels that look almost identical:
yearDirs = os.listdir(dirRoot)
problems, older, eq, newer = partitionDirs(yearDirs, oldKeepYear)
problemList.extend(problems)
yield from older
for year in eq:
monthDirs = os.listdir(os.path.join(dirRoot, str(year)))
problems, older, eq, newer = partitionDirs(monthDirs, oldKeepMonth)
problemList.extend(problems)
yield from older
for month in eq:
dayDirs = os.listdir(os.path.join(dirRoot, str(year), str(month)))
problems, older, eq, newer = partitionDirs(dayDirs, oldKeepDay)
problemList.extend(problems)
yield from older
yield from eq
You can simplify this further through recursion—pass down the path so far, and the list of further levels to check, and you can turn this 18 lines into 9. Whether that's more readable or not depends on how well you manage to encode the information to pass down and the appropriate yield from. Here's a sketch of the idea:
def doLevel(pathSoFar, dateComponentsLeft):
if not dateComponentsLeft:
return
dirs = os.listdir(pathSoFar)
problems, older, eq, newer = partitionDirs(dirs, dateComponentsLeft[0])
problemList.extend(problems)
yield from older
if eq:
yield from doLevel(os.path.join(pathSoFar, eq[0]), dateComponentsLeft[1:]))
yield from doLevel(rootPath, [oldKeepYear, oldKeepMonth, oldKeepDay])
If you're on an older Python version that doesn't have yield from, the earlier stuff is almost trivial to transform; the recursive version as written will be uglier and more painful. But there's really no way to avoid this when dealing with recursive generators, because a sub-generator cannot "yield through" a calling generator.
I would suggest not using generators unless you are absolutely sure you need them. In this case, you don't need them.
In the below, newer_list isn't strictly needed. While categorizeSubdirs could be made recursive, I don't feel that the increase in complexity is worth the repetition savings (but that's just a personal style issue; I only use recursion when it's unclear how many levels of recursion are needed or the number is fixed but large; three isn't enough IMO).
def categorizeSubdirs(keep_int, base_path):
older_list = []
equal_list = []
newer_list = []
problem_list = []
for subdir_str in os.listdir(base_path):
subdir_path = os.path.join(base_path, subdir_str))
try:
subdir_int = int(subdir_path)
except ValueError:
problem_list.append(subdir_path)
else:
if subdir_int keep_int:
newer_list.append(subdir_path)
else:
equal_list.append(subdir_path)
# Note that for your case, you don't need newer_list,
# and it's not clear if you need problem_list
return older_list, equal_list, newer_list, problem_list
def get_dirs_to_remove(dir_path, olderThanDays):
oldest_dt = datetime.datetime.now() datetime.timedelta(days= -olderThanDays)
remove_list = []
problem_list = []
olderYear_list, equalYear_list, newerYear_list, problemYear_list = categorizeSubdirs(oldest_dt.year, dir_path))
remove_list.extend(olderYear_list)
problem_list.extend(problemYear_list)
for equalYear_path in equalYear_list:
olderMonth_list, equalMonth_list, newerMonth_list, problemMonth_list = categorizeSubdirs(oldest_dt.month, equalYear_path))
remove_list.extend(olderMonth_list)
problem_list.extend(problemMonth_list)
for equalMonth_path in equalMonth_list:
olderDay_list, equalDay_list, newerDay_list, problemDay_list = categorizeSubdirs(oldest_dt.day, equalMonth_path))
remove_list.extend(olderDay_list)
problem_list.extend(problemDay_list)
return remove_list, problem_list
The three nested loops at the end could be made less repetitive at the cost of code complexity. I don't think that it's worth it, though reasonable people can disagree. All else being equal, I prefer simpler code to slightly more clever code; as they say, reading code is harder than writing it, so if you write the most clever code you can, you're not going to be clever enough to read it. :/
Related
So I am currently preparing for a competition (Australian Informatics Olympiad) and in the training hub, there is a problem in AIO 2018 intermediate called Castle Cavalry. I finished it:
input = open("cavalryin.txt").read()
output = open("cavalryout.txt", "w")
squad = input.split()
total = squad[0]
squad.remove(squad[0])
squad_sizes = squad.copy()
squad_sizes = list(set(squad))
yn = []
for i in range(len(squad_sizes)):
n = squad.count(squad_sizes[i])
if int(squad_sizes[i]) == 1 and int(n) == int(total):
yn.append(1)
elif int(n) == int(squad_sizes[i]):
yn.append(1)
elif int(n) != int(squad_sizes[i]):
yn.append(2)
ynn = list(set(yn))
if len(ynn) == 1 and int(ynn[0]) == 1:
output.write("YES")
else:
output.write("NO")
output.close()
I submitted this code and I didn't pass because it was too slow, at 1.952secs. The time limit is 1.000 secs. I wasn't sure how I would shorten this, as to me it looks fine. PLEASE keep in mind I am still learning, and I am only an amateur. I started coding only this year, so if the answer is quite obvious, sorry for wasting your time 😅.
Thank you for helping me out!
One performance issue is calling int() over and over on the same entity, or on things that are already int:
if int(squad_sizes[i]) == 1 and int(n) == int(total):
elif int(n) == int(squad_sizes[i]):
elif int(n) != int(squad_sizes[i]):
if len(ynn) == 1 and int(ynn[0]) == 1:
But the real problem is your code doesn't work. And making it faster won't change that. Consider the input:
4
2
2
2
2
Your code will output "NO" (with missing newline) despite it being a valid configuration. This is due to your collapsing the squad sizes using set() early in your code. You've thrown away vital information and are only really testing a subset of the data. For comparison, here's my complete rewrite that I believe handles the input correctly:
with open("cavalryin.txt") as input_file:
string = input_file.read()
total, *squad_sizes = map(int, string.split())
success = True
while squad_sizes:
squad_size = squad_sizes.pop()
for _ in range(1, squad_size):
try:
squad_sizes.remove(squad_size) # eliminate n - 1 others like me
except ValueError:
success = False
break
else: # no break
continue
break
with open("cavalryout.txt", "w") as output_file:
print("YES" if success else "NO", file=output_file)
Note that I convert all the input to int early on so I don't have to consider that issue again. I don't know whether this will meet AIO's timing constraints.
I can see some things in there that might be inefficient, but the best way to optimize code is to profile it: run it with a profiler and sample data.
You can easily waste time trying to speed up parts that don't need it without having much effect. Read up on the cProfile module in the standard library to see how to do this and interpret the output. A profiling tutorial is probably too long to reproduce here.
My suggestions, without profiling,
squad.remove(squad[0])
Removing the start of a big list is slow, because the rest of the list has to be copied as it is shifted down. (Removing the end of the list is faster, because lists are typically backed by arrays that are overallocated (more slots than elements) anyway, to make .append()s fast, so it only has to decrease the length and can keep the same array.
It would be better to set this to a dummy value and remove it when you convert it to a set (sets are backed by hash tables, so removals are fast), e.g.
dummy = object()
squad[0] = dummy # len() didn't change. No shifting required.
...
squad_sizes = set(squad)
squad_sizes.remove(dummy) # Fast lookup by hash code.
Since we know these will all be strings, you can just use None instead of a dummy object, but the above technique works even when your list might contain Nones.
squad_sizes = squad.copy()
This line isn't required; it's just doing extra work. The set() already makes a shallow copy.
n = squad.count(squad_sizes[i])
This line might be the real bottleneck. It's effectively a loop inside a loop, so it basically has to scan the whole list for each outer loop. Consider using collections.Counter for this task instead. You generate the count table once outside the loop, and then just look up the numbers for each string.
You can also avoid generating the set altogether if you do this. Just use the Counter object's keys for your set.
Another point unrelated to performance. It's unpythonic to use indexes like [i] when you don't need them. A for loop can get elements from an iterable and assign them to variables in one step:
from collections import Counter
...
count_table = Counter(squad)
for squad_size, n in count_table.items():
...
You can collect all occurences of the preferred number for each knight in a dictionary.
Then test if the number of knights with a given preferred number is divisible by that number.
with open('cavalryin.txt', 'r') as f:
lines = f.readlines()
# convert to int
list_int = [int(a) for a in lines]
#initialise counting dictionary: key: preferred number, item: empty list to collect all knights with preferred number.
collect_dict = {a:[] for a in range(1,1+max(list_int[1:]))}
print(collect_dict)
# loop though list, ignoring first entry.
for a in list_int[1:]:
collect_dict[a].append(a)
# initialise output
out='YES'
for key, item in collect_dict.items():
# check number of items with preference for number is divisilbe
# by that number
if item: # if list has entries:
if (len(item) % key) > 0:
out='NO'
break
with open('cavalryout.txt', 'w') as f:
f.write(out)
I'm processing strings using regexes in a bunch of files in a directory. To each line in a file, I apply a series of try-statements to match a pattern and if they do, then I transform the input. After I have analyzed each line, I write it to a new file. I have a lot of these try-else followed by if-statements (I only included two here as an illustration). My issue here is that after processing a few files, the script slows down so much that it almost stalls the process completely. I don't know what in my code is causing the slowing down but I have a feeling it is the combination of try-else + if-statements. How can I streamline the transformations so that the data is processed at a reasonable speed?
Or is it that I need a more efficient iterator that does not tax memory to the same extent?
Any feedback would be much appreciated!
import re
import glob
fileCounter = 0
for infile in glob.iglob(r'\input-files\*.txt'):
fileCounter += 1
outfile = r'\output-files\output_%s.txt' % fileCounter
with open(infile, "rb") as inList, open(outfile, "wb") as outlist:
for inline in inlist:
inword = inline.strip('\r\n')
#apply some text transformations
#Transformation #1
try: result = re.match('^[AEIOUYaeiouy]([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy](.*\[=\].*)*', inword).group()
except: result = None
if result == inword:
inword = re.sub('(?<=^[AEIOUYaeiouy])(?=([bcćdfghjklłmnńprsśtwzżź]|rz|sz|cz|dz|dż|dź|ch)[aąeęioóuy])', '[=]', wbWord)
#Transformation #2 etc.
try: result = re.match('(.*\[=\].*)*(\w?\w?)[AEIOUYaąeęioóuy]\[=\][ćsśz][ptkbdg][aąeęioóuyrfw](.*\[=\].*)*', inword).group()
except: result = None
if result == inword:
inword = re.sub('(?<=[AEIOUYaąeęioóuy])\[=\](?=[ćsśz][ptkbdg][aąeęioóuyrfw])', '', inword)
inword = re.sub('(?<=[AEIOUYaąeęioóuy][ćsśz])(?=[ptkbdg][aąeęioóuyrfw])', '[=]', inword)
outline = inword + "\n"
outlist.write(outline)
print "Processed file number %s" % fileCounter
print "*** Processing completed ***"
try/except is indeed not the most efficient way (nor the most readable one) to test for the result of a re.match() , but the penalty hit should still be (more or less) constant - the performance should not degrade during execution (until perhaps there's some worst case happening due to your data but well) - so chances are the problem is elsewhere.
FWIW you can start by replacing your try/except blocks with the appropriate canonical solution, ie instead of:
try:
result = re.match(someexp, yourline).group()
except:
result = None
you want:
match = re.match(someexp, yourline)
result = match.group() if match else None
This will slightly improve perfs but, most importantly, make your code more readable and much more maintainable - at least it won't hide any unexpected error.
As a side note, never use a bare except clause, always only catch expected exceptions (here it would have been an AttributeError since re.match() returns None when nothing matched and None has of course no attribute group).
This will very probably NOT solve your problem but at least you'll then know the issue is elsewhere.
I've been learning Python for a couple of months, and wanted to understand a cleaner and more efficient way of writing this function. It's just a basic thing I use to look up bus times near me, then display the contents of mtodisplay on an LCD, but I'm not sure about the mtodisplay=mtodisplay+... line. There must be a better, smarter, more Pythonic way of concatenating a string, without resorting to lists (I want to output this string direct to LCD. Saves me time. Maybe that's my problem ... I'm taking shortcuts).
Similarly, my method of using countit and thebuslen seems a bit ridiculous! I'd really welcome some advice or pointers in making this better. Just wanna learn!
Thanks
json_string = requests.get(busurl)
the_data = json_string.json()
mtodisplay='220 buses:\n'
countit=0
for entry in the_data['departures']:
for thebuses in the_data['departures'][entry]:
if thebuses['line'] == '220':
thebuslen=len(the_data['departures'][entry])
print 'buslen',thebuslen
countit += 1
mtodisplay=mtodisplay+thebuses['expected_departure_time']
if countit != thebuslen:
mtodisplay=mtodisplay+','
return mtodisplay
Concatenating strings like this
mtodisplay = mtodisplay + thebuses['expected_departure_time']
Used to be very inefficient, but for a long time now, Python does reuse the string being catentated to (as long as there are no other references to it), so it's linear performance instead of the older quadratic performance which should definitely be avoided.
In this case it looks like you already have a list of items that you want to put commas between, so
','.join(some_list)
is probably more appropriate (and automatically means you don't get an extra comma at the end).
So next problem is to construct the list(could also be a generator etc.). #bgporter shows how to make the list, so I'll show the generator version
def mtodisplay(busurl):
json_string = requests.get(busurl)
the_data = json_string.json()
for entry in the_data['departures']:
for thebuses in the_data['departures'][entry]:
if thebuses['line'] == '220':
thebuslen=len(the_data['departures'][entry])
print 'buslen',thebuslen
yield thebuses['expected_departure_time']
# This is where you would normally just call the function
result = '220 buses:\n' + ','.join(mtodisplay(busurl))
I'm not sure what you mean by 'resorting to lists', but something like this:
json_string = requests.get(busurl)
the_data = json_string.json()
mtodisplay= []
for entry in the_data['departures']:
for thebuses in the_data['departures'][entry]:
if thebuses['line'] == '220':
thebuslen=len(the_data['departures'][entry])
print 'buslen',thebuslen
mtodisplay.append(thebuses['expected_departure_time'])
return '220 buses:\n' + ", ".join(mtodisplay)
I'm creating objects derived from a rather large txt file. My code is working properly but takes a long time to run. This is because the elements I'm looking for in the first place are not ordered and not (necessarily) unique. For example I am looking for a digit-code that might be used twice in the file but could be in the first and the last row. My idea was to check how often a certain code is used...
counter=collections.Counter([l[3] for l in self.body])
...and then loop through the counter. Advance: if a code is only used once you don't have to iterate over the whole file. However You are stuck with a lot of iterations which makes the process really slow.
So my question really is: how can I improve my code? Another idea of course is to oder the data first. But that could take quite long as well.
The crucial part is this method:
def get_pc(self):
counter=collections.Counter([l[3] for l in self.body])
# This returns something like this {'187':'2', '199':'1',...}
pcode = []
#loop through entries of counter
for k,v in counter.iteritems():
i = 0
#find post code in body
for l in self.body:
if i == v:
break
# find fist appearence of key
if l[3] == k:
#first encounter...
if i == 0:
#...so create object
self.pc = CodeCana(k,l[2])
pcode.append(self.pc)
i += 1
# make attributes
self.pc.attr((l[0],l[1]),l[4])
if v <= 1:
break
return pcode
I hope the code explains the problem sufficiently. If not, let me know and I will expand the provided information.
You are looping over body way too many times. Collapse this into one loop, and track the CodeCana items in a dictionary instead:
def get_pc(self):
pcs = dict()
pcode = []
for l in self.body:
pc = pcs.get(l[3])
if pc is None:
pc = pcs[l[3]] = CodeCana(l[3], l[2])
pcode.append(pc)
pc.attr((l[0],l[1]),l[4])
return pcode
Counting all items first then trying to limit looping over body by that many times while still looping over all the different types of items defeats the purpose somewhat...
You may want to consider giving the various indices in l names. You can use tuple unpacking:
for foo, bar, baz, egg, ham in self.body:
pc = pcs.get(egg)
if pc is None:
pc = pcs[egg] = CodeCana(egg, baz)
pcode.append(pc)
pc.attr((foo, bar), ham)
but building body out of a namedtuple-based class would help in code documentation and debugging even more.
Yesterday I had to parse a very simple binary data file - the rule is, look for two bytes in a row that are both 0xAA, then the next byte will be a length byte, then skip 9 bytes and output the given amount of data from there. Repeat to the end of the file.
My solution did work, and was very quick to put together (even though I am a C programmer at heart, I still think it was quicker for me to write this in Python than it would have been in C) - BUT, it is clearly not at all Pythonic and it reads like a C program (and not a very good one at that!)
What would be a better / more Pythonic approach to this? Is a simple FSM like this even still the right choice in Python?
My solution:
#! /usr/bin/python
import sys
f = open(sys.argv[1], "rb")
state = 0
if f:
for byte in f.read():
a = ord(byte)
if state == 0:
if a == 0xAA:
state = 1
elif state == 1:
if a == 0xAA:
state = 2
else:
state = 0
elif state == 2:
count = a;
skip = 9
state = 3
elif state == 3:
skip = skip -1
if skip == 0:
state = 4
elif state == 4:
print "%02x" %a
count = count -1
if count == 0:
state = 0
print "\r\n"
The coolest way I've seen to implement FSMs in Python has to be via generators and coroutines. See this Charming Python post for an example. Eli Bendersky also has an excellent treatment of the subject.
If coroutines aren't familiar territory, David Beazley's A Curious Course on Coroutines and Concurrency is a stellar introduction.
You could give your states constant names instead of using 0, 1, 2, etc. for improved readability.
You could use a dictionary to map (current_state, input) -> (next_state), but that doesn't really let you do any additional processing during the transitions. Unless you include some "transition function" too to do extra processing.
Or you could do a non-FSM approach. I think this will work as long as 0xAA 0xAA only appears when it indicates a "start" (doesn't appear in data).
with open(sys.argv[1], 'rb') as f:
contents = f.read()
for chunk in contents.split('\xaa\xaa')[1:]:
length = ord(chunk[0])
data = chunk[10:10+length]
print data
If it does appear in data, you can instead use string.find('\xaa\xaa', start) to scan through the string, setting the start argument to begin looking where the last data block ended. Repeat until it returns -1.
I am a little apprehensive about telling anyone what's Pythonic, but here goes. First, keep in mind that in python functions are just objects. Transitions can be defined with a dictionary that has the (input, current_state) as the key and the tuple (next_state, action) as the value. Action is just a function that does whatever is necessary to transition from the current state to the next state.
There's a nice looking example of doing this at http://code.activestate.com/recipes/146262-finite-state-machine-fsm. I haven't used it, but from a quick read it seems like it covers everything.
A similar question was asked/answered here a couple of months ago: Python state-machine design. You might find looking at those responses useful as well.
I think your solution looks fine, except you should replace count = count - 1 with count -= 1.
This is one of those times where fancy code-show-offs will come up ways of have dicts mapping states to callables, with a small driver function, but it isn't better, just fancier, and using more obscure language features.
I suggest checking out chapter 4 of Text Processing in Python by David Mertz. He implements a state machine class in Python that is very elegant.
I think the most pythonic way would by like what FogleBird suggested, but mapping from (current state, input) to a function which would handle the processing and transition.
You can use regexps. Something like this code will find the first block of data. Then it's just a case of starting the next search from after the previous match.
find_header = re.compile('\xaa\xaa(.).{9}', re.DOTALL)
m = find_header.search(input_text)
if m:
length = chr(find_header.group(1))
data = input_text[m.end():m.end() + length]