Schwartzian sort example in "Text Processing in Python"

Schwartzian sort example in "Text Processing in Python" - python

I was browsing through "Text Processing in Python" and tried its example about Schwartzian sort.
I used following structure for sample data which also contains empty lines. I sorted this data by fifth column:
383230 -49 -78 1 100034 '06 text' 9562 'text' 720 'text' 867
335067 -152 -18 3 100030 'text' 2400 'text' 2342 'text' 696
136592 21 230 3 100035 '03. text' 10368 'text' 1838 'text' 977
Code used for Schwartzian sorting:
for n in range(len(lines)): # Create the transform
lst = string.split(lines[n])
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (lst[4], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
lines.sort() # Native sort
for n in range(len(lines)): # Restore original lines
lines[n] = lines[n][1]
open('tmp.schwartzian','w').writelines(lines)
I don't get how the author intended that short or empty lines should go to end of file by using this code. Lines are sorted after the if-else structure, thus raising empty lines to top of file. Short lines of course work as supposed with the custom sort (fourth_word function) as implemented in the example.
This is now bugging me, so any ideas? If I'm correct about this then how would you ensure that short lines actually stay at end of file?
EDIT: I noticed the square brackets around '\377'. This messed up sort() so I removed those brackets and output started working.
else: # Short lines to end
lines[n] = (['\377'], lines[n])
print type(lines[n][0])
>>> (type 'list')
I accepted nosklo's answer for good clarification about the meaning of '\377' and for his improved algorithm. Many thanks for the other answers also!
If curious, I used 2 MB sample file which took 0.95 secs with the custom sort and 0.09 with the Schwartzian sort while creating identical output files. It works!

Not directly related to the question, but note that in recent versions of python (since 2.3 or 2.4 I think), the transform and untransform can be performed automatically using the key argument to sort() or sorted(). eg:
def key_func(line):
lst = string.split(line)
if len(lst) >= 4:
return lst[4]
else:
return '\377'
lines.sort(key=key_func)

I don't know what is the question, so I'll try to clarify things in a general way.
This algorithm sorts lines by getting the 4th field and placing it in front of the lines. Then built-in sort() will use this field to sort. Later the original line is restored.
The lines empty or shorter than 5 fields fall into the else part of this structure:
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (lst[4], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
It adds a ['\377'] into the first field of the list to sort. The algorithm does that in hope that '\377' (the last char in ascii table) will be bigger than any string found in the 5th field. So the original line should go to bottom when doing the sort.
I hope that clarifies the question. If not, perhaps you should indicate exaclty what is it that you want to know.
A better, generic version of the same algorithm:
sort_by_field(list_of_str, field_number, separator=' ', defaultvalue='\xFF')
# decorates each value:
for i, line in enumerate(list_of_str)):
fields = line.split(separator)
try:
# places original line as second item:
list_of_str[i] = (fields[field_number], line)
except IndexError:
list_of_str[i] = (defaultvalue, line)
list_of_str.sort() # sorts list, in place
# undecorates values:
for i, group in enumerate(list_of_str))
list_of_str[i] = group[1] # the second item is original line
The algorithm you provided is equivalent to this one.

An empty line won't pass the test
if len(lst) >= 4:
so it will have ['\377'] as its sort key, not the 5th column of your data, which is lst[4] ( lst[0] is the first column).

Well, it will sort short lines almost at the end, but not quite always.
Actually, both the "naive" and the schwartzian version are flawed (in different ways). Nosklo and wbg already explained the algorithm, and you probably learn more if you try to find the error in the schwartzian version yourself, therefore I will give you only a hint for now:
Long lines that contain certain text
in the fourth column will sort later
than short lines.
Add a comment if you need more help.

Although the used of the Schwartzian transform is pretty outdated for Python it is worth mentioning that you could have written the code this way to avoid the possibility of a line with line[4] starting with \377 being sorted into the wrong place
for n in range(len(lines)):
lst = lines[n].split()
if len(lst)>4:
lines[n] = ((0, lst[4]), lines[n])
else:
lines[n] = ((1,), lines[n])
Since tuples are compared elementwise, the tuples starting with 1 will always be sorted to the bottom.
Also note that the test should be len(list)>4 instead of >=
The same logic applies when using the modern equivalent AKA the key= function
def key_func(line):
lst = line.split()
if len(lst)>4:
return 0, lst[4]
else:
return 1,
lines.sort(key=key_func)

Related

AIO Castle Cavalry - My code is too slow, is there a way I can shorten this?

So I am currently preparing for a competition (Australian Informatics Olympiad) and in the training hub, there is a problem in AIO 2018 intermediate called Castle Cavalry. I finished it:
input = open("cavalryin.txt").read()
output = open("cavalryout.txt", "w")
squad = input.split()
total = squad[0]
squad.remove(squad[0])
squad_sizes = squad.copy()
squad_sizes = list(set(squad))
yn = []
for i in range(len(squad_sizes)):
n = squad.count(squad_sizes[i])
if int(squad_sizes[i]) == 1 and int(n) == int(total):
yn.append(1)
elif int(n) == int(squad_sizes[i]):
yn.append(1)
elif int(n) != int(squad_sizes[i]):
yn.append(2)
ynn = list(set(yn))
if len(ynn) == 1 and int(ynn[0]) == 1:
output.write("YES")
else:
output.write("NO")
output.close()
I submitted this code and I didn't pass because it was too slow, at 1.952secs. The time limit is 1.000 secs. I wasn't sure how I would shorten this, as to me it looks fine. PLEASE keep in mind I am still learning, and I am only an amateur. I started coding only this year, so if the answer is quite obvious, sorry for wasting your time 😅.
Thank you for helping me out!

One performance issue is calling int() over and over on the same entity, or on things that are already int:
if int(squad_sizes[i]) == 1 and int(n) == int(total):
elif int(n) == int(squad_sizes[i]):
elif int(n) != int(squad_sizes[i]):
if len(ynn) == 1 and int(ynn[0]) == 1:
But the real problem is your code doesn't work. And making it faster won't change that. Consider the input:
4
2
2
2
2
Your code will output "NO" (with missing newline) despite it being a valid configuration. This is due to your collapsing the squad sizes using set() early in your code. You've thrown away vital information and are only really testing a subset of the data. For comparison, here's my complete rewrite that I believe handles the input correctly:
with open("cavalryin.txt") as input_file:
string = input_file.read()
total, *squad_sizes = map(int, string.split())
success = True
while squad_sizes:
squad_size = squad_sizes.pop()
for _ in range(1, squad_size):
try:
squad_sizes.remove(squad_size) # eliminate n - 1 others like me
except ValueError:
success = False
break
else: # no break
continue
break
with open("cavalryout.txt", "w") as output_file:
print("YES" if success else "NO", file=output_file)
Note that I convert all the input to int early on so I don't have to consider that issue again. I don't know whether this will meet AIO's timing constraints.

I can see some things in there that might be inefficient, but the best way to optimize code is to profile it: run it with a profiler and sample data.
You can easily waste time trying to speed up parts that don't need it without having much effect. Read up on the cProfile module in the standard library to see how to do this and interpret the output. A profiling tutorial is probably too long to reproduce here.
My suggestions, without profiling,
squad.remove(squad[0])
Removing the start of a big list is slow, because the rest of the list has to be copied as it is shifted down. (Removing the end of the list is faster, because lists are typically backed by arrays that are overallocated (more slots than elements) anyway, to make .append()s fast, so it only has to decrease the length and can keep the same array.
It would be better to set this to a dummy value and remove it when you convert it to a set (sets are backed by hash tables, so removals are fast), e.g.
dummy = object()
squad[0] = dummy # len() didn't change. No shifting required.
...
squad_sizes = set(squad)
squad_sizes.remove(dummy) # Fast lookup by hash code.
Since we know these will all be strings, you can just use None instead of a dummy object, but the above technique works even when your list might contain Nones.
squad_sizes = squad.copy()
This line isn't required; it's just doing extra work. The set() already makes a shallow copy.
n = squad.count(squad_sizes[i])
This line might be the real bottleneck. It's effectively a loop inside a loop, so it basically has to scan the whole list for each outer loop. Consider using collections.Counter for this task instead. You generate the count table once outside the loop, and then just look up the numbers for each string.
You can also avoid generating the set altogether if you do this. Just use the Counter object's keys for your set.
Another point unrelated to performance. It's unpythonic to use indexes like [i] when you don't need them. A for loop can get elements from an iterable and assign them to variables in one step:
from collections import Counter
...
count_table = Counter(squad)
for squad_size, n in count_table.items():
...

You can collect all occurences of the preferred number for each knight in a dictionary.
Then test if the number of knights with a given preferred number is divisible by that number.
with open('cavalryin.txt', 'r') as f:
lines = f.readlines()
# convert to int
list_int = [int(a) for a in lines]
#initialise counting dictionary: key: preferred number, item: empty list to collect all knights with preferred number.
collect_dict = {a:[] for a in range(1,1+max(list_int[1:]))}
print(collect_dict)
# loop though list, ignoring first entry.
for a in list_int[1:]:
collect_dict[a].append(a)
# initialise output
out='YES'
for key, item in collect_dict.items():
# check number of items with preference for number is divisilbe
# by that number
if item: # if list has entries:
if (len(item) % key) > 0:
out='NO'
break
with open('cavalryout.txt', 'w') as f:
f.write(out)

How to do this recursion method to print out every 3 letters from a string on a separate line?

I'm making a method that takes a string, and it outputs parts of the strings on separate line according to a window.
For example:
I want to output every 3 letters of my string on separate line.
Input : "Advantage"
Output:
Adv
ant
age
Input2: "23141515"
Output:
231
141
515
My code:
def print_method(input):
mywindow = 3
start_index = input[0]
if(start_index == input[len(input)-1]):
exit()
print(input[1:mywindow])
printmethod(input[mywindow:])
However I get a runtime error.... Can someone help?

I think this is what you're trying to get. Here's what I changed:
Renamed input to input_str. input is a keyword in Python, so it's not good to use for a variable name.
Added the missing _ in the recursive call to print_method
Print from 0:mywindow instead of 1:mywindow (which would skip the first character). When you start at 0, you can also just say :mywindow to get the same result.
Change the exit statement (was that sys.exit?) to be a return instead (probably what is wanted) and change the if condition to be to return once an empty string is given as the input. The last string printed might not be of length 3; if you want this, you could use instead if len(input_str) < 3: return
def print_method(input_str):
mywindow = 3
if not input_str: # or you could do if len(input_str) == 0
return
print(input_str[:mywindow])
print_method(input_str[mywindow:])

edit sry missed the title: if that is not a learning example for recursion you shouldn't use recursion cause it is less efficient and slices the list more often.
def chunked_print (string,window=3):
for i in range(0,len(string) // window + 1): print(string[i*window:(i+1)*window])
This will work if the window size doesn't divide the string length, but print an empty line if it does. You can modify that according to your needs

How to match fields from two lists and further filter based upon the values in subsequent fields?

EDIT: My question was answered on reddit. Here is the link if anyone is interested in the answer to this problem https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
I am attempting to get the pos and alt strings from file1 to match up with what is in
file2, fairly simple. However, file2 has values in the 17th split element/column to the
last element/column (340th) which contains string such as 1/1:1.2.2:51:12 which
I also want to filter for.
I want to extract the rows from file2 that contain/match the pos and alt from file1.
Thereafter, I want to further filter the matched results that only contain certain
values in the 17th split element/column onwards. But to do so the values would have to
be split by ":" so I can filter for split[0] = "1/1" and split[2] > 50. The problem is
I have no idea how to do this.
I imagine I will have to iterate over these and split but I am not sure how to do this
as the code is presently in a loop and the values I want to filter are in columns not rows.
Any advice would be greatly appreciated, I have sat with this problem since Friday and
have yet to find a solution.
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:

Your requirement is not clear, but you might mean this:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
I chose to concatenate x and y into the matched array, since I wasn't sure whether or not you would want all the data. If not, feel free to go back to just appending x or y.

You may want to look into the csv library, which can use tab as a delimiter. You can also use a generator and/or guards to make the code a bit more pythonic and efficient. I think your approach with indexes works pretty well, but it would be easy to break when trying to modify down the road, or to update if your file lines change shape. You may wish to create objects (I use NamedTuples in the last part) to represent your lines and make it much easier to read/refine down the road.
Lastly, remember that Python has a shortcut feature with the comparative 'if'
for example:
if x_evaluation and y_evaluation:
do some stuff
when x_evaluation returns False, Python will skip y_evaluation entirely. In your code, cells_x[0]+":"+cells_x[1] is evaluated every single time you iterate the loop. Instead of storing this value, I wait until the easier alt comparison evaluates to True before doing this (comparatively) heavier/uglier check.
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
namedtuples for the first part
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x

Python: Get the line and column number of string index?

Say I have a text file I'm operating on. Something like this (hopefully this isn't too unreadable):
data_raw = open('my_data_file.dat').read()
matches = re.findall(my_regex, data_raw, re.MULTILINE)
for match in matches:
try:
parse(data_raw, from_=match.start(), to=match.end())
except Exception:
print("Error parsing data starting on line {}".format(what_do_i_put_here))
raise
Notice in the exception handler there's a certain variable named what_do_i_put_here. My question is: how can I assign to that name so that my script will print the line number that contains the start of the 'bad region' I'm trying to work with? I don't mind re-reading the file, I just don't know what I'd do...

Here's something a bit cleaner, and in my opinion easier to understand than your own answer:
def index_to_coordinates(s, index):
"""Returns (line_number, col) of `index` in `s`."""
if not len(s):
return 1, 1
sp = s[:index+1].splitlines(keepends=True)
return len(sp), len(sp[-1])
It works essentially the same way as your own answer, but by utilizing string slicing splitlines() actually calculates all the information you need for you without the need for any post processing.
Using the keepends=True is necessary to give correct column counts for end of line characters.
The only extra problem is the edge case of an empty string, which can easily be handled by a guard-clause.
I tested it in Python 3.8, but it probably works correctly after about version 3.4 (in some older versions len() counts code units instead of code points, and I assume it would break for any string containing characters outside of the BMP)

I wrote this. It's untested and inefficient but it does help my exception message be a little clearer:
def coords_of_str_index(string, index):
"""Get (line_number, col) of `index` in `string`."""
lines = string.splitlines(True)
curr_pos = 0
for linenum, line in enumerate(lines):
if curr_pos + len(line) > index:
return linenum + 1, index-curr_pos
curr_pos += len(line)
I haven't even tested to see if the column number is vaguely accurate. I failed to abide by YAGNI

Column indexing starts with 0 so you need to extract 1 from len(sp[-1]) at the very end of your code to get the correct column value. Also, I'd perhaps return None (instead of "1.1" - which is also incorrect since it should be "1.0"...) if the lenght of string is 0 or if the string is too short to fit the index.
Otherwise, it's an excellent and elegant solution Tim.
def index_to_coordinates(txt:str, index:int) -> str:
"""Returns 'line.column' of index in 'txt'."""
if not txt or len(txt)-1 < index:
return None
sp = txt[:index+1].splitlines(keepends=True)
return (f"{len(sp)}.{len(sp[-1])-1}")

index a list in a Python for loop

I'm making a for loop within a for loop. I'm looping through a list and finding a specific string that contains a regular expression pattern. Once I find the line, I need to search to find the next line of a certain pattern. I need to store both lines to be able to parse out the time for them. I've created a counter to keep track of the index number of the list as the outer for loop works. Can I use a construction like this to find the second line I need?
index = 0
for lineString in summaryList:
match10secExp = re.search('taking 10 sec. exposure', lineString)
if match10secExp:
startPlate = lineString
for line in summaryList[index:index+10]:
matchExposure = re.search('taking \d\d\d sec. exposure', line)
if matchExposure:
endPlate = line
break
index = index + 1
The code runs, but I'm not getting the result I'm looking for.
Thanks.

matchExposure = re.search('taking \d\d\d sec. exposure', lineString)
should probably be
matchExposure = re.search('taking \d\d\d sec. exposure', line)

Depending on your exact needs, you can just use an iterator on the list, or two of them as mae by itertools.tee. I.e., if you want to search lines following the first pattern only for the second pattern, a single iterator will do:
theiter = iter(thelist)
for aline in theiter:
if re.search(somestart, aline):
for another in theiter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This will not search lines from aline to the ending another for somestart, only for someend. If you need to search them for both purposes, i.e., leave theiter itself intact for the outer loop, that's where tee can help:
for aline in theiter:
if re.search(somestart, aline):
_, anotheriter = itertools.tee(iter(thelist))
for another in anotheriter:
if re.search(someend, another):
yield aline, another # or print, whatever
break
This is an exception to the general rule about tee which the docs give:
Once tee() has made a split, the
original iterable should not be used
anywhere else; otherwise, the iterable
could get advanced without the tee
objects being informed.
because the advancing of theiter and that of anotheriter occur in disjoint parts of the code, and anotheriter is always rebuilt afresh when needed (so the advancement of theiter in the meantime is not relevant).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Schwartzian sort example in "Text Processing in Python" - python

An empty line won't pass the test if len(lst) >= 4: so it will have ['\377'] as its sort key, not the 5th column of your data, which is lst[4] ( lst[0] is the first column).

Related

AIO Castle Cavalry - My code is too slow, is there a way I can shorten this?

How to do this recursion method to print out every 3 letters from a string on a separate line?

How to match fields from two lists and further filter based upon the values in subsequent fields?

Python: Get the line and column number of string index?

index a list in a Python for loop

Categories

Resources