I'm comparing 2 files with an initial identifier column, start value, and end value. The second file contains corresponding identifiers and another value column.
Ex.
File 1:
A 200 900
A 1000 1200
B 100 700
B 900 1000
File 2:
A 103
A 200
A 250
B 50
B 100
B 150
I would like to find all values from the second file that are contained within the ranges found in the first file so that my output would look like:
A 200
A 250
B 100
B 150
For now I have created a dictionary from the first file with a list of ranges:
Ex.
if Identifier in Dictionary:
Dictionary[Identifier].extend(range(Start, (End+1)))
else:
Dictionary[Identifier] = range(Start, (End+1))
I then go through the second file and search for the value within the dictionary of ranges:
Ex.
if Identifier in Dictionary:
if Value in Dictionary[Identifier]:
OutFile.write(Line + "\n")
While not optimal this works for relatively small files, however I have several large files and this program is proving terribly inefficient. I need to optimize my program so that it will run much faster.
from collections import defaultdict
ident_ranges = defaultdict(list)
with open('file1.txt', 'r') as f1
for row in f1:
ident, start, end = row.split()
start, end = int(start), int(end)
ident_ranges[ident].append((start, end))
with open('file2.txt', 'r') as f2, open('out.txt', 'w') as output:
for line in f2:
ident, value = line.split()
value = int(value)
if any(start <= value <= end for start, end in ident_ranges[ident]):
output.write(line)
Notes: Using a defaultdict allows you to add ranges to your dictionary without first checking for the existence of a key. Using any allows for short circuiting of the range check. Using chained comparision is a nice Python syntactic shortcut (start <= value <= end).
Do you need to construct range(START, END)? That seems quite wasteful when you can do:
if START <= x <= END:
# process
Checking if the value is in the range is slow because a) you've had to construct the list and b) perform a linear search over the list to find it.
You can try something like this:
In [27]: ranges=defaultdict(list)
In [28]: with open("file1") as f:
for line in f:
name,st,end=line.split()
st,end=int(st),int(end)
ranges[name].append([st,end])
....:
In [30]: ranges
Out[30]: defaultdict(<type 'list'>, {'A': [[200, 900], [1000, 1200]], 'B': [[100, 700], [900, 1000]]})
In [29]: with open("file2") as f:
for line in f:
name,val=line.split()
val=int(val)
if any(y[0]<=val<=y[1] for y in ranges[name]):
print name,val
....:
A 200
A 250
B 100
B 150
Neat trick: Python lets you do in comparisons with xrange objects, which is much faster than doing in with a range, and much more memory efficient.
So, you can do
from collections import defaultdict
rangedict = defaultdict(list)
...
rangedict[ident].append(xrange(start, end+1))
...
for i in rangedict:
for r in rangedict[i]:
if v in r:
print >>outfile, line
Since you've got large ranges and your problem is essentially just a bunch of comparisons, it's almost certainly faster to store a start/end tuple than the whole range (especially since what you have now is going to duplicate most of the numbers in the ranges if two happen to overlap).
# Building the dict
if not ident in d:
d[ident] = (lo, hi)
else:
old_lo, old_hi = d[ident]
d[ident] = (min(lo, old_lo), max(hi, old_hi))
Then your comparisons just look like:
# comparing...
if ident in d:
if d[ident][0] <= val <= d[ident][1]:
outfile.write(line+'\n')
Both parts of this will go faster if you aren't making separate checks for if ident in d. Python dictionaries are nice and fast, so just make the call to it in the first place. You've got the ability to provide defaults to the dictionary, so use it. I haven't benchmarked this or anything to see what the speedup is, but you'd certainly get some, and it certainly works:
# These both make use of the following somewhat silly hack:
# In Python, None is treated as less than everything (even -float('inf))
# and empty containers (e.g. (), [], {}) are treated as greater than everything.
# So we use the tuple ((), None) as if it was (float('inf'), float('-inf))
for line in file1:
ident, lo, hi = line.split()
lo = int(lo)
hi = int(hi)
old_lo, old_hi = d.get(ident, ((), None))
d[ident] = (min(lo, old_lo), max(hi, old_hi))
# comparing:
for line in file2:
ident, val = line.split()
val = int(val)
lo, hi = d.get(ident, ((), None))
if lo <= val <= hi:
outfile.write(line) # unless you stripped it off, this still has a \n
The above code is what I was using to test; it runs on a file2 of a million lines in a couple seconds.
Related
i am trying to extract a specific line as variable in file.
this is content of my test.txt
#first set
Task Identification Number: 210CT1
Task title: Assignment 1
Weight: 25
fullMark: 100
Description: Program and design and complexity running time.
#second set
Task Identification Number: 210CT2
Task title: Assignment 2
Weight: 25
fullMark: 100
Description: Shortest Path Algorithm
#third set
Task Identification Number: 210CT3
Task title: Final Examination
Weight: 50
fullMark: 100
Description: Close Book Examination
this is my code
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for line in mod:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
print(taskNumber)
print(taskTile)
print(weight)
print(fullMark)
print(description)
here is what i'm trying to do:
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time
and loop until the third set
but there's an error occurred in the output
ValueError: not enough values to unpack (expected 5, got 2)
Reponse for SwiftsNamesake
i tried out your code . i am still getting an error.
ValueError: too many values to unpack (expected 5)
this is my attempt by using your code
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
with open(home + '\\Desktop\\PADS Assignment\\210CT.txt', 'r') as mod:
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
As previously mentioned, you need some sort of chunking. To chunk it usefully we'd also need to ignore the irrelevant lines of the file. I've implemented such a function with some nice Python witchcraft below.
It might also suit you to use a namedtuple to store the values. A namedtuple is a pretty simple type of object, that just stores a number of different values - for example, a point in 2D space might be a namedtuple with an x and a y field. This is the example given in the Python documentation. You should refer to that link for more info on namedtuples and their uses, if you wish. I've taken the liberty of making a Task class with the fields ["number", "title", "weight", "fullMark", "desc"].
As your variables are all properties of a task, using a named tuple might make sense in the interest of brevity and clarity.
Aside from that, I've tried to generally stick to your approach, splitting by the colon. My code produces the output
================================================================================
number is 210CT1
title is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
================================================================================
number is 210CT2
title is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
================================================================================
number is 210CT3
title is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
which seems to be roughly what you're after - I'm not sure how strict your output requirements are. It should be relatively easy to modify to that end, though.
Here is my code, with some explanatory comments:
from collections import namedtuple
#defines a simple class 'Task' which stores the given properties of a task
Task = namedtuple("Task", ["number", "title", "weight", "fullMark", "desc"])
#chunk a file (or any iterable) into groups of n (as an iterable of n-tuples)
def n_lines(n, read_file):
return zip(*[iter(read_file)] * n)
#used to strip out empty lines and lines beginning with #, as those don't appear to contain any information
def line_is_relevant(line):
return line.strip() and line[0] != '#'
with open("input.txt") as in_file:
#filters the file for relevant lines, and then chunks into 5 lines
for task_lines in n_lines(5, filter(line_is_relevant, in_file)):
#for each line of the task, strip it, split it by the colon and take the second element
#(ie the remainder of the string after the colon), and build a Task from this
task = Task(*(line.strip().split(": ")[1] for line in task_lines))
#just to separate each parsed task
print("=" * 80)
#iterate over the field names and values in the task, and print them
for name, value in task._asdict().items():
print("{} is {}".format(name, value))
You can also reference each field of the Task, like this:
print("The number is {}".format(task.number))
If the namedtuple approach is not desired, feel free to replace the content of the main for loop with
taskNumber, taskTitle, weight, fullMark, desc = (line.strip().split(": ")[1] for line in task_lines)
and then your code will be back to normal.
Some notes on other changes I've made:
filter does what it says on the tin, only iterating over lines that meet the predicate (line_is_relevant(line) is True).
The * in the Task instantiation unpacks the iterator, so each parsed line is an argument to the Task constructor.
The expression (line.strip().split(": ")[1] for line in task_lines) is a generator. This is needed because we're doing multiple lines at once with task_lines, so for each line in our 'chunk' we strip it, split it by the colon and take the second element, which is the value.
The n_lines function works by passing a list of n references to the same iterator to the zip function (documentation). zip then tries to yield the next element from each element of this list, but as each of the n elements is an iterator over the file, zip yields n lines of the file. This continues until the iterator is exhausted.
The line_is_relevant function uses the idea of "truthiness". A more verbose way to implement it might be
def line_is_relevant(line):
return len(line.strip()) > 0 and line[0] != '#'
However, in Python, every object can implicitly be used in boolean logic expressions. An empty string ("") in such an expression acts as False, and a non-empty string acts as True, so conveniently, if line.strip() is empty it will act as False and line_is_relevant will therefore be False. The and operator will also short-circuit if the first operand is falsy, which means the second operand won't be evaluated and therefore, conveniently, the reference to line[0] will not cause an IndexError.
Ok, here's my attempt at a more extended explanation of the n_lines function:
Firstly, the zip function lets you iterate over more than one 'iterable' at once. An iterable is something like a list or a file, that you can go over in a for loop, so the zip function can let you do something like this:
>>> for i in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print(i)
...
('foo', 1)
('bar', 4)
('baz', 9)
The zip function returns a 'tuple' of one element from each list at a time. A tuple is basically a list, except it's immutable, so you can't change it, as zip isn't expecting you to change any of the values it gives you, but to do something with them. A tuple can be used pretty much like a normal list apart from that. Now a useful trick here is using 'unpacking' to separate each of the bits of the tuple, like this:
>>> for a, b in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print("a is {} and b is {}".format(a, b))
...
a is foo and b is 1
a is bar and b is 4
a is baz and b is 9
A simpler unpacking example, which you may have seen before (Python also lets you omit the parentheses () here):
>>> a, b = (1, 2)
>>> a
1
>>> b
2
Although the n-lines function doesn't use this. Now zip can also work with more than one argument - you can zip three, four or as many lists (pretty much) as you like.
>>> for i in zip([1, 2, 3], [0.5, -2, 9], ["cat", "dog", "apple"], "ABC"):
... print(i)
...
(1, 0.5, 'cat', 'A')
(2, -2, 'dog', 'B')
(3, 9, 'apple', 'C')
Now the n_lines function passes *[iter(read_file)] * n to zip. There are a couple of things to cover here - I'll start with the second part. Note that the first * has lower precedence than everything after it, so it is equivalent to *([iter(read_file)] * n). Now, what iter(read_file) does, is constructs an iterator object from read_file by calling iter on it. An iterator is kind of like a list, except you can't index it, like it[0]. All you can do is 'iterate over it', like going over it in a for loop. It then builds a list of length 1 with this iterator as its only element. It then 'multiplies' this list by n.
In Python, using the * operator with a list concatenates it to itself n times. If you think about it, this kind of makes sense as + is the concatenation operator. So, for example,
>>> [1, 2, 3] * 3 == [1, 2, 3] + [1, 2, 3] + [1, 2, 3] == [1, 2, 3, 1, 2, 3, 1, 2, 3]
True
By the way, this uses Python's chained comparison operators - a == b == c is equivalent to a == b and b == c, except b only has to be evaluated once,which shouldn't matter 99% of the time.
Anyway, we now know that the * operator copies a list n times. It also has one more property - it doesn't build any new objects. This can be a bit of a gotcha -
>>> l = [object()] * 3
>>> id(l[0])
139954667810976
>>> id(l[1])
139954667810976
>>> id(l[2])
139954667810976
Here l is three objects - but they're all in reality the same object (you might think of this as three 'pointers' to the same object). If you were to build a list of more complex objects, such as lists, and perform an in place operation like sorting them, it would affect all elements of the list.
>>> l = [ [3, 2, 1] ] * 4
>>> l
[[3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1]]
>>> l[0].sort()
>>> l
[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
So [iter(read_file)] * n is equivalent to
it = iter(read_file)
l = [it, it, it, it... n times]
Now the very first *, the one with the low precedence, 'unpacks' this, again, but this time doesn't assign it to a variable, but to the arguments of zip. This means zip receives each element of the list as a separate argument, instead of just one argument that is the list. Here is an example of how unpacking works in a simpler case:
>>> def f(a, b):
... print(a + b)
...
>>> f([1, 2]) #doesn't work
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'b'
>>> f(*[1, 2]) #works just like f(1, 2)
3
So in effect, now we have something like
it = iter(read_file)
return zip(it, it, it... n times)
Remember that when you 'iterate' over a file object in a for loop, you iterate over each lines of the file, so when zip tries to 'go over' each of the n objects at once, it draws one line from each object - but because each object is the same iterator, this line is 'consumed' and the next line it draws is the next line from the file. One 'round' of iteration from each of its n arguments yields n lines, which is what we want.
Your line variable gets only Task Identification Number: 210CT1 as its first input. You're trying to extract 5 values from it by splitting it by :, but there are only 2 values there.
What you want is to divide your for loop into 5, read each set as 5 lines, and split each line by :.
The problem here is that you are spliting the lines by : and for each line there is only 1 : so there are 2 values.
In this line:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
you are telling it that there are 5 values but it only finds 2 so it gives you an error.
One way to fix this is to run multiple for loops one for each value since you are not allowed to change the format of the file. I would use the first word and sort the data into different
import re
Identification=[]
title=[]
weight=[]
fullmark=[]
Description=[]
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod::
for line in mod:
list_of_line=re.findall(r'\w+', line)
if len(list_of_line)==0:
pass
else:
if list_of_line[0]=='Task':
if list_of_line[1]=='Identification':
Identification.append(line[28:-1])
if list_of_line[1]=='title':
title.append(line[12:-1])
if list_of_line[0]=='Weight':
weight.append(line[8:-1])
if list_of_line[0]=='fullMark':
fullmark.append(line[10:-1])
if list_of_line[0]=='Description':
Description.append(line[13:-1])
print('taskNumber is %s' % Identification[0])
print('taskTitle is %s' % title[0])
print('Weight is %s' % weight[0])
print('fullMark is %s' %fullmark[0])
print('desc is %s' %Description[0])
print('\n')
print('taskNumber is %s' % Identification[1])
print('taskTitle is %s' % title[1])
print('Weight is %s' % weight[1])
print('fullMark is %s' %fullmark[1])
print('desc is %s' %Description[1])
print('\n')
print('taskNumber is %s' % Identification[2])
print('taskTitle is %s' % title[2])
print('Weight is %s' % weight[2])
print('fullMark is %s' %fullmark[2])
print('desc is %s' %Description[2])
print('\n')
of course you can use a loop for the prints but i was too lazy so i copy and pasted :).
IF YOU NEED ANY HELP OR HAVE ANY QUESTIONS PLEASE PLEASE ASK!!!
THIS CODE ASSUMES THAT YOU ARE NOT THAT ADVANCED IN CODING
Good Luck!!!
As another poster (#Cuber) has already stated, you're looping over the lines one-by-one, whereas the data sets are split across five lines. The error message is essentially stating that you're trying to unpack five values when all you have is two. Furthermore, it looks like you're only interested in the value on the right hand side of the colon, so you really only have one value.
There are multiple ways of resolving this issue, but the simplest is probably to group the data into fives (plus the padding, making it seven) and process it in one go.
First we'll define chunks, with which we'll turn this somewhat fiddly process into one elegant loop (from the itertools docs).
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Now, we'll use it with your data. I've omitted the file boilerplate.
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
The 2 in 5+2 is for the padding (the comment above and the empty line below).
The implementation of chunks may not make sense to you at the moment. If so, I'd suggest looking into Python generators (and the itertools documentation in particular, which is a marvellous resource). It's also a good idea to get your hands dirty and tinker with snippets inside the Python REPL.
You can still read in lines one by one, but you will have to help the code understand what it's parsing. We can use an OrderedDict to lookup the appropriate variable name.
import os
import collections as ct
def printer(dict_, lookup):
for k, v in lookup.items():
print("{} is {}".format(v, dict_[k]))
print()
names = ct.OrderedDict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
for line in f.readlines():
line = line.strip()
if line.startswith("#"):
header = line
d = {}
continue
elif line:
k, v = line.split(":")
d[k] = v.strip(" ")
else:
printer(d, names)
printer(d, names)
Output
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
You're trying to get more data than is present on one line; the five pieces of data are on separate lines.
As SwiftsNamesake suggested, you can use itertools to group the lines:
import itertools
def keyfunc(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for k, g in itertools.groupby(mod, keyfunc):
if not k: # Does not process lines that are separators.
for line in g:
data = line.strip().partition(": ")
print(f"{data[0] is {data[2]}")
# print(data[0] + " is " + data[2]) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
If you want to use the data in other functions, output it as a dictionary from a generator:
from collections import OrderedDict
import itertools
def isSeparator(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
def parseData(data):
for line in data:
k, s, v = line.strip().partition(": ")
yield k, v
def readData(filePath):
with open(filePath, "r") as mod:
for key, g in itertools.groupby(mod, isSeparator):
if not key: # Does not process lines that are separators.
yield OrderedDict((k, v) for k, v in parseData(g))
def printData(data):
for d in data:
for k, v in d.items():
print(f"{k} is {v}")
# print(k + " is " + v) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
data = readData(home + '\\Desktop\\PADS Assignment\\test.txt')
printData(data)
Inspired by itertools-related solutions, here is another using the more_itertools.grouper tool from the more-itertools library. It behaves similarly to #SwiftsNamesake's chunks function.
import collections as ct
import more_itertools as mit
names = dict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
lines = (line.strip() for line in f.readlines())
for group in mit.grouper(7, lines):
for line in group[1:]:
if not line: continue
k, v = line.split(":")
print("{} is {}".format(names[k], v.strip()))
print()
Output
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
Care was taken to print the variable name with the corresponding value.
I have a target file called TARGFILE of the form:
10001000020002002001100100200000111
10201001020000120210101100110010011
02010010200000011100012021001012021
00102000012001202100101202100111010
My idea here was to leave this as a string, and use slicing in python to remove the indices.
The removal will occur based on a list of integers called INDICES like so:
[1, 115654, 115655, 115656, 2, 4, 134765, 134766, 18, 20, 21, 23, 24, 17659, 92573, 30, 32, 88932, 33, 35, 37, 110463, 38, 18282, 46, 18458, 48, 51, 54]
I want to remove every position of every line in TARGFILE that matches with INDICES. For instance, the first digit in INDICES is 1, so the first column of TARGFILE containing 1,1,0,0 would be removed. However, I am weary of doing this incorrectly due to off-by-one errors and changing index positions if everything is not removed at the same time.
Thus, a solution that removed every column from each row at the same time would likely be both much faster and safer than using a nested loop, but I am unsure of how to code this.
My code so far is here:
#!/usr/bin/env python
import fileinput
SRC_FILES=open('YCP.txt', 'r')
for line in SRC_FILES:
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split(' ')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
TARGFILE=open(EUR, 'r')
It seems to me that a solution using enumerate might be possible, but I have not found it, and that might be suboptimal in the first place...
EDIT: in response to concerns about memory: the longest lines are ~180,000 items, but I should be able to get this into memory without a problem, I have access to a cluster.
I like the simplicity of Peter's answer, even though it's currently off-by-one. My thought is that you can get rid of the index-shifting problem, by sorting INDICES, and doing the process from the back to the front. That led to remove_indices1, which is really inefficient. I think 2 is better, but simplest is 3, which is Peter's answer.
I may do timing in a bit for some large numbers, but my intuition says that my remove_indices2 will be faster than Peter's remove_indices3 if INDICES is very sparse. (Because you don't have to iterate over each character, but only over the indices that are being deleted.)
BTW - If you can sort INDICES once, then you don't need to make the local copy to sort/reverse, but I didn't know if you could do that.
rows = [
'0000000001111111111222222222233333333334444444444555555555566666666667',
'1234567890123456789012345678901234567890123456789012345678901234567890',
]
def remove_nth_character(row,n):
return row[:n-1] + row[n:]
def remove_indices1(row,indices):
local_indices = indices[:]
retval = row
local_indices.sort()
local_indices.reverse()
for i in local_indices:
retval = remove_nth_character(retval,i)
return retval
def remove_indices2(row,indices):
local_indices = indices[:]
local_indices.sort()
local_indices.reverse()
front = row
chunks = []
for i in local_indices:
chunks.insert(0,front[i:])
front = front[:i-1]
chunks.insert(0,front)
return "".join(chunks)
def remove_indices3(row,indices):
return ''.join(c for i,c in enumerate(row) if i+1 not in indices)
indices = [1,11,4,54,33,20,7]
for row in rows:
print remove_indices1(row,indices)
print ""
for row in rows:
print remove_indices2(row,indices)
print ""
for row in rows:
print remove_indices3(row,indices)
EDIT: Adding timing info, plus a new winner!
As I suspected, my algorithm (remove_indices2) wins when there aren't many indices to remove. It turns out that the enumerate-based one, though, gets worse even faster as there are more indices to remove. Here's the timing code (bigrows rows have 210000 characters):
bigrows = []
for row in rows:
bigrows.append(row * 30000)
for indices_len in [10,100,1000,10000,100000]:
print "indices len: %s" % indices_len
indices = range(indices_len)
#for func in [remove_indices1,remove_indices2,remove_indices3,remove_indices4]:
for func in [remove_indices2,remove_indices4]:
start = time.time()
for row in bigrows:
func(row,indices)
print "%s: %s" % (func.__name__,(time.time() - start))
And here are the results:
indices len: 10
remove_indices1: 0.0187089443207
remove_indices2: 0.00184297561646
remove_indices3: 1.40601491928
remove_indices4: 0.692481040955
indices len: 100
remove_indices1: 0.0974130630493
remove_indices2: 0.00125503540039
remove_indices3: 7.92742991447
remove_indices4: 0.679095029831
indices len: 1000
remove_indices1: 0.841033935547
remove_indices2: 0.00370812416077
remove_indices3: 73.0718669891
remove_indices4: 0.680690050125
So, why does 3 do so much worse? Well, it turns out that the in operator isn't efficient on a list. It's got to iterate through all of the list items to check. remove_indices4 is just 3 but converting indices to a set first, so the inner loop can do a fast hash-lookup, instead of iterating through the list:
def remove_indices4(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if i+1 not in indices_set)
And, as I originally expected, this does better than my algorithm for high densities:
indices len: 10
remove_indices2: 0.00230097770691
remove_indices4: 0.686790943146
indices len: 100
remove_indices2: 0.00113391876221
remove_indices4: 0.665997982025
indices len: 1000
remove_indices2: 0.00296902656555
remove_indices4: 0.700706005096
indices len: 10000
remove_indices2: 0.074893951416
remove_indices4: 0.679219007492
indices len: 100000
remove_indices2: 6.65899395943
remove_indices4: 0.701599836349
If you've got fewer than 10000 indices to remove, 2 is fastest (even faster if you do the indices sort/reverse once outside the function). But, if you want something that is pretty stable in time, no matter how many indices, use 4.
The simplest way I can see would be something like:
>>> for line in TARGFILE:
... print ''.join(c for i,c in enumerate(line) if (i+1) not in INDICES)
...
100000200020020100200001
100010200001202010110001
010102000000111021001021
000000120012021012100110
(Substituting print for writing to your output file etc)
This relies on being able to load each line into memory which may or may not be reasonable given your data.
Edit: explaination:
The first line is straightforward:
>>> for line in TARGFILE:
Just iterates through each line in TARGFILE. The second line is a bit more complex:
''.join(...) concatenates a list of strings together with an empty joiner (''). join is often used with a comma like: ','.join(['a', 'b', 'c']) == 'a,b,c', but here we just want to join each item to the next.
enumerate(...) takes an interable and returns pairs of (index, item) for each item in the iterable. For example enumerate('abc') == (0, 'a'), (1, 'b'), (2, 'c')
So the line says,
Join together each character of line whose index are not found in INDICES
However, as John pointed out, Python indexes are zero base, so we add 1 to the value from enumerate.
The script I ended up using is the following:
#!/usr/bin/env python
def remove_indices(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if (i+1) in indices_set)
SRC_FILES=open('YCP2.txt', 'r')
CEUDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/CEU/PARSED/'
YRIDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/YRI/PARSED/'
i=0
for line in SRC_FILES:
i+=1
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split('\t')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
INDEXSORT=sorted(INDICES, key=int)
EURF=open(EUR, 'r')
EURFOUT=open(CEUDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype in EURF:
TRIMLINE=remove_indices(haplotype, INDEXSORT)
EURFOUT.write(TRIMLINE + '\n')
EURFOUT.close()
AFRF=open(YRI, 'r')
AFRFOUT=open(YRIDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype2 in AFRF:
TRIMLINE=remove_indices(haplotype2, INDEXSORT)
AFRFOUT.write(TRIMLINE + '\n')
AFRFOUT.close()
I was coding a High Scores system where the user would enter a name and a score then the program would test if the score was greater than the lowest score in high_scores. If it was, the score would be written and the lowest score, deleted. Everything was working just fine, but i noticed something. The high_scores.txt file was like this:
PL1 50
PL2 50
PL3 50
PL4 50
PL5 50
PL1 was the first score added, PL2 was the second, PL3 the third and so on. Then I tried adding another score, higher than all the others (PL6 60) and what happened was that the program assigned PL1 as the lowest score. PL6 was added and PL1 was deleted. That was exactly the behavior I wanted but I don't understand how it happened. Do dictionaries keep track of the point in time where a item was assigned? Here's the code:
MAX_NUM_SCORES = 5
def getHighScores(scores_file):
"""Read scores from a file into a list."""
try:
cache_file = open(scores_file, 'r')
except (IOError, EOFError):
print("File is empty or does not exist.")
return []
else:
lines = cache_file.readlines()
high_scores = {}
for line in lines:
if len(high_scores) < MAX_NUM_SCORES:
name, score = line.split()
high_scores[name] = int(score)
else:
break
return high_scores
def writeScore(file_, name, new_score):
"""Write score to a file."""
if len(name) > 3:
name = name[0:3]
high_scores = getHighScores(file_)
if high_scores:
lowest_score = min(high_scores, key=high_scores.get)
if new_score > high_scores[lowest_score] or len(high_scores) < 5:
if len(high_scores) == 5:
del high_scores[lowest_score]
high_scores[name.upper()] = int(new_score)
else:
return 0
else:
high_scores[name.upper()] = int(new_score)
write_file = open(file_, 'w')
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
line = highest_key + ' ' + str(high_scores[highest_key]) + '\n'
write_file.write(line)
del high_scores[highest_key]
return 1
def displayScores(file_):
"""Display scores from file."""
high_scores = getHighScores(file_)
print("HIGH SCORES")
if high_scores:
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
print(highest_key, high_scores[highest_key])
del high_scores[highest_key]
else:
print("No scores yet.")
def resetScores(file_):
open(file_, "w").close()
No. The results you got were due to arbitrary choices internal to the dict implementation that you cannot depend on always happening. (There is a subclass of dict that does keep track of insertion order, though: collections.OrderedDict.) I believe that with the current implementation, if you switch the order of the PL1 and PL2 lines, PL1 will probably still be deleted.
As others noted, the order of items in the dictionary is "up to the implementation".
This answer is more a comment to your question, "how min() decides what score is the lowest?", but is much too long and format-y for a comment. :-)
The interesting thing is that both max and min can be used this way. The reason is that they (can) work on "iterables", and dictionaries are iterable:
for i in some_dict:
loops i over all the keys in the dictionary. In your case, the keys are the user names. Further, min and max allow passing a key argument to turn each candidate in the iterable into a value suitable for a binary comparison. Thus, min is pretty much equivalent to the following python code, which includes some tracing to show exactly how this works:
def like_min(iterable, key=None):
it = iter(iterable)
result = it.next()
if key is None:
min_val = result
else:
min_val = key(result)
print '** initially, result is', result, 'with min_val =', min_val
for candidate in it:
if key is None:
cmp_val = candidate
else:
cmp_val = key(candidate)
print '** new candidate:', candidate, 'with val =', cmp_val
if cmp_val < min_val:
print '** taking new candidate'
result = candidate
return result
If we run the above on a sample dictionary d, using d.get as our key:
d = {'p': 0, 'ayyy': 3, 'b': 5, 'elephant': -17}
m = like_min(d, key=d.get)
print 'like_min:', m
** initially, result is ayyy with min_val = 3
** new candidate: p with val = 0
** taking new candidate
** new candidate: b with val = 5
** new candidate: elephant with val = -17
** taking new candidate
like_min: elephant
we find that we get the key whose value is the smallest. Of course, if multiple values are equal, the choice of "smallest" depends on the dictionary iteration order (and also whether min actually uses < or <= internally).
(Also, the method you use to "sort" the high scores to print them out is O(n2): pick highest value, remove it from dictionary, repeat until empty. This traverses n items, then n-1, ... then 2, then 1 => n+(n-1)+...+2+1 steps = n(n+1)/2 = O(n2). Deleting the high one is also an expensive operation, although it should still come in at or under O(n2), I think. With n=5 this is not that bad (5 * 6 / 2 = 15), but ... not elegant. :-) )
This is pretty much what http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ is about.
Short version: Get the treap module, which works like a sorted dictionary, and keep the keys in order. Or use the nest module to get the n greatest (or least) values automatically.
collections.OrderedDict is good for preserving insertion order, but not key order.
I have a file that looks like this
N1 1.023 2.11 3.789
Cl1 3.124 2.4534 1.678
Cl2 # # #
Cl3 # # #
Cl4
Cl5
N2
Cl6
Cl7
Cl8
Cl9
Cl10
N3
Cl11
Cl12
Cl13
Cl14
Cl15
The three numbers continue down throughout.
What I would like to do is pretty much a permutation. These are 3 data sets, set 1 is N1-Cl5, 2 is N2-Cl10, and set three is N3 - end.
I want every combination of N's and Cl's. For example the first output would be
Cl1
N1
Cl2
then everything else the same. the next set would be Cl1, Cl2, N1, Cl3...and so on.
I have some code but it won't do what I want, becuase it would know that there are three individual data sets. Should I have the three data sets in three different files and then combine, using a code like:
list1 = ['Cl1','Cl2','Cl3','Cl4', 'Cl5']
for line in file1:
line.replace('N1', list1(0))
list1.pop(0)
print >> file.txt, line,
or is there a better way?? Thanks in advance
This should do the trick:
from itertools import permutations
def print_permutations(in_file):
separators = ['N1', 'N2', 'N3']
cur_separator = None
related_elements = []
with open(in_file, 'rb') as f:
for line in f:
line = line.strip()
# Split Nx and CIx from numbers.
value = line.split()[0]
# Found new Nx. Print previous permutations.
if value in separators and related_elements:
for perm in permutations([cur_separator] + related_elements)
print perm
cur_separator = line
related_elements = []
else:
# Found new CIx. Append to the list.
related_elements.append(value)
You could use regex to find the line numbers of the "N" patterns and then slice the file using those line numbers:
import re
n_pat = re.compile(r'N\d')
N_matches = []
with open(sample, 'r') as f:
for num, line in enumerate(f):
if re.match(n_pat, line):
N_matches.append((num, re.match(n_pat, line).group()))
>>> N_matches
[(0, 'N1'), (12, 'N2'), (24, 'N3')]
After you figure out the line numbers where these patterns appear, you can then use itertools.islice to break the file up into a list of lists:
import itertools
first = N_matches[0][0]
final = N_matches[-1][0]
step = N_matches[1][0]
data_set = []
locallist = []
while first < final + step:
with open(file, 'r') as f:
for item in itertools.islice(f, first, first+step):
if item.strip():
locallist.append(item.strip())
dataset.append(locallist)
locallist = []
first += step
itertools.islice is a really nice way to take a slice of an iterable. Here's the result of the above on a sample:
>>> dataset
[['N1 1.023 2.11 3.789', 'Cl1 3.126 2.6534 1.878', 'Cl2 3.124 2.4534 1.678', 'Cl3 3.924 2.1134 1.1278', 'Cl4', 'Cl5'], ['N2', 'Cl6 3.126 2.6534 1.878', 'Cl7 3.124 2.4534 1.678', 'Cl8 3.924 2.1134 1.1278', 'Cl9', 'Cl10'], ['N3', 'Cl11', 'Cl12', 'Cl13', 'Cl14', 'Cl15']]
After that, I'm a bit hazy on what you're seeking to do, but I think you want permutations of each sublist of the dataset? If so, you can use itertools.permutations to find permutations on various sublists of dataset:
for item in itertools.permutations(dataset[0]):
print(item)
etc.
Final Note:
Assuming I understand correctly what you're doing, the number of permutations is going to be huge. You can calculate how many permutations there are them by taking the factorial of the number of items. Anything over 10 (10!) is going to produce over 3,000,000 million permutations.
This is for homework, so I must try to use as little python functions as possible, but still allow for a computer to process a list of 1 million numbers efficiently.
#!/usr/bin/python3
#Find the 10 largest integers
#Don't store the whole list
import sys
import heapq
def fOpen(fname):
try:
fd = open(fname,"r")
except:
print("Couldn't open file.")
sys.exit(0)
all = fd.read().splitlines()
fd.close()
return all
words = fOpen(sys.argv[1])
numbs = map(int,words)
print(heapq.nlargest(10,numbs))
li=[]
count = 1
#Make the list
for x in words:
li.append(int(x))
count += 1
if len(li) == 10:
break
#Selection sort, largest-to-smallest
for each in range(0,len(li)-1):
pos = each
for x in range(each+1,10):
if li[x] > li[pos]:
pos = x
if pos != each:
li[each],li[pos] = li[pos],li[each]
for each in words:
print(li)
each = int(each)
if each > li[9]:
for x in range(0,9):
pos = x
if each > li[x]:
li[x] = each
for i in range(x+1,10):
li[pos],li[i] = li[i],li[pos]
break
#Selection sort, largest-to-smallest
for each in range(0,len(li)-1):
pos = each
for x in range(each+1,10):
if li[x] > li[pos]:
pos = x
if pos != each:
li[each],li[pos] = li[pos],li[each]
print(li)
The code is working ALMOST the way that I want it to. I tried to create a list from the first 10 digits. Sort them, so that it in descending order. And then have python ONLY check the list, if the digits are larger than the smaller one (instead of reading through the list 10(len(x)).
This is the output I should be getting:
>>>[9932, 9885, 9779, 9689, 9682, 9600, 9590, 9449, 9366, 9081]
This is the output I am getting:
>>>[9932, 9689, 9885, 9779, 9682, 9025, 9600, 8949, 8612, 8575]
If you only need the 10 top numbers, and don't care to sort the whole list.
And if "must try to use as little python functions as possible" means that you (or your theacher) prefer to to avoid heapq.
Another way could be to keep track of the 10 top numbers while you parse the whole file only one time:
top = []
with open('numbers.txt') as f:
# the first ten numbers are going directly in
for line in f:
top.add(int(line.strip()))
if len(top) == 10:
break
for line in f:
num = int(line.strip())
min_top = min(top)
if num > min_top: # check if the new number is a top one
top.remove(min_top)
top.append(num)
print(sorted(top))
Update: If you don't really need an in-place sort and since you're going to sort only 10 numebrs, I'd avoid the pain of reordering.
I'd just build a new list, example:
sorted_top = []
while top:
max_top = max(top)
sorted_top.append(max_top)
top.remove(max_top)
well, by both reading in the entire file and splitting it, then using map(), you are keeping a lot of data in memory.
As Adrien pointed out, files are iterators in py3k, so you can just use a generator literal to provide the iterable for nlargest:
nums = (int(x) for x in open(sys.argv[1]))
then, using
heapq.nlargest(10, nums)
should get you what you need, and you haven't stored the entire list even once.
the program is even shorter than the original, as well!
#!/usr/bin/env python3
from heapq import nlargest
import sys
nums = (int(x) for x in open(sys.argv[1]))
print(nlargest(10, nums))