Optimizing the filling of lists from a .txt file - python

I am currently working on a post processing program for which I use a .txt file. This text file contains 4 informations that are repeated 8 times on each line. I created a function to get these informations and store them in lists in the simplest way :
def add_to_lists(line, frequence, phase, in_phase, in_quad) :
# Serie 1 - Even
frequence[0].append(line[3])
in_phase[0].append(line[4])
in_quad[0].append(line[5])
phase[0].append(line[6])
frequence[1].append(line[7])
in_phase[1].append(line[8])
in_quad[1].append(line[9])
phase[1].append(line[10])
frequence[2].append(line[11])
in_phase[2].append(line[12])
in_quad[2].append(line[13])
phase[2].append(line[14])
frequence[3].append(line[15])
in_phase[3].append(line[16])
in_quad[3].append(line[17])
phase[3].append(line[18])
# Serie 2 - Odd
frequence[4].append(line[19])
in_phase[4].append(line[20])
in_quad[4].append(line[21])
phase[4].append(line[22])
frequence[5].append(line[23])
in_phase[5].append(line[24])
in_quad[5].append(line[25])
phase[5].append(line[26])
frequence[6].append(line[27])
in_phase[6].append(line[28])
in_quad[6].append(line[29])
phase[6].append(line[30])
frequence[7].append(line[31])
in_phase[7].append(line[32])
in_quad[7].append(line[33])
phase[7].append(line[34])
This method works fine but I was wondering if there was a more efficient way of filling in those lists.

Instead of popping out which may be dangerous, you can simply use a step in your for loop, and divide the loop index by the step.
def add_to_lists(line, frequence, phase, in_phase, in_quad, step=4) :
for i in range(3, len(line), step):
idx = (i-3) // step
frequence[idx].append(line[i])
in_phase[idx].append(line[i+1])
in_quad[idx].append(line[i+2])
phase[idx].append(line[i+3])

You could chunk your line variable into sublists of length 4. You can simply pip install more-itertools and import chunked from this package.
from more_itertools import chunked
line_chunks = chunked(line[3:], 4)
for i, line_chunk in enumerate(line_chunks):
frequence[i].append(line_chunk[0])
in_phase[i].append(line_chunk[1])
in_quad[i].append(line_chunk[2])
phase[i].append(line_chunk[3])

from itertools import islice
def add_to_list(line, frequence, phase, in_phase, in_quad):
frequence.extend(islice(line, 0, None, 4))
in_phase.extend(islice(line, 1, None, 4))
in_quad.extend(islice(line, 2, None, 4))
phase.extend(islice(line, 3, None, 4))
You could just return a tuple of every thing, like this
def add_to_list(line):
return (list(islice(line, 0, None, 4)),
list(islice(line, 1, None, 4)),
list(islice(line, 2, None, 4)),
list(islice(line, 3, None, 4)))

You could use a generator to create chunks...
def chunked(elements, size, start_index=0, limit=None):
""" Generator creating chunks of given size. """
if limit == 0:
return
for chunk_index, i in enumerate(range(start_index, len(elements), size)):
if limit is None or chunk_index < limit:
yield (chunk_index, elements[i:i+size])
else:
break
def add_to_lists(line, frequence, phase, in_phase, in_quad):
for i, data in chunked(line, 4, 3, 8):
frequence[i].append(data[0])
in_phase[i].append(data[1])
in_quad[i].append(data[2])
phase[i].append(data[3])
I think using a generator, as in this and chatax' answers used, is more readable, reusable and testable. It separates distinct behaviours.
Creating 8 Chunks
Fill Arrays
That said, the generator chunked can easily be tested, e.g. with an unit test.

You could pop the values from the line list (i.e. get the fourth item in this case and remove it)
for i in range(8):
frequence[i].append(line.pop(3))
in_phase[i].append(line.pop(3))
in_quad[i].append(line.pop(3))
phase[i].append(line.pop(3))
Edit: While this works, popping the list undeniably affects it. If this is unintended (or you don't know if you should), stepping with range() is a better option

Related

Is there any available function in numpy that iterate a ndarray and modify each element with a custom function?

def evolve(self):
newgrid =signal.convolve2d(self.grid, self.neighborhood, 'same')
dimentionX = self.grid.shape[0]
dimentionY = self.grid.shape[1]
for i in range(0, dimentionX):
for j in range(0, dimentionY):
if newgrid[i,j] < 2:
self.grid[i,j] = self.deadValue
elif newgrid[i,j] == 3:
self.grid[i,j] = self.aliveValue
elif newgrid[i,j] > 3:
self.grid[i,j] = self.deadValue
return self.grid
I am doing something like this. This function is frequently called. It was fine when the grid is not large (64x64 for examplee). However, when the grid has with more than a thousand, the simulation runs very slow.
I was told that with appropriate use of numpy it should be much more faster. I was told that numpy provides such a function that does the same thing as what I have written, but much faster.
After some research at the documentations, I only found this:
But this only support boolean return type, and only support simple callback for each element, while I need to do complex operation (that is multilined and involves 'if's) for each element
Note that I do not discuss you approach as such. I strictly address your question.
What about resorting to boolean indexing ? As follows
# [...]
self.grid[(newgrid < 2) | (newgrid > 3)] = self.deadValue
self.grid[newgrid == 3] = self.aliveValue
# [...]
?
The function is np.where
def evolve(self):
newgrid = signal.convolve2d(self.grid, self.neighborhood, 'same')
self.grid = np.where(newgrid == 3, self.aliveValue, self.deadvalue)
return self.grid

Wrong list output than what was expected

So I have to iterate through this list and divide the even numbers by 2 and multiply the odds by 3, but when I join the list together to print it gives me [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]. I printed each value inside the loop in order to check if it was an arithmetic error but it prints out the correct value. Using lambda I have found out that it rewrites data every time it is called, so I'm trying to find other ways to do this while still using the map function. The constraint for the code is that it needs to be done using a map function. Here is a snippet of the code:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_list1 = []
i = 0
while i < len(data):
if (data[i] % 2) == 0:
data_list1 = list(map(lambda a: a / 2, data))
print(data_list1[i])
i += 1
else:
data_list1 = list(map(lambda a: a * 3, data))
print(data_list1[i])
i += 1
print(list(data_list1))1
Edit: Error has been fixed.
The easiest way for me to do this is as follows:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_list1 = []
i = 0
for i in range(len(data)):
if (data[i]%2) == 0:
data_list1=data_list1+[int(data[i]/2)]
elif (data[i]%2) == 1: # alternatively a else: would do, but only if every entry in data is an int()
data_list1=data_list1+[data[i]*3]
print(data_list1)
In your case a for loop makes the code much more easy to read, but a while loop works just as well.
In your original code the issue is your map() function. If you look into the documentation for it, you will see that map() affects every item in the iterable. You do not want this, instead you want to change only the specified entry.
Edit: If you want to use lambda for some reason, here's a (pretty useless) way to do it:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_list1 = []
for i in range(len(data)):
if (data[i] % 2) == 0:
x = lambda a: a/2
data_list1.append(x(data[i]))
else:
y = lambda a: a*3
data_list1.append(y(data[i]))
print(data_list1)
If you have additional design constraints, please specify them in your question, so we can help.
Edit2: And once again onto the breach: Since you added your constraints, here's how to do it with a mapping function:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
def changer(a):
if a%2==0:
return a/2
else:
return a*3
print(list(map(changer,data)))
If you want it to be in a new list, just add data_list1=list(map(changer,data)).
Hope this is what you were looking for!
You can format the string for the output like this:
print(','.join(["%2.1f " % (.5*x if (x%2)==0 else 3*x) for x in data]))
From your latest comment I completely edited the answer below (old version can be found in the edit-history of this post).
From your update, I see that your constraint is to use map. So let's address how this works:
map is a function which exists in many languages and it might be surprising to see at first because it takes a function as an argument. One analogy could be: You give a craftsmen (the "map" function) pieces of metal (the list of values) and a tool (the function passed into "map") and you tell him, to use the tool on each piece of metal and give you back the modified pieces.
A very important thing to understand is that map takes a complete list/iterable and return a new iterable all by itself. map takes care of the looping so you don't have to.
If you hand him a hammer as tool, each piece of metal will have a dent in it.
If you hand him a scriber, each piece of metal will have a scratch in it.
If you hand him a forge as tool, each piece of metal will be returned molten.
The core to underand here is that "map" will take any list (or more precisely an "iterable") and will apply whatever function you give it to each item and return the modified list (again, the return value is not really a list but a new "iterable").
So for example (using strings):
def scribe(piece_of_metal):
"""
This function takes a string and appends "with a scratch" at the end
"""
return "%s with a scratch" % piece_of_metal
def hammer(piece_of_metal):
"""
This function takes a string and appends "with a dent" at the end
"""
return "%s with a dent" % piece_of_metal
def forge(piece_of_metal):
"""
This function takes a string and prepends it with "molten"
"""
return "molten %s" % piece_of_metal
metals = ["iron", "gold", "silver"]
scribed_metals = map(scribe, metals)
dented_metals = map(hammer, metals)
molten_metals = map(forge, metals)
for row in scribed_metals:
print(row)
for row in dented_metals:
print(row)
for row in molten_metals:
print(row)
I have delibrately not responded to the core of your question as it is homework but I hope this post gives you a practical example of using map which helps with the exercise.
Another, more practical example, saving data to disk
The above example is deliberately contrived to keep it simple. But it's not very practical. Here is another example which could actually be useful, storing documents on disk. We assume that we hava a function fetch_documents which returns a list of strings, where the strings are the text-content of the documents. We want to store those into .txt files. As filenames we will use the MD5 hash of the contents. The reason MD5 is chosen is to keep things simple. This way we still only require one argument to the "mapped" function and it is sufficiently unique to avoid overwrites:
from assume_we_have import fetch_documents
from hashlib import md5
def store_document(contents):
"""
Store the contents into a unique filename and return the generated filename.
"""
hash = md5(contents)
filename = '%s.txt' % hash.hexdigest()
with open(filename, 'w') as outfile:
outfile.write(contents)
return filename
documents = fetch_documents()
stored_filenames = map(store_document, documents)
The last line which is using map could be replaced with:
stored_filenames = []
for document in documents:
filename = store_document(document)
stored_filenames.append(filename)

Nesting loops to arbitrary depth by passing method objects

I am trying to scan over iterable properties of n objects. I am looking for a pythonic way to perform functions in nested loops of arbitrary depth by passing functions to method calls of the loop one level up. I haven't been able to get more than the most inner loop to run when the depth is 3. Here is a non-working python pseudo code where I am querying a different value at each point in the loops. The other difficulty is I am trying to capture the output and pass it to the next outer loop
class Parent(object):
def __init__(self):
self.iterable = [None] * 2
self.result = self.iterable[:]
def loop(self, query_func):
def innerloop():
for i, x in enumerate(self.iterable):
self.result[i] = query_func(x)
return self.result[:]
return innerloop
class ChildA(Parent):
def __init___(self, A, object_to_queryA):
self.iterableA = [valueA for valueA in range(A)]
self.resultA = self.iterableA[:]
self.object_to_query = object_to_queryA
def query_valueA(self, x):
return self.object_to_query.some_query_function(x)
class ChildB(Parent):
def __init___(self, B, object_to_queryB):
self.iterableB = [valueB for valueB in range(B))]
self.resultB = self.iterableB[:]
self.object_to_query = object_to_queryB
def query_valueB(self, x):
return self.object_to_query.some_other_query_function(x)
class ChildC(Parent):
def __init___(self, C, , object_to_queryC):
self.iterableC = [valueC for valueC in range(C))]
self.resultC = self.iterableC[:]
self.object_to_query = object_to_queryC
def query_valueC(self, x):
return self.object_to_query.yet_another_query_function(x)
I want to be able to call these loops as follows:
import numpy
query_objA, query_objB, query_objC = (SomeObjA(), SomeObjB(), SomeObjC())
A, B, C = (len(query_objA.data), len(query_objB.data), len(query_objC.data))
instA = ChildA(A, query_objA)
instB = ChildB(B, query_objB)
instC = ChildC(C, query_objC)
my_scanning_func = ChildA.loop(ChildB.loop(ChildC.loop))
my_queries = numpy.array(my_scanning_func()).reshape(A,B,C)
# Equally valid call example below:
my_scanning_func2 = ChildB.loop(ChildC.loop(ChildA.loop))
my_queries2 = numpy.array(my_scanning_func2()).reshape(B,C,A)
The ultimate functionality im looking for would be similar to below, but for arbitrary depth and order:
for i, x in enumerate(query_objA.data):
response[i] = instA.some_query_function(x)
for j, y in enumerate(query_objB.data):
response[i][j] = instB.some_other_query_function(y)
for k, z in enumerate(query_objC.data):
response[i][j][k] = instC.yet_another_query_function(z)
Bonus points if this can be done via an inherited recursive function, rather than defining separate looping methods for each child, as I tried to do above. Last Note: I am trying to write Python 2.7 compatible code. Thanks in advance!
After much discussion with the OP I have a better idea of how you could generalize the construction of these arrays, first it seems that your objects would be designed to both iterate over predefined states or query the present state (possibly with only one of these being valid) so the iterface for object would be abstracted to something like this:
class Apparatus_interface:
def __init__(self,*needed_stuff):
#I have no idea how you are actually interacting with the device
self._device = SET_UP_OBJECT(needed_stuff)
#when iterating over this object we need to know how many states there are
#so we can predefine the shape (dimensions) of our arrays
self.num_of_states = 5
#it would make sense for each object to define
#the type of value that .query() returns (following spec of numpy's dtype)
self.query_type = [('f1', float), ('f2', float)]
def __iter__(self):
"""iterates over the physical positions/states of the apperatus
the state of the device is only active in between iterations
* calling list(device) doesn't give you any useful information, just a lot of mechanical work
"""
for position in range(self.num_of_states):
# ^ not sure what this should be either, you will have a better idea
self._device.move_to(position) #represents a physical change in the device
yield position #should it generate different information?
def query(self):
return self._device.query()
with this interface you would generate your array by iterating (nested loop) over a number of devices and at each combination of states between them you query the state of another device (and record that value into an array)
Normally you'd be able to use itertools.product to generate the combinations of states of the devices however due to optimizations itertools.product would run the iteration code that affects the physical device before it is used in iteration, so you will need an implementation that does not apply this kind of optimization:
#values is a list that contains the current elements generated
#the loop: for values[depth] in iterables[depth] basically sets the depth-th element to each value in that level of iterable
def _product(iterables, depth, values):
if len(iterables)-depth == 1:
for values[depth] in iterables[depth]:
yield tuple(values)
else:
for values[depth] in iterables[depth]:
#yield from _product(iterables, depth+1, values)
for tup in _product(iterables, depth+1, values):
yield tup
def product(*iterables):
"""
version of itertools.product to activate side-effects of iteration
only works with iterables, not iterators.
"""
values = [None]*len(iterables)
return _product(iterables, 0, values)
Now for actually generating the array - first a process that iterates through the product of all states and makes a query at each one, note that states variable is unused as I'm going to assume the placement in the numpy array will be determined by the order the states get iterated not the values produced
def traverse_states(variable_devices, queried_device):
"""queries a device at every combination of variable devices states"""
for states in product(*variable_devices):
yield queried_device.query()
then the function to put the array together is quite strait forward:
def array_from_apparatus(variable_devices, queried_object, dtype=None):
# the # of states in each device <==> # of elements in each dimension
arr_shape = [device.num_of_states for device in variable_devices]
iterator = traverse_states(variable_devices, queried_object)
if dtype is None:
dtype = queried_object.query_type
array = numpy.fromiter(iterator, dtype=dtype)
array.shape = arr_shape #this will fail if .num_of_states doesn't match the actual number of iterations
return array
I'm not sure how I could make a decent test of this but I believe it would work or at least be close.
I'm not sure if this answers your question but I think it is at least relevant, if you want to generate a numpy array such that array[tup] = func(tup) where tup is a tuple of integer indices you could use itertools.product in combination with numpy.fromiter like this:
import itertools
#from itertools import imap as map #for python 2
import numpy
def array_from_func(dimensions, func, dtype=float):
ranges = (range(i) for i in dimensions) #ranges of indices for all dimensions
all_indices = itertools.product(*ranges) #will iterate over all locations regardless of # of dimensions
value_gen = map(func, all_indices) #produces each value for each location
array = numpy.fromiter(value_gen, dtype=dtype)
array.shape = dimensions #modify the shape in place, .reshape would work but makes a copy.
return array
This is useful to me to see how indices relate to the actual array output, here are three demos to show basic functionality (second one I figured out recently)
from operator import itemgetter
>>> array_from_func((2,3,4), itemgetter(1),int) #second index
array([[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2]],
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2]]])
>>> def str_join(it):
return ",".join(map(str,it))
#the '<U5' in next line specifies strings of length 5, this only works when the string will actually be length 5
#changing to '<U%d'%len(str_join(dims)) would be more generalized but harder to understand
>>> print(array_from_func((3,2,7), str_join, '<U5'))
[[['0,0,0' '0,0,1' '0,0,2' '0,0,3' '0,0,4' '0,0,5' '0,0,6']
['0,1,0' '0,1,1' '0,1,2' '0,1,3' '0,1,4' '0,1,5' '0,1,6']]
[['1,0,0' '1,0,1' '1,0,2' '1,0,3' '1,0,4' '1,0,5' '1,0,6']
['1,1,0' '1,1,1' '1,1,2' '1,1,3' '1,1,4' '1,1,5' '1,1,6']]
[['2,0,0' '2,0,1' '2,0,2' '2,0,3' '2,0,4' '2,0,5' '2,0,6']
['2,1,0' '2,1,1' '2,1,2' '2,1,3' '2,1,4' '2,1,5' '2,1,6']]]
>>> array_from_func((3,4), sum) #the sum of the indices, not as useful but another good demo
array([[ 0., 1., 2., 3.],
[ 1., 2., 3., 4.],
[ 2., 3., 4., 5.]])
I think this is along the lines of what you are trying to accomplish but I'm not quite sure... please give me feedback if I can be more specific about what you need.

Python - finding all paths between points of arbitrary shape

My goal for the program is the following:
Given any shape (represented as enumerated points and their connections to other points), return a list containg all possible paths (as strings/lists/...). A path is a 'drawing' of the given shape, in which:
no connection has been used more than once and
the 'pen' hasn't been lifted (example included below).
The following code is essentially what I've come up with so far. It's not the code of the actual program, but the basic semantics are the same (i.e. if this code will work, my program will work too).
"""
Example used:
2
/ \
/ \
/ \
1-------3
"""
from copy import deepcopy
points = {1: [2,3],
2: [1,3],
3: [1,2]}
def find_paths(prev_point, points):
for current_point in points[prev_point]:
points[current_point].remove(prev_point)
points[prev_point].remove(current_point)
return [prev_point] + find_paths(current_point, points)
return [prev_point]
def collect(points):
results = []
for first_point in points:
result = find_paths(first_point, deepcopy(points))
results.append(result)
return results
print(collect(points))
My struggle has been to make it return all paths. As of now, it lists only 3 (out of 6). I do understand that the issue arises from the for-loop in f being executed exactly once each time it is called (and it's being called 3 times), since the execution is terminated by return each time. However, I have up until now failed to find a way to avoid this - I played around with making f a generator but this has given me a list of generators as the end result, no matter how I tried to change it.
Any help is appreciated!
EDIT: The generator-version I had simply replaced the returns in find_paths with yield s.
So the last two lines look like:
...
yield [prev_point] + find_paths(current_point, points)
yield [prev_point]
Additionally, I played around with a 'flattener' for generators, but it didn't work at all:
def flatten(x):
if callable(x):
for i in x:
yield flatten(i)
yield x
def func():
yield 1
lis = [1,2,func]
for f in flatten(lis):
print(f)
I think the following works. I based it off of your original code, but did a few things (some necessary, some not):
Rename parameters in find_paths to make more sense for me. We are working with the current_point not the previous_point, etc.
Add an end condition to stop recursion.
Make a copy of points for every possible path being generated and return (yield) each one of those possible paths. Your original code didn't have logic for this since it only expected one result per call to find_paths, but that doesn't really make sense when using recursion like this. I also extend my final result for the same reason.
Here is the code:
from copy import deepcopy
points = {1: [2,3],
2: [1,3],
3: [1,2]}
def find_paths(current_point, points):
if len(points[current_point]) == 0:
# End condition: have we found a complete path? Then yield
if all(not v for v in points.values()):
yield [current_point]
else:
for next_point in points[current_point]:
new_points = deepcopy(points)
new_points[current_point].remove(next_point)
new_points[next_point].remove(current_point)
paths = find_paths(next_point, new_points)
for path in paths:
yield [current_point] + path
def collect(points):
results = []
for first_point in points:
result = find_paths(first_point, points)
results.extend(list(result))
return results
print(collect(points))
Results in:
[1, 2, 3, 1]
[1, 3, 2, 1]
[2, 1, 3, 2]
[2, 3, 1, 2]
[3, 1, 2, 3]
[3, 2, 1, 3]
Your original example image should work with the following:
points = {
1: [2,3],
2: [1,3,4,5],
3: [1,2,4,5],
4: [2,3,5],
5: [2,3,4],
}
Edit: Removed the extra deepcopy I had in collect.
It is necessary to copy the points every time because you are "saving" the current state of the current path you are "drawing". If you didn't copy it then going down the path to node 2 would change the state of the points when going down the path to node 3.

Avoiding off-by-one errors when removing columns based on indices in a python list

I have a target file called TARGFILE of the form:
10001000020002002001100100200000111
10201001020000120210101100110010011
02010010200000011100012021001012021
00102000012001202100101202100111010
My idea here was to leave this as a string, and use slicing in python to remove the indices.
The removal will occur based on a list of integers called INDICES like so:
[1, 115654, 115655, 115656, 2, 4, 134765, 134766, 18, 20, 21, 23, 24, 17659, 92573, 30, 32, 88932, 33, 35, 37, 110463, 38, 18282, 46, 18458, 48, 51, 54]
I want to remove every position of every line in TARGFILE that matches with INDICES. For instance, the first digit in INDICES is 1, so the first column of TARGFILE containing 1,1,0,0 would be removed. However, I am weary of doing this incorrectly due to off-by-one errors and changing index positions if everything is not removed at the same time.
Thus, a solution that removed every column from each row at the same time would likely be both much faster and safer than using a nested loop, but I am unsure of how to code this.
My code so far is here:
#!/usr/bin/env python
import fileinput
SRC_FILES=open('YCP.txt', 'r')
for line in SRC_FILES:
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split(' ')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
TARGFILE=open(EUR, 'r')
It seems to me that a solution using enumerate might be possible, but I have not found it, and that might be suboptimal in the first place...
EDIT: in response to concerns about memory: the longest lines are ~180,000 items, but I should be able to get this into memory without a problem, I have access to a cluster.
I like the simplicity of Peter's answer, even though it's currently off-by-one. My thought is that you can get rid of the index-shifting problem, by sorting INDICES, and doing the process from the back to the front. That led to remove_indices1, which is really inefficient. I think 2 is better, but simplest is 3, which is Peter's answer.
I may do timing in a bit for some large numbers, but my intuition says that my remove_indices2 will be faster than Peter's remove_indices3 if INDICES is very sparse. (Because you don't have to iterate over each character, but only over the indices that are being deleted.)
BTW - If you can sort INDICES once, then you don't need to make the local copy to sort/reverse, but I didn't know if you could do that.
rows = [
'0000000001111111111222222222233333333334444444444555555555566666666667',
'1234567890123456789012345678901234567890123456789012345678901234567890',
]
def remove_nth_character(row,n):
return row[:n-1] + row[n:]
def remove_indices1(row,indices):
local_indices = indices[:]
retval = row
local_indices.sort()
local_indices.reverse()
for i in local_indices:
retval = remove_nth_character(retval,i)
return retval
def remove_indices2(row,indices):
local_indices = indices[:]
local_indices.sort()
local_indices.reverse()
front = row
chunks = []
for i in local_indices:
chunks.insert(0,front[i:])
front = front[:i-1]
chunks.insert(0,front)
return "".join(chunks)
def remove_indices3(row,indices):
return ''.join(c for i,c in enumerate(row) if i+1 not in indices)
indices = [1,11,4,54,33,20,7]
for row in rows:
print remove_indices1(row,indices)
print ""
for row in rows:
print remove_indices2(row,indices)
print ""
for row in rows:
print remove_indices3(row,indices)
EDIT: Adding timing info, plus a new winner!
As I suspected, my algorithm (remove_indices2) wins when there aren't many indices to remove. It turns out that the enumerate-based one, though, gets worse even faster as there are more indices to remove. Here's the timing code (bigrows rows have 210000 characters):
bigrows = []
for row in rows:
bigrows.append(row * 30000)
for indices_len in [10,100,1000,10000,100000]:
print "indices len: %s" % indices_len
indices = range(indices_len)
#for func in [remove_indices1,remove_indices2,remove_indices3,remove_indices4]:
for func in [remove_indices2,remove_indices4]:
start = time.time()
for row in bigrows:
func(row,indices)
print "%s: %s" % (func.__name__,(time.time() - start))
And here are the results:
indices len: 10
remove_indices1: 0.0187089443207
remove_indices2: 0.00184297561646
remove_indices3: 1.40601491928
remove_indices4: 0.692481040955
indices len: 100
remove_indices1: 0.0974130630493
remove_indices2: 0.00125503540039
remove_indices3: 7.92742991447
remove_indices4: 0.679095029831
indices len: 1000
remove_indices1: 0.841033935547
remove_indices2: 0.00370812416077
remove_indices3: 73.0718669891
remove_indices4: 0.680690050125
So, why does 3 do so much worse? Well, it turns out that the in operator isn't efficient on a list. It's got to iterate through all of the list items to check. remove_indices4 is just 3 but converting indices to a set first, so the inner loop can do a fast hash-lookup, instead of iterating through the list:
def remove_indices4(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if i+1 not in indices_set)
And, as I originally expected, this does better than my algorithm for high densities:
indices len: 10
remove_indices2: 0.00230097770691
remove_indices4: 0.686790943146
indices len: 100
remove_indices2: 0.00113391876221
remove_indices4: 0.665997982025
indices len: 1000
remove_indices2: 0.00296902656555
remove_indices4: 0.700706005096
indices len: 10000
remove_indices2: 0.074893951416
remove_indices4: 0.679219007492
indices len: 100000
remove_indices2: 6.65899395943
remove_indices4: 0.701599836349
If you've got fewer than 10000 indices to remove, 2 is fastest (even faster if you do the indices sort/reverse once outside the function). But, if you want something that is pretty stable in time, no matter how many indices, use 4.
The simplest way I can see would be something like:
>>> for line in TARGFILE:
... print ''.join(c for i,c in enumerate(line) if (i+1) not in INDICES)
...
100000200020020100200001
100010200001202010110001
010102000000111021001021
000000120012021012100110
(Substituting print for writing to your output file etc)
This relies on being able to load each line into memory which may or may not be reasonable given your data.
Edit: explaination:
The first line is straightforward:
>>> for line in TARGFILE:
Just iterates through each line in TARGFILE. The second line is a bit more complex:
''.join(...) concatenates a list of strings together with an empty joiner (''). join is often used with a comma like: ','.join(['a', 'b', 'c']) == 'a,b,c', but here we just want to join each item to the next.
enumerate(...) takes an interable and returns pairs of (index, item) for each item in the iterable. For example enumerate('abc') == (0, 'a'), (1, 'b'), (2, 'c')
So the line says,
Join together each character of line whose index are not found in INDICES
However, as John pointed out, Python indexes are zero base, so we add 1 to the value from enumerate.
The script I ended up using is the following:
#!/usr/bin/env python
def remove_indices(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if (i+1) in indices_set)
SRC_FILES=open('YCP2.txt', 'r')
CEUDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/CEU/PARSED/'
YRIDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/YRI/PARSED/'
i=0
for line in SRC_FILES:
i+=1
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split('\t')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
INDEXSORT=sorted(INDICES, key=int)
EURF=open(EUR, 'r')
EURFOUT=open(CEUDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype in EURF:
TRIMLINE=remove_indices(haplotype, INDEXSORT)
EURFOUT.write(TRIMLINE + '\n')
EURFOUT.close()
AFRF=open(YRI, 'r')
AFRFOUT=open(YRIDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype2 in AFRF:
TRIMLINE=remove_indices(haplotype2, INDEXSORT)
AFRFOUT.write(TRIMLINE + '\n')
AFRFOUT.close()

Categories

Resources