Using a dictionary to index parallel arrays?

Using a dictionary to index parallel arrays? - python

I have 4 parallel arrays based on a table representing attributes of a map. Each array has approx. 500 values, but all have the same number of values.
The arrays are:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
I am attempting to create a data structure from which I can use a recursive function on to determine the start and end points every 2000m along the length.
The following question and answer describe what I am attempting to accomplish:
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
How do I store these 4 parallel arrays in a dictionary keyed by start?
I am new to writing functions, dictionaries and using arrays in dictionaries. I am attempting to do this task in Python.

I think this is what you mean:
d = {}
for i in range(len(start)):
d[start[i]] = (shape[i],length[i],end[i])
so now d[some_start_value] will hold the corresponding shape length and end values.

If you want to do things a little bit more Python-esque, you can use enumerate:
d = {}
for (i,st) in enumerate(start):
d[st] = (shape[i],length[i],end[i])
or even better - zip:
d = {}
for (st,sh,le,en) in zip(start,shape,length,end):
d[st] = (sh,le,en)
Note that you can leave out the parantheses around the first part of the for loops (i.e. between the for and in keywords). I used them solely for enhanced code readability.
As with WeaselFox's answer, d[some_start_value] will now hold the corresponding shape, length and end values.

In addition to the above answers, I would recommend using namedtuple to simplify accesses:
from collections import namedtuple
# This creates a namedtuple called GISData. Name of the object and name in the first argument
# should be the same.
GISData = namedtuple('GISData', 'start shape length end')
# zip creates 1 list of 4-tuples from 4 single lists
# There are other ways to write this; this is just the shortest for me.
# Note that if you need this ordered, you should use an OrderedDict,
# which is in the collections module in python 2.7+, or you can find
# backported versions for python 2.6+. In those, the keys preserve ordering,
# so can still be searched as a list, which is useful if you need to find e.g.
# 479, which is not in the dictionary, but 400 and 500 are and you have to interpolate etc.
GISDict = dict((x[0], GISData(*x)) for x in zip(start, shape, length, end))
# The dictionary for any given start value
# Access the 4 individual pieces by name, or by index
GISDict[start_lookup].shape
etc.

Related

Sort unknown length array within unknown length 2D array - Python

I have a Python script which ends up creating a 2D array based on user input. Therefore, the length of the 2D array is unknown and the length of the individual arrays within the 2D array are also unknown until the user has input the information. I would like to sort the individual array pieces based on a value associated with them. An example of a possible output that needs to be sorted is below:
Basically, each individual array is a failure symptom followed by the a list of possible components, each having a "score" associated with them that is the likelihood that this component is causing the failure. My goal is to reorder the array with the components along with their scores in descending order based on the score, i.e., the component and score need to be moved together. The problem I have is like I said, I do not know the length of anything until user input is given. There could be only 1 failure symptom input, or there could be 9. The failure symptom could contain only 1 component, or maybe 12. I know it will take nested for loops and if statements, but I haven't been able to figure it out based on all the possible scenarios. Some possible scenarios I have thought of:
The array is already in order (move to the next failure symptom)
The first component is correct, but the ones after may not be. Or the first two are correct, but the ones after may not be, etc...
The array is completely backwards in order
The array only contains 1 component, therefore there is no need to sort
The array is in some random order, so some positions for some components may already be in the correct spot while some others aren't
Every time I feel like I am making headway, I think of another scenario which wouldn't hold up. Any help is greatly appreciated!

Your problem is a bit special. You don't only want to sort a multidimensional array, which would be rather simple using the default sorting algorithms, you also want to keep the order between the key/value pairs.
The second problem is that the keys are strings with numbers in it. So simple string comparison wouldn't work, because it is compared letter by letter, so "test9" > "test11" would be true (the second 1 wouldn't be even recognized, because 9>1).
The simpliest solution i figured out would be the following:
#get the failure id of one list
def failureId(value):
return int(value[0].replace("failure",""))
#get the id of one component
def componentId(value):
return int(value.replace("component",""))
#sort one failure list using bubble sort
def sortFailure(failure):
#iteraring through the array twice (only the keys, ignoring the values)
for i in range(1,len(failure), 2):
for j in range(1,i, 2):
#comparing the component ids
if (componentId(failure[j])>componentId(failure[j+2])):
#swaping keys and values
failure[j],failure[j+2] = failure[j+2],failure[j]
failure[j+1],failure[j+3] = failure[j+3],failure[j+1]
#sorting the full list
def sortData(data):
#sorting the failures using default sort algorithm
data.sort(key=failureId)
#sorting the single list of failure datas itself
for failure in data:
sortFailure(failure)
data = [['failure2', 'component2', 0.15, 'component1', 0.85], ['failure3', 'component1', 0.95], ['failure1','component1',0.05,'component3', 0.8, 'component2', 0.1, 'component4', 0.05]]
print(data)
sortData(data)
print(data)
The first two functions are required to get the numbers(=id) from the strings as mentioned above. The second function uses "bubble sort" to sort the array. It uses steps 2 for the range function, because we want to skipt the values for each component. If the data are in wrong order we are swapping the key & value. In the sortData function we are using the built in sort function for lists to sort the whole list (by failure ids). Then we take each "sublist" and sort them using the other function.

Why is my dict being overwritten in this loop in python?

I have a dict, coords_dict, in a strange format. Which is currently being used to store a set of Cartesian coordinate points (x,y,z). The structure of the dict (which is unfortunately out of my control) is as follows.
The keys of the dict are a series of z values of a plane, and each entry consists of a single element list, which itself is a list of lists containing the coordinate points. For example, two elements in the dict can be specified as
coords_dict['3.5']=[[[1.62,2.22,3.50],[4.54,5.24,3.50]]]
coords_dict['5.0']=[[[0.33,6.74,5.00],[2.54,12.64,5.00]]]
So, I now want to apply some translational shift to all coordinate points in this dict by some shift vector [-1,-1,-1], i.e. I want all x, y, and z coordinates to be 1 less than they were before (rounded to 2 decimal places). And I want to assign the result of this translation to a new dictionary, coords_dict_translated, while also updating the dict keys to match the z locations of all points
My attempt at a solution is below
import numpy as np
shift_vector=[-1,-1,-1]
coords_dict_translated={}
for key,plane in coords_dict.items(): #iterate over dictionary, k are keys representing each plane
key=str(float(key)+shift_vector[2]) #the new key should match the z location
#print(key)
for point_index in range(0,len(plane[0])): #loop over points in this plane
plane[0][point_index]=list(np.around(np.array(plane[0][point_index])
+np.array(shift_vector),decimals=2)) #add shift vector to all points
coords_dict_translated[key]=plane
However, I notice that if I do this that that the original values of coords_dict are also changing. I want coords_dict to stay the same but return a completely new and entirely separate dict. I am not quite sure where the issue lies, I have tried using for key,plane in list(coords_dict.items()): as well but this did not work. Why does this loop change the values of the original dictionary?

when you are iterating over the dictionary in the for loop you are referencing the elements in your list/array:
for key,plane in coords_dict.items(): #iterate over dictionary, k are keys representing each plane
If you don't want to change the items, you should just make a copy of the variable you are using instead of setting plane directly:
import copy
key=str(float(key)+shift_vector[2]) #the new key should match the z location
#print(key)
c = copy.deepcopy(plane)
for point_index in range(0,len(plane[0])): #loop over points in this plane
c[0][point_index]=list(np.around(np.array(plane[0][point_index])
+np.array(shift_vector),decimals=2)) #add shift vector to all points
coords_dict_translated[key] = c

The most likely issue here is that you have a list that is being referenced from two different variables. This can happen even using .copy() when you have nested structure (as you do here).
If this is the problem, you can probably overcome it by using need to make sure you are making a (deep) copy of lists you want to update independently. copy.deepcopy will iteratively make copies of lists within lists etc. to avoid double references to lower-level lists.
(comment made into answer).

Increment numbers in list from a certain point

I have a list of numbers, e.g. [50,100,150,200,250]. I need to increment (or decrement) each number from a specified index and by a specified amount. I have been able to do this in two ways:
from itertools import islice
l = [50,100,150,200,250]
start_increment_index = 3
l[start_increment_index:] = [e+100 for e in l[start_increment_index:]]
print (l)
l = [50,100,150,200,250]
l[start_increment_index:] = [e+100 for e in islice(l,start_increment_index,len(l))]
print (l)
Both print: [50, 100, 150, 300, 350].
However, my real list contains millions of numbers and this operation is performed repeatedly with different indexes and different increments/decrements. Would there be a faster way of doing this using a Python list? I have been considering writing my own C/C++ extension to deal with this.
Edit: Would this be a useful module for Python in general? Having a function written in C which can take parameters (python_list_object, increment_amount, start_index, end_index)?

Main problem in your solution that you creates(allocating memory + copy) two lists. First it's list comprehension by itself and second l[start_increment_index:] inside it.
If you data source is python list, you can do you operation for O(n):
for i in range(start_increment_index, len(l)):
l[i] += increment
NB: define increment first.

It depends specifically on your goals. I suppose that you can use segment tree for this case. For more information see https://en.m.wikipedia.org/wiki/Segment_tree.
Just for brief description. This structure represents array upon which will be performed range operations (like addition/substraction subarray with number) This structure is optimized for case where you have very big number of such range queries.
Note: if you want to use only python list structure, then you can implement sparse table (it is another view of segment tree with implicit storing of tree in arrays)

Python speeding up the search for a value in a dictionary of ranges

I have a file with a column of values I would like to use to compare with a dictionary that contains two values that together form a range.
for instance:
File A:
Chr1 200 ....
Chr3 300
File B:
Chr1 200 300 ...
Chr2 300 350 ...
For now I created a dictionary of values for File B:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(LineB)
For the comparison:
for Line in MethylationFile:
Line = Line.strip("\n")
Info = Line.split("\t")
Chr = Info[0]
Location = int(Info[1])
Annotation = ""
for i, r in enumerate(Ranges[Chr]):
n = i + 1
while (n < len(Ranges[Chr])):
if (int(Ranges[Chr][i][1]) <= Location <= int(Ranges[Chr][i][2])):
Annotation = '\t'.join(Ranges[Chr][i][4:])
n +=1
OutFile.write(Line + '\t' + Annotation + '\n')
If I leave the while loop the program does not seem to run (or is probably running too slow to get results) since I have over 7,000 values in each dictionary. If I change the while loop to an if loop the program runs but at an incredibly slow pace.
I'm looking for a way to make this program faster and more efficient

Dictionaries are great when you want to look up a key by exact match. In particular, the hash of the lookup key has to be the same as the hash of the stored key.
If your ranges are consistent, you could fake this by writing a hash function that returns the same value for a range, and for every value within that range. But if they're not, this hash function would have to keep track of all of the known ranges, which takes you back to the same problem you're starting with.
In that case, the right data structure here is probably some kind of sorted collection. If you only need to build up the collection, and then use it many times without ever modifying it, just sorting a list and using the bisect module will do it for you. If you need to modify the collection after creation, you'll want something built around a binary tree or B-tree variant of some kind, like blist or bintrees.
This will reduce the time to find a range from N/2 to log2(N). So, if you've got 10000 ranges, instead of 5000 comparisons, you'll do 14.
While we're at it, it would help to convert the range start and stop values to ints once, instead of doing it each time. Also, if you want to use the stdlib bisect, you unfortunately can't pass a key to most functions, so let's reorganize the ranges into comparable order too. So:
for Line in FileB:
LineB = Line.strip('\n').split('\t')
Ranges[Chr].append(int(LineB[1]), int(LineB[2]), [LineB[0])
for r in Ranges:
r.sort()
Now, instead of this loop:
for i, r in enumerate(Ranges[Chr]):
# ...
Do this:
i = bisect.bisect(Ranges[Chr], (Location, Location, None))
if i:
r = Ranges[Chr][i-1]
if r[0] <= Location < r[1]:
# do whatever you wanted with r
else:
# there is no range that includes Location
else:
# Location is before all ranges
You have to be careful thinking about bisect, and it's possible I've got this wrong on the first attempt, so… read the docs on what it does, and experiment with your data (printing out the results of the bisect function), before trusting this.
If your ranges can overlap, and you want to be able to find all ranges that contain a value rather than just one, you'll need a bit more than this to keep things efficient. There's no way to fully-order overlapping ranges, so bisect won't cut it.
If you're expecting more than log N matches per average lookup, you can do it with two sorted lists and bisect.
But otherwise, you need a more complex data structure, and more complex code. For example, if you can spare N^2 space, you can keep the time at log N by having, for each range in the first list, a second list, sorted by end, of all the values with a matching start.
And at this point, I think it's getting complex enough that you want to look for a library to do it for you.
However, you might want to consider a different solution.
If you use numpy or a database instead of pure Python, this can't cut the algorithmic complexity from N to log N… but it can cut the constant overhead by a factor of 10 or so, which may be good enough. In fact, if you're doing tons of searches on a medium-small list, it may even be better.
Plus, it looks a lot simpler, and once you get used to array operations or SQL, it may even be more readable. So:
RangeArrays = [np.array(a[:2] for a in value) for value in Ranges]
… or, if Ranges is a dict mapping strings to values, instead of a list:
RangeArrays = {key: np.array(a[:2] for a in value) for key, value in Ranges.items()}
Then, instead of this:
for i, r in enumerate(Ranges[Chr]):
# ...
Do:
comparisons = Location < RangeArrays[Chr]
matches = comparisons[:,0] < comparisons[:,1]
indices = matches.nonzero()[0]
for index in indices:
r = Ranges[indices[0]]
# Do stuff with r
(You can of course make things more concise, but it's worth doing it this way and printing out all of the intermediate steps to see why it works.)
Or, using a database:
cur = db.execute('''SELECT Start, Stop, Chr FROM Ranges
WHERE Start <= ? AND Stop > ?''', (Location, Location))
for (Start, Stop, Chr) in cur:
# do stuff

create an array from a txt file

I'm new in python and I have a problem.
I have some measured data saved in a txt file.
the data is separated with tabs, it has this structure:
0 0 -11.007001 -14.222319 2.336769
i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point.
The first column is the simulation number, the second column is the data point number and the other 3 columns are the x,y,z coordinates.
I would like to create a 3d array, the first dimension should be the simulation number, the second the number of the datapoint and the third the three coordinates.
I already started a bit and here is what I have so far:
## read file
coords = [x.split('\t') for x in
open(f,'r').read().replace('\r','')[:-1].split('\n')]
## extract the information you want
simnum = [int(x[0]) for x in coords]
npts = [int(x[1]) for x in coords]
xyz = array([map(float,x[2:]) for x in coords])
but I don't know how to combine these 2 lists and this one array.
in the end i would like to have something like this:
array = [simnum][num_dat_point][xyz]
thanks for your help.
I hope you understand my problem, it's my first posting in a python forum, so if I did anything wrong, I'm sorry about this.
thanks again

you can combine them with zip function, like so:
for sim, datapoint, x, y, z in zip(simnum, npts, *xyz):
# do your thing
or you could avoid list comprehensions altogether and just iterate over the lines of the file:
for line in open(fname):
lst = line.split('\t')
sim, datapoint = int(lst[0]), int(lst[1])
x, y, z = [float(i) for i in lst[2:]]
# do your thing
to parse a single line you could (and should) do the following:
coords = [x.split('\t') for x in open(fname)]

This seems like a good opportunity to use itertools.groupby.
import itertools
import csv
file = open("data.txt")
reader = csv.reader(file, delimiter='\t')
result = []
for simnumberStr, rows in itertools.groupby(reader, key=lambda t: t[0]):
simData = []
for row in rows:
simData.append([float(v) for v in row[2:]])
result.append(simData)
file.close()
This will create a 3 dimensional list named 'result'. The first index is the simulation number, and the second index is the data index within that simulation. The value is a list of integers containing the x, y, and z coordinate.
Note that this assumes the data is already sorted on simulation number and data number.

According to the zen of python, flat is better than nested. I'd just use a dict.
import csv
f = csv.reader(open('thefile.csv'), delimiter='\t',
quoting=csv.QUOTE_NONNUMERIC)
result = {}
for simn, dpoint, c1, c2, c3 in f:
result[simn, dpoint] = c1, c2, c3
# pretty-prints the result:
from pprint import pprint
pprint(result)

You could be using many different kinds of containers for your purposes, but none of them has array as an unqualified name -- Python has a module array which you can import from the standard library, but the array.array type is too limited for your purposes (1-D only and with elementary types as contents); there's a popular third-party extension known as numpy, which does have a powerful numpy.array type, which you could use if you has downloaded and installed the extension -- but as you never even once mention numpy I doubt that's what you mean; the relevant builtin types are list and dict. I'll assume you want any container whatsoever -- but if you could learn to use precise terminology in the future, that will substantially help you AND anybody who's trying to help you (say list when you mean list, array only when you DO mean array, "container" when you're uncertain about what container to use, and so forth).
I suggest you look at the csv module in the standard library for a more robust way to reading your data, but that's a separate issue. Let's start from when you have the coords list of lists of 5 strings each, each sublist with strings representing two ints followed by three floats. Two more key aspects need to be specified...
One key aspect you don't tell us about: is the list sorted in some significant way? is there, in particular, some significant order you want to keep? As you don't even mention either issue, I will have to assume one way or another, and I'll assume that there isn't any guaranteed nor meaningful order; but, no repetition (each pair of simulation/datapoint numbers is not allowed to occur more than once).
Second key aspect: are there the same number of datapoints per simulation, in increasing order (0, 1, 2, ...), or is that not necessarily the case (and btw, are the simulation themselves numbered 0, 1, 2, ...)? Again, no clue from you on this indispensable part of the specs -- note how many assumptions you're forcing would-be helpers to make by just not telling us about such obviously crucial aspects. Don't let people who want to help you stumble in the dark: rather, learn to ask questions the smart way -- this will save untold amounts of time to yourself AND would-be helpers, and give you higher-quality and more relevant help, so, why not do it? Anyway, forced to make yet another assumption, I'll have to assume nothing at all is known about the simulation numbers nor about the numers of datapoints in each simulation.
With these assumptions dict emerges as the only sensible structure to use for the outer container: a dictionary whose key is a tuple with two items, simulation number then datapoint number within the simulation. The values may as well be tuple, too (with three floats each), since it does appear that you have exactly 3 coordinates per line.
With all of these assumptions...:
def make_container(coords):
result = dict()
for s, d, x, y, z in coords:
key = int(s), int(d)
value = float(x), float(y), float(z)
result[key] = value
return result
It's always best, and fastest, to have all significant code within def statements (i.e. as functions to be called, possibly with appropriate arguments), so I'm presenting it this way. make_container returns a dictionary which you can address with the simulation number and datapoint number; for example,
d = make_container(coords)
print d[0, 0]
will print the x, y, z for dp 0 of sim 0, assuming one exists (you would get an error if such a sim/dp combination did not exist). dicts have many useful methods, e.g. changing the print statement above to
print d.get((0, 0))
(yes, you do need double parentheses here -- inner ones to make a tuple, outer ones to call get with that tuple as its single argument), you'd see None, rather than get an exception, if there was no such sim/dp combinarion as (0, 0).
If you can edit your question to make your specs more precise (perhaps including some indication of ways you plan to use the resulting container, as well as the various key aspects I've listed above), I might well be able to fine-tune this advice to match your need and circumstances much better (and so might ever other responder, regarding their own advice!), so I strongly recommend you do so -- thanks in advance for helping us help you!-)

essentially the difficulty is what happens if different simulations have different numbers of points.
You will therefore need to dimension an array to the appropriate sizes first.
t should be an array of at least max(simnum) x max(npts) x 3.
To eliminate confusion you should initialise with not-a-number,
this will allow you to see missing points.
then use something like
for x in coords:
t[int(x[0])][int(x[1])][0]=float(x[3])
t[int(x[0])][int(x[1])][1]=float(x[4])
t[int(x[0])][int(x[1])][2]=float(x[5])
is this what you meant?

First I'd point out that your first data point appears to be an index, and wonder if the data is therefore important or not, but whichever :-)
def parse(line):
mch = re.compile('^(\d+)\s+(\d+)\s+([-\d\.]+)\s+([-\d\.]+)\s+([-\d\.]+)$')
m = mch.match(line)
if m:
l = m.groups()
(idx,data,xyz) = (int(l[0]),int(l[1]), map(float, l[2:]))
return (idx, data, xyz)
return None
finaldata = []
file = open("data.txt",'r')
for line in file:
r = parse(line)
if r is not None:
finaldata.append(r)
Final data should have output along the lines of:
[(0, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(1, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(2, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(3, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(4, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999])]
This should be pretty robust about dealing w/ the whitespace issues (tabs spaces whatnot)...
I also wonder how big your data files are, mine are usually large so being able to process them in chunks or groups become more important... Anyway this will work in python 2.6.

Are you sure a 3d array is what you want? It seems more likely that you want a 2d array, where the simulation number is one dimension, the data point is the second, and then the value stored at that location is the coordinates.
This code will give you that.
data = []
for coord in coords:
if coord[0] not in data:
data[coord[0]] = []
data[coord[0]][coord[1]] = (coord[2], coord[3], coord[4])
To get the coordinates at simulation 7, data point 13, just do data[7][13]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.