create an array from a txt file - python

I'm new in python and I have a problem.
I have some measured data saved in a txt file.
the data is separated with tabs, it has this structure:
0 0 -11.007001 -14.222319 2.336769
i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point.
The first column is the simulation number, the second column is the data point number and the other 3 columns are the x,y,z coordinates.
I would like to create a 3d array, the first dimension should be the simulation number, the second the number of the datapoint and the third the three coordinates.
I already started a bit and here is what I have so far:
## read file
coords = [x.split('\t') for x in
open(f,'r').read().replace('\r','')[:-1].split('\n')]
## extract the information you want
simnum = [int(x[0]) for x in coords]
npts = [int(x[1]) for x in coords]
xyz = array([map(float,x[2:]) for x in coords])
but I don't know how to combine these 2 lists and this one array.
in the end i would like to have something like this:
array = [simnum][num_dat_point][xyz]
thanks for your help.
I hope you understand my problem, it's my first posting in a python forum, so if I did anything wrong, I'm sorry about this.
thanks again

you can combine them with zip function, like so:
for sim, datapoint, x, y, z in zip(simnum, npts, *xyz):
# do your thing
or you could avoid list comprehensions altogether and just iterate over the lines of the file:
for line in open(fname):
lst = line.split('\t')
sim, datapoint = int(lst[0]), int(lst[1])
x, y, z = [float(i) for i in lst[2:]]
# do your thing
to parse a single line you could (and should) do the following:
coords = [x.split('\t') for x in open(fname)]

This seems like a good opportunity to use itertools.groupby.
import itertools
import csv
file = open("data.txt")
reader = csv.reader(file, delimiter='\t')
result = []
for simnumberStr, rows in itertools.groupby(reader, key=lambda t: t[0]):
simData = []
for row in rows:
simData.append([float(v) for v in row[2:]])
result.append(simData)
file.close()
This will create a 3 dimensional list named 'result'. The first index is the simulation number, and the second index is the data index within that simulation. The value is a list of integers containing the x, y, and z coordinate.
Note that this assumes the data is already sorted on simulation number and data number.

According to the zen of python, flat is better than nested. I'd just use a dict.
import csv
f = csv.reader(open('thefile.csv'), delimiter='\t',
quoting=csv.QUOTE_NONNUMERIC)
result = {}
for simn, dpoint, c1, c2, c3 in f:
result[simn, dpoint] = c1, c2, c3
# pretty-prints the result:
from pprint import pprint
pprint(result)

You could be using many different kinds of containers for your purposes, but none of them has array as an unqualified name -- Python has a module array which you can import from the standard library, but the array.array type is too limited for your purposes (1-D only and with elementary types as contents); there's a popular third-party extension known as numpy, which does have a powerful numpy.array type, which you could use if you has downloaded and installed the extension -- but as you never even once mention numpy I doubt that's what you mean; the relevant builtin types are list and dict. I'll assume you want any container whatsoever -- but if you could learn to use precise terminology in the future, that will substantially help you AND anybody who's trying to help you (say list when you mean list, array only when you DO mean array, "container" when you're uncertain about what container to use, and so forth).
I suggest you look at the csv module in the standard library for a more robust way to reading your data, but that's a separate issue. Let's start from when you have the coords list of lists of 5 strings each, each sublist with strings representing two ints followed by three floats. Two more key aspects need to be specified...
One key aspect you don't tell us about: is the list sorted in some significant way? is there, in particular, some significant order you want to keep? As you don't even mention either issue, I will have to assume one way or another, and I'll assume that there isn't any guaranteed nor meaningful order; but, no repetition (each pair of simulation/datapoint numbers is not allowed to occur more than once).
Second key aspect: are there the same number of datapoints per simulation, in increasing order (0, 1, 2, ...), or is that not necessarily the case (and btw, are the simulation themselves numbered 0, 1, 2, ...)? Again, no clue from you on this indispensable part of the specs -- note how many assumptions you're forcing would-be helpers to make by just not telling us about such obviously crucial aspects. Don't let people who want to help you stumble in the dark: rather, learn to ask questions the smart way -- this will save untold amounts of time to yourself AND would-be helpers, and give you higher-quality and more relevant help, so, why not do it? Anyway, forced to make yet another assumption, I'll have to assume nothing at all is known about the simulation numbers nor about the numers of datapoints in each simulation.
With these assumptions dict emerges as the only sensible structure to use for the outer container: a dictionary whose key is a tuple with two items, simulation number then datapoint number within the simulation. The values may as well be tuple, too (with three floats each), since it does appear that you have exactly 3 coordinates per line.
With all of these assumptions...:
def make_container(coords):
result = dict()
for s, d, x, y, z in coords:
key = int(s), int(d)
value = float(x), float(y), float(z)
result[key] = value
return result
It's always best, and fastest, to have all significant code within def statements (i.e. as functions to be called, possibly with appropriate arguments), so I'm presenting it this way. make_container returns a dictionary which you can address with the simulation number and datapoint number; for example,
d = make_container(coords)
print d[0, 0]
will print the x, y, z for dp 0 of sim 0, assuming one exists (you would get an error if such a sim/dp combination did not exist). dicts have many useful methods, e.g. changing the print statement above to
print d.get((0, 0))
(yes, you do need double parentheses here -- inner ones to make a tuple, outer ones to call get with that tuple as its single argument), you'd see None, rather than get an exception, if there was no such sim/dp combinarion as (0, 0).
If you can edit your question to make your specs more precise (perhaps including some indication of ways you plan to use the resulting container, as well as the various key aspects I've listed above), I might well be able to fine-tune this advice to match your need and circumstances much better (and so might ever other responder, regarding their own advice!), so I strongly recommend you do so -- thanks in advance for helping us help you!-)

essentially the difficulty is what happens if different simulations have different numbers of points.
You will therefore need to dimension an array to the appropriate sizes first.
t should be an array of at least max(simnum) x max(npts) x 3.
To eliminate confusion you should initialise with not-a-number,
this will allow you to see missing points.
then use something like
for x in coords:
t[int(x[0])][int(x[1])][0]=float(x[3])
t[int(x[0])][int(x[1])][1]=float(x[4])
t[int(x[0])][int(x[1])][2]=float(x[5])
is this what you meant?

First I'd point out that your first data point appears to be an index, and wonder if the data is therefore important or not, but whichever :-)
def parse(line):
mch = re.compile('^(\d+)\s+(\d+)\s+([-\d\.]+)\s+([-\d\.]+)\s+([-\d\.]+)$')
m = mch.match(line)
if m:
l = m.groups()
(idx,data,xyz) = (int(l[0]),int(l[1]), map(float, l[2:]))
return (idx, data, xyz)
return None
finaldata = []
file = open("data.txt",'r')
for line in file:
r = parse(line)
if r is not None:
finaldata.append(r)
Final data should have output along the lines of:
[(0, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(1, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(2, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(3, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(4, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999])]
This should be pretty robust about dealing w/ the whitespace issues (tabs spaces whatnot)...
I also wonder how big your data files are, mine are usually large so being able to process them in chunks or groups become more important... Anyway this will work in python 2.6.

Are you sure a 3d array is what you want? It seems more likely that you want a 2d array, where the simulation number is one dimension, the data point is the second, and then the value stored at that location is the coordinates.
This code will give you that.
data = []
for coord in coords:
if coord[0] not in data:
data[coord[0]] = []
data[coord[0]][coord[1]] = (coord[2], coord[3], coord[4])
To get the coordinates at simulation 7, data point 13, just do data[7][13]

Related

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

Manipulating a list of numbers into columns or separate lists for plotting in Python

I'm pretty new to Python, so I'm sorry if this one is quite easy. (it seems easy to me, but I'm struggling...)
I have a list of numbers kicked back from a Keithley SMU 2400 from an IV sweep I've done, and there resulting list is ordered like the following:
[v0, c0, t0, v1, c1, t1, v2, c2, t2, ...]
The list appears to be a list of numbers (not a list of ascii values thanks to PyVisa's query_ascii_values command).
How can I parse these into either columns of numbers (for output in csv or similar) or three separate lists for plotting in matplotlib.
The output I'd love would be similar to this in the end:
volts = [v0, v1, v2, v3...]
currents = [c0, c1, c2, c3...]
times = [t0, t1, t2, t3...]
that should enable easier plotting in matplotlib (or outputting into a csv text file).
Please note that my v0, v1 etc., are just my names for them, they are numbers currently.
I would have attempted this in Matlab similar to this:
volts = mydata(1:3:length(mydata));
(calling the index by counting every third item from 1)
Thanks for your thoughts and help! Also- are there any good resources for simple data munging like this that I should get a copy of?
Simply slicing works. Thus, with A as the input list, we would have -
volts, currents, times = A[::3], A[1::3], A[2::3]
If you want to keep it a list, then Divakar's solution works well. However, if you will be doing analysis and plotting later, you really want to be using a numpy array, and this makes what you want to do easier still.
To get it into a numpy array, you can do:
>>> import numpy as np
>>>
>>> mydata = np.array(mydata)
To get it into individual variables, you can just do:
>>> volts, currents, times = mydata.reshape(-1,3).T
This reshapes into a Nx3 array, then transposes it to a 3xN array, then puts each row in a separate variable. One advantage of this is it is very fast, since only one array is ever created, unlike the list approach where 4 lists need to be created (especially if you put the data directly into a numpy array).
You can also use the identical approach Divakar used with lists, again avoiding creating additional lists.
That being said, I would strongly suggest you look into pandas. It will make keeping track of data like this much easier. You can label the columns of an array with informative names, meaning you don't need to split data like this into three variables to keep track of it.

Using a dictionary to index parallel arrays?

I have 4 parallel arrays based on a table representing attributes of a map. Each array has approx. 500 values, but all have the same number of values.
The arrays are:
start = location of the endpoint with the smaller flow accumulation,
end = location of the other endpoint (with the larger flow accumulation),
length = segment length, and;
shape = actual shape, oriented to run from start to end.
I am attempting to create a data structure from which I can use a recursive function on to determine the start and end points every 2000m along the length.
The following question and answer describe what I am attempting to accomplish:
https://gis.stackexchange.com/questions/87649/select-points-approx-2000-metres-from-another-point-along-a-river
How do I store these 4 parallel arrays in a dictionary keyed by start?
I am new to writing functions, dictionaries and using arrays in dictionaries. I am attempting to do this task in Python.
I think this is what you mean:
d = {}
for i in range(len(start)):
d[start[i]] = (shape[i],length[i],end[i])
so now d[some_start_value] will hold the corresponding shape length and end values.
If you want to do things a little bit more Python-esque, you can use enumerate:
d = {}
for (i,st) in enumerate(start):
d[st] = (shape[i],length[i],end[i])
or even better - zip:
d = {}
for (st,sh,le,en) in zip(start,shape,length,end):
d[st] = (sh,le,en)
Note that you can leave out the parantheses around the first part of the for loops (i.e. between the for and in keywords). I used them solely for enhanced code readability.
As with WeaselFox's answer, d[some_start_value] will now hold the corresponding shape, length and end values.
In addition to the above answers, I would recommend using namedtuple to simplify accesses:
from collections import namedtuple
# This creates a namedtuple called GISData. Name of the object and name in the first argument
# should be the same.
GISData = namedtuple('GISData', 'start shape length end')
# zip creates 1 list of 4-tuples from 4 single lists
# There are other ways to write this; this is just the shortest for me.
# Note that if you need this ordered, you should use an OrderedDict,
# which is in the collections module in python 2.7+, or you can find
# backported versions for python 2.6+. In those, the keys preserve ordering,
# so can still be searched as a list, which is useful if you need to find e.g.
# 479, which is not in the dictionary, but 400 and 500 are and you have to interpolate etc.
GISDict = dict((x[0], GISData(*x)) for x in zip(start, shape, length, end))
# The dictionary for any given start value
# Access the 4 individual pieces by name, or by index
GISDict[start_lookup].shape
etc.

python insert into list at constant x position

I know by the title this may sound easy. It's not for this task.
Imagine the following scenerio: you have a connection running, and a list called example. You get some data with a constant number x that will always start with 1 and increase then on until the connection is closed. You need some data surrounding this number and it to be stored in a list at exactly that number's position. so example[x-1]. Ok, so this solves the basic problem.
The problem this doesn't solve is, say if that the connection gives you a command to delete some of the data previously stored as it's no longer needed. Let's say at this point, you have 10 item's in the list, you need to delete at positions 3, 5, and 6. So now, example is at 7. x is now 11, you insert some data, and now example's length is at 8. At this point, exmaple[x-1] != 11. So now, we have fragmentation.
The problem is this. The connection will (but not in a set order) give you some other data. This data will also have the same number as x, but we'll say it's y. This data need's to go together (let's just say x and y are int's that need to be added for purposes of this example, though we're really filling in missing stuff in a class). but at a later point in the data sequence, but you now no longer have x-1 to put the 2 parts of data together. The problem now is that because you're not able to set example[x] and ALWAYS have it EXACTLY at position x in example, there's now no way in run-time to match both positions x and y.
My question is, is there some way like in C++ how you can do example[x] = data; and it will always be no matter what in position x dependent on what's changed around in the list, such as removing items. If this isn't possible, then I'll put efforts into calculating a formula for the position of the example list so that it can always match.
example = {}
example[1] = "whatever"
example[99] = "whatever"
example[-12] = "something else"
example['cow'] = 'pie'
#delete
example.pop(99)
is that what you are looking for?
or
example = [0 for _ in range(MAX_ITEMS)]
#delete
example[x-1] = 0
#add
example[x-1] = data

Plotting occurrences for values higher than a threshold in Python

I have a non-uniform array 'A'.
A = [1,3,2,4,..., 12002, 13242, ...]
I want to explore how many elements from the array 'A' have values above certain threshold values.
For example, there are 1000 elements that have values larger than 1200, so I want to plot the number of elements that have values larger than 1200. Also, there are other 1500 elements that have values larger than 110 (this includes the 1000 elements, whose values are larger than 1200).
This is a rather large data set, so I would not like to omit any kind of information.
Then, I want to plot the number of elements 'N' above a value A vs. Log (A), i.e.
**'Log N(> A)" vs. 'Log (A)'**.
I thought of binning the data, but I was rather unsuccessful.
I haven't done that much statistics in python, so I was wondering if there is a good way to plot this data?
Thanks in advance.
Let me take another crack at what we have:
A = [1, 3, 2, 4, ..., 12002, 13242, ...]
# This is a List of 12,000 zeros.
num_above = [0]*(12000)
# Notice how we can re-write this for-loop!
for i in B:
num_above = [val+1 if key <= i else val for key,val in enumerate(num_above)]
I believe this is what you want. The final list num_above will be such that for num_above[5] equals the number of elements in A that are above 5.
Explanation::
That last line is where all the magic happens. It goes through elements in A (i)and adds one to all the elements in num_above whose index is less than i.
The enumerate(A) statement is an enumerator that generates an iterator of tuples that include the keys and values of all the elements in A: (0,1) (1,3) -> (2,2) -> (3,4) -> ...
Also, the num_above = [x for y in List] statement is known as List Comprehension, and is a really powerful tool in Python.
Improvements: I see you already modified your question to include these changes, but I think they were important.
I removed the numpy dependency. When possible, removing dependencies reduces the complexity of projects, especially larger projects.
I also removed the original list A. This could be replaced with something that was basically like A = range(12000).

Categories

Resources