Plotting occurrences for values higher than a threshold in Python - python

I have a non-uniform array 'A'.
A = [1,3,2,4,..., 12002, 13242, ...]
I want to explore how many elements from the array 'A' have values above certain threshold values.
For example, there are 1000 elements that have values larger than 1200, so I want to plot the number of elements that have values larger than 1200. Also, there are other 1500 elements that have values larger than 110 (this includes the 1000 elements, whose values are larger than 1200).
This is a rather large data set, so I would not like to omit any kind of information.
Then, I want to plot the number of elements 'N' above a value A vs. Log (A), i.e.
**'Log N(> A)" vs. 'Log (A)'**.
I thought of binning the data, but I was rather unsuccessful.
I haven't done that much statistics in python, so I was wondering if there is a good way to plot this data?
Thanks in advance.

Let me take another crack at what we have:
A = [1, 3, 2, 4, ..., 12002, 13242, ...]
# This is a List of 12,000 zeros.
num_above = [0]*(12000)
# Notice how we can re-write this for-loop!
for i in B:
num_above = [val+1 if key <= i else val for key,val in enumerate(num_above)]
I believe this is what you want. The final list num_above will be such that for num_above[5] equals the number of elements in A that are above 5.
Explanation::
That last line is where all the magic happens. It goes through elements in A (i)and adds one to all the elements in num_above whose index is less than i.
The enumerate(A) statement is an enumerator that generates an iterator of tuples that include the keys and values of all the elements in A: (0,1) (1,3) -> (2,2) -> (3,4) -> ...
Also, the num_above = [x for y in List] statement is known as List Comprehension, and is a really powerful tool in Python.
Improvements: I see you already modified your question to include these changes, but I think they were important.
I removed the numpy dependency. When possible, removing dependencies reduces the complexity of projects, especially larger projects.
I also removed the original list A. This could be replaced with something that was basically like A = range(12000).

Related

Sort unknown length array within unknown length 2D array - Python

I have a Python script which ends up creating a 2D array based on user input. Therefore, the length of the 2D array is unknown and the length of the individual arrays within the 2D array are also unknown until the user has input the information. I would like to sort the individual array pieces based on a value associated with them. An example of a possible output that needs to be sorted is below:
Basically, each individual array is a failure symptom followed by the a list of possible components, each having a "score" associated with them that is the likelihood that this component is causing the failure. My goal is to reorder the array with the components along with their scores in descending order based on the score, i.e., the component and score need to be moved together. The problem I have is like I said, I do not know the length of anything until user input is given. There could be only 1 failure symptom input, or there could be 9. The failure symptom could contain only 1 component, or maybe 12. I know it will take nested for loops and if statements, but I haven't been able to figure it out based on all the possible scenarios. Some possible scenarios I have thought of:
The array is already in order (move to the next failure symptom)
The first component is correct, but the ones after may not be. Or the first two are correct, but the ones after may not be, etc...
The array is completely backwards in order
The array only contains 1 component, therefore there is no need to sort
The array is in some random order, so some positions for some components may already be in the correct spot while some others aren't
Every time I feel like I am making headway, I think of another scenario which wouldn't hold up. Any help is greatly appreciated!
Your problem is a bit special. You don't only want to sort a multidimensional array, which would be rather simple using the default sorting algorithms, you also want to keep the order between the key/value pairs.
The second problem is that the keys are strings with numbers in it. So simple string comparison wouldn't work, because it is compared letter by letter, so "test9" > "test11" would be true (the second 1 wouldn't be even recognized, because 9>1).
The simpliest solution i figured out would be the following:
#get the failure id of one list
def failureId(value):
return int(value[0].replace("failure",""))
#get the id of one component
def componentId(value):
return int(value.replace("component",""))
#sort one failure list using bubble sort
def sortFailure(failure):
#iteraring through the array twice (only the keys, ignoring the values)
for i in range(1,len(failure), 2):
for j in range(1,i, 2):
#comparing the component ids
if (componentId(failure[j])>componentId(failure[j+2])):
#swaping keys and values
failure[j],failure[j+2] = failure[j+2],failure[j]
failure[j+1],failure[j+3] = failure[j+3],failure[j+1]
#sorting the full list
def sortData(data):
#sorting the failures using default sort algorithm
data.sort(key=failureId)
#sorting the single list of failure datas itself
for failure in data:
sortFailure(failure)
data = [['failure2', 'component2', 0.15, 'component1', 0.85], ['failure3', 'component1', 0.95], ['failure1','component1',0.05,'component3', 0.8, 'component2', 0.1, 'component4', 0.05]]
print(data)
sortData(data)
print(data)
The first two functions are required to get the numbers(=id) from the strings as mentioned above. The second function uses "bubble sort" to sort the array. It uses steps 2 for the range function, because we want to skipt the values for each component. If the data are in wrong order we are swapping the key & value. In the sortData function we are using the built in sort function for lists to sort the whole list (by failure ids). Then we take each "sublist" and sort them using the other function.

Random partitioning given array with given bin sizes

How to randomly partition given array with given bin sizes?
Is there an inbuilt function for that? For example, I want something like
function(12,(2,3,3,2,2)) to output four partitions of numbers from 1 go 12 (or 0 to 11, doesn't matter). So output may be a list like [[3,4],[7,8,11],[12,1,2],[5,9],[6,10]](or some other efficient data structure). The first argument of the function may be just a number n, in which case it will consider np.arange(n) as the input, otherwise it may be any other ndarray.
Of course we can randomly permute the list and then pick the first 2, next 3, next 3, next 2 and last 2 elements. But does there exist something more efficient?
numpy.partition() function has a different meaning, it performs a step in quicksort, and I also couldn't find any such function in the numpy.random submodule.
Try this following solution:
def func(a, b:List):
# a is integer and b is a python list
indx = np.random.rand(a).argsort() # Get randomly arranged index
b = np.array(b)
return np.r_[np.split(indx,b.cumsum()[:-1])] # split the index and merge

Python Iterating through nested list using list comprehension

I'm working on Euler Project, problem 11, which involves finding the greatest product of all possible combinations of four adjacent numbers in a grid. I've split the numbers into a nested list and used a list comprehension to slice the relevant numbers, like this:
if x+4 <= len(matrix[x]): #check right
my_slice = [int(matrix[x][n]) for n in range(y,y+4)]
...and so on for the other cardinal directions. So far, so good. But when I get to the diagonals things get problematic. I tried to use two ranges like this:
if x+4 <= len(matrix[x]) and y-4 >=0:# check up, right
my_slice = [int(matrix[m][n]) for m,n in ((range(x,x+4)),range(y,y+4))]
But this yields the following error:
<ipython-input-53-e7c3ebf29401> in <listcomp>(.0)
48 if x+4 <= len(matrix[x]) and y-4 >=0:# check up, right
---> 49 my_slice = [int(matrix[m][n]) for m,n in ((range(x,x+4)),range(y,y+4))]
ValueError: too many values to unpack (expected 2)
My desired indices for x,y values of [0,0] would be ['0,0','1,1','2,2','3,3']. This does not seem all that different for using the enumerate function to iterate over a list, but clearly I'm missing something.
P.S. My apologies for my terrible variable nomenclature, I'm a work in progress.
You do not need to use two ranges, simply use one and apply it twice:
my_slice = [int(matrix[m][m-x+y]) for m in range(x,x+4)]
Since your n is supposed to be attached to range(y,y+4) we know that there will always be a difference of y-x between m and n. So instead of using two variables, we can counter the difference ourselves.
Or in case you still wish to use two range(..) constructs, you can use zip(..) which takes a list of generators, consumes them concurrently and emits tuples:
my_slice = [int(matrix[m][n]) for m,n in zip(range(x,x+4),range(y,y+4))]
But I think this will not improve performance because of the tuple packing and unpacking overhead.
[int(matrix[x+d][n+d]) for d in range(4)] for one diagonal.
[int(matrix[x+d][n-d]) for d in range(4)] for the other.
Btw, better use standard matrix index names, i.e., row i and column j. Not x and y. It's confusing. I think you even confused yourself, as for example your if x+4 <= len(matrix[x]) tests x against the second dimension length but uses it in the first dimension. Huh?

Python list of numpy matrices behaving strangely

I am trying to work with lists of numpy matrices and am encountering an annoying problem.
Let's say I start with a list of ten 2x2 zero matrices
para=[numpy.matrix(numpy.zeros((2,2)))]*(10)
I access individual matrices like this
para[0]
para[1]
and so on. So far so good.
Now, I want to modify the first row of the second matrix only, leaving all the others unchanged. So I do this
para[1][0]=numpy.matrix([[1,1]])
The first index points to the second matrix in the list and the second index points to the first row in that matrix, replacing it with [1,1].
But strangely enough, this command changes the first row of ALL ten matrices in the list to [1,1] instead of just the second one like I wanted. What gives?
When you multiply the initial list by 10, you end up with a list of 10 numpy arrays which are in fact references to the the same underlying structure. Modifying one will modify all of them because in fact there's only one numpy array, not 10.
If you need proof, check out this example in the REPL:
>>> a = numpy.zeros(10)
>>> a = [numpy.zeros(10)]*10
>>> a[0] is a[1]
True
>>>
The is operator checks if both objects are in fact the same(not if they are equal in value).
What you should do is use a list comprehension to generate your initial arrays instead of a multiplication, like so:
para=[numpy.matrix(numpy.zeros((2,2))) for i in range(10)]
That will call numpy.matrix() ten times instead of just once and generate 10 distinct matrixes.

create an array from a txt file

I'm new in python and I have a problem.
I have some measured data saved in a txt file.
the data is separated with tabs, it has this structure:
0 0 -11.007001 -14.222319 2.336769
i have always 32 datapoints per simulation (0,1,2,...,31) and i have 300 simulations (0,1,2...,299), so the data is sorted at first with the number of simulation and then the number of the data point.
The first column is the simulation number, the second column is the data point number and the other 3 columns are the x,y,z coordinates.
I would like to create a 3d array, the first dimension should be the simulation number, the second the number of the datapoint and the third the three coordinates.
I already started a bit and here is what I have so far:
## read file
coords = [x.split('\t') for x in
open(f,'r').read().replace('\r','')[:-1].split('\n')]
## extract the information you want
simnum = [int(x[0]) for x in coords]
npts = [int(x[1]) for x in coords]
xyz = array([map(float,x[2:]) for x in coords])
but I don't know how to combine these 2 lists and this one array.
in the end i would like to have something like this:
array = [simnum][num_dat_point][xyz]
thanks for your help.
I hope you understand my problem, it's my first posting in a python forum, so if I did anything wrong, I'm sorry about this.
thanks again
you can combine them with zip function, like so:
for sim, datapoint, x, y, z in zip(simnum, npts, *xyz):
# do your thing
or you could avoid list comprehensions altogether and just iterate over the lines of the file:
for line in open(fname):
lst = line.split('\t')
sim, datapoint = int(lst[0]), int(lst[1])
x, y, z = [float(i) for i in lst[2:]]
# do your thing
to parse a single line you could (and should) do the following:
coords = [x.split('\t') for x in open(fname)]
This seems like a good opportunity to use itertools.groupby.
import itertools
import csv
file = open("data.txt")
reader = csv.reader(file, delimiter='\t')
result = []
for simnumberStr, rows in itertools.groupby(reader, key=lambda t: t[0]):
simData = []
for row in rows:
simData.append([float(v) for v in row[2:]])
result.append(simData)
file.close()
This will create a 3 dimensional list named 'result'. The first index is the simulation number, and the second index is the data index within that simulation. The value is a list of integers containing the x, y, and z coordinate.
Note that this assumes the data is already sorted on simulation number and data number.
According to the zen of python, flat is better than nested. I'd just use a dict.
import csv
f = csv.reader(open('thefile.csv'), delimiter='\t',
quoting=csv.QUOTE_NONNUMERIC)
result = {}
for simn, dpoint, c1, c2, c3 in f:
result[simn, dpoint] = c1, c2, c3
# pretty-prints the result:
from pprint import pprint
pprint(result)
You could be using many different kinds of containers for your purposes, but none of them has array as an unqualified name -- Python has a module array which you can import from the standard library, but the array.array type is too limited for your purposes (1-D only and with elementary types as contents); there's a popular third-party extension known as numpy, which does have a powerful numpy.array type, which you could use if you has downloaded and installed the extension -- but as you never even once mention numpy I doubt that's what you mean; the relevant builtin types are list and dict. I'll assume you want any container whatsoever -- but if you could learn to use precise terminology in the future, that will substantially help you AND anybody who's trying to help you (say list when you mean list, array only when you DO mean array, "container" when you're uncertain about what container to use, and so forth).
I suggest you look at the csv module in the standard library for a more robust way to reading your data, but that's a separate issue. Let's start from when you have the coords list of lists of 5 strings each, each sublist with strings representing two ints followed by three floats. Two more key aspects need to be specified...
One key aspect you don't tell us about: is the list sorted in some significant way? is there, in particular, some significant order you want to keep? As you don't even mention either issue, I will have to assume one way or another, and I'll assume that there isn't any guaranteed nor meaningful order; but, no repetition (each pair of simulation/datapoint numbers is not allowed to occur more than once).
Second key aspect: are there the same number of datapoints per simulation, in increasing order (0, 1, 2, ...), or is that not necessarily the case (and btw, are the simulation themselves numbered 0, 1, 2, ...)? Again, no clue from you on this indispensable part of the specs -- note how many assumptions you're forcing would-be helpers to make by just not telling us about such obviously crucial aspects. Don't let people who want to help you stumble in the dark: rather, learn to ask questions the smart way -- this will save untold amounts of time to yourself AND would-be helpers, and give you higher-quality and more relevant help, so, why not do it? Anyway, forced to make yet another assumption, I'll have to assume nothing at all is known about the simulation numbers nor about the numers of datapoints in each simulation.
With these assumptions dict emerges as the only sensible structure to use for the outer container: a dictionary whose key is a tuple with two items, simulation number then datapoint number within the simulation. The values may as well be tuple, too (with three floats each), since it does appear that you have exactly 3 coordinates per line.
With all of these assumptions...:
def make_container(coords):
result = dict()
for s, d, x, y, z in coords:
key = int(s), int(d)
value = float(x), float(y), float(z)
result[key] = value
return result
It's always best, and fastest, to have all significant code within def statements (i.e. as functions to be called, possibly with appropriate arguments), so I'm presenting it this way. make_container returns a dictionary which you can address with the simulation number and datapoint number; for example,
d = make_container(coords)
print d[0, 0]
will print the x, y, z for dp 0 of sim 0, assuming one exists (you would get an error if such a sim/dp combination did not exist). dicts have many useful methods, e.g. changing the print statement above to
print d.get((0, 0))
(yes, you do need double parentheses here -- inner ones to make a tuple, outer ones to call get with that tuple as its single argument), you'd see None, rather than get an exception, if there was no such sim/dp combinarion as (0, 0).
If you can edit your question to make your specs more precise (perhaps including some indication of ways you plan to use the resulting container, as well as the various key aspects I've listed above), I might well be able to fine-tune this advice to match your need and circumstances much better (and so might ever other responder, regarding their own advice!), so I strongly recommend you do so -- thanks in advance for helping us help you!-)
essentially the difficulty is what happens if different simulations have different numbers of points.
You will therefore need to dimension an array to the appropriate sizes first.
t should be an array of at least max(simnum) x max(npts) x 3.
To eliminate confusion you should initialise with not-a-number,
this will allow you to see missing points.
then use something like
for x in coords:
t[int(x[0])][int(x[1])][0]=float(x[3])
t[int(x[0])][int(x[1])][1]=float(x[4])
t[int(x[0])][int(x[1])][2]=float(x[5])
is this what you meant?
First I'd point out that your first data point appears to be an index, and wonder if the data is therefore important or not, but whichever :-)
def parse(line):
mch = re.compile('^(\d+)\s+(\d+)\s+([-\d\.]+)\s+([-\d\.]+)\s+([-\d\.]+)$')
m = mch.match(line)
if m:
l = m.groups()
(idx,data,xyz) = (int(l[0]),int(l[1]), map(float, l[2:]))
return (idx, data, xyz)
return None
finaldata = []
file = open("data.txt",'r')
for line in file:
r = parse(line)
if r is not None:
finaldata.append(r)
Final data should have output along the lines of:
[(0, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(1, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(2, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(3, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999]),
(4, 0, [-11.007001000000001, -14.222319000000001, 2.3367689999999999])]
This should be pretty robust about dealing w/ the whitespace issues (tabs spaces whatnot)...
I also wonder how big your data files are, mine are usually large so being able to process them in chunks or groups become more important... Anyway this will work in python 2.6.
Are you sure a 3d array is what you want? It seems more likely that you want a 2d array, where the simulation number is one dimension, the data point is the second, and then the value stored at that location is the coordinates.
This code will give you that.
data = []
for coord in coords:
if coord[0] not in data:
data[coord[0]] = []
data[coord[0]][coord[1]] = (coord[2], coord[3], coord[4])
To get the coordinates at simulation 7, data point 13, just do data[7][13]

Categories

Resources