Python noob: manipulating arrays

Python noob: manipulating arrays - python

I have already asked a few questions on here about this same topic, but I'm really trying not to disappoint the professor I'm doing research with. This is my first time using Python and I may have gotten in a little over my head.
Anyways, I was sent a file to read and was able to using this command:
SNdata = numpy.genfromtxt('...', dtype=None,
usecols (0,6,7,8,9,19,24,29,31,33,34,37,39,40,41,42,43,44),
names ['sn','off1','dir1','off2','dir2','type','gal','dist',
'htype','d1','d2','pa','ai','b','berr','b0','k','kerr'])
sn is just an array of the names of a particular supernova; type is an array of the type of supernovae it is (Ia or II), etc.
One of the first things I need to do is simply calculate the probabilities of certain properties given the SN type (Ia or II).
For instance, the column htype is the morphology of a galaxy (given as an integer 1=elliptical to 8=irregular). I need to calculate the probability of an elliptical given a TypeIa and an elliptical given TypeII, for all of the integers to up to 8.
For ellipticals, I know that I just need the number of elements that have htype = 1 and type = Ia divided by the total number of elements of type = Ia. And then the number of elements that have htype = 1 and type = II divided by the total number of elements that have type = II.
I just have no idea how to write code for this. I was planning on finding the total number of each type first and then running a for loop to find the number of elements that have a certain htype given their type (Ia or II).
Could anyone help me get started with this? If any clarification is needed, let me know.
Thanks a lot.

Numpy supports boolean array operations, which will make your code fairly straightforward to write. For instance, you could do:
htype_sums = {}
for htype_number in xrange(1,9):
htype_mask = SNdata.htype == htype_number
Ia_mask = SNdata.type == 'Ia'
II_mask = SNdata.type == 'II'
Ia_sum = (htype_mask & Ia_mask).sum() / Ia_mask.sum()
II_sum = (htype_mask & II_mask).sum() / II_mask.sum()
htype_sums[htype_number] = (Ia_sum, II_sum)
Each of the _mask variables are boolean arrays, so when you sum them you count the number of elements that are True.

You can use collections.Counter to count needed observations.
For example,
from collections import Counter
types_counter = Counter(row['type'] for row in data)
will give you desired counts of sn types.
htypes_types_counter = Counter((row['type'], row['htype']) for row in data)
counts for morphology and types. Then, to get your evaluation for ellipticals, just divide
1.0*htypes_types_counter['Ia', 1]/types_counter['Ia']

Related

enumerate in dictionary loop take long time how to improv the speed

I am using python-3.x and I would like to speed my code where in every loop, I am creating new values and I checked if they exist or not in the dictionary by using the (check if) then I will keep the index where it is found if it exists in the dictionary. I am using the enumerate but it takes a long time and it very clear way. is there any way to speed my code by using another way or in my case the enumerate is the only way I need to work with? I am not sure in my case using numpy will be better.
Here is my code:
# import numpy
import numpy as np
# my first array
my_array_1 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_1 = np.array(my_array_1)
# here I want to find the unique values from my_array_1
indx = np.unique(my_array_1, return_index=True, return_counts= True,axis=0)
#then saved the result to dictionary
dic_t= {"my_array_uniq":indx[0], # unique values in my_array_1
"counts":indx[2]} # how many times this unique element appear on my_array_1
# here I want to create random array 100 times
for i in range (100):
print (i)
# my 2nd array
my_array_2 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_2 = np.array(my_array_2)
# I would like to check if the values in my_array_2 exists or not in the dictionary (my_array_uniq":indx[0])
# if it exists then I want to hold the index number of that value in the dictionary and
# add 1 to the dic_t["counts"], which mean this value appear agin and cunt how many.
# if not exists, then add this value to the dic (my_array_uniq":indx[0])
# also add 1 to the dic_t["counts"]
for i, a in enumerate(my_array_2):
ix = [k for k,j in enumerate(dic_t["my_array_uniq"]) if (a == j).all()]
if ix:
print (50*"*", i, "Yes", "at", ix[0])
dic_t["counts"][ix[0]] +=1
else:
# print (50*"*", i, "No")
dic_t["counts"] = np.hstack((dic_t["counts"],1))
dic_t["my_array_uniq"] = np.vstack((dic_t["my_array_uniq"], my_array_2[i]))
explanation:
1- I will create an initial array.
2- then I want to find the unique values, index and count from an initial array by using (np.unique).
3- saved the result to the dictionary (dic_t)
4- Then I want to start the loop by creating random values 100 times.
5- I would like to check if this random values in my_array_2 exist or not in the dictionary (my_array_uniq":indx[0])
6- if one of them exists then I want to hold the index number of that value in the dictionary.
7 - add 1 to the dic_t["counts"], which mean this value appears again and count how many.
8- if not exists, then add this value to the dic as new unique value (my_array_uniq":indx[0])
9 - also add 1 to the dic_t["counts"]

So from what I can see you are
Creating 256 random numbers from a linear distribution of numbers between -1000 and 1000
Generating 100 triplets from those (it could be fewer than 100 due to unique but with overwhelming probability it will be exactly 100)
Then doing pretty much the same thing 100 times and each time checking for each of the triplets in the new list whether they exist in the old list.
You're then trying to get a count of how often each element occurs.
I'm wondering why you're trying to do this, because it doesn't make much sense to me, but I'll give a few pointers:
There's no reason to make a dictionary dic_t if you're only going to hold to objects in it, just use two variables my_array_uniq and counts
You're dealing with triplets of floating point numbers. In the given range, that should give you about 10^48 different possible triplets (I may be wrong on the exact number but it's an absurdly large number either way). The way you're generating them does reduce the total phase-space a fair bit, but nowhere near enough. The probability of finding identical ones is very very low.
If you have a set of objects (in this case number triplets) and you want to determine whether you have seen a given one before, you want to use sets. Sets can only contain immutable objects, so you want to turn your triplets into tuples. Determining whether a given triplet is already contained in your set is then an O(1) operation.
For counting the number of occurences of sth, collections.Counter is the natural datastructure to use.

How to calculate Delta F / F using python?

I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average

Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!

Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.

Finding repeated rows in a numpy array

The following function is designed to find the unique rows of an array:
def unique_rows(a):
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
return unique_a
For example,
test = np.array([[1,0,1],[1,1,1],[1,0,1]])
unique_rows(test)
[[1,0,1],[1,1,1]]
I believe that this function should work all the time, however it may not be watertight. In my code I would like to calculate how many unique positions exist for a set of particles. The particles are stored in a 2d array, each row corresponding to the position of a particle. The positions are of type np.float64.
I have also defined the following function
def pos_tag(pos):
x,y,z = pos[:,0],pos[:,1],pos[:,2]
return (2**x)*(3**y)*(5**z)
In principle this function should produce a unique value for any (x,y,z) position.
However, when I use these to functions to calculate the number of unique positions in my set of particles they produce different answers. Is this due to some possible logical flaw in the first function, or the second function not producing a unique value for each given position?
EDIT: Usage example
I have some long code that produces a 2d array of particle postions.
partpos.shape = (6039539,3)
I then calculate the number of unique rows as follows
len(unqiue_rows(partpos))
6034411
And
posids = pos_tag(partpos)
len(np.unique(posids))
5328871

I believe that the discrepancy arises due to a precision error.
Using the code
print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos)))
6034411
6034411
However with
print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos.astype(np.float32))))
6034411
5328871

a = [[1,0,1],[1,1,1],[1,0,1]]
# Convert rows to tuples so they're hashable, creating a generator thereof
b = (tuple(row) for row in a)
# Convert back to list of lists, after coercing to a set to eliminate non-unique rows
unique_rows = list(list(row) for row in set(b))
Edit: Well that's embarrassing. I just realized I didn't really address the question asked. This could still be the answer the OP is looking for, so I'll leave it, but it's not really what was asked. Sorry for that.

Should this be solved using a subset sum algorithm

The problem is asking me to find all possible subsets of a list that added together (in pairs, alone, or multiple of them) will equal a given number. I have been reading a lot on subset sum problems and not sure if this applies to this problem.
To explain the problem more, I have a max weight of candy that I am allowed to purchase.
I know the weight of ten pieces of different candy that I have stored in a list
candy = [ [snickers, 150.5], [mars, 130.3], ......]
I can purchase at most max_weight = 740.5 grams EXACTLY.
Thus I have to find all possible combinations of candy that will equal exactly the max_weight. I will be programming in python. Don't need the exact code but just whether or not it is a subset sum problem and possible suggestions on how to proceed.

Ok here's a brute force approach exploiting numpy's index magic:
from itertools import combinations
import numpy as np
candy = [ ["snickers", 150.5], ["mars", 130.3], ["choc", 10.0]]
n = len(candy)
ww = np.array([c[1] for c in candy]) # extract weights of candys
idx = np.arange(n) # list of indexes
iidx,sums = [],[]
# generate all possible sums with index list
for k in range(n):
for ii in combinations(idx, k+1):
ii = list(ii) # convert tupel to list, so it can be used as a list of indeces
sums.append(np.sum(ww[ii]))
iidx.append(ii)
sums = np.asarray(sums)
ll = np.where(np.abs(sums-160.5)<1e-9) # filter out values which match 160.5
# print results
for le in ll:
print([candy[e] for e in iidx[le]])

This is exactly the subset sum problem. You could use a dynamic programming approach to solve it

Python: How would I get an average of a set of tuples?

I have a problem I attempting to solve this problem.
I have a function that produces tuples. I attempted to store them in an array in this method
while(loops til exhausted)
count = 0
set_of_tuples[count] = function(n,n,n)
count = count + 1
apparently python doesn't store variables this way. How can I go about storing a set of tuples in a variable and then averaging them out?

You can store them in a couple ways. Here is one:
set_of_tuples = []
while `<loop-condition>`:
set_of_tuples.append(function(n, n, n))
If you want to average the results element-wise, you can:
average = tuple(sum(x[i] for x in set_of_tuples) / len(set_of_tuples)
for i in range(len(set_of_tuples[0])))
If this is numerical data, you probably want to use Numpy instead. If you were using a Numpy array, you would just:
average = numpy.average(arr, axis=0)

Hmmm, your psuedo-code is not Python at all. You might want to look at something more like:
## count = 0
set_of_tuples = list()
while not exhausted():
set_of_tuples.append(function(n,n,n))
## count += 1
count = len(set_of_tuples)
However, here the count is superfluous since we can just *len(set_of_tuples)* after the loop if we want. Also the name "set_of_tuples" is a pretty poor choice; especially given that it's not a set.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.