Python: map two arrays with similar values

Python: map two arrays with similar values - python

I have two arrays, lets say A and B. I want to find all elements in B that are within a certain value of all elements in A.
To be exact, I am working with database of galaxies and I want to compare simulated galaxies (S) with observed (O).
I have 2 arrays: brightness (Br) and distance (D) for both simulated and observations. (so 4 arrays in total)
for each simulated galaxy, i want to find a set of galaxies from the observed list, that has similar brightness and distance.
I have already made a code that could work but my program says it will take around 9-10 hours to run the code. Is there any other way I can efficiently change the code to speed up the process:
Here is my code
list1= flux_hb
list2= flux_inp_ha
flux_MC=[[]]*len(list1) # map flux from MAMBO to COSMOS (mambo--> (list of cosmos galaxies))
noise=[[]]*len(list1)
z_MC=[[]]*len(list1)
index=np.zeros(len(list1))
flux_ha_temp=np.zeros(len(list1))
for i in tqdm(range(len(list1))):
temp_flux_diff = abs(list1[i] - list2)/list1[i] #Matrix subtraction
temp_z_diff = abs(z_geo[i] - zinp)/z_geo[i]
for j in range(len(temp_flux_diff)):
if (temp_flux_diff[j]<0.01 and temp_z_diff[j]<0.01):
flux_MC[i].append(list2[j])
noise[i].append(snr_ha[j])
z_MC[i].append(zinp[j])
My approach is to make an empty list to save the information I need.
For each element in list1, find the difference in values of list1 from list2 (temp_flux_diff)
Now for each element in list2, if any of this difference is smaller than required difference, i call them similar values and save them in the empty list.
Hopefully, I was able to explain what I need and how I did it. I just want to speed up the code.
Update 1 As per the suggestion in comments by #СергейКох, i changed my second loop to:
index = np.where(np.logical_and(temp_flux_diff<0.01, temp_z_diff<0.01))[0]
flux_MC[i].append(flux_inp_ha[index])
noise[i].append(snr_ha[index])
z_MC[i].append(zinp[index])
and my code now takes around 1 hour instead of 9 hours.

Related

How can I easily Iterate NumPy arrays using approaches other than #nditer?

I would like to Iterate each cell value of arrays. I tried it using np.nditer methods (for i in np.nditer(bar_st_1). However, even with 64 GB RAM laptop it tooks alot of computational time and runs out of memory. Do you know what will be the easiest and fastets way to extract each array values? Thanks
#Assign the crop specific irrigated area of each array for each month accoridng to the crop calander
#Barley
for Barley in df_dist.Crop:
for i in np.nditer(bar_st_1):
for j in df_area.Month:
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
if (j>= min(k,l)) and (j<= max(k,l)):
df_area.Barley=i
else:
df_area.Barley=0
My goal is to extract a value of each array and assign it for each growing season (month). df_dist is a district-level data frame containing the growing area for each month. bar_st_1 is an array (7*7) that contains an irrigated area of a specific district. For each specific cell, I would like to extract the value of the corresponding array and assign it for a specific month based on the growing season (if the condition is stated above)

for j in df_area.Month:
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
if (j>= min(k,l)) and (j<= max(k,l)):
df_area.Barley=i
else:
df_area.Barley=0
This code block seems to be wasting a lot of effort. If you changed the order of the iterations, you could write
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
for j in range(min(k,l), max(k,l)+1):
df_area.Barley=i
Then you avoid making a lot of comparisons and calculating a lot of max(k,l)'s that aren't necessary.
The loop over i is also wasting effort, since you write certain entries of df_area.Barley to i, but then in a later iteration you overwrite them with a different value of i, without ever (in the code you've shared) using df_area with the first value of i.
So you could reduce your code to
for Barley in df_dist.Crop:
# Initialize the df_area array for this crop with zeros:
df_area.Barley = np.zeros(df_area.Month.max())
r, c = bar_st_1.shape
# Choose the last element in bar_st_1:
i = bar_st_1[r-1, c-1]
for k in df_dist.Planting_month:
for l in df_dist.Maturity_month:
for j in range(min(k,l), max(k,l)+1):
df_area.Barley=i
Now you've eliminated one level from your nested loop structure and shortened the iteration in another level, so you're likely to get 10x or better improvement in speed.

How do I speed up a nested for-loop in Python with large datasets?

For every element in list A, I need to calculate the Levenshtein distance between it and every element in list B. It's 375 million calculations total, which will take too long (over 10 hours) with the nested for-loop that I currently have below:
for a in range(10000):
listA_element = listA[a]
# Calculate the levenshtein distance between the listA element and every listB element
for b in range(50000):
listB_element = listB[b]
score = abd.DiscountedLevenshtein().sim(listA_element, listB_element)
How can I do what the code above does, but in under 1-2 hours? I have looked into using NumPy but it seems that will not work with the levenshtein distance library, and I need the flexibility to do several different things in the loops other than calculations (creating lists, appending to lists, etc). I am having issues with the Cython route, so any alternatives are welcome.

Sort unknown length array within unknown length 2D array - Python

I have a Python script which ends up creating a 2D array based on user input. Therefore, the length of the 2D array is unknown and the length of the individual arrays within the 2D array are also unknown until the user has input the information. I would like to sort the individual array pieces based on a value associated with them. An example of a possible output that needs to be sorted is below:
Basically, each individual array is a failure symptom followed by the a list of possible components, each having a "score" associated with them that is the likelihood that this component is causing the failure. My goal is to reorder the array with the components along with their scores in descending order based on the score, i.e., the component and score need to be moved together. The problem I have is like I said, I do not know the length of anything until user input is given. There could be only 1 failure symptom input, or there could be 9. The failure symptom could contain only 1 component, or maybe 12. I know it will take nested for loops and if statements, but I haven't been able to figure it out based on all the possible scenarios. Some possible scenarios I have thought of:
The array is already in order (move to the next failure symptom)
The first component is correct, but the ones after may not be. Or the first two are correct, but the ones after may not be, etc...
The array is completely backwards in order
The array only contains 1 component, therefore there is no need to sort
The array is in some random order, so some positions for some components may already be in the correct spot while some others aren't
Every time I feel like I am making headway, I think of another scenario which wouldn't hold up. Any help is greatly appreciated!

Your problem is a bit special. You don't only want to sort a multidimensional array, which would be rather simple using the default sorting algorithms, you also want to keep the order between the key/value pairs.
The second problem is that the keys are strings with numbers in it. So simple string comparison wouldn't work, because it is compared letter by letter, so "test9" > "test11" would be true (the second 1 wouldn't be even recognized, because 9>1).
The simpliest solution i figured out would be the following:
#get the failure id of one list
def failureId(value):
return int(value[0].replace("failure",""))
#get the id of one component
def componentId(value):
return int(value.replace("component",""))
#sort one failure list using bubble sort
def sortFailure(failure):
#iteraring through the array twice (only the keys, ignoring the values)
for i in range(1,len(failure), 2):
for j in range(1,i, 2):
#comparing the component ids
if (componentId(failure[j])>componentId(failure[j+2])):
#swaping keys and values
failure[j],failure[j+2] = failure[j+2],failure[j]
failure[j+1],failure[j+3] = failure[j+3],failure[j+1]
#sorting the full list
def sortData(data):
#sorting the failures using default sort algorithm
data.sort(key=failureId)
#sorting the single list of failure datas itself
for failure in data:
sortFailure(failure)
data = [['failure2', 'component2', 0.15, 'component1', 0.85], ['failure3', 'component1', 0.95], ['failure1','component1',0.05,'component3', 0.8, 'component2', 0.1, 'component4', 0.05]]
print(data)
sortData(data)
print(data)
The first two functions are required to get the numbers(=id) from the strings as mentioned above. The second function uses "bubble sort" to sort the array. It uses steps 2 for the range function, because we want to skipt the values for each component. If the data are in wrong order we are swapping the key & value. In the sortData function we are using the built in sort function for lists to sort the whole list (by failure ids). Then we take each "sublist" and sort them using the other function.

Algos - Delete Extremes From A List of Integers in Python?

I want to eliminate extremes from a list of integers in Python. I'd say that my problem is one of design. Here's what I cooked up so far:
listToTest = [120,130,140,160,200]
def function(l):
length = len(l)
for x in xrange(0,length - 1):
if l[x] < (l[x+1] - l[x]) * 4:
l.remove(l[x+1])
return l
print function(listToTest)
So the output of this should be: 120,130,140,160 without 200, since that's way too far ahead from the others.
And this works, given 200 is the last one or there's only one extreme. Though, it gets problematic with a list like this:
listToTest = [120,200,130,140,160,200]
Or
listToTest = [120,130,140,160,200,140,130,120,200]
So, the output for the last list should be: 120,130,140,160,140,130,120. 200 should be gone, since it's a lot bigger than the "usual", which revolved around ~130-140.
To illustrate it, here's an image:
Obviously, my method doesn't work. Some thoughts:
- I need to somehow do a comparison between x and x+1, see if the next two pairs have a bigger difference than the last pair, then if it does, the pair that has a bigger difference should have one element eliminated (the biggest one), then, recursively do this again. I think I should also have an "acceptable difference", so it knows when the difference is acceptable and not break the recursivity so I end up with only 2 values.
I tried writting it, but no luck so far.

You can use statistics here, eliminating values that fall beyond n standard deviations from the mean:
import numpy as np
test = [120,130,140,160,200,140,130,120,200]
n = 1
output = [x for x in test if abs(x - np.mean(test)) < np.std(test) * n]
# output is [120, 130, 140, 160, 140, 130, 120]

Your problem statement is not clear. If you simply want to remove the max and min then that is a simple
O(N) with 2 extra memory- which is O(1)
operation. This is achieved by retaining the current min/max value and comparing it to each entry in the list in turn.
If you want the min/max K items it is still
O(N + KlogK) with O(k) extra memory
operation. This is achieved by two priorityqueue's of size K: one for the mins, one for the max's.
Or did you intend a different output/outcome from your algorithm?
UPDATE the OP has updated the question: it appears they want a moving (/windowed) average and to delete outliers.
The following is an online algorithm -i.e. it can handle streaming data http://en.wikipedia.org/wiki/Online_algorithm
We can retain a moving average: let's say you keep K entries for the average.
Then create a linked list of size K and a pointer to the head and tail. Now: handling items within the first K entries needs to be thought out separately. After the first K retained items the algo can proceed as follows:
check the next item in the input list against the running k-average. If the value exceeds the acceptable ratio threshold then put its list index into a separate "deletion queue" list. Otherwise: update the running windowed sum as follows:
(a) remove the head entry from the linked list and subtract its value from the running sum
(b) add the latest list entry as the tail of the linked list and add its value to the running sum
(c) recalculate the running average as the running sum /K
Now: how to handle the first K entries? - i.e. before we have a properly initialized running sum?
You will need to make some hard-coded decisions here. A possibility:
run through all first K+2D (D << K) entries.
Keep d max/min values
Remove the d (<< K) max/min values from that list

Finding several regions of interest in an array

Say I have conducted an experiment where I've left a python program running for some long time and in that time I've taken several measurements of some quantity against time. Each measurement is separated by some value between 1 and 3 seconds with the time step used much smaller than that... say 0.01s. An example of such an even if you just take the y axis might look like:
[...0,1,-1,4,1,0,0,2,3,1,0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1,1,0,0,1,-1,0,0,2,0...]
Here we have some period of inactivity followed by a sharp rise, fall, a brief pause around 0, drop sharply, rise sharply and settle again around 0. The dots indicate that this is part of a long stream of data extending in both directions. There will be many of these events over the whole dataset with varying lengths separated by low magnitude regions.
I wish to essentially form an array of 'n' arrays (tuples?) with varying lengths capturing just the events so I can analyse them separately later. I can't separate purely by an np.absolute() type threshold because there are occasional small regions of near zero values within a given event such as in the above example. In addition to this there may be occasional blips in between measurements with large magnitudes but short duration.
The sample above would ideally end up as with a couple of elements or so from the flat region either side or so.
[0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1]
I'm thinking something like:
Input:
[0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0]
Split based on some number of consecutive values below a magnitude of 2.
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[8,-7,-1,0],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0],[1,22,4,]]
Like in this graph:
If sub arrays length is less than say 10 then remove:
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0]]
Is this a good way to approach it? The first step is confusing me a little also. I need to preserve those small low magnitude regions within an event also.
Re-edited! I'm going to be comparing two signals each measured as a function of time so they will be zipped together in a list of tuples.

Here is my two cents, based on exponential smoothing.
import itertools
A=np.array([0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0])
B=np.hstack(([0,0],A,[0,0]))
B=np.asanyarray(zip(*[B[i:] for i in range(5)]))
C=(B*[0.25,0.5,1,0.5,0.25]).mean(axis=1) #C is the 5-element sliding windows exponentially smoothed signal
D=[]
for item in itertools.groupby(enumerate(C), lambda x: abs(x[1])>1.5):
if item[0]:
D.append(list(item[1])) #Get the indices where the signal are of magnitude >2. Change 1.5 to control the behavior.
E=[D[0]]
for item in D[1:]:
if (item[0][0]-E[-1][-1][0]) <5: #Merge interesting regions if they are 5 or less indices apart. Change 5 to control the behavior.
E[-1]=E[-1]+item
else:
E.append(item)
print [(item[0][0], item[-1][0]) for item in E]
[A[item[0][0]: item[-1][0]] for item in E if (item[-1][0]-item[0][0])>9] #Filter out the interesting regions <10 in length.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.