Optimizing random sampling for scaling permutations - python

I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.

Related

How to make a huge loop for checking condition in dataframe run faster in mac

I need to calculate a huge table value (157954 rows and 365 columns) by checking three conditions in a dataframe with 11 mil rows. Do you have any way to speed up the calculation, which is taking more than 10 hours now?
I have 367 stations in total.
for station in stations:
no_pickup_array = []
for time_point in data_matrix['Timestamp']:
time_point_2 = time_point + timedelta(minutes=15)
no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2)
& (dataframe['departure_name'] == station)])
no_pickup_array.append(no_pickup)
print(f"Station name: {station}")
data_matrix[station] = no_pickup_array
I appreciate any of your help.
# To all: Thank you for your comments, I add more info for my problem.
Each row of dataframe is info of each renting bike. I want to create a matrix with number of bikes picked up at each station for each 15 minutes interval. Then I also want to calculate the average speed, average time,.. as well.
The solution from #Jérôme Richard could reduce the number of calculations, but I still struggle to understand and implement indexing steps and apply logarithmic search or binary search.
index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code #Jérôme Richard recommended
The main problem is the right-hand-side of the no_pickup assignment expression which is algorithmically inefficient because it makes a linear search while a logarithmic search is possible.
The first thing to do is to do a groupby of dataframe so to build an index enabling to fetch the dataframe subset having a given name. Then, you can sort each dataframe subset by departure so to be able to perform a binary search enabling you to know the number of item fitting the condition.
The index can be built with something like:
index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
Finally, you can do the binary search with two np.searchsorted on index[station]: one to know the starting index and one to know the ending index. You can get the length with a simple subtraction of the two.
Note that you may need some tweak since I am not sure the above code will works on your dataset but it is hard to know without an example of code generating the inputs.
You're indexing the dataframe list with a boolean (which will be zero or one, so you're only ever going to get the length of the first or second element) instead of a number. It's going to get evaluated like so:
len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) & (dataframe['departure_name'] == station)])
len(dataframe[True & False & True]) # let's just say the variables work out like this
len(dataframe[False])
len(dataframe[0])
This probably isn't the behavior you're after. (let me know what you're trying to do in a comment and I'll try to help out more.)
In terms of code speed specifically, & is bitwise "AND", in python the boolean operators are written out as and, or, and not. Using and here would speed up your code, since python only evaluates parts of boolean expressions where they're needed, e.g.
from time import sleep
def slow_function():
sleep(3)
return False
# This line doesn't take 3 seconds to run as you may expect.
# Python sees "False and" and is smart enough to realize that whatever comes after is irrelevant.
# No matter what comes after "False and", it's never going to make the first half True.
# So, python doesn't bother evaluating it, and saves 3 seconds in the process.
False and slow_function()
# Some more examples that show python doesn't evaluate the right half unless it needs to
False and print("hi")
False and asdfasdfasdf
False and 42/0
# The same does not happen here. These are bitwise operators, expected to be applied to numbers.
# While it does produce the correct result for boolean inputs, it's going to be slower,
# since it can't rely on the same optimization.
False & slow_function()
# Of course, both of these still take 3 seconds, since the right half has to be evaluated either way.
True and slow_function()
True & slow_function()

How to generate unique(!) arrays/lists/sequences of uniformly distributed random

Let‘s say I generate a pack, i.e., a one dimensional array of 10 random numbers with a random generator. Then I generate another array of 10 random numbers. I do this X times. How can I generate unique arrays, that even after a trillion generations, there is no array which is equal to another?
In one array, the elements can be duplicates. The array just has to differ from the other arrays with at least one different element from all its elements.
Is there any numpy method for this? Is there some special algorithm which works differently by exploring some space for the random generation? I don’t know.
One easy answer would be to write the arrays to a file and check if they were generated already, but the I/O operations on a subsequently bigger file needs way too much time.
This is a difficult request, since one of the properties of a RNG is that it should repeat sequences randomly.
You also have the problem of trying to record terabytes of prior results. Once thing you could try is to form a hash table (for search speed) of the existing arrays. Using this depends heavily on whether you have sufficient RAM to hold the entire list.
If not, you might consider disk-mapping a fast search structure of some sort. For instance, you could implement an on-disk binary tree of hash keys, re-balancing whenever you double the size of the tree (with insertions). This lets you keep the file open and find entries via seek, rather than needing to represent the full file in memory.
You could also maintain an in-memory index to the table, using that to drive your seek to the proper file section, then reading only a small subset of the file for the final search.
Does that help focus your implementation?
Assume that the 10 numbers in a pack are each in the range [0..max]. Each pack can then be considered as a 10 digit number in base max+1. Obviously, the size of max determines how many unique packs there are. For example, if max=9 there are 10,000,000,000 possible unique packs from [0000000000] to [9999999999].
The problem then comes down to generating unique numbers in the correct range.
Given your "trillions" then the best way to generate guaranteed unique numbers in the range is probably to use an encryption with the correct size output. Unless you want 64 bit (DES) or 128 bit (AES) output then you will need some sort of format preserving encryption to get output in the range you want.
For input, just encrypt the numbers 0, 1, 2, ... in turn. Encryption guarantees that, given the same key, the output is unique for each unique input. You just need to keep track of how far you have got with the input numbers. Given that, you can generate more unique packs as needed, within the limit imposed by max. After that point the output will start repeating.
Obviously as a final step you need to convert the encryption output to a 10 digit base max+1 number and put it into an array.
Important caveat:
This will not allow you to generate "arbitrarily" many unique packs. Please see limits as highlighted by #Prune.
Note that as the number of requested packs approaches the number of unique packs this takes longer and longer to find a pack. I also put in a safety so that after a certain number of tries it just gives up.
Feel free to adjust:
import random
## -----------------------
## Build a unique pack generator
## -----------------------
def build_pack_generator(pack_length, min_value, max_value, max_attempts):
existing_packs = set()
def _generator():
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts = 1
while pack_hash in existing_packs:
if attempts >= max_attempts:
raise KeyError("Unable to fine a valid pack")
pack = tuple(random.randint(min_value, max_value) for _ in range(1, pack_length +1))
pack_hash = hash(pack)
attempts += 1
existing_packs.add(pack_hash)
return list(pack)
return _generator
generate_unique_pack = build_pack_generator(2, 1, 9, 1000)
## -----------------------
for _ in range(50):
print(generate_unique_pack())
The Birthday problem suggests that at some point you don't need to bother checking for duplicates. For example, if each value in a 10 element "pack" can take on more than ~250 values then you only have a 50% chance of seeing a duplicate after generating 1e12 packs. The more distinct values each element can take on the lower this probability.
You've not specified what these random values are in this question (other than being uniformly distributed) but your linked question suggests they are Python floats. Hence each number has 2**53 distinct values it can take on, and the resulting probability of seeing a duplicate is practically zero.
There are a few ways of rearranging this calculation:
for a given amount of state and number of iterations what's the probability of seeing at least one collision
for a given amount of state how many iterations can you generate to stay below a given probability of seeing at least one collision
for a given number of iterations and probability of seeing a collision, what state size is required
The below Python code calculates option 3 as it seems closest to your question. The other options are available on the birthday attack page.
from math import log2, log1p
def birthday_state_size(size, p):
# -log1p(p) is a numerically stable version of log(1/(1+p))
return size**2 / (2*-log1p(-p))
log2(birthday_state_size(1e12, 1e-6)) # => ~100
So as long as you have more than 100 uniform bits of state in each pack everything should be fine. For example, two or more Python floats is OK (2 * 53), as is 10 integers with >= 1000 distinct values (10*log2(1000)).
You can of course reduce the probability down even further, but as noted in the Wikipedia article going below 1e-15 quickly approaches the reliability of a computer. This is why I say "practically zero" given the 530 bits of state provided by 10 uniformly distributed floats.

Memory problems for multiple large arrays

I'm trying to do some calculations on over 1000 (100, 100, 1000) arrays. But as I could imagine, it doesn't take more than about 150-200 arrays before my memory is used up, and it all fails (at least with my current code).
This is what I currently have now:
import numpy as np
toxicity_data_path = open("data/toxicity.txt", "r")
toxicity_data = np.array(toxicity_data_path.read().split("\n"), dtype=int)
patients = range(1, 1000, 1)
The above is just a list of 1's and 0's (indicating toxicity or not) for each array (in this case one array is data for one patient). So in this case roughly 1000 patients.
I then create two lists from the above code so I have one list with patients having toxicity and one where they have not.
patients_no_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("0")]
patients_with_tox = [i for i, e in enumerate(toxicity_data.astype(np.str)) if e in set("1")]
I then write this function, which takes an already saved-to-disk array ((100, 100, 1000)) for each patient, and then remove some indexes (which is also loaded from a saved file) on each array that will not work later on, or just needs to be removed. So it is essential to do so. The result is a final list of all patients and their 3D arrays of data. This is where things start to eat memory, when the function is used in the list comprehension.
def log_likely_list(patient, remove_index_list):
array_data = np.load("data/{}/array.npy".format(patient)).ravel()
return np.delete(array_data, remove_index_list)
remove_index_list = np.load("data/remove_index_list.npy")
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
Next step is to create two lists that I need for my calculations. I take the final list, with all the patients, and remove either patients that have toxicity or not, respectively.
patients_no_tox_list = np.column_stack(np.delete(final_list, patients_with_tox, 0))
patients_with_tox_list = np.column_stack(np.delete(final_list, patients_no_tox, 0))
The last piece of the puzzle is to use these two lists in the following equation, where I put the non-tox list into the right side of the equation, and with tox on the left side. It then sums up for all 1000 patients for each individual index in the 3D array of all patients, i.e. same index in each 3D array/patient, and then I end up with a large list of values pretty much.
log_likely = np.sum(np.log(patients_with_tox_list), axis=1) +
np.sum(np.log(1 - patients_no_tox_list), axis=1)
My problem, as stated is, that when I get around 150-200 (in the patients range) my memory is used, and it shuts down.
I have obviously tried to save stuff on the disk to load (that's why I load so many files), but that didn't help me much. I'm thinking maybe I could go one array at a time and into the log_likely function, but in the end, before summing, I would probably just have just as large an array, plus, the computation might be a lot slower if I can't use the numpy sum feature and such.
So is there any way I could optimize/improve on this, or is the only way to but a hell of lot more RAM ?
Each time you use a list comprehension, you create a new copy of the data in memory. So this line:
final_list = [log_likely_list(patient, remove_index_list) for patient in patients]
contains the complete data for all 1000 patients!
The better choice is to utilize generator expressions, which process items one at a time. To form a generator, surround your for...in...: expression with parentheses instead of brackets. This might look something like:
with_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_with_tox)
with_tox_log = (np.log(data, axis=1) for data in with_tox_data)
no_tox_data = (log_likely_list(patient, remove_index_list) for patient in patients_no_tox)
no_tox_log = (np.log(1 - data, axis=1) for data in no_tox_data)
final_data = itertools.chain(with_tox_log, no_tox_log)
Note that no computations have actually been performed yet: generators don't do anything until you iterate over them. The fastest way to aggregate all the results in this case is to use reduce:
log_likely = functools.reduce(np.add, final_data)

How to calculate Delta F / F using python?

I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average
Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!
Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.

Finding several regions of interest in an array

Say I have conducted an experiment where I've left a python program running for some long time and in that time I've taken several measurements of some quantity against time. Each measurement is separated by some value between 1 and 3 seconds with the time step used much smaller than that... say 0.01s. An example of such an even if you just take the y axis might look like:
[...0,1,-1,4,1,0,0,2,3,1,0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1,1,0,0,1,-1,0,0,2,0...]
Here we have some period of inactivity followed by a sharp rise, fall, a brief pause around 0, drop sharply, rise sharply and settle again around 0. The dots indicate that this is part of a long stream of data extending in both directions. There will be many of these events over the whole dataset with varying lengths separated by low magnitude regions.
I wish to essentially form an array of 'n' arrays (tuples?) with varying lengths capturing just the events so I can analyse them separately later. I can't separate purely by an np.absolute() type threshold because there are occasional small regions of near zero values within a given event such as in the above example. In addition to this there may be occasional blips in between measurements with large magnitudes but short duration.
The sample above would ideally end up as with a couple of elements or so from the flat region either side or so.
[0,-1,2,3,5,7,8,17,21,8,3,1,0,0,-2,-17,-20,-10,-3,3,1,0,-2,-1]
I'm thinking something like:
Input:
[0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0]
Split based on some number of consecutive values below a magnitude of 2.
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[8,-7,-1,0],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0],[1,22,4,]]
Like in this graph:
If sub arrays length is less than say 10 then remove:
[[-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1],[-1,-17,-22,-40,16,1,3,14,17,19,8,2,0]]
Is this a good way to approach it? The first step is confusing me a little also. I need to preserve those small low magnitude regions within an event also.
Re-edited! I'm going to be comparing two signals each measured as a function of time so they will be zipped together in a list of tuples.
Here is my two cents, based on exponential smoothing.
import itertools
A=np.array([0,1,0,0,-1,4,8,22,16,7,2,1,0,-1,-17,-20,-6,-1,0,1,0,2,1,0,8,-7,-1,0,0,1,0,1,-1,-17,-22,-40,16,1,3,14,17,19,8,2,0,1,3,2,3,1,0,0,-2,1,0,0,-1,22,4,0,-1,0])
B=np.hstack(([0,0],A,[0,0]))
B=np.asanyarray(zip(*[B[i:] for i in range(5)]))
C=(B*[0.25,0.5,1,0.5,0.25]).mean(axis=1) #C is the 5-element sliding windows exponentially smoothed signal
D=[]
for item in itertools.groupby(enumerate(C), lambda x: abs(x[1])>1.5):
if item[0]:
D.append(list(item[1])) #Get the indices where the signal are of magnitude >2. Change 1.5 to control the behavior.
E=[D[0]]
for item in D[1:]:
if (item[0][0]-E[-1][-1][0]) <5: #Merge interesting regions if they are 5 or less indices apart. Change 5 to control the behavior.
E[-1]=E[-1]+item
else:
E.append(item)
print [(item[0][0], item[-1][0]) for item in E]
[A[item[0][0]: item[-1][0]] for item in E if (item[-1][0]-item[0][0])>9] #Filter out the interesting regions <10 in length.

Categories

Resources