How can select samples with apply method without replacing in a loop - python

I would like to select "dmin" numbers of samples in each group(group by) in a dataframe and add them to another empty dataframe.
If total number of samples which we need is not enough, again select new "dmin" numbers of samples and add to the dataframe. This loop needs to be repeated until total number of samples we need is covered.
I am new in coding and can not understand the problem, but samples are selected just one time in my code and can not be repeated time by time in a group.
Another problem is that in each group of that dataframe, the number of records might go to be less than value of "dmin" in the loop and the code might face this problem "number of samples in the group is less than "dmin".
I was wondering if you could help me. This is part of my code:
while V > 0:
x6 = result_sort[result_sort['K'] > p_ratio].groupby('position').apply(lambda x:x.sample(dmin).reset_index(drop=True))
A = x6.append(A)
S = len(A)
V = V_total - S

I solved my problem with adding another condition in while loop.

Related

Optimizing random sampling for scaling permutations

I have a two-fold problem here.
1) My initial problem, which was trying to improve the computational time on my algorithm
2) My next problem, which is one of my 'improvements' appears to consume over 500G RAM memory and I don't really know why.
I've been writing a permutation for a bioinformatics pipeline. Basically, the algorithm starts off sampling 1000 null variants for each variant I have. If a certain condition is not met, the algorithm repeats the sampling with 10000 nulls this time. If the condition is not met again, this up-scales to 100000, and then 1000000 nulls per variant.
I've done everything I can think of to optimize this, and at this point I'm stumped for improvements. I have this gnarly looking comprehension:
output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))
Basically all this is doing is calling my random_sampling_for_variants function on each variant in my variant list, and throwing the two outputs from that function into a list of lists (so, I end up with two lists of lists, output_names_df, and output_p_values_df). I then turn these list of lists back into DataFrames, rename the columns, and do whatever I need with them. The function being called looks like:
def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
"""
Inner function to permute_null_variants_single
equivalent to permutation for 1 variant
"""
#Get nulls that are in the same bin as our variant
permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
#If number of permutations > number of nulls, sample with replacement
if num_permutations >= len(permuted_null_variant_table):
replace = True
else:
replace = False
#Select rows for permutation, then add as columns to our temporary dfs
picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
temp_names_df = picked_indices['variant'].values[:use_num]
temp_p_values_df = picked_indices['GWAS_p_value'].values
return(temp_names_df, temp_p_values_df)
When defining permuted_null_variant_table, I'm just querying a pre-defined grouped up DataFrame to determine the appropriate null table to sample from. I found this to be faster than trying to determine the appropriate nulls to sample from on the fly. The logic that follows in there determines whether I need to sample with or without replacement, and it takes hardly any time at all. Defining picked_indices is where the random sampling is actually being done. temp_names_df and temp_p_values_df get the values that I want out of the null table, and then I send my returns out of the function.
The problem here is that the above doesn't scale very well. For 1000 permutations, the above lines of code take ~254 seconds on 7725 variants. For 10,000 permutations, ~333 seconds. For 100,000 permutations, ~720 seconds. For 1,000,000 permutations, which is the upper boundary I'd like to get to, my process got killed because it was apparently going to take up more RAM than the cluster had.
I'm stumped on how to proceed. I've turned all by ints into 8bit-uint, I've turned all my floats into 16bit-floats. I've dropped columns that I don't need where I don't need them, so I'm only ever sampling from tables with the columns I need. Initially I had the comprehension in the form of a loop, but the computational time on the comprehension is 3/5th that of the loop. Any feedback is appreciated.

Python repeating a function using the result of the previous function

I am a bit new to python and have been searching and trying different solutions to this issue.
I need to create a function that not only counts down within the function but also adds the previous results.
To help put this in context:
I have a formula for a weekly cost where Time corresponds to the current time within the model. It looks like the following:
week1 = 5000**((Time-1))
week2 = 5000**((Time-2))
...
(where the number next to time is increasing by one over a specific range)
Now the end result needs to be (for example)
if Time > 5:
return week1+ week2+ week3+ week4+ week5
elif Time == 5:
return week1+ week2+ week3+ week4
This would continue to time <=1. So I need a formula where not only is the function repeated a specific number of times adding the previous result, but one of the variables in the formula also changes based on the count. I know there must be an efficient way to do this with a loop but I can not seem to figure it out.
Any help would be amazing!
Thanks!
One way of solving this problem is using recursion. Put simply, it is a function that will continue to call itself until a specific condition is met (time <=1 in this example).
The downside of doing this is that it uses more memory than a simple loop.
An example of this would be:
def funcName(time):
sum = 0
if (time > 1):
sum = funcName(time-1)
sum += 5000**(time-1)
return sum
I think, your formula is wrong, it should be:
week1 = 5000 * (Time-1)
With simple loop:
result = 0
for i in range(Time):
result += 5000 * (Time - i)
print result
You can achieve it in one line using both sum and generator expression.
result = sum(5000 *(Time - i) for i in range(Time))

How to calculate Delta F / F using python?

I've recently "taught" myself python in order to analyze data for my experiments. As such I'm pretty clueless on many aspects. I've managed to make my analysis work for certain files but in some cases it breaks down and I imagine it is a result of faulty programming.
Currently I export a file containing 3 numpy arrays. One of these arrays is my signal (float values from -10 to 10). What I wish to do is to normalize every datum in this array to a range of values that preceed it. (i.e. the 30001st value must have the average of the preceeding 3000 values subtracted from it and then the difference must then be divided by thisvery same average (the preceeding 3000 values). My data is collected at a rate of 100Hz thus to get a normalization of the alst 30s i must use the preceeding 3000values.
As it stand this is how I've managed to make it work:
this stores the signal into the variable photosignal
photosignal = np.array(seg.analogsignals[0], ndmin=1)
now this the part I use to get the delta F/F over a moving window of 30s
normalizedphotosignal = [(uu-(np.mean(photosignal[uu-3000:uu])))/abs(np.mean(photosignal[uu-3000:uu])) for uu in photosignal[3000:]]
The following adds 3000 values to the beginning to keep the array the same length since later on i must time lock it to another list that is the same length
holder =list(range(3000))
normalizedphotosignal = holder + normalizedphotosignal
What I have noticed is that in certain files this code gives me an error because it says that the"slice" is empty and therefore it cannot create a mean.
I think maybe there is a better way to program this that could avoid this problem altogether. Or this a correct way to approach this problem?
So i tried the solution but it is quite slow and it nevertheless still gives me the "empty slice error".
I went over the moving average post and found this method:
def running_mean(x, N):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[N:] - cumsum[:-N]) / N
however I'm having trouble accommodating it to my desired output. namely (x-running average)/running average
Allright so I finally figured it out thanks to your help and the posts you referred me to.
The calculation for my entire data (300 000 +) takes about a second!
I used the following code:
def runningmean(x,N):
cumsum =np.cumsum(np.insert(x,0,0))
return (cumsum[N:] -cumsum[:-N])/N
photosignal = np.array(seg.analogsignal[0], ndmin =1)
photosignalaverage = runningmean(photosignal, 3000)
holder = np.zeros(2999)
photosignalaverage = np.append(holder,photosignalaverage)
detalfsignal = (photosignal-photosignalaverage)/abs(photosignalaverage)
Photosignal stores my raw signal in a numpy array.
Photosignalaverage uses cumsum to calculate the running average of every datapoint in photosignal. I then add the first 2999 values as 0, to maintian the same list size as my photosignal.
I then use basic numpy calculations to get my delta F/F signal.
Thank you once more for the feedback, was truly helpful!
Your approach goes in the right direction. However, you made a mistake in your list comprehension: you are using uu as your index whereas uu are the elements of your input data photosignal.
You want something like this:
normalizedphotosignal2 = np.zeros((photosignal.shape[0]-3000))
for i, uu in enumerate(photosignal[3000:]):
normalizedphotosignal2 = (uu - (np.mean(photosignal[i-3000:i]))) / abs(np.mean(photosignal[i-3000:i]))
Keep in mind that for-loops are relatively slow in python. If performance is an issue here, you could try avoiding the for loop and use numpy methods instead (e.g. have a look at Moving average or running mean).
Hope this helps.

List's and while loops - Python

I am fairly new to Python and I am stuck on a particular question and I thought i'd ask you guys.
The following contains my code so far, aswell as the questions that lie therein:
list=[100,20,30,40 etc...]
Just a list with different numeric values representing an objects weight in grams.
object=0
while len(list)>0:
list_caluclation=list.pop(0)
print(object number:",(object),"evaluates to")
What i want to do next is evaluate the items in the list. So that if we go with index[0], we have a list value of 100. THen i want to separate this into smaller pieces like, for a 100 gram object, one would split it into five 20 gram units. If the value being split up was 35, then it would be one 20 gram unit, on 10 gram unit and one 5 gram unit.
The five units i want to split into are: 20, 10, 5, 1 and 0.5.
If anyone has a quick tip regarding my issue, it would be much appreciated.
Regards
You should think about solving this for a single number first. So what you essentially want to do is split up a number into a partition of known components. This is also known as the Change-making problem. You can choose a greedy algorithm for this that always takes the largest component size as long as it’s still possible:
units = [20, 10, 5, 1, 0.5]
def change (number):
counts = {}
for unit in units:
count, number = divmod(number, unit)
counts[unit] = count
return counts
So this will return a dictionary that maps from each unit to the count of that unit required to get to the target number.
You just need to call that function for each item in your original list.
One way you could do it with a double for loop. The outer loop would be the numbers you input and the inner loop would be the values you want to evaluate (ie [20,10,5,1,0.5]). For each iteration of the inner loop, find how many times the value goes into the number (using the floor method), and then use the modulo operator to reassign the number to be the remainder. On each loop you can have it print out the info that you want :) Im not sure exactly what kind of output you're looking for, but I hope this helps!
Ex:
import math
myList=[100,20,30,40,35]
values=[20,10,5,1,0.5]
for i in myList:
print(str(i)+" evaluates to: ")
for num in values:
evaluation=math.floor(i/num)
print("\t"+str(num)+"'s: "+str(evaluation))
i%=num

Project Euler #82 (Python)

First of all this is the problem : https://projecteuler.net/problem=82 .
This is my code :
# https://projecteuler.net/problem=82
matrice = open('matrix3.txt','r').read().split('\n')
m = []
for el in matrice:
if el=='':
continue
tmp = el.split(',')
m.append(tmp)
matrix = [[0 for i in range(80)]for j in range(80)]
x,y = 0,0
while(True):
matrix[x][y]=int(m[x][y])
y+=1
if y==80:
y=0
x+=1
if x==80:
break
tmp = [0]*80
x,y = 0,78
while(True):
if x==0:
tmp[x]=min(matrix[x][y+1],matrix[x+1][y]+matrix[x+1][y+1])
if x==79:
tmp[x]=min(matrix[x][y+1],matrix[x-1][y]+matrix[x-1][y+1])
else:
tmp[x]=min(matrix[x][y+1],matrix[x-1][y]+matrix[x-1][y+1],matrix[x+1][y]+matrix[x+1][y+1])
x+=1
if x==80:
for e in range(80):
matrix[e][y]+=tmp[e]
tmp = [0]*80
x=0
y+=-1
if y<0:
break
minimo = 10**9
for e in range(80):
if matrix[e][0]<minimo:
minimo=matrix[e][0]
print(minimo)
The idea behind this code is the following:
I start from the 79th column(78th if you start counting from 0) and I calculate the best(the minimal) way to get from any given entry in that column to the column to the right.
When the column is over I replace it with the minimal results I found and I start doing the same with the column to the left.
Is anyone able to help me understand why I get the wrong answer?(I get 262716)
The same code works for the matrix in the example(It works if you change the indeces of course).
If I understand the question, your code, and your algorithm correctly, it looks like you aren't actually calculating the best way to get from one column to the next because you're only considering a couple of the possible ways to get to the next column. For example, consider the first iteration (when y=78). Then I think what you want is tmp[0] to hold the minimum sum for getting from matrix[0][78] to anywhere in the 79th column, but you only consider two possibilities: go right, or go down and then go right. What if the best way to get from matrix[0][78] to the next column is to go down 6 entries and then go right? Your code will never consider that possibility.
Your code probably works on the small example because it so happens that the minimum path only goes up or down a single time in each column. But I think that's a coincidence (also possibly a poorly chosen example).
One way to solve this problem is using the following approach. When the input is a NxN matrix, define a NxN array min_path. We're going to want to fill in min_path so that min_path[x][y] is the minimum path sum starting in any entry in the first column of the input matrix and ending at [x][y]. We fill in one column of min_path at a time, starting at the leftmost column. To compute min_path[i][j], we look at all entries in the (j-1)th column of min_path, and the cost of getting from each of those entries to (i, j). Here is some Python code showing this solution: https://gist.github.com/estark37/5216851. This is an O(N^4) solution but it can probably be made faster! (maybe by precomputing the results of the sum_to calls?)

Categories

Resources