I am a SAS user who try to transform SAS code to python version.
I have create SAS code as below and have some issues to apply to python language. Supposed I have data table, which contained fields aging1 to aging60 and I want to create new two fields, named 'life_def' and 'obs_time'. These two fields contained value as 0 and will be changed based on condition from other fields, which are aging1 to aging60.
data want;
set have;
array aging_array(*) aging1--aging60;
life_def=0;
obs_time=0;
do i to 60;
if life_def=0 and aging_array[i] ne . then do;
if aging_array[i]>=4 then do;
obs_time=i;
life_def=1;
end;
if aging_array[i]<4 then do;
obs_time=i;
end;
end;
end;
drop i;
run;
I have tried to re-create above SAS code into python version but it doesn't work that I though. Below is my code that currently working on.
df['life_def']=0
df['obs_time']=0
for i in range(1,lag+1):
if df['life_def'].all()==0 and pd.notnull(df[df.columns[i+4]].all()):
condition=df[df.columns[i+4]]>=4
df['life_def']=np.where(condition, 1, df['life_def'])
df['obs_time']=np.where(condition, i, df['obs_time'])
Supposed df[df.columns[i+4]] is my aging columns in SAS. By using code above, the loop continue when i is increased. However, the logic from SAS provided is stop i at the first time that aging>=4.
For example, if aging7>=4 (first time) life_def will be 1 and obs_time will be 7 and assign the next loop, which is 8.
Thank you!
Your objective is to get the first aging**x** column's x (per row) that is ge 4. The snippet below would do the same thing.
Note - I am using python 2.7
mydf['obs_time'] = 0
agingcols_len = len([k for k in mydf.columns.tolist() if 'aging' in k])
rowcnt = mydf['aging1'].fillna(0).count()
for k in xrange(rowcnt):
isFirst = True
for i in xrange(1, agingcols_len):
if isFirst and mydf['aging' + str(i)][k] >= 4:
mydf['obs_time'][k] = i
isFirst = False
elif isFirst and mydf['aging' + str(i)][k] < 4:
pass
I have uploaded the data that I used to test the above. The same can be found here.
The snippet iterates over all the aging**x**columns (e.g. - aging1, aging2), and keeps increasing the obs_time till it is greater than or equal to 4. This whole thing iterates over the DataFrame rows with k.
FYI - However, this is super slow when you have million rows to loop through.
Related
I'm trying to create an And statement in my for loop. There is a data frame that I have under "b". There are two columns of the data frame that I isolated into their respective lists (cdr3_length and heavy_percent).
I'm trying to create a for loop where the it parses through b and adds all data where the cdr3_length > 15 and heavy_percent < 88 to a new list "candidates".
cdr3_length=marv["heavy_cdr3_aa_length"]
cdr3_length.head()
heavy_percent=marv["heavy_percent_id"]
heavy_percent.head()
cnt_=0
candidates=[]
for i in range(0, len(b)):
if (cdr3_length(b[i]) > 15 & heavy_percent(b[i]) < 88
candidates.append(b[i])
cnt_+=1
I am getting a syntax error in the if statement line, but I can't find it. I appreciate any hep given!
if (cdr3_length(b[i]) > 15 & heavy_percent(b[i]) < 88
candidates.append(b[i])
->
if (cdr3_length(b[i]) > 15) and (heavy_percent(b[i]) < 88):
candidates.append(b[i])
python uses indentation to form blocks of code, so you should watch this closely. Also you may want to visit python.org an read its awesome beginners guide. While both typing an syntax are both relaxed in this language they still demand respect
I have two google spreadsheets:
QC- many columns, I want to check if a value from column 4 appears in the second spreadsheet lastEdited_PEID; if it does, it would put 'Bingo!' in column 14 of the same row where the value was found
lastEdited- one column, long spreadsheets of values
I achieve that with the following code:
#acces the documents on Drive
QC = gc.open_by_key("FIRST KEY").sheet1
lastEdited = gc.open_by_key("SECOND KEY").sheet1
#get values from columns and convert to lists
QC_PEID = QC.col_values(4)
lastEdited_PEID = lastEdited.col_values(1)
#iterate by rows and check if value from each row appears in the second document
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!')
So it does the job but does it very slowly (about 5 minutes). I am concerned about the speed because I have to perform the operation for about 50 spreadsheets (avg. 6000 rows each).
I tried to remove the element from the second list when found (it can only appear once) with the following code in the loop:
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!')
**lastEdited_PEID.remove('value')**
I thought it would make it faster as the reference list would be shorter but surprisingly it takes even more.
What could I do to make the process quicker?
Since gspread is a wrapper around the Google Sheet's REST API each operation you perform on a spreadsheet renders to an HTTP request to the API. Most of the time this is the slowest part of the code. If you want to improve performance you need to figure out how to reduce the number of interactions with the API.
In your code sample each col_values() call makes a single HTTP request. This is good. But then, when you iterating over cells values, there's an update_cell() in a loop:
for value in QC_PEID:
ind = QC_PEID.index(value)
if value in lastEdited_PEID:
QC.update_cell(ind, 14, 'Bingo!') # it makes 2 HTTP requests each time
update_cell makes two HTTP requests to the API (one to retrieve information needed to update the cell and another to actually send the update to the API.) You need to avoid this method call in your loop.
A better idea is to collect all updates and send them in a batch. This is what update_cells() method is for.
update_cells() needs a list of Cell objects to do the batch update. You can get those by calling Worksheet.range().
This is what comes in into my mind:
# A utility method
def col_cells(worksheet, col):
"""Returns a range of cells in a `worksheet`'s column `col`."""
start_cell = self.get_addr_int(1, col)
end_cell = self.get_addr_int(worksheet.row_count, col)
return worksheet.range('%s:%s' % (start_cell, end_cell))
QC_PEID = QC.col_values(4)
lastEdited_PEID = set(lastEdited.col_cells(1)) # make the 'in' lookup a bit faster
column_14_cells = col_cells(QC, 14)
has_updates = False
# iterate by rows and check if value from each row appears in the second document
for i, value in enumerate(QC_PEID):
if value in lastEdited_PEID:
has_updates = True
column_14_cells[i].value = 'Bingo!'
if has_updates:
QC.update_cells(column_14_cells)
I didn't run the code. Beware of typos.
I have a file containing a dictionary of values on each line that I grab and use to query a mysql database using each key as a query. The results of each query get placed in a dict and once all values for the query dict have been generated the line gets written out.
IN > foo bar someotherinfo {1: 'query_val', 2: 'query_val', 3: 'query_val'
OUT > foo bar someotherinfo 1_result 2_result 3_result
This whole process appears to be somewhat slow because I'm performing around 200,000 mysql queries per file and have around 10 files per sample and around 30 samples in total, so I'm looking to speed up the whole process.
I'm just wondering if the fileIO could be creating a bottleneck. Instead of writing the line_info (foo,bar,somblah) followed by each result dict as it's returned, would I be better of chunking these results into memory before writing them to file in batches?
Or is this simply a case of having to just wait it out... ?
Example Input line and output line
INPUT
XM_006557349.1 1 - exon XM_006557349.1_exon_2 10316 10534 {1: 10509:10534', 2: '10488:10508', 3: '10467:10487', 4: '10446:10466', 5: '10425:10445', 6: '10404:10424', 7: '10383:10403', 8: '10362:10382', 9: '10341:10361', 10: '10316:10340'}
OUTPUT
XM_006557349.1 1 - exon XM_006557349.1_exon_2 10316 105340.7083 0.2945 0.2 0.2931 0.125 0.1154 0.2095 0.5833 0.0569 0.0508
CODE
def array_2_meth(sample,bin_type,type,cur_meth):
bins_in = open('bin_dicts/'+bin_type,'r')
meth_out = open('meth_data/'+bin_type+'_'+sample+'_plus_'+type+'_meth.tsv','w')
for line in bins_in.readlines():
meth_dict = {}
# build array of data from each line
array = line.strip('\n').split('\t')
mrna_id = array[0]
assembly = array[1]
strand = array[2]
bin_dict = ast.literal_eval(array[7])
for bin in bin_dict:
coords = bin_dict[bin].split(':')
start = int(coords[0]) -1
end = int(coords[1]) +1
cur_meth.execute('select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5')
for row in cur_meth.fetchall():
if str(row[0]) == 'None':
meth_dict[bin] = 'no_cov'
else:
meth_dict[bin] = float(row[0])
meth_out.write('\t'.join(array[:7]))
for k in sorted(meth_dict.keys()):
meth_out.write('\t'+str(meth_dict[k]))
meth_out.write('\n')
meth_out.close()
Not sure if adding this code is going to be a massive help, but it should show the way I'm approaching this.. Any advice you could provide on mistakes I'm making in my approach or tips on how to optimise would be greatly appreciated!!!
Thanks ^_^
I think the fileIO shouldn't take too long, the main bottleneck is probably the amount of queries you are making. But from the example you provide I see no pattern in those start and end position so I have no idea how to cut down the amount of queries you are making.
I have a probably amazing or stupid ideas depending on your test results.(also i don't know shxt about python so ignore the syntax haha)
it SEEMS that every query will only return a single value?
maybe you could try something like
SQL = ''
for bin in bin_dict:
coords = bin_dict[bin].split(':')
start = int(coords[0]) -1
end = int(coords[1]) +1
SQL += 'select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5'
SQL += 'UNION ALL'
//somehow remove the last UNION ALL at end of loop
cur_meth.execute(str(SQL))
for row in cur_meth.fetchall():
//loop through the 10 row array and write to file
The core idea is to use UNION ALL to join all queries into 1, and thus you'll only need to do 1 transaction instead of 10 shown in your example. You also reduce the 10 write to file action into 1. The possible drawback is that UNION ALL might be slow, but as far as I know it shouldn't take anymore processing time then 10 individual queries as long as you keep the SQL format in my example.
The second obvious method is to do it multi-thread. if you are not using all your processing power of your machine, you could probably try to start multiple script/program at the same time as all you do is query data and doesn't modify anything. This would cause individual script slightly slower but overall faster as it should reduce wait time between queries.
I am converting old pseudo-Fortran code into python and am struggling to create a framework within which I can perform some complex iterative calculations.
As a beginner, my first instinct is to use lists as I find them easier to work with, but i understand that arrays would probably be a more suitable method.
I already have all the input channels as lists and am hoping for a good explanation of how to set up loops for such calculations.
This is an example of the pseudo-Fortran i am replicating. Each (t) indicates a 'time-series channel' that I currently have stored as lists (ie. ECART2(t) and NNNN(t) are lists) All lists have the same number of entries.
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. ) ;
mmm(t)=nnnn(t)+1.;
if YRPVBPO(t).ge.0.1 .and. YRPVBPO(t).le.0.999930338 .and. YAEVBPO(t).ge.0.000015 .and. YAEVBPO(t).le.0.000615 then do;
YM5(t) = customFunction(YRPVBPO,YAEVBPO);*
end;
YUEVBO(t) = YU0VBO(t) * YM5(t) ;*m/s
YHEVBO(t) = YCPEVBO(t)*TPO_TGETO1(t)+0.5*YUEVBO(t)*YUEVBO(t);*J/kg
YAVBO(t) = ddnn2(t)*(YUEVBO(t)**2);*
YDVBO(t) = YCPEVBO(t)**2 + 4*YHEVBO(t)*YAVBO(t) ;*
YTSVBPO(t) = (sqrt(YDVBO(t))-YCPEVBO(t))/2./YAVBO(t);*K
YUSVBO(t) = ddnn(t)*YUEVBO(t)*YTSVBPO(t);*m/s
YM7(t) = YUSVBO(t)/YU0VBO(t);*
YPHSVBPOtot(t) = (YPHEVBPO(t) - YPDHVBPO(t))/(1.+((YGAMAEVBO(t)-1)/2)*(YM7(t)**2))**(YGAMAEVBO(t)/(1-YGAMAEVBO(t)));*bar
YPHEVBPOtot(t) = YPHEVBPO(t) / (1.+rss0(t)*YM5(t)*YM5(t))**rss1(t);*bar
YDPVBPOtot(t) = YPHEVBPOtot(t) - YPHSVBPOtot(t) ;*bar
iter(t) = (YPHEVBPOtot(t) - YDPVBPOtot(t))/YPHEVBPOtot(t);*
ecart2(t)= ABS(iter(t)-YRPVBPO(t));*
aa(t)=YRPVBPO(t)+0.0001;
YRPVBPO(t)=aa(t);*
nnnn(t)=mmm(t);*
end;
Understanding the pseudo-fortran: With 'time-series data' there is an impicit loop iterating through the individual values in each list - as well as looping over each of those values until the conditions are met.
It will carry out the loop calculations on the first list values until the conditions are met. It then moves onto the second value in the lists and perform the same looping calculations until the conditions are met...
ECART2 = [2,0,3,5,3,4]
NNNN = [6,7,5,8,6,7]
do while ( ecart2(t) > 0.0002 .and. nnnn(t) < 2000. )
MMM = NNNN + 1
this looks at the first values in each list (2 and 6). Because the conditions are met, subsequent calculations are performed on the first values in the new lists such as MMM = [6+1,...]
Once the rest of the calculations have been performed (looping multiple times if the conditions are not met) only then does the second value in every list get considered. The second values (0 and 7) do not meet the conditions and therefore the second entry for MMM is 0.
MMM=[6+1, 0...]
Because 0 must be entered if conditons are not met, I am considering setting up all the 'New lists' in advance and populating them with 0s.
NB: 'customFunction()' is a separate function that is called, returning a value from two input values
MY CURRENT SOLUTION
set up all the empty lists
nPts = range(ECART2)
MMM = [0]*nPts
YM5 = [0]*nPts
etc...
then start performing calculations
for i in ECART2:
while (ECART2[i] > 0.0002) and (NNNN[i] < 2000):
MMM[i] = NNNN[i]+1
if YRPVBPO[i]>=0.1 and YRPVBPO[i]<=0.999930338 and YAEVBPO[i]>=0.000015 and YAEVBPO[i]<=0.000615:
YM5[i] = MACH_LBP_DIA30(YRPVBPO[i],YAEVBPO[i])
YUEVBO[i] = YU0VBO[i]*YM5[i]
YHEVBO[i] = YCPEVBO[i]*TGETO1[i] + 0.5*YUEVBO[i]^2
YAVBO[i] = DDNN2[i]*YUEVBO[i]^2
YDVBO[i] = YCPEVBO[i]^2 + 4*YHEVBO[i]*YAVBO[i]
etc etc...
but i'm guessing that there are better ways of doing this - such as the suggestion to use numpy arrays (something i plan on learning in the near future)
I am trying to understand if there is an advantage in space/time/programming to storing data from a signal processing system as nested list in either :
data[channel][sample]
data[sample][channel]
I can code processing for both - thou I personally find 1) easy to write and index to then 2).
However, 2) is the more common was my local group programs in and stores the data (either in excel/csv or from the data gathering systems). While it is easy to transpose
dataA = map(list, zip(*dataB))
I was wondering if there are any storage or performance - or even - module compatibility issues with 1 over 2?
with 1) I can loop like this
for R in dataA :
for C in R :
process_channel(C)
matplotlib.loglog(dataA[0], dataA[i])
where dataA[0] is time or frequency and i is some other channel to plot
with 2)
for R in dataB :
for C in R
process_sample(C)
matplotlib.loglog([j[0] for j in dataB],[k[i] for k in dataB])
This looks worse in programming style. Maybe I am missing a list method of making this easier? I have also developed code to used dicts ... but this really breaks with general use. So I am less inclined to continue to use dicts. Although the dict storage is
dataC = list(['f':0.1,'chnl1':100.0],['f':0.2,'chnl1':110.0])
or some such. It seems that to be better integrated option 2 is better. However, I am trying to understand how better to code when using option 2) when you wish to process over channels then samples? Just transpose the matrix first and then do the work in option 1) space and transpose back the results:
dataA = smoothing(dataA, smooth_factor)
def smoothing(d, s) :
td = numpy.transpose(d)
td = map(list, zip(*d))
nd=[]
for row in td :
col = []
for i in xrange(0,len(row)-step,step) :
col.append(sum(row[i:i+step]/step)
nd.append(col)
nd = numpy.transpose(nd)
return nd
while this construction works - transposing back and forth all the time looks - um - inefficient.