Using columnar data in for loop - python

I'm trying to create an And statement in my for loop. There is a data frame that I have under "b". There are two columns of the data frame that I isolated into their respective lists (cdr3_length and heavy_percent).
I'm trying to create a for loop where the it parses through b and adds all data where the cdr3_length > 15 and heavy_percent < 88 to a new list "candidates".
cdr3_length=marv["heavy_cdr3_aa_length"]
cdr3_length.head()
heavy_percent=marv["heavy_percent_id"]
heavy_percent.head()
cnt_=0
candidates=[]
for i in range(0, len(b)):
if (cdr3_length(b[i]) > 15 & heavy_percent(b[i]) < 88
candidates.append(b[i])
cnt_+=1
I am getting a syntax error in the if statement line, but I can't find it. I appreciate any hep given!

if (cdr3_length(b[i]) > 15 & heavy_percent(b[i]) < 88
candidates.append(b[i])
->
if (cdr3_length(b[i]) > 15) and (heavy_percent(b[i]) < 88):
candidates.append(b[i])
python uses indentation to form blocks of code, so you should watch this closely. Also you may want to visit python.org an read its awesome beginners guide. While both typing an syntax are both relaxed in this language they still demand respect

Related

What's the problem in my for loop code to seperate and make a new DataFrame from my origin Data?

I'm beginner at Python and Pandas.
I have origin Data what i defined F1 and shape (194000,4).
I wanna split it into 97 groups of 2,000 each (ex. Index num 0~1999 is F1_0, 2000~3999 is F1_1)
And i wrote code like below.
n=0
for i in (0, 97):
num=2000*(i+1)
globals()['F1_{0}'.format(i)] = F1.loc[n:num]
n = A
When i call F1_0, there is no problem.
But From F1_1 to F1_96, there is "no define error".
I don't know what's the reason in my code :(
And i'd appreciate if you could let me know if there is better way.
Thanks for reading
Using range instead of only passing a tuple in the loop. In your code, for loop will iterates the value 0 and 97 only, not the range (0, ..., 96).
n=0
for i in range(97):
num=2000*(i+1)
globals()['F1_{0}'.format(i)] = F1.loc[n:num]

How to write if statement from SAS to python

I am a SAS user who try to transform SAS code to python version.
I have create SAS code as below and have some issues to apply to python language. Supposed I have data table, which contained fields aging1 to aging60 and I want to create new two fields, named 'life_def' and 'obs_time'. These two fields contained value as 0 and will be changed based on condition from other fields, which are aging1 to aging60.
data want;
set have;
array aging_array(*) aging1--aging60;
life_def=0;
obs_time=0;
do i to 60;
if life_def=0 and aging_array[i] ne . then do;
if aging_array[i]>=4 then do;
obs_time=i;
life_def=1;
end;
if aging_array[i]<4 then do;
obs_time=i;
end;
end;
end;
drop i;
run;
I have tried to re-create above SAS code into python version but it doesn't work that I though. Below is my code that currently working on.
df['life_def']=0
df['obs_time']=0
for i in range(1,lag+1):
if df['life_def'].all()==0 and pd.notnull(df[df.columns[i+4]].all()):
condition=df[df.columns[i+4]]>=4
df['life_def']=np.where(condition, 1, df['life_def'])
df['obs_time']=np.where(condition, i, df['obs_time'])
Supposed df[df.columns[i+4]] is my aging columns in SAS. By using code above, the loop continue when i is increased. However, the logic from SAS provided is stop i at the first time that aging>=4.
For example, if aging7>=4 (first time) life_def will be 1 and obs_time will be 7 and assign the next loop, which is 8.
Thank you!
Your objective is to get the first aging**x** column's x (per row) that is ge 4. The snippet below would do the same thing.
Note - I am using python 2.7
mydf['obs_time'] = 0
agingcols_len = len([k for k in mydf.columns.tolist() if 'aging' in k])
rowcnt = mydf['aging1'].fillna(0).count()
for k in xrange(rowcnt):
isFirst = True
for i in xrange(1, agingcols_len):
if isFirst and mydf['aging' + str(i)][k] >= 4:
mydf['obs_time'][k] = i
isFirst = False
elif isFirst and mydf['aging' + str(i)][k] < 4:
pass
I have uploaded the data that I used to test the above. The same can be found here.
The snippet iterates over all the aging**x**columns (e.g. - aging1, aging2), and keeps increasing the obs_time till it is greater than or equal to 4. This whole thing iterates over the DataFrame rows with k.
FYI - However, this is super slow when you have million rows to loop through.

Python increment loop to add filter parameters to dataframe

I created a dictionary with a set of functions. Then, I created a while loop that attempts to use those functions. But part of the loop doesn't call the functions the way I want it to. Here's the code:
while bool(str(w).endswith(' 2')) != True:
a = re.search('[0-9]{1,2}$', str(w))
w = w & int(a.group())-1
result = df[f[w]]
The third line, w = w & int(a.group())-1, doesn't function the way I want when I test it outside of this loop. I try setting w = 34, and then testing what results when I do 34 & int(a.group())-1. Instead of giving me 34 & 33, I get 32. Is there any way to create an increment that adds parameters to the result, instead of creating some integer that doesn't even seem to be derived logically? I would like it to start with 34, and add an integer that is one less for every go around the loop (34, 34 & 33, 34 & 33 & 32, etc.). Thanks in advance!
34 & 33 is 32. & is the bitwise and operator.
Saying you want "34 & 33" suggests that you want a string as a result, but that seems to conflict w/ the use of str(w) throughout your code. Or maybe you are just unclear about what & does, and really want some different operation.
Okay, I figured it out. I needed q = f[w] and then q = q & f[w-n], where f[w] defines parameters to filter a dataframe (df) based on a column, and f[w-n] defines a different parameter for filtering based on the next adjacent column. So, the progression should be f[w], f[w] & f[w-n], f[w] & f[w-n] & f[w-n], etc., instead of 34, 34 & 33, 34 & 33 & 32, etc. while n <= w.
So that would look like this:
w = 34
n = 1
q = f[w]
while n <= w:
q = q & f[w-n]
result = df[q]
n = n+1
And, there would be conditions later on to decide whether or not enough parameters were used. In my usage, I'm not looking for a result before the while loop after q is initially defined because that result would have already been found in a different portion of the program. (In case this helps anyone else.)
Scott Hunter thanks for the tip on the & operator.

How to update/insert cell in variables using Python in SPSS

I'm using this code to read a set of cases from dataset:
begin program.
with spss.DataStep():
start = 0
end = 3
firstColumn = 'deviation'
datasetObj = spss.Dataset('DataSet1')
variables = datasetObj.varlist
caseData = datasetObj.cases
print([itm[0] for itm in caseData[start:end, variables[firstColumn].index]])
spss.EndDataStep()
end program.
Now, I want to change this cell based on the variable name and case number.
This question and answer related to my issue, but I can't use spss.Submit inside with spss.DataStep():
See Example: Modifying Case Values from this page.
*python_dataset_modify_cases.sps.
DATA LIST FREE /cust (F2) amt (F5).
BEGIN DATA
210 4500
242 6900
370 32500
END DATA.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
for i in range(len(datasetObj.cases)):
# Multiply the value of amt by 1.05 for each case
datasetObj.cases[i,1] = 1.05*datasetObj.cases[i,1][0]
spss.EndDataStep()
END PROGRAM.

Python nest list performance choice

I am trying to understand if there is an advantage in space/time/programming to storing data from a signal processing system as nested list in either :
data[channel][sample]
data[sample][channel]
I can code processing for both - thou I personally find 1) easy to write and index to then 2).
However, 2) is the more common was my local group programs in and stores the data (either in excel/csv or from the data gathering systems). While it is easy to transpose
dataA = map(list, zip(*dataB))
I was wondering if there are any storage or performance - or even - module compatibility issues with 1 over 2?
with 1) I can loop like this
for R in dataA :
for C in R :
process_channel(C)
matplotlib.loglog(dataA[0], dataA[i])
where dataA[0] is time or frequency and i is some other channel to plot
with 2)
for R in dataB :
for C in R
process_sample(C)
matplotlib.loglog([j[0] for j in dataB],[k[i] for k in dataB])
This looks worse in programming style. Maybe I am missing a list method of making this easier? I have also developed code to used dicts ... but this really breaks with general use. So I am less inclined to continue to use dicts. Although the dict storage is
dataC = list(['f':0.1,'chnl1':100.0],['f':0.2,'chnl1':110.0])
or some such. It seems that to be better integrated option 2 is better. However, I am trying to understand how better to code when using option 2) when you wish to process over channels then samples? Just transpose the matrix first and then do the work in option 1) space and transpose back the results:
dataA = smoothing(dataA, smooth_factor)
def smoothing(d, s) :
td = numpy.transpose(d)
td = map(list, zip(*d))
nd=[]
for row in td :
col = []
for i in xrange(0,len(row)-step,step) :
col.append(sum(row[i:i+step]/step)
nd.append(col)
nd = numpy.transpose(nd)
return nd
while this construction works - transposing back and forth all the time looks - um - inefficient.

Categories

Resources