How to update/insert cell in variables using Python in SPSS - python

I'm using this code to read a set of cases from dataset:
begin program.
with spss.DataStep():
start = 0
end = 3
firstColumn = 'deviation'
datasetObj = spss.Dataset('DataSet1')
variables = datasetObj.varlist
caseData = datasetObj.cases
print([itm[0] for itm in caseData[start:end, variables[firstColumn].index]])
spss.EndDataStep()
end program.
Now, I want to change this cell based on the variable name and case number.
This question and answer related to my issue, but I can't use spss.Submit inside with spss.DataStep():

See Example: Modifying Case Values from this page.
*python_dataset_modify_cases.sps.
DATA LIST FREE /cust (F2) amt (F5).
BEGIN DATA
210 4500
242 6900
370 32500
END DATA.
BEGIN PROGRAM.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
for i in range(len(datasetObj.cases)):
# Multiply the value of amt by 1.05 for each case
datasetObj.cases[i,1] = 1.05*datasetObj.cases[i,1][0]
spss.EndDataStep()
END PROGRAM.

Related

WUnderground, Extraction of Extremes Today

As contributor to WUnderground not a problem to read via API-call the JSON-outputfile with Today's values for my station.
That JSON-file has a series of numbered 'bins', with the series growing with time from 00:00.
In each numbered 'bin' an equivalent dataset reporting values.
At the end of the day a few hundred 'bins' in the JSON-file.
Avoiding setup of a local database, to find an actual survey of Extremes_Today, it is required to periodically scan the latest JSON-file from bin0 till the latest added bin in a recursive way.
It means in some way to read each numbered bin, extract&evaluate values, jump to next bin, till last bin reached & processed.
Trying the 2 approaches below in a Python-script: these 2 script-segments just should check & report that a bin exists. The scriptlines till 442 do other jobs (incl. complete read-out of bin=0 for references), already running without error.
# Line 442 = In WU JSON-Output Today's Data find & process next Bin upto/incl. last Bin
# Example call-string for ToDay-info = https://api.weather.com/v2/pws/observations/all/1day?stationId=KMAHANOV10&format=json&units=m&apiKey=yourApiKey
# Extracting contents of the JSON-file by the scriptlines below
# page = urllib.urlopen('https://api.weather.com/v2/pws/observations/all/1day?stationId=KMAHANOV10&format=json&units=m&apiKey=yourApiKey')
# content_test = page.read()
# obj_test2 = json.loads(content_test)
# Extraction of a value is like
# Epochcheck = obj_test2['observations'][Bin]['epoch']
# 'epoch' is present as element in all bins of the JSON-file (with trend related to the number of the bin) and therefore choosen as key for scan & search. If not found, then that bin not existing = passed last present bin
# Bin [0] earlier separately has been processed => initial contents at 00:00 = references for Extremes-search
# GENERAL setup of the scanning function:
# Bin = 0
# while 'epoch' exists
# Read 'epoch' & translate to CET/LocalTime
# Compare values of Extremes in that bin with earlier Extremes
# if hi_value higher than hiExtreme => new hiExtreme & adapt HiTime (= translated 'epoch')
# if low_value lower than LowExtreme => new lowExtreme & adapt LowTime (= translated 'epoch')
# Bin = Bin + 1
# Approach1
Bin = 0
Epochcheck = obj_test2['observations'][0]['epoch']
try:
Epochcheck = obj_test2['observations'][Bin]['epoch']
print(Bin)
Bin += 1
except NameError:
Epochcheck = None
# Approach2
Bin = 0
Epochcheck = obj_test2['observations'][0]['epoch']
While Epochcheck is not None:
Epochcheck = obj_test2['observations'][Bin]['epoch']
Print(Bin)
Bin += 1
Approach1 does not throw an error, but it steps out at Bin = 1.
Approach2 reports a syntax error.
File "/home/pi/domoticz/scripts/python/URL_JSON_WU_to_HWA_Start01a_0186.py", line 476
While Epochcheck is not None:
^
SyntaxError: invalid syntax
Apparently the checkline with dynamically variable contents for Bin cannot be set up in this way: the dynamic setting of variable Bin must be inserted/described in a different way.
Epochcheck = obj_test2['observations'][Bin]['epoch']
What is in Python the appropriate way to perform such JSON-scanning using a dynamic variable [Bin]?
Or simpler way of scan&extract a series of Bins in a JSON-file?

How to write if statement from SAS to python

I am a SAS user who try to transform SAS code to python version.
I have create SAS code as below and have some issues to apply to python language. Supposed I have data table, which contained fields aging1 to aging60 and I want to create new two fields, named 'life_def' and 'obs_time'. These two fields contained value as 0 and will be changed based on condition from other fields, which are aging1 to aging60.
data want;
set have;
array aging_array(*) aging1--aging60;
life_def=0;
obs_time=0;
do i to 60;
if life_def=0 and aging_array[i] ne . then do;
if aging_array[i]>=4 then do;
obs_time=i;
life_def=1;
end;
if aging_array[i]<4 then do;
obs_time=i;
end;
end;
end;
drop i;
run;
I have tried to re-create above SAS code into python version but it doesn't work that I though. Below is my code that currently working on.
df['life_def']=0
df['obs_time']=0
for i in range(1,lag+1):
if df['life_def'].all()==0 and pd.notnull(df[df.columns[i+4]].all()):
condition=df[df.columns[i+4]]>=4
df['life_def']=np.where(condition, 1, df['life_def'])
df['obs_time']=np.where(condition, i, df['obs_time'])
Supposed df[df.columns[i+4]] is my aging columns in SAS. By using code above, the loop continue when i is increased. However, the logic from SAS provided is stop i at the first time that aging>=4.
For example, if aging7>=4 (first time) life_def will be 1 and obs_time will be 7 and assign the next loop, which is 8.
Thank you!
Your objective is to get the first aging**x** column's x (per row) that is ge 4. The snippet below would do the same thing.
Note - I am using python 2.7
mydf['obs_time'] = 0
agingcols_len = len([k for k in mydf.columns.tolist() if 'aging' in k])
rowcnt = mydf['aging1'].fillna(0).count()
for k in xrange(rowcnt):
isFirst = True
for i in xrange(1, agingcols_len):
if isFirst and mydf['aging' + str(i)][k] >= 4:
mydf['obs_time'][k] = i
isFirst = False
elif isFirst and mydf['aging' + str(i)][k] < 4:
pass
I have uploaded the data that I used to test the above. The same can be found here.
The snippet iterates over all the aging**x**columns (e.g. - aging1, aging2), and keeps increasing the obs_time till it is greater than or equal to 4. This whole thing iterates over the DataFrame rows with k.
FYI - However, this is super slow when you have million rows to loop through.

XLRD: Successfully extracted 2 lists out of 2 sheets, but list comparison won't work

Ok so I have two xlsx sheets, both sheets have in their second column, at index 1, a list of sim card numbers. I have successfully printed the contents of both columns into my powershell terminal as 2 lists, and the quantity of elements in those lists, after extracting that data using xlrd.
The first sheet (theirSheet) has 454 entries, the second (ourSheet) has 361. I need to find the 93 that don't exist in the second sheet and put them into (unpaidSims). I could do this manually of course, but I would like to automate this task for the future when I inevitably need to do it again so I am trying to write this python script.
Considering python agrees that I have a list of 454 entries, and a list of 361 entries, I thought I just need to figure out a list comparison and I researched that on Stack Overflow, and tried 3 times with 3 different solutions, but each time, when I use that script to produce the third list (unpaidSims), it says 454...meaning it hasn't removed the entries that are duplicated in the smaller list. Please advise.
from os.path import join, dirname, abspath
import xlrd
theirBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'theirBook.xlsx')
ourBookFileName = join(dirname(dirname(abspath(__file__))), 'pycel', 'ourBook.xlsx')
theirBook = xlrd.open_workbook(theirBookFileName)
ourBook = xlrd.open_workbook(ourBookFileName)
theirSheet = theirBook.sheet_by_index(0)
ourSheet = ourBook.sheet_by_index(0)
theirSimColumn = theirSheet.col(1)
ourSimColumn = ourSheet.col(1)
numColsTheirSheet = theirSheet.ncols
numRowsTheirSheet = theirSheet.nrows
numColsOurSheet = ourSheet.ncols
numRowsOurSheet = ourSheet.nrows
# First Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = [d for d in theirSimColumn if d not in ourSimColumn]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
print "\nWe are expecting 93 entries in this new list"
# Second Attempt at the comparison, but fails and returns 454 entries from the bigger list
s = set(ourSimColumn)
unpaidSims = [x for x in theirSimColumn if x not in s]
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
# Third Attempt at the comparison, but fails and returns 454 entries from the bigger list
unpaidSims = tuple(set(theirSimColumn) - set(ourSimColumn))
print unpaidSims
lengthOfUnpaidSims = len(unpaidSims)
print lengthOfUnpaidSims
According to the xlrd Documentation, the col method returns "a sequence of the Cell objects in the given column".
It doesn't mention anything about comparison of Cell objects. Digging into the source, it appears that they didn't code any comparison methods into the class. As such, the Python documentation states that the objects will be compared by "object identity". In other words, the comparison will be False unless they are the exact same instance of the Cell class, even if the values they contain are identical.
You need to compare the values of the Cells instead. For example:
unpaidSims = set(sim.value for sim in theirSimColumn) - set(sim.value for sim in ourSimColumn)

Optimizing performance using a big dictionary in Python

Background:
I'm trying to create a simple python program that allows me to take part of a transcript by its transcriptomic coordinates and get both its sequence and its genomic coordinates.
I'm not an experienced bioinformatician or programmer, more a biologist, but the way I thought about doing it would be to split each transcript to its nucleotides and store along with each nucleotide, in a tuple, both its genomic coordinates and its coordinates inside the transcript. That way I can then use python to take part of a certain transcript (say, the last 200 nucleotides) and get the sequence and the various genomic windows that construct it. The end goal is more complicated than that (the final program will receive a set of coordinates in the form of distance from the translation start site (ATG) and randomly assign each coordinate to a random transcript and output the sequence and genomic coordinates)
This is the code I wrote for this, that takes the information from a BED file containing the coordinates and sequence of each exon (along with information such as transcript length, position of start (ATG) codon, position of stop codon):
from __future__ import print_function
from collections import OrderedDict
from collections import defaultdict
import time
import sys
import os
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
content.append(line.strip().split())
all_transcripts.append(line.strip().split()[3])
all_transcripts = list(OrderedDict.fromkeys(all_transcripts))
genes = dict.fromkeys(all_transcripts)
n=0
for line in content:
n+=1
if genes[line[3]] is not None:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
if line[5] == '+':
genes[line[3]][5].extend(seq)
elif line[5] == '-':
genes[line[3]][5] = seq + genes[line[3]][5]
else:
seq=[]
i=0
for nucleotide in line[14]:
seq.append((nucleotide,int(line[9])+i,int(line[1])+i))
i+=1
genes[line[3]] = [line[0],line[5],line[11],line[12],line[13],seq]
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
This is an example of how to BED file looks:
Chr Start End Transcript_ID Exon_Type Strand Ex_Start Ex_End Ex_Length Ts_Start Ts_End Ts_Length ATG Stop Sequence
chr1 861120 861180 uc001abw.1 5UTR + 0 60 80 0 60 2554 80 2126 GCAGATCCCTGCGG
chr1 861301 861321 uc001abw.1 5UTR + 60 80 80 60 80 2554 80 2126 GGAAAAGTCTGAAG
chr1 861321 861393 uc001abw.1 CDS + 0 72 2046 80 152 2554 80 2126 ATGTCCAAGGGGAT
chr1 865534 865716 uc001abw.1 CDS + 72 254 2046 152 334 2554 80 2126 AACCGGGGGCGGCT
chr1 866418 866469 uc001abw.1 CDS + 254 305 2046 334 385 2554 80 2126 AGTCCACACCCACT
I wanted to create a dictionary in which each transcript id is the key, and the values stored are the length of the transcript, the chromsome it is in, the strand, the position of the ATG, the position of the Stop codon and most importantly - a list of tuples of the sequence.
Basically, the code works. However, once the dictionary starts to get big it runs very very slowly.
So, what I would like to know is, how can I make it run faster? Currently it is getting intolerably slow in around the 60,000th line of the bed file. Perhaps there is a more efficient way to do what I'm trying to do, or just a better way to store data for this.
The BED file is custom made btw using awk from UCSC tables.
EDIT:
Sharing what I learned...
I now know that the bottleneck is in the creation of a large dictionary.
If I alter the program to iterate by genes and create a new list everytime in a similar mechanism, with this code:
for transcript in groupby(content, lambda x: x[3]):
a = list(transcript[1])
b = a[0]
gene = [b[0],b[5],b[11],b[12],b[13]]
seq=[]
n+=1
sys.stdout.write("\r")
sys.stdout.write(str(n))
sys.stdout.flush()
for exon in a:
i=0
for nucleotide in list(exon)[14]:
seq.append((nucleotide,int(list(exon)[9])+i,int(list(exon)[1])+i))
i+=1
gene.append(seq)
It runs in less than 4 minutes, while in the former version of creating a big dictionary with all of the genes at once it takes an hour to run.
One this to make your code more efficient is to add to the dictionary as you read the file.
with open("canonicals_metagene_withseq.bed") as f:
for line in f:
all_transcripts.append(line.strip().split()[3])
add_to_content_dict(line)
and then your add_to_content_dict() function would look like the code inside the for line in content: loop
(see here)
Also, you have to define your defaultdicts as such, I don't see where genes or another dict is defined as a defaultdict.
This might be a good read, which details the practice of defining all dot-notation functions used inside of your loop outside as variables to enhance performance, because you aren't looking up the definition in every iteration of the loop. For example, instead of
for line in f:
transcripts.append(line.strip().split()[3])
you would have
f_split = str.split
f_strip = str.strip
f_append = all_transcripts.append
for line in f:
f_append(f_split(f_strip(line))[3])
There are other goodies in that link about local variable access and is, again, definitely worth the read.
You may also consider using the Cython, PyInline, Pyrex, or PyPy libraries to use C code within your Python script (for the efficiency when dealing with lots and lots of iteration and/or file I/O.)
As for the data structure itself (which was your major concern), we're limited in python to how much control over a dictionary's expansion we have. Big dicts do get heavier on the memory consumption as they get bigger... but so do all data structures! You have a couple options that may make a minute difference (storing strings as bytestrings / using a translation dict for encoded integers), but you may want to consider implementing a database instead of holding all that stuff in a python dict during runtime.

How to efficiently process a large file with a grouping variable in Python

I've got a dataset that looks something like the following:
ID Group
1001 2
1006 2
1008 1
1027 2
1013 1
1014 4
So basically, a long list of unsorted IDs with a grouping variable as well.
At the moment, I want to take subsets of this list based on the generation of a random number (imagine they're being drafted, or won the lottery, etc.). Right now, this is the code I'm using to just process it row-by-row, by ID.
reader = csv.reader(open(inputname), delimiter=' ')
out1 = open(output1name,'wb')
out2 = open(output2name,'wb')
for row in reader:
assignment = gcd(1,p,marg_rate,rho)
if assignment[0,0]==1:
out1.write(row[0])
out1.write("\n")
if assignment[0,1]==1:
out2.write(row[0])
out2.write("\n")
Basically, i the gcd() function goes one way, you get written to one file, another way to a second, and then some get tossed out. The issue is I'd now like to do this by Group rather than ID - basically, I'd like to assign values to the first time a member of the group appears, and then apply it to all members of that group (so for example, if 1001 goes to File 2, so does 1006 and 1027).
Is there an efficient way to do this in Python? The file's large enough that I'm a little wary of my first thought, which was to do the assignments in a dictionary or list and then have the program look it up for each line.
I used random.randint to generate a random number, but this can be easily replaced.
The idea is to use a defaultdict to have single score (dict keys are unique) for a group from the moment it's created:
import csv
import random
from collections import defaultdict
reader = csv.DictReader(open(inputname), delimiter=' ')
out1 = open(output1name,'wb')
out2 = open(output2name,'wb')
# create a dictionary with a random default integer value [0, 1] for
# keys that are accessed for the first time
group_scores = defaultdict(lambda: random.randint(0,1))
for row in reader:
# set a score for current row according to it's group
# if none found - defaultdict will call it's lambda for new keys
# and create a score for this row and all who follow
score = group_scores[row['Group']]
if score==0:
out1.write(row['ID'])
out1.write("\n")
if score==1:
out2.write(row['ID'])
out2.write("\n")
out1.close()
out2.close()
I've also used DictReader which I find nicer for csv files with headers.
Tip: you may want to use the with context manager to open files.
Example output:
reut#sharabani:~/python/ran$ cat out1.txt
1001
1006
1008
1027
1013
reut#sharabani:~/python/ran$ cat out2.txt
1014
Sounds like you're looking for a mapping. You can use dicts for that.
Once you've first decided 1001 goes to file 2, you can add to your mapping dict.
fileMap={}
fileMap[group]="fileName"
And then, when you need to check if the group has been decided yet, you just
>>>group in fileMap
True
This is instead of mapping every ID to a filename. Just map the groups.
Also, I'm wondering about whether it's worth condsidering batching the writes with .write([aListofLines]).

Categories

Resources