From spreadsheet to dictionary in ipython/python & more? - python

I would like to be able to take data from a file (spreadsheet or other) and create a dictionary that I can then iterate over in a loop for the keys, and have corresponding values inserted in my command for each key. Sorry if that does not make much sense, I will explain in more detail below.
I have several samples that I am running through a bioinformatics pipeline and I am trying to automate the process. One of the steps is adding "read group" information to my files which is done with the following shell command:
picard-tools AddOrReplaceReadGroups I=input.bam O=output.bam RGID=IDXX
RGLB=LBXX RGPL=PLXX RGPU=PUXX RGSM=SMXX VALIDATION_STRINGENCY=SILENT
SORT_ORDER=coordinate CREATE_INDEX=true
For each sample ID there is a different RGID, RGLB, GRPL, RGPU, and RGSM (and different input files, but I already know how to call that info.) What I would like to do is have a loop that executes this command for each sample ID and have the corresponding RGLB, GRPL, RGPU, and RGSM inserted into the command. Is there an easy way to do this? I have been reading a bit and it seems like a dictionary is probably the way to go, but it is not clear to me how to generate the dictionary and call the independent values into my command.

This should be pretty easy, but how you do it depends on the format of your input file. You're going to want something basically like this:
import subprocess # This is how we're going to call the commands.
samples = {} # Empty dict
with open('inputfile','r') as f:
for line in f:
# Extract sampleID, other things depending on file format...
samples[sampleID] = [rgid, rglb, grpl, rgpu, rgsm] # Populate dict
for sampleID in samples:
rgid, rglb, grpl, rgpu, rgsm = samples[sampleID]
# Now you can run your commands using the subprocess module.
# Remember to add a change based on sampleID if e.g. the IO files differ.
subprocess.call(['picard-tools', 'AddOrReplaceReadGroups', 'I=input.bam',
'O=output.bam', 'RGID=%s' % rgid, 'RGLB=%s' % rglb, 'RGPL=%s' %rgpl,
'RGPU=%s' % rgpu, 'RGSM=%s' % rgsm, 'VALIDATION_STRINGENCY=SILENT',
'SORT_ORDER=coordinate', 'CREATE_INDEX=true'])

Related

Read csv file for datadriven testing in robotframework

I am currently trying to do some datadriven testing with robot framework from a csv file, using a python customlibrary. I am running in some problems though, would be grateful if someone can point me in the right direction
This is the error I am getting:
Resolving variable '${Tlogdata.0}' failed: SyntaxError: unexpected EOF while parsing (, line 1)
The csv I want to process currently has two records (I tried without, with single, and double codes):
1-KR8P27,11.0,1000
1-KR8P27,12.0,1001
I suspect the problem is with the customlibrary. I tried a lot in tweaking my code, but with what I found and my Python knowledge (that is admittably very basic) I cannot find any issue. This is what I currently have:
import csv
def read_csv_file(filename):
data = []
with open(filename,) as csvfile:
reader = csv.reader(csvfile)
for row in reader:
data.append(row)
return data
I am using some more keywords in Robot Framework to use this customlibrary to fetch data from my csv. While I suspect that my python code is the problem and I double checked everything I might be overlooking something here instead:
In a datamanager keyword file I created the following Keyword:
Get CSV Data
[Arguments] ${FilePath}
${Data} = read csv file ${FilePath}
[Return] ${Data}
Than I created a 'looping' keyword with a for loop:
Check multiple results
[Arguments] ${tlogdatas}
FOR ${tlogdata} IN ${tlogdatas}
Check result TLOG3 ${tlogdata}
The keyword I call in my loop is already used in a testcase without a datadriven setup, and works. Only the variables are named differently to make it work with the datadriven thing. The keyword looks like this:
Check result TLOG3
[Arguments] ${Tlogdata}
${queryResults} = query select x_ord_pts_earn, total_amt from siebel.s_order where
contact_id = ${Tlogdata.0} and total_amt = ${Tlogdata.1} and X_ORD_PTS_earn = ${Tlogdata.2}
# log #{queryResults[0][1]}
${dbvalue} = set variable ${queryResults}
${DB ordptsearn} = set variable ${queryResults[0][0]}
${DB contact_id} = set variable ${queryResults[0][1]}
should be equal as integers ${DB ordptsearn} ${Tlogdata.2}
should be equal as strings ${DB contact_id} ${Tlogdata.1}
END
Than in my testcase I define a variable which fetches its results from my datamanager keyword and use the looping keyword to go through the csv values:
Check TLOG results from CSVFile
${Tlogdata} = DataManager.Get CSV Data ${TLOG_RESULTS_CSVPath}
TLOG.Check multiple results ${Tlogdata}
It might also be worth it to show the values from the csv that are fetched according to the report file:
${Tlogdata} = [["'1-KR8P27'", "'11.0'", "'1000'"], ["'1-KR8P27'", "'12.0'", "'1001'"]]
I hope this is somewhat clear, I understand it is quit some text. But I am not 100% sure where the problem is in my scripts. I hope someone can point me in the right direction.
You are indexing your list wrong. Instead of ${Tlogdata.0} you should have ${Tlogdata[0]}, etc..
Here is a quick example:
*** Test Cases ***
Test
${Tlogdata}= Evaluate [["'1-KR8P27'", "'11.0'", "'1000'"], ["'1-KR8P27'", "'12.0'", "'1001'"]]
Log ${Tlogdata[0]}
Log ${Tlogdata[1]}
Log ${Tlogdata[0][1]}
Log ${Tlogdata[1][1]}

Cannot get the output of pstats

I'm trying to use cProfile from: https://docs.python.org/2/library/profile.html#module-cProfile
I can get the data to print but I want to be able to manipulate the data and sort so that I get just the info I want. To get the data to print I use:
b = cProfile.run("function_name")
But after that runs and prints, b = None and I cannot figure out where the data is that it printed so that I can manipulate the data. Of course, I can see the data, but in order to analyze the data I need to able to get some sort of output into my IED editor. I've tried pstats but I get error messages. It seems that to use pstats I have to save some sort of file but I cannot figure out how to run the program and save it to a file.
UPDATE:
I almost have a solution
cProfile.run('re.compile("foo|bar")', 'restats')
There is a second argument where you can save a file as 'restats'. Now I should be able to open it and read it.
SOLVED:
cProfile.run("get_result()", 'data_stats')
p = pstats.Stats('data_stats')
p.strip_dirs().sort_stats(-1).print_stats()
p.sort_stats('name')
cProfile.run("get_result()", 'data_stats')
p = pstats.Stats('data_stats')
p.strip_dirs().sort_stats(-1).print_stats()
p.sort_stats('name')
In addition to the first argument which runs the code, the second argument actually saves the output to a file. The next line will then open the file. Once that file is open you should be able to see the values of p in your IED editor and be able to use normal python operations to manipulate it.

Constant first row of a .csv file?

I have a Python code which is logging some data into a .csv file.
logging_file = 'test.csv'
dt = datetime.datetime.now()
f = open(logging_file, 'a')
f.write('\n "{:%H:%M:%S}",{},{}'.format(dt,x,y,))
The above code is the core part and this produces continuous data in .csv file as
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
Now I wish to add the following lines in first row of this data. time, data1,data2.I expect output as
time, data1, data2
"00:34:09" ,23.05,23.05
"00:36:09" ,24.05,24.05
"00:38:09" ,26.05,26.05
... etc.,
I tried many ways. Those ways not produced me the result as preferred format.But I am unable to get my expected result.
Please help me to solve the problem.
I would recommend writing a class specifically for creating and managing logs.Have it initialize a file, on creation, with the expected first line (don't forget a \n character!), and keep track of any necessary information about that log(the name of the log it created, where it is, etc). You can then have the class 'write' to the log (append the log, really), you can create new logs as necessary, and, you can have it check for existing logs, and make decisions about either updating what is existing, or scrapping it and starting over.

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:

Merge two large text files by common row to one mapping file

I have two text files that have similar formatting. The first (732KB):
>lib_1749;size=599;
TACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTCACTAGACTGTCACTGACACTGATGCTCGAAAGTGTGGGTATCAAACA
--
>lib_2235;size=456;
TACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACTATTAAGTCAGCTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCTAGCGGTGAAATGCTTAGATATCACGAAGAACTCCGATTGCGAAGGCAGCTTACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
>lib_13686;size=69;
TACGTATGGAGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGTGTAGGTGGCCAGGCAAGTCAGAAGTGAAAGCCCGGGGCTCAACCCCGGGGCTGGTAGCGGTGAAATGCGTAGATATTAGGAGGAACACCAGTGGCGAAGGCGGCTTGCTGGACTGTAACTGACACTGAGGCTCGAAAGCGTGGGGAGCAAACA
--
The second (5.26GB):
>Stool268_1 HWI-ST155_0605:1:1101:1194:2070#CTGTCTCTCCTA
TACGGAGGATGCGAGCGTTATCCGGATTTACTGGGTTTAAAGGGAGCGCAGACGGGACGTTAAGTCAGCTGTGAAAGTTTGGGGCTCAACCCTAAAACTGCTAGCGGTGAAATGCTTAGATATCGGGAGGAACTCCGGTTGCGAAGGCAGCATACTGGACTGCAACTGACGCTGATGCTCGAAAGTGTGGGTATCAAACAGG
--
Note the key difference is the header for each entry (lib_1749 vs. Stool268_1). What I need is to create a mapping file between the headers of one file and the headers of the second using the sequence (e.g., TACGGAGGATGCGAGCGTTATCCGGAT...) as a key.
Note as one final complication the mapping is not going to be 1-to-1 there will be multiple entries of the form Stool****** for each entry of lib****. This is because the length of the key in the first file was trimmed to have 200 characters but in the second file it can be longer.
For smaller files I would just do something like this in python but I often have trouble because these files are so big and cannot be read into memory at one time. Usually I try unix utilities but in this case I cannot think of how to accomplish this.
Thank you!
In my opinion, the easiest way would be to use BLAST+...
Set up the larger file as a BLAST database and use the smaller file as the query...
Then just write a small script to analyse the output - I.e. Take the top hit or two to create the mapping file.
BTW. You might find SequenceServer (Google it) helpful in setting up a custom Blast database and your BLAST environment...
BioPython should be able to read in large FASTA files.
from Bio import SeqIO
from collections import defaultdict
mapping = defaultdict(list)
for stool_record in SeqIO.parse('stool.fasta', 'fasta'):
stool_seq = str(stool_record.seq)
for lib_record in SeqIO.parse('libs.fasta', 'fasta'):
lib_seq = str(lib_record.seq)
if stool_seq.startswith(lib_seq):
mapping[lib_record.id.split(';')[0]].append(stool_record.id)

Categories

Resources