Elegant way to parse a C array from python - python

Is there any elegant way to parse a C array and to extract elements with defined indexes to an out file.
For example:
myfile.c
my_array[SPECIFIC_SIZE]={
0x10,0x12,0x13,0x14,0x15,0x23,0x01,0x02,0x04,0x07,0x08,
0x33,0x97,0x52,0x27,0x56,0x11,0x99,0x97,0x95,0x77,0x23,
0x45,0x97,0x90,0x97,0x68,0x23,0x28,0x05,0x66,0x99,0x38,
0x11,0x37,0x27,0x11,0x22,0x33,0x44,0x66,0x09,0x88,0x17,
0x90,0x97,0x17,0x90,0x97,0x22,0x77,0x97,0x87,0x25,0x22,
0x25,0x47,0x97,0x57,0x97,0x67,0x26,0x62,0x67,0x69,0x96
}
Python script:
I would like to do something like (just as pseudocode)
def parse_data():
outfile = open(newfile.txt,'w')
with open(myfile, 'r')
SEARCH FOR ELEMENT WITH INDEX 0 IN my_array
COPY ELEMENT TO OUTFILE AND LABEL WITH "Version Number"
SEARCH FOR ALL ELEMENTS WITH INDEX 1..10 IN my_array
COPY ELEMENTS TO OUTFILE WITH NEW LINE AND LABEL with "Date"
....
....
At the end I would like to have an newfile.txt like:
Version Number:
0x10
Date:
0x12,0x13,0x14,0x15,0x23,0x01,0x02,0x04,0x07,0x08
Can you show an example on that pseudocode?

If your .c file is always parsed like this, as in :
First line is the declaration of the array.
Middle lines are the data.
Last line is the closing bracket.
You can do...
def parse_myfile(fileName; outName):
with open(outName, 'w') as out:
with open(fileName, 'r') as f:
""" 1. Read all lines, except first and last.
2. Join all the lines together.
3. Replace all the '\n' by ''.
4. Split using ','.
"""
lines = (''.join(f.readlines()[1:-1])).replace('\n', '').split(',')
header = lines[0]
date = lines[1:11]
out.write('Version Number:\n{}\n\nDate:\n{}'.format(header, date))
if __name__ == '__main__':
fileName = 'myfile.c'
outFile = 'output.txt'
parse_myfile(fileName, outFile)
cat output.txt outputs...
Version Number:
0x10
Date:
['0x12', '0x13', '0x14', '0x15', '0x23', '0x01', '0x02', '0x04', '0x07', '0x08']

Related

left padding with python

I have following data and link combination of 100000 entries
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:32546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
I am trying to write a program in python 2.7 to add left padding zeros if value of link is less than 10 .
Eg:
545214569 --> 0545214569
32546897 --> 0032546897
can you please guide me what am i doing wrong with the following program :
with open("test.txt", "r") as f:
line=f.readline()
line1=f.readline()
wordcheck = "link"
wordcheck1= "dn"
for wordcheck1 in line1:
with open("pad-link.txt", "a") as ff:
for wordcheck in line:
with open("pad-link.txt", "a") as ff:
key, val = line.strip().split(":")
val1 = val.strip().rjust(10,'0')
line = line.replace(val,val1)
print (line)
print (line1)
ff.write(line1 + "\n")
ff.write('%s:%s \n' % (key, val1))
The usual pythonic way to pad values in Python is by using string formatting and the Format Specification Mini Language
link = 545214569
print('{:0>10}'.format(link))
Your for wordcheck1 in line1: and for workcheck in line: aren't doing what you think. They iterate one character at a time over the lines and assign that character to the workcheck variable.
If you only want to change the input file to have leading zeroes, this can be simplified as:
import re
# Read the whole file into memory
with open('input.txt') as f:
data = f.read()
# Replace all instances of "link:<digits>", passing the digits to a function that
# formats the replacement as a width-10 field, right-justified with zeros as padding.
data = re.sub(r'link:(\d+)', lambda m: 'link:{:0>10}'.format(m.group(1)), data)
with open('output.txt','w') as f:
f.write(data)
output.txt:
dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com
link:0545214569
dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com
link:0032546897
dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com
link:6547896541
i don't know why you have to open many times. Anyway, open 1 time, then for each line, split by :. the last element in list is the number. Then you know what lenght the digits should consistently b, say 150, then use zfill to padd the 0. then put the lines back by using join
for line in f.readlines():
words = line.split(':')
zeros = 150-len(words[-1])
words[-1] = words[-1].zfill(zeros)
newline = ':'.join(words)
# write this line to file

Create as many files as the number of items from two lists with the same number of items in Python

Consider the file testbam.txt:
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bam
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bam
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bam
and the file testbai.txt:
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bai
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bai
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bai
They always have in common the same length and I created a function to find it:
def file_len(fname):
with open(fname) as f:
for i,l in enumerate(f):
pass
return i+1
n = file_len('/groups/cgsd/alexandre/python_code/src/testbai.txt')
print(n)
3
Then I created two lists by opening the files and doing some manipulation:
content = []
with open('/groups/cgsd/alexandre/python_code/src/testbam.txt') as bams:
for line in bams:
content.append(line.strip().split())
print(content)
content2 = []
with open('/groups/cgsd/alexandre/python_code/src/testbai.txt') as bais:
for line in bais:
content2.append(line.strip().split())
print(content2)
Now I have a json type file called mutec.json that I would like to replace certain parts with the items of the lists:
{
"Mutect2.gatk_docker": "broadinstitute/gatk:4.1.4.1",
"Mutect2.intervals": "/groups/cgsd/alexandre/gatk-workflows/src/interval_list/Basic_Core_xGen_MSI_TERT_HPV_EBV_hg38.interval_list",
"Mutect2.scatter_count": 30,
"Mutect2.m2_extra_args": "--downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6",
"Mutect2.filter_funcotations": true,
"Mutect2.funco_reference_version": "hg38",
"Mutect2.run_funcotator": true,
"Mutect2.make_bamout": true,
"Mutect2.funco_data_sources_tar_gz": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/funcotator_dataSources.v1.6.20190124s.tar.gz",
"Mutect2.funco_transcript_selection_list": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/transcriptList.exact_uniprot_matches.AKT1_CRLF2_FGFR1.txt",
"Mutect2.ref_fasta": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta",
"Mutect2.ref_fai": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta.fai",
"Mutect2.ref_dict": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.dict",
"Mutect2.tumor_reads": "<<<N_item_of_list_content>>>",
"Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>",
}
Please note that this section:
"Mutect2.tumor_reads": "<<<N_item_of_list_content>>>",
"Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>",
<<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> should be replaced by their respective items of the list and I would like to finally write the result of every modification into a new file.
The final result would be 3 files: mutect1.json with first item from testbam.txt and first item from testbai.txt , mutect2.json with second item from testbam.txt and second item from testbai.txt and third file with the same reasoning applied.
Please note that the notation I wrote <<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> isn't necesserarily hard-coded into the file, I wrote myself just to make clear what I would like to replace.
First, and even if it is unrelated to the question, some of your code is not really Pythonic:
def file_len(fname):
with open(fname) as f:
for i,l in enumerate(f):
pass
return i+1
You use a for loop over enumerate when you should simply do:
def file_len(fname):
with open(fname) as f:
return len(f)
because f is an iterator over the lines of the file
Now to your question. You want to replace some elements in a file with data found in two other files.
In your initial question, the strings were enclosed in triple angle brackets.
I would have used:
import re
rx = re.compile(r'<<<.*?>>>') # how to identify what is to replace
with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \
open('.../mutect.json') as src:
for i, reps in enumerate(zip(bams, bais), 1): # gets a pair of replacement strings at each step
src.seek(0) # rewind src file
with open(f'mutect{i}', 'w') as fdout: # open the output files
rep_index = 0 # will first use rep string from first file
for line in src:
if rx.search(line): # if the string to replace there?
line = rx.sub(reps[rep_index], line)
rep_index = 1 - rep_index # next time will use the other string
fdout.write(line)
In comments, you proposed to change the first line of each file with the others. The code could become:
with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \
open('.../mutect.json') as src:
it = iter(zip(bams, bais))
to_find = next(it) # we will have to find that
for i, reps in enumerate(it, 2): # gets a pair of replacement strings at each step
src.seek(0) # rewind src file
with open(f'mutect{i}', 'w') as fdout: # open the output files
for line in src:
line = line.replace(to_find[0], reps[0]) # just try to replace
line = line.replace(to_find[1], reps[1])
fdout.write(line)

Data Formatting within a txt. File

I have the following txt file that needs to be formatted with specific start and end positions for data throughout the file. For instance, column 1 is blank and will be read as an entry number. The values for this data type is a numeric 9 and should have the following positions (1-9). Next is employee ID with positions (10-15).. and so on. Values do not need a delimiter.
,MB4858,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,MD6535,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,PM7858,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,RM0111,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,RY2585,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,TM0617 ,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,VE2495,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,VJ8913,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,FJ4815 ,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,OM0188,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y,
,H00858,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H08392,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H15624,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H27573,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H40249,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H44581,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H48473,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H51570,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H55768,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H64315,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H71507,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H72248,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H78527,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H90393,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
,H95973,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
You can try starting here:
import sys
inFile = sys.argv[1]
outFile = "newFile.txt"
with open(inFile, 'r') as inf, open(outFile, 'w') as outf:
for line in inf:
line = line.split(',')
print(line)
Where sys argv[1] is the name of your txt file when you run the python script from the command line.
You can see that it will print out a list containing individual strings between the comma delimiters you have in your txt data file. From there you can do list manipulations to format the data. And then write it to outf like so (example):
# do what ever manipulations here to the output line
output_line = line[0] + " " + line[1]
outf.write(output_line)
outf.write('\n'

Adding each item in list to end of specific lines in FASTA file

I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)

python csv replace listitem

i have following output from a csv file:
word1|word2|word3|word4|word5|word6|01:12|word8
word1|word2|word3|word4|word5|word6|03:12|word8
word1|word2|word3|word4|word5|word6|01:12|word8
what i need to do is change the time string like this 00:01:12.
my idea is to extract the list item [7] and add a "00:" as string to the front.
import csv
with open('temp', 'r') as f:
reader = csv.reader(f, delimiter="|")
for row in reader:
fixed_time = (str("00:") + row[7])
begin = row[:6]
end = row[:8]
print begin + fixed_time +end
get error message:
TypeError: can only concatenate list (not "str") to list.
i also had a look on this post.
how to change [1,2,3,4] to '1234' using python
i neeed to know if my approach to soloution is the right way. maybe need to use split or anything else for this.
thx for any help
The line that's throwing the exception is
print begin + fixed_time +end
because begin and end are both lists and fixed_time is a string. Whenever you take a slice of a list (that's the row[:6] and row[:8] parts), a list is returned. If you just want to print it out, you can do
print begin, fixed_time, end
and you won't get an error.
Corrected code:
I'm opening a new file for writing (I'm calling it 'final', but you can call it whatever you want), and I'm just writing everything to it with the one modification. It's easiest to just change the one element of the list that has the line (row[6] here), and use '|'.join to write a pipe character between each column.
import csv
with open('temp', 'r') as f, open('final', 'w') as fw:
reader = csv.reader(f, delimiter="|")
for row in reader:
# just change the element in the row to have the extra zeros
row[6] = '00:' + row[6]
# 'write the row back out, separated by | characters, and a new line.
fw.write('|'.join(row) + '\n')
you can use regex for that:
>>> txt = """\
... word1|word2|word3|word4|word5|word6|01:12|word8
... word1|word2|word3|word4|word5|word6|03:12|word8
... word1|word2|word3|word4|word5|word6|01:12|word8"""
>>> import re
>>> print(re.sub(r'\|(\d\d:\d\d)\|', r'|00:\1|', txt))
word1|word2|word3|word4|word5|word6|00:01:12|word8
word1|word2|word3|word4|word5|word6|00:03:12|word8
word1|word2|word3|word4|word5|word6|00:01:12|word8

Categories

Resources