Elegant way to parse a C array from python - python
Is there any elegant way to parse a C array and to extract elements with defined indexes to an out file.
For example:
myfile.c
my_array[SPECIFIC_SIZE]={
0x10,0x12,0x13,0x14,0x15,0x23,0x01,0x02,0x04,0x07,0x08,
0x33,0x97,0x52,0x27,0x56,0x11,0x99,0x97,0x95,0x77,0x23,
0x45,0x97,0x90,0x97,0x68,0x23,0x28,0x05,0x66,0x99,0x38,
0x11,0x37,0x27,0x11,0x22,0x33,0x44,0x66,0x09,0x88,0x17,
0x90,0x97,0x17,0x90,0x97,0x22,0x77,0x97,0x87,0x25,0x22,
0x25,0x47,0x97,0x57,0x97,0x67,0x26,0x62,0x67,0x69,0x96
}
Python script:
I would like to do something like (just as pseudocode)
def parse_data():
outfile = open(newfile.txt,'w')
with open(myfile, 'r')
SEARCH FOR ELEMENT WITH INDEX 0 IN my_array
COPY ELEMENT TO OUTFILE AND LABEL WITH "Version Number"
SEARCH FOR ALL ELEMENTS WITH INDEX 1..10 IN my_array
COPY ELEMENTS TO OUTFILE WITH NEW LINE AND LABEL with "Date"
....
....
At the end I would like to have an newfile.txt like:
Version Number:
0x10
Date:
0x12,0x13,0x14,0x15,0x23,0x01,0x02,0x04,0x07,0x08
Can you show an example on that pseudocode?
If your .c file is always parsed like this, as in :
First line is the declaration of the array.
Middle lines are the data.
Last line is the closing bracket.
You can do...
def parse_myfile(fileName; outName):
with open(outName, 'w') as out:
with open(fileName, 'r') as f:
""" 1. Read all lines, except first and last.
2. Join all the lines together.
3. Replace all the '\n' by ''.
4. Split using ','.
"""
lines = (''.join(f.readlines()[1:-1])).replace('\n', '').split(',')
header = lines[0]
date = lines[1:11]
out.write('Version Number:\n{}\n\nDate:\n{}'.format(header, date))
if __name__ == '__main__':
fileName = 'myfile.c'
outFile = 'output.txt'
parse_myfile(fileName, outFile)
cat output.txt outputs...
Version Number:
0x10
Date:
['0x12', '0x13', '0x14', '0x15', '0x23', '0x01', '0x02', '0x04', '0x07', '0x08']
Related
left padding with python
I have following data and link combination of 100000 entries dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com link:545214569 dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com link:32546897 dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com link:6547896541 I am trying to write a program in python 2.7 to add left padding zeros if value of link is less than 10 . Eg: 545214569 --> 0545214569 32546897 --> 0032546897 can you please guide me what am i doing wrong with the following program : with open("test.txt", "r") as f: line=f.readline() line1=f.readline() wordcheck = "link" wordcheck1= "dn" for wordcheck1 in line1: with open("pad-link.txt", "a") as ff: for wordcheck in line: with open("pad-link.txt", "a") as ff: key, val = line.strip().split(":") val1 = val.strip().rjust(10,'0') line = line.replace(val,val1) print (line) print (line1) ff.write(line1 + "\n") ff.write('%s:%s \n' % (key, val1))
The usual pythonic way to pad values in Python is by using string formatting and the Format Specification Mini Language link = 545214569 print('{:0>10}'.format(link))
Your for wordcheck1 in line1: and for workcheck in line: aren't doing what you think. They iterate one character at a time over the lines and assign that character to the workcheck variable. If you only want to change the input file to have leading zeroes, this can be simplified as: import re # Read the whole file into memory with open('input.txt') as f: data = f.read() # Replace all instances of "link:<digits>", passing the digits to a function that # formats the replacement as a width-10 field, right-justified with zeros as padding. data = re.sub(r'link:(\d+)', lambda m: 'link:{:0>10}'.format(m.group(1)), data) with open('output.txt','w') as f: f.write(data) output.txt: dn:id=2150fccc-beb8-42f8-b201-182a6bf5ddfe,ou=test,dc=com link:0545214569 dn:id=ffa55959-457d-49e6-b4cf-a34eff8bbfb7,ou=test,dc=com link:0032546897 dn:id=3452a4c3-b768-43f5-8f1e-d33c14787b9b,ou=test,dc=com link:6547896541
i don't know why you have to open many times. Anyway, open 1 time, then for each line, split by :. the last element in list is the number. Then you know what lenght the digits should consistently b, say 150, then use zfill to padd the 0. then put the lines back by using join for line in f.readlines(): words = line.split(':') zeros = 150-len(words[-1]) words[-1] = words[-1].zfill(zeros) newline = ':'.join(words) # write this line to file
Create as many files as the number of items from two lists with the same number of items in Python
Consider the file testbam.txt: /groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bam /groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bam /groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bam and the file testbai.txt: /groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bai /groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bai /groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bai They always have in common the same length and I created a function to find it: def file_len(fname): with open(fname) as f: for i,l in enumerate(f): pass return i+1 n = file_len('/groups/cgsd/alexandre/python_code/src/testbai.txt') print(n) 3 Then I created two lists by opening the files and doing some manipulation: content = [] with open('/groups/cgsd/alexandre/python_code/src/testbam.txt') as bams: for line in bams: content.append(line.strip().split()) print(content) content2 = [] with open('/groups/cgsd/alexandre/python_code/src/testbai.txt') as bais: for line in bais: content2.append(line.strip().split()) print(content2) Now I have a json type file called mutec.json that I would like to replace certain parts with the items of the lists: { "Mutect2.gatk_docker": "broadinstitute/gatk:4.1.4.1", "Mutect2.intervals": "/groups/cgsd/alexandre/gatk-workflows/src/interval_list/Basic_Core_xGen_MSI_TERT_HPV_EBV_hg38.interval_list", "Mutect2.scatter_count": 30, "Mutect2.m2_extra_args": "--downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6", "Mutect2.filter_funcotations": true, "Mutect2.funco_reference_version": "hg38", "Mutect2.run_funcotator": true, "Mutect2.make_bamout": true, "Mutect2.funco_data_sources_tar_gz": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/funcotator_dataSources.v1.6.20190124s.tar.gz", "Mutect2.funco_transcript_selection_list": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/transcriptList.exact_uniprot_matches.AKT1_CRLF2_FGFR1.txt", "Mutect2.ref_fasta": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta", "Mutect2.ref_fai": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta.fai", "Mutect2.ref_dict": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.dict", "Mutect2.tumor_reads": "<<<N_item_of_list_content>>>", "Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>", } Please note that this section: "Mutect2.tumor_reads": "<<<N_item_of_list_content>>>", "Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>", <<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> should be replaced by their respective items of the list and I would like to finally write the result of every modification into a new file. The final result would be 3 files: mutect1.json with first item from testbam.txt and first item from testbai.txt , mutect2.json with second item from testbam.txt and second item from testbai.txt and third file with the same reasoning applied. Please note that the notation I wrote <<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> isn't necesserarily hard-coded into the file, I wrote myself just to make clear what I would like to replace.
First, and even if it is unrelated to the question, some of your code is not really Pythonic: def file_len(fname): with open(fname) as f: for i,l in enumerate(f): pass return i+1 You use a for loop over enumerate when you should simply do: def file_len(fname): with open(fname) as f: return len(f) because f is an iterator over the lines of the file Now to your question. You want to replace some elements in a file with data found in two other files. In your initial question, the strings were enclosed in triple angle brackets. I would have used: import re rx = re.compile(r'<<<.*?>>>') # how to identify what is to replace with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \ open('.../mutect.json') as src: for i, reps in enumerate(zip(bams, bais), 1): # gets a pair of replacement strings at each step src.seek(0) # rewind src file with open(f'mutect{i}', 'w') as fdout: # open the output files rep_index = 0 # will first use rep string from first file for line in src: if rx.search(line): # if the string to replace there? line = rx.sub(reps[rep_index], line) rep_index = 1 - rep_index # next time will use the other string fdout.write(line) In comments, you proposed to change the first line of each file with the others. The code could become: with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \ open('.../mutect.json') as src: it = iter(zip(bams, bais)) to_find = next(it) # we will have to find that for i, reps in enumerate(it, 2): # gets a pair of replacement strings at each step src.seek(0) # rewind src file with open(f'mutect{i}', 'w') as fdout: # open the output files for line in src: line = line.replace(to_find[0], reps[0]) # just try to replace line = line.replace(to_find[1], reps[1]) fdout.write(line)
Data Formatting within a txt. File
I have the following txt file that needs to be formatted with specific start and end positions for data throughout the file. For instance, column 1 is blank and will be read as an entry number. The values for this data type is a numeric 9 and should have the following positions (1-9). Next is employee ID with positions (10-15).. and so on. Values do not need a delimiter. ,MB4858,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,MD6535,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,PM7858,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,RM0111,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,RY2585,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,TM0617 ,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,VE2495,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,VJ8913,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,FJ4815 ,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,OM0188,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225D,,,DF2016,CA4310,,0172CA,,,,,Y, ,H00858,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H08392,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H15624,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H27573,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H40249,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H44581,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H48473,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H51570,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H55768,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H64315,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H71507,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H72248,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H78527,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H90393,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y, ,H95973,01,1,CA,07/18/20,0,0,4.8,,,,,,14.77,,Y,2225DH,,,DF2016,CA4311,,0172CA,,,,,Y,
You can try starting here: import sys inFile = sys.argv[1] outFile = "newFile.txt" with open(inFile, 'r') as inf, open(outFile, 'w') as outf: for line in inf: line = line.split(',') print(line) Where sys argv[1] is the name of your txt file when you run the python script from the command line. You can see that it will print out a list containing individual strings between the comma delimiters you have in your txt data file. From there you can do list manipulations to format the data. And then write it to outf like so (example): # do what ever manipulations here to the output line output_line = line[0] + " " + line[1] outf.write(output_line) outf.write('\n'
Adding each item in list to end of specific lines in FASTA file
I solved this in the comments below. So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file. Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made. For example: File1: ">seq1 unwanted here AATATTATA ATATATATA >seq2 unwanted stuff here GTGTGTGTG GTGTGTGTG >seq3 more stuff I don't want ACACACACAC ACACACACAC" I want it to keep ">seq#" but replace everything after with the next item in the list below: List: mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']" Result (modified file1): ">seq1 things1 AATATTATA ATATATATA >seq2 # adds nothing here due to mylist[1] = '' GTGTGTGTG GTGTGTGTG >seq3 things3 ACACACACAC ACACACACAC As you can see I want it to add even the blank items in the list. So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code: #!/usr/bin/python import sys # gets list of annotations def get_annos(infile): with open(infile, 'r') as fh: # makes sure the file is closed properly annos = [] for line in fh: annos.append( line.split('\t')[5] ) # added tab as separator return annos # replaces extra info on each header with correct annotation def add_annos(infile1, infile2, outfile): annos = get_annos(infile1) # contains list of annos with open(infile2, 'r') as f2, open(outfile, 'w') as output: for line in f2: if line.startswith('>'): line_split = list(line.split()[0]) # split line on whitespace and store first element in list line_split.append(annos.pop(0)) # append data of interest to current id line output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character else: output.write(line) anno = sys.argv[1] seq = sys.argv[2] out = sys.argv[3] add_annos(anno, seq, out) get_annos(anno) This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need. Here is a simple example extracted from the library website: from Bio import SeqIO for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) You should get something like this on your screen: gi|2765658|emb|Z78533.1|CIZ78533 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) 740 ... gi|2765564|emb|Z78439.1|PBZ78439 Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet()) 592
***********EDIT********* I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory. #!/usr/bin/python # Script takes unedited FASTA file, removed seq length and # other header info, adds annotation after sequence name # run as: $ python addanno.py testanno.out testseq.fasta out.txt import sys # gets list of annotations def get_annos(infile): f = open(infile) list2 = [] for line in f: columns = line.strip().split('\t') list2.append(columns[5]) return list2 # replaces extra info on each header with correct annotation def add_annos(infile1, infile2, outfile): mylist = get_annos(infile1) # contains list of annos f2 = open(infile2, 'r') output = open(out, 'w') for line in f2: if line.startswith('>'): l = line.partition(" ") list3 = list(l) del list3[1:] list3.append(' ') list3.append(mylist.pop(0)) final = ''.join(list3) line = line.replace(line, final) output.write(line) output.write('\n') else: output.write(line) anno = sys.argv[1] seq = sys.argv[2] out = sys.argv[3] add_annos(anno, seq, out) get_annos(anno)
python csv replace listitem
i have following output from a csv file: word1|word2|word3|word4|word5|word6|01:12|word8 word1|word2|word3|word4|word5|word6|03:12|word8 word1|word2|word3|word4|word5|word6|01:12|word8 what i need to do is change the time string like this 00:01:12. my idea is to extract the list item [7] and add a "00:" as string to the front. import csv with open('temp', 'r') as f: reader = csv.reader(f, delimiter="|") for row in reader: fixed_time = (str("00:") + row[7]) begin = row[:6] end = row[:8] print begin + fixed_time +end get error message: TypeError: can only concatenate list (not "str") to list. i also had a look on this post. how to change [1,2,3,4] to '1234' using python i neeed to know if my approach to soloution is the right way. maybe need to use split or anything else for this. thx for any help
The line that's throwing the exception is print begin + fixed_time +end because begin and end are both lists and fixed_time is a string. Whenever you take a slice of a list (that's the row[:6] and row[:8] parts), a list is returned. If you just want to print it out, you can do print begin, fixed_time, end and you won't get an error. Corrected code: I'm opening a new file for writing (I'm calling it 'final', but you can call it whatever you want), and I'm just writing everything to it with the one modification. It's easiest to just change the one element of the list that has the line (row[6] here), and use '|'.join to write a pipe character between each column. import csv with open('temp', 'r') as f, open('final', 'w') as fw: reader = csv.reader(f, delimiter="|") for row in reader: # just change the element in the row to have the extra zeros row[6] = '00:' + row[6] # 'write the row back out, separated by | characters, and a new line. fw.write('|'.join(row) + '\n')
you can use regex for that: >>> txt = """\ ... word1|word2|word3|word4|word5|word6|01:12|word8 ... word1|word2|word3|word4|word5|word6|03:12|word8 ... word1|word2|word3|word4|word5|word6|01:12|word8""" >>> import re >>> print(re.sub(r'\|(\d\d:\d\d)\|', r'|00:\1|', txt)) word1|word2|word3|word4|word5|word6|00:01:12|word8 word1|word2|word3|word4|word5|word6|00:03:12|word8 word1|word2|word3|word4|word5|word6|00:01:12|word8