I'm writing a function that will read a file given the number of header lines to skip and the number of footer lines to skip.
def LoadText(file, HeaderLinesToSkip, FooterLinesToSkip):
fin = open(file)
text = []
for line in fin.readlines()[HeaderLinesToSkip, -FooterLinesToSkip]
text.append(line.strip())
return text
My problem is that this function will work properly only of FooterLinesToSkip is at least equal to 1. If FooterLinesToSkip = 0, then the function will return []. I can solve this problem with an if statement, but is there a much simpler form?
Edit : I actually simplified my problem; the lines read from the file contains columns separated by a semi-column. The real function includes .split(delimiter_character) and should store only column 1.
def LoadText(file, HeaderLinesToSkip, FooterLinesToSkip):
fin = open(file)
text = []
for line in fin.readlines()[HeaderLinesToSkip, -FooterLinesToSkip]
text.append(line.strip().split(';')[1])
return text
Set FooterLinesToSkip to None instead, so the slice defaults to the list length:
def LoadText(file, HeaderLinesToSkip, FooterLinesToSkip):
with open(file) as fin:
FooterLinesToSkip = -FooterLinesToSkip if FooterLinesToSkip else None
text = []
for line in fin.readlines()[HeaderLinesToSkip:FooterLinesToSkip]):
text.append(line.strip().split(';')[1])
Let me offer you an improvement, which does not require you to read the whole list into memory:
from collections import deque
from itertools import islice
def skip_headers_and_footers(fh, header_skip, footer_skip):
buffer = deque(islice(fh, header_skip, header_skip + footer_skip), footer_skip)
for line in fh:
yield buffer.popleft()
buffer.append(line)
This reads lines one by one, after skipping header_skip lines, and keeping footer_skip lines in a buffer. By the time we looped over all lines in the file, footer_skip lines remain in the buffer and are ignored.
This is a generator function, so it'll yield lines in a loop:
with open(filename) as open_file:
for line in skip_headers_and_footers(open_file, 2, 2):
# do something with this line.
line = line.strip()
I moved the file opening out of the function so that it can be used for other iterables too, not just files.
Now you can use the csv module to handle the column splitting and stripping:
import csv
with open(filename, 'rb') as open_file:
reader = csv.reader(open_file, delimiter=';')
for row in skip_headers_and_footers(reader, 2, 2):
column = row[1]
and the skip_headers_and_footers() generator has skipped the first two rows for you and will never yield the last two rows either.
Related
Hello there community!
I've been struggling with this function for a while and I cannot seem to make it work.
I need a function that reads a csv file and takes as arguments: the csv file, and a range of first to last line to be read. I've looked into many threads but none seem to work in my case.
My current function is based on the answer to this post: How to read specific lines of a large csv file
It looks like this:
def lines(file, first, last):
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csvfile:
for line_number, row in enumerate(csvfile):
if line_number in lines_set:
result.append(extract_data(csvfile.readline())) #this extract_data is a previously created function that fetches specific columns from the csv
return result
What's happening is it's skipping a row everytime, meaning that instead of reading, for example, from line 1 to 4, it reads these four lines: 1, 3, 5 and 7.
Adittional question: The csv file has a headline. How can I use the next method in order to never include the header?
Thank you very much for all your help!
I recommend you use the CSV reader, it’ll save you from getting any incomplete row as a row can span many lines.
That said, your basic problem is manually iterating your reader inside the for-loop, which is already automatically iterating your reader:
import csv
lines_set = range(first, last, 1)
result = []
with open("./2020.csv", "r") as csv_file:
reader = csv.reader(csv_file)
next(reader) # only if CSV has a header AND it should not be counted
for line_number, row in enumerate(reader):
if line_number in lines_set:
result.append(extract_data(row))
return result
Also, keep in mind that enumerate() will start at 0 by default, give the start=1 option (or whatever start value you need) to start counting correctly. Then, range() has a non-inclusive end, so you might want end+1.
Or... do away with range() entirely since < and > are sufficient, and maybe clearer:
import csv
start = 3
end = 6
with open('input.csv') as file:
reader = csv.reader(file)
next(reader) # Skip header row
for i, row in enumerate(reader, start=1):
if i < start or i > end:
continue
print(row)
Consider the file testbam.txt:
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bam
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bam
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bam
and the file testbai.txt:
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bai
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bai
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bai
They always have in common the same length and I created a function to find it:
def file_len(fname):
with open(fname) as f:
for i,l in enumerate(f):
pass
return i+1
n = file_len('/groups/cgsd/alexandre/python_code/src/testbai.txt')
print(n)
3
Then I created two lists by opening the files and doing some manipulation:
content = []
with open('/groups/cgsd/alexandre/python_code/src/testbam.txt') as bams:
for line in bams:
content.append(line.strip().split())
print(content)
content2 = []
with open('/groups/cgsd/alexandre/python_code/src/testbai.txt') as bais:
for line in bais:
content2.append(line.strip().split())
print(content2)
Now I have a json type file called mutec.json that I would like to replace certain parts with the items of the lists:
{
"Mutect2.gatk_docker": "broadinstitute/gatk:4.1.4.1",
"Mutect2.intervals": "/groups/cgsd/alexandre/gatk-workflows/src/interval_list/Basic_Core_xGen_MSI_TERT_HPV_EBV_hg38.interval_list",
"Mutect2.scatter_count": 30,
"Mutect2.m2_extra_args": "--downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6",
"Mutect2.filter_funcotations": true,
"Mutect2.funco_reference_version": "hg38",
"Mutect2.run_funcotator": true,
"Mutect2.make_bamout": true,
"Mutect2.funco_data_sources_tar_gz": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/funcotator_dataSources.v1.6.20190124s.tar.gz",
"Mutect2.funco_transcript_selection_list": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/transcriptList.exact_uniprot_matches.AKT1_CRLF2_FGFR1.txt",
"Mutect2.ref_fasta": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta",
"Mutect2.ref_fai": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta.fai",
"Mutect2.ref_dict": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.dict",
"Mutect2.tumor_reads": "<<<N_item_of_list_content>>>",
"Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>",
}
Please note that this section:
"Mutect2.tumor_reads": "<<<N_item_of_list_content>>>",
"Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>",
<<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> should be replaced by their respective items of the list and I would like to finally write the result of every modification into a new file.
The final result would be 3 files: mutect1.json with first item from testbam.txt and first item from testbai.txt , mutect2.json with second item from testbam.txt and second item from testbai.txt and third file with the same reasoning applied.
Please note that the notation I wrote <<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> isn't necesserarily hard-coded into the file, I wrote myself just to make clear what I would like to replace.
First, and even if it is unrelated to the question, some of your code is not really Pythonic:
def file_len(fname):
with open(fname) as f:
for i,l in enumerate(f):
pass
return i+1
You use a for loop over enumerate when you should simply do:
def file_len(fname):
with open(fname) as f:
return len(f)
because f is an iterator over the lines of the file
Now to your question. You want to replace some elements in a file with data found in two other files.
In your initial question, the strings were enclosed in triple angle brackets.
I would have used:
import re
rx = re.compile(r'<<<.*?>>>') # how to identify what is to replace
with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \
open('.../mutect.json') as src:
for i, reps in enumerate(zip(bams, bais), 1): # gets a pair of replacement strings at each step
src.seek(0) # rewind src file
with open(f'mutect{i}', 'w') as fdout: # open the output files
rep_index = 0 # will first use rep string from first file
for line in src:
if rx.search(line): # if the string to replace there?
line = rx.sub(reps[rep_index], line)
rep_index = 1 - rep_index # next time will use the other string
fdout.write(line)
In comments, you proposed to change the first line of each file with the others. The code could become:
with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \
open('.../mutect.json') as src:
it = iter(zip(bams, bais))
to_find = next(it) # we will have to find that
for i, reps in enumerate(it, 2): # gets a pair of replacement strings at each step
src.seek(0) # rewind src file
with open(f'mutect{i}', 'w') as fdout: # open the output files
for line in src:
line = line.replace(to_find[0], reps[0]) # just try to replace
line = line.replace(to_find[1], reps[1])
fdout.write(line)
I solved this in the comments below.
So essentially what I am trying to do is add each element of a list of strings to the end of specific lines in a different file.
Hard to explain but essentially I want to parse a FASTA file, and every time it reaches a header (line.startswith('>')) I want it to replace parts of that header with an element in a list I've already made.
For example:
File1:
">seq1 unwanted here
AATATTATA
ATATATATA
>seq2 unwanted stuff here
GTGTGTGTG
GTGTGTGTG
>seq3 more stuff I don't want
ACACACACAC
ACACACACAC"
I want it to keep ">seq#" but replace everything after with the next item in the list below:
List:
mylist = "['things1', '', 'things3', 'things4', '' 'things6', 'things7']"
Result (modified file1):
">seq1 things1
AATATTATA
ATATATATA
>seq2 # adds nothing here due to mylist[1] = ''
GTGTGTGTG
GTGTGTGTG
>seq3 things3
ACACACACAC
ACACACACAC
As you can see I want it to add even the blank items in the list.
So once again, I want it to parse this FASTA file, and every time it gets to a header (there are thousands), I want it to replace everything after the first word with the next item in the separate list I have made.
What you have will work, but there are a few unnecessary lines so I've edited down to use a few less lines. Also, an important note is that you don't close your file handles. This could result in errors, specifically when writing to file, either way it's bad practice. code:
#!/usr/bin/python
import sys
# gets list of annotations
def get_annos(infile):
with open(infile, 'r') as fh: # makes sure the file is closed properly
annos = []
for line in fh:
annos.append( line.split('\t')[5] ) # added tab as separator
return annos
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
annos = get_annos(infile1) # contains list of annos
with open(infile2, 'r') as f2, open(outfile, 'w') as output:
for line in f2:
if line.startswith('>'):
line_split = list(line.split()[0]) # split line on whitespace and store first element in list
line_split.append(annos.pop(0)) # append data of interest to current id line
output.write( ' '.join(line_split) + '\n' ) # join and write to file with a newline character
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
This is not perfect but it cleans things up a bit. I'd might veer away from using pop() to associate the annotation data with the sequence IDs unless you are certain the files are in the same order every time.
There is a great library in python for Fasta and other DNA file parsing. It is totally helpful in Bioinformatics. You can also manipulate any data according to your need.
Here is a simple example extracted from the library website:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))
You should get something like this on your screen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC', SingleLetterAlphabet())
592
***********EDIT*********
I solved this before anyone could help. This is my code, can anyone tell me if I have any bad practices? Is there a way to do this without writing everything to a new file? Seems like it would take a long time/lots of memory.
#!/usr/bin/python
# Script takes unedited FASTA file, removed seq length and
# other header info, adds annotation after sequence name
# run as: $ python addanno.py testanno.out testseq.fasta out.txt
import sys
# gets list of annotations
def get_annos(infile):
f = open(infile)
list2 = []
for line in f:
columns = line.strip().split('\t')
list2.append(columns[5])
return list2
# replaces extra info on each header with correct annotation
def add_annos(infile1, infile2, outfile):
mylist = get_annos(infile1) # contains list of annos
f2 = open(infile2, 'r')
output = open(out, 'w')
for line in f2:
if line.startswith('>'):
l = line.partition(" ")
list3 = list(l)
del list3[1:]
list3.append(' ')
list3.append(mylist.pop(0))
final = ''.join(list3)
line = line.replace(line, final)
output.write(line)
output.write('\n')
else:
output.write(line)
anno = sys.argv[1]
seq = sys.argv[2]
out = sys.argv[3]
add_annos(anno, seq, out)
get_annos(anno)
What i'm trying to do is to take 4 lines from a file that look like this:
#blablabla
blablabla #this string needs to match the amount of characters in line 4
!blablabla
blablabla #there is a string here
This goes on for a few hundred times.
I read the entire thing line by line, make a change to the fourth line, then want to match the second line's character count to the amount in the fourth line.
I can't figure out how to "backtrack" and change the second line after making changes to the fourth.
with fileC as inputA:
for line1 in inputA:
line2 = next(inputA)
line3 = next(inputA)
line4 = next(inputA)
is what i'm currently using, because it lets me handle 4 lines at the same time, but there has to be a better way as causes all sorts of problems when writing away the file. What could I use as an alternative?
you could do:
with open(filec , 'r') as f:
lines = f.readlines() # readlines creates a list of the lines
to access line 4 and do something with it you would access:
lines[3] # as lines is a list
and for line 2
lines[1] # etc.
You could then write your lines back into a file if you wish
EDIT:
Regarding your comment, perhaps something like this:
def change_lines(fileC):
with open(fileC , 'r') as f:
while True:
lines = []
for i in range(4):
try:
lines.append(f.next()) # f.next() returns next line in file
except StopIteration: # this will happen if you reach end of file before finding 4 more lines.
#decide what you want to do here
return
# otherwise this will happen
lines[2] = lines[4] # or whatever you want to do here
# maybe write them to a new file
# remember you're still within the for loop here
EDIT:
Since your file divides into fours evenly, this works:
def change_lines(fileC):
with open(fileC , 'r') as f:
while True:
lines = []
for i in range(4):
try:
lines.append(f.next())
except StopIteration:
return
code code # do something with lines here
# and write to new file etc.
Another way to do it:
import sys
from itertools import islice
def read_in_chunks(file_path, n):
with open(file_path) as fh:
while True:
lines = list(islice(fh, n))
if lines: yield lines
else: break
for lines in read_in_chunks(sys.argv[1], 4):
print lines
Also relevant is the grouper() recipe in the itertools module. In that case, you would need to filter out the None values before yielding them to the caller.
You could read the file with .readlines and then index which ever line you want to change and write that back to the file:
rf = open('/path/to/file')
file_lines = rf.readlines()
rf.close()
line[1] = line[3] # trim/edit however you'd like
wf = open('/path/to/write', 'w')
wf.writelines(file_lines)
wf.close()
How can I skip the header row and start reading a file from line2?
with open(fname) as f:
next(f)
for line in f:
#do something
f = open(fname,'r')
lines = f.readlines()[1:]
f.close()
If you want the first line and then you want to perform some operation on file this code will helpful.
with open(filename , 'r') as f:
first_line = f.readline()
for line in f:
# Perform some operations
If slicing could work on iterators...
from itertools import islice
with open(fname) as f:
for line in islice(f, 1, None):
pass
f = open(fname).readlines()
firstLine = f.pop(0) #removes the first line
for line in f:
...
To generalize the task of reading multiple header lines and to improve readability I'd use method extraction. Suppose you wanted to tokenize the first three lines of coordinates.txt to use as header information.
Example
coordinates.txt
---------------
Name,Longitude,Latitude,Elevation, Comments
String, Decimal Deg., Decimal Deg., Meters, String
Euler's Town,7.58857,47.559537,0, "Blah"
Faneuil Hall,-71.054773,42.360217,0
Yellowstone National Park,-110.588455,44.427963,0
Then method extraction allows you to specify what you want to do with the header information (in this example we simply tokenize the header lines based on the comma and return it as a list but there's room to do much more).
def __readheader(filehandle, numberheaderlines=1):
"""Reads the specified number of lines and returns the comma-delimited
strings on each line as a list"""
for _ in range(numberheaderlines):
yield map(str.strip, filehandle.readline().strip().split(','))
with open('coordinates.txt', 'r') as rh:
# Single header line
#print next(__readheader(rh))
# Multiple header lines
for headerline in __readheader(rh, numberheaderlines=2):
print headerline # Or do other stuff with headerline tokens
Output
['Name', 'Longitude', 'Latitude', 'Elevation', 'Comments']
['String', 'Decimal Deg.', 'Decimal Deg.', 'Meters', 'String']
If coordinates.txt contains another headerline, simply change numberheaderlines. Best of all, it's clear what __readheader(rh, numberheaderlines=2) is doing and we avoid the ambiguity of having to figure out or comment on why author of the the accepted answer uses next() in his code.
If you want to read multiple CSV files starting from line 2, this works like a charm
for files in csv_file_list:
with open(files, 'r') as r:
next(r) #skip headers
rr = csv.reader(r)
for row in rr:
#do something
(this is part of Parfait's answer to a different question)
# Open a connection to the file
with open('world_dev_ind.csv') as file:
# Skip the column names
file.readline()
# Initialize an empty dictionary: counts_dict
counts_dict = {}
# Process only the first 1000 rows
for j in range(0, 1000):
# Split the current line into a list: line
line = file.readline().split(',')
# Get the value for the first column: first_col
first_col = line[0]
# If the column value is in the dict, increment its value
if first_col in counts_dict.keys():
counts_dict[first_col] += 1
# Else, add to the dict and set value to 1
else:
counts_dict[first_col] = 1
# Print the resulting dictionary
print(counts_dict)