Python line.split between two delimeters - python

I have a text file that contains the following data:
Schema:
Column Name Localized Name Type MaxLength
---------------------------- ---------------------------- ------ ---------
Raw Binary Binary 16384
Row 1:
Binary:
-----BEGIN-----
fdsfdsfdasadsad
fsdfafsdafsadfa
fsdafadsfadsfdsa
-----END-----
Row 2:
Binary:
-----BEGIN-----
fsdfdssd
fdsfadsfasd
fsdafdsa
-----END-----
Row 3:
Binary:
-----BEGIN-----
fsdafadsds
fsdafasdsda
fdsafadssad
-----END-----
I need to extract the data between the "-----BEGIN-----" and "------END-----" delimiters into an array.
This is what I've tried:
data = open("test_data.txt", 'r')
result = [line.split('-----BEGIN-----') for line in data.readlines()]
print data
However this obviously gets all of the data after the '-----BEGIN-----' delimiter.
How can I add the end delimeter ?
Note the file is quite large, arround about 1GB.

For multiple lines between and you want the data separated into sections just catch each block beginning with -----BEGIN-.. and keep adding lines until you reach END:
with open("file.txt") as f:
out = []
for line in f:
if line.rstrip() == "-----BEGIN-----":
tmp = []
for line in f:
if line.rstrip() == "-----END-----":
out.append(tmp)
break
tmp.append(line)
The sections will be split into sublists:
[['fdsfdsfdasadsad\n', 'fsdfafsdafsadfa\n', 'fsdafadsfadsfdsa\n'], ['fsdfdssd\n', 'fdsfadsfasd\n', 'fsdafdsa \n'], ['fsdafadsds\n', 'fsdafasdsda\n', 'fdsafadssad\n']]
Use with to open your files and don't call readlines unless you want a list, you can iterate over the file object as above without storing all the content in memory.
Or using itertools.takewhile to get the sections :
from itertools import takewhile, imap
with open("file.txt") as f:
f = imap(str.rstrip,f) # use map for python3
out = [list(takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----"]
print(out)
[['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa'],
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa'],
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']]
If you want a single list of all the words you can chain:
from itertools import takewhile,chain, imap
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = chain.from_iterable(takewhile(lambda x: x != "-----END-----",f) for line in f if line == "-----BEGIN-----")
print(list(out))
['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa',
'fsdfdssd', 'fdsfadsfasd', 'fsdafdsa', 'fsdafadsds', 'fsdafasdsda', 'fdsafadssad']
A file object returns its own iterator so every time we iterate or call takewhile we consume lines, takewhile will keep taking lines until we hit -----END---- then we continue iterating until we hit another -----BEGIN----- line, if the lines always start with - and no other lines do then you can just check for that condition i.e if line[0] == "-" and x[0] != "-" instead of check the full line.
If you wanted to process each section you could use a generator expression and work on the lines from each section:
with open("file.txt") as f:
f = imap(str.rstrip,f)
out = ((takewhile(lambda x: x != "-----END-----",f)) for line in f if line == "-----BEGIN-----")
for sec in out:
print(list(sec))
['fdsfdsfdasadsad', 'fsdfafsdafsadfa', 'fsdafadsfadsfdsa']
['fsdfdssd', 'fdsfadsfasd', 'fsdafdsa']
['fsdafadsds', 'fsdafasdsda', 'fdsafadssad']
If you want a single string call join:
with open("file.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(chain.from_iterable(takewhile(lambda x: x != end,f)
for line in f if line == st))
print(out)
Output:
fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsafsdfdssdfdsfadsfasdfsdafdsafsdafadsdsfsdafasdsdafdsafadssad
To get a single string keeping -----BEGIN----- and -----END-----
with open("out.txt") as f:
f = imap(str.rstrip,f)
st, end = "-----BEGIN-----", "-----END-----"
out = "".join(["{}{}{}".format(st, "".join(takewhile(lambda x: x != end, f)), end)
for line in f if line == st])
Output:
-----BEGIN-----fdsfdsfdasadsadfsdfafsdafsadfafsdafadsfadsfdsa-----END----------BEGIN-----fsdfdssdfdsfadsfasdfsdafdsa-----END----------BEGIN-----fsdafadsdsfsdafasdsdafdsafadssad-----END-----

Try This :
array1 =[]
with open('test_data.txt','r') as infile:
copy = False
for line in infile:
if line.strip() == "-----BEGIN-----":
copy = True
elif line.strip() == "-----END-----":
copy = False
elif copy:
array1.append(line)
This will solve your purpose.

If your file is small enough to load the whole thing into memory, then using a Regular Expression (aka regex) is probably the best approach.
import re
beginstr = '\n-----BEGIN-----\n'
endstr = '-----END-----\n'
pat = re.compile(beginstr + '(.*?\n)' + endstr, re.DOTALL)
with open('test_data.txt', 'r') as f:
data = f.read()
result = pat.findall(data)
for row in result:
print repr(row)
output
'fdsfdsfdasadsad\nfsdfafsdafsadfa\nfsdafadsfadsfdsa\n'
'fsdfdssd\nfdsfadsfasd\nfsdafdsa \n'
'fsdafadsds\nfsdafasdsda\nfdsafadssad\n'
That code creates a compiled regex pattern; it's not strictly necessary in this case, since we're only using the pattern once, but it does make the code look neater, IMHO.
That regex looks for substrings delimited by 'beginstr' and '\n' + endstr. The findall call only captures the stuff between those delimiters, due to use of the grouping parentheses. I've put a '\n' inside those parentheses so that the captured substrings will always have a trailing newline.

You can use itertools.ifilter :
from itertools import ifilter
with open('a1.txt') as f,open('a1.txt') as g :
f.next()
it=f
print [i.strip() for i in ifilter(lambda x:next(f).strip()=='-----END-----',g)]
result :
['fdsfdsfdasadsad', 'fsdfdssd', 'fsdafadsds']
If the file is not huge use re.findall :
>>> re.findall('-----BEGIN-----\n(.*?)\n-----END-----',open('file_name').read(),re.M|re.DOTALL)
['fdsfdsfdasadsad', 'fsdfdssd', 'fsdafadsds']
Or without itertools you can use following recipe :
with open('a1.txt') as f,open('a1.txt') as g :
f.next()
it=f
for line in g :
n=next(f)
try :
if n.strip()=='-----END-----':
print line
except StopIteration:
break
result :
fdsfdsfdasadsad
fsdfdssd
fsdafadsds
Note that a file object is an iterator you can get the next item from the it by next function in each iteration. so we compare the next line of each line in our file with its next line (stripped)if it's equal to '-----END-----' we print it.

split alone is just fine, no need for other tools. Just also split off the end marker and everything after it:
with open("file.txt") as f:
blocks = [part.split('-----END-----')[0].strip()
for part in f.read().split('-----BEGIN-----')[1:]]

Related

Create as many files as the number of items from two lists with the same number of items in Python

Consider the file testbam.txt:
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bam
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bam
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bam
and the file testbai.txt:
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg001G.GRCh38DH.target.bai
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg002G.GRCh38DH.target.bai
/groups/cgsd/alexandre/gatk-workflows/src/exomesinglesample_out/bam/pfg014G.GRCh38DH.target.bai
They always have in common the same length and I created a function to find it:
def file_len(fname):
with open(fname) as f:
for i,l in enumerate(f):
pass
return i+1
n = file_len('/groups/cgsd/alexandre/python_code/src/testbai.txt')
print(n)
3
Then I created two lists by opening the files and doing some manipulation:
content = []
with open('/groups/cgsd/alexandre/python_code/src/testbam.txt') as bams:
for line in bams:
content.append(line.strip().split())
print(content)
content2 = []
with open('/groups/cgsd/alexandre/python_code/src/testbai.txt') as bais:
for line in bais:
content2.append(line.strip().split())
print(content2)
Now I have a json type file called mutec.json that I would like to replace certain parts with the items of the lists:
{
"Mutect2.gatk_docker": "broadinstitute/gatk:4.1.4.1",
"Mutect2.intervals": "/groups/cgsd/alexandre/gatk-workflows/src/interval_list/Basic_Core_xGen_MSI_TERT_HPV_EBV_hg38.interval_list",
"Mutect2.scatter_count": 30,
"Mutect2.m2_extra_args": "--downsampling-stride 20 --max-reads-per-alignment-start 6 --max-suspicious-reads-per-alignment-start 6",
"Mutect2.filter_funcotations": true,
"Mutect2.funco_reference_version": "hg38",
"Mutect2.run_funcotator": true,
"Mutect2.make_bamout": true,
"Mutect2.funco_data_sources_tar_gz": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/funcotator_dataSources.v1.6.20190124s.tar.gz",
"Mutect2.funco_transcript_selection_list": "/groups/cgsd/alexandre/gatk-workflows/mutect2/inputs/transcriptList.exact_uniprot_matches.AKT1_CRLF2_FGFR1.txt",
"Mutect2.ref_fasta": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta",
"Mutect2.ref_fai": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.fasta.fai",
"Mutect2.ref_dict": "/groups/cgsd/alexandre/gatk-workflows/src/ref_Homo38_HPV/Homo_sapiens_assembly38_chrHPV.dict",
"Mutect2.tumor_reads": "<<<N_item_of_list_content>>>",
"Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>",
}
Please note that this section:
"Mutect2.tumor_reads": "<<<N_item_of_list_content>>>",
"Mutect2.tumor_reads_index": "<<<N_item_of_list_content2>>>",
<<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> should be replaced by their respective items of the list and I would like to finally write the result of every modification into a new file.
The final result would be 3 files: mutect1.json with first item from testbam.txt and first item from testbai.txt , mutect2.json with second item from testbam.txt and second item from testbai.txt and third file with the same reasoning applied.
Please note that the notation I wrote <<<N_item_of_list_content>>> and <<<N_item_of_list_content2>>> isn't necesserarily hard-coded into the file, I wrote myself just to make clear what I would like to replace.
First, and even if it is unrelated to the question, some of your code is not really Pythonic:
def file_len(fname):
with open(fname) as f:
for i,l in enumerate(f):
pass
return i+1
You use a for loop over enumerate when you should simply do:
def file_len(fname):
with open(fname) as f:
return len(f)
because f is an iterator over the lines of the file
Now to your question. You want to replace some elements in a file with data found in two other files.
In your initial question, the strings were enclosed in triple angle brackets.
I would have used:
import re
rx = re.compile(r'<<<.*?>>>') # how to identify what is to replace
with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \
open('.../mutect.json') as src:
for i, reps in enumerate(zip(bams, bais), 1): # gets a pair of replacement strings at each step
src.seek(0) # rewind src file
with open(f'mutect{i}', 'w') as fdout: # open the output files
rep_index = 0 # will first use rep string from first file
for line in src:
if rx.search(line): # if the string to replace there?
line = rx.sub(reps[rep_index], line)
rep_index = 1 - rep_index # next time will use the other string
fdout.write(line)
In comments, you proposed to change the first line of each file with the others. The code could become:
with open('.../testbam.txt') as bams, open('.../testbai.txt') as bais, \
open('.../mutect.json') as src:
it = iter(zip(bams, bais))
to_find = next(it) # we will have to find that
for i, reps in enumerate(it, 2): # gets a pair of replacement strings at each step
src.seek(0) # rewind src file
with open(f'mutect{i}', 'w') as fdout: # open the output files
for line in src:
line = line.replace(to_find[0], reps[0]) # just try to replace
line = line.replace(to_find[1], reps[1])
fdout.write(line)

Changing the contents of a text file and making a new file with same format

I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?
This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.
I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))
readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)

how to count empty lines in python file

I would like to print the total empty lines using python. I have been trying to print using:
f = open('file.txt','r')
for line in f:
if (line.split()) == 0:
but not able to get proper output
I have been trying to print it.. it does print the value as 0.. not sure what wrong with code..
print "\nblank lines are",(sum(line.isspace() for line in fname))
it printing as:
blank lines are 0
There are 7 lines in the file.
There are 46 characters in the file.
There are 8 words in the file.
Since the empty string is a falsy value, you may use .strip():
for line in f:
if not line.strip():
....
The above ignores lines with only whitespaces.
If you want completely empty lines you may want to use this instead:
if line in ['\r\n', '\n']:
...
Please use a context manager (with statement) to open files:
with open('file.txt') as f:
print(sum(line.isspace() for line in f))
line.isspace() returns True (== 1) if line doesn't have any non-whitespace characters, and False (== 0) otherwise. Therefore, sum(line.isspace() for line in f) returns the number of lines that are considered empty.
line.split() always returns a list. Both
if line.split() == []:
and
if not line.split():
would work.
FILE_NAME = 'file.txt'
empty_line_count = 0
with open(FILE_NAME,'r') as fh:
for line in fh:
# The split method will split the word into list. if the line is
# empty the split will return an empty list. ' == [] ' this will
# check the list is empty or not.
if line.split() == []:
empty_line_count += 1
print('Empty Line Count : ' , empty_line_count)

Printing specific lines txt file python

I have a text file I wish to analyze. I'm trying to find every line that contains certain characters (ex: "#") and then print the line located 3 lines before it (ex: if line 5 contains "#", I would like to print line 2)
This is what I got so far:
file = open('new_file.txt', 'r')
a = list()
x = 0
for line in file:
x = x + 1
if '#' in line:
a.append(x)
continue
x = 0
for index, item in enumerate(a):
for line in file:
x = x + 1
d = a[index]
if x == d - 3:
print line
continue
It won't work (it prints nothing when I feed it a file that has lines containing "#"), any ideas?
First, you are going through the file multiple times without re-opening it for subsequent times. That means all subsequent attempts to iterate the file will terminate immediately without reading anything.
Second, your indexing logic a little convoluted. Assuming your files are not huge relative to your memory size, it is much easier to simply read the whole into memory (as a list) and manipulate it there.
myfile = open('new_file.txt', 'r')
a = myfile.readlines();
for index, item in enumerate(a):
if '#' in item and index - 3 >= 0:
print a[index - 3].strip()
This has been tested on the following input:
PrintMe
PrintMe As Well
Foo
#Foo
Bar#
hello world will print
null
null
##
Ok, the issue is that you have already iterated completely through the file descriptor file in line 4 when you try again in line 11. So line 11 will make an empty loop. Maybe it would be a better idea to iterate the file only once and remember the last few lines...
file = open('new_file.txt', 'r')
a = ["","",""]
for line in file:
if "#" in line:
print(a[0], end="")
a.append(line)
a = a[1:]
For file IO it is usually most efficient for programmer time and runtime to use reg-ex to match patterns. In combination with iteration through the lines in the file. your problem really isn't a problem.
import re
file = open('new_file.txt', 'r')
document = file.read()
lines = document.split("\n")
LinesOfInterest = []
for lineNumber,line in enumerate(lines):
WhereItsAt = re.search( r'#', line)
if(lineNumber>2 and WhereItsAt):
LinesOfInterest.append(lineNumber-3)
print LinesOfInterest
for lineNumber in LinesOfInterest:
print(lines[lineNumber])
Lines of Interest is now a list of line numbers matching your criteria
I used
line1,0
line2,0
line3,0
#
line1,1
line2,1
line3,1
#
line1,2
line2,2
line3,2
#
line1,3
line2,3
line3,3
#
as input yielding
[0, 4, 8, 12]
line1,0
line1,1
line1,2
line1,3

How would I read only the first word of each line of a text file?

I wanted to know how I could read ONLY the FIRST WORD of each line in a text file. I tried various codes and tried altering codes but can only manage to read whole lines from a text file.
The code I used is as shown below:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
line = QuizList[0]
for word in line.split():
print(word)
This refers to an attempt to extract only the first word from the first line. In order to repeat the process for every line i would do the following:
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
QuizList.append(line)
capacity = len(QuizList)
capacity = capacity-1
index = 0
while index!=capacity:
line = QuizList[index]
for word in line.split():
print(word)
index = index+1
You are using split at the wrong point, try:
for line in f:
QuizList.append(line.split(None, 1)[0]) # add only first word
Changed to a one-liner that's also more efficient with the strip as Jon Clements suggested in a comment.
with open('Quizzes.txt', 'r') as f:
wordlist = [line.split(None, 1)[0] for line in f]
This is pretty irrelevant to your question, but just so the line.split(None, 1) doesn't confuse you, it's a bit more efficient because it only splits the line 1 time.
From the str.split([sep[, maxsplit]]) docs
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
' 1 2 3 '.split() returns ['1', '2', '3']
and
' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].
with Open(filename,"r") as f:
wordlist = [r.split()[0] for r in f]
I'd go for the str.split and similar approaches, but for completness here's one that uses a combination of mmap and re if you needed to extract more complicated data:
import mmap, re
with open('quizzes.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
wordlist = re.findall('^(\w+)', mf, flags=re.M)
You should read one character at a time:
import string
QuizList = []
with open('Quizzes.txt','r') as f:
for line in f:
for i, c in enumerate(line):
if c not in string.letters:
print line[:i]
break
l=[]
with open ('task-1.txt', 'rt') as myfile:
for x in myfile:
l.append(x)
for i in l:
print[i.split()[0] ]

Categories

Resources