large text editing, paste replace

large text editing, paste replace - python

I have a large text file with over ~200 million lines. It is split into blocks of approximately 50000 lines. What I need to do is replace lines 10-100 from all the blocks with lines 10-100 from the first block. Any ideas how to go about this?
Thanks in advance

Use a list. First read the lines you want to use from the first block into a list. Next, read each other file in turn line by line, and write them out to a new file, but if the line number is between 1-100 then use the line from your list. Example that achieves your goal:
fnames = ["file1.txt", "file2.txt", "file3.txt"]
sub_list_start = 9
sub_list_end = 100
file1_line_10_to_100 = []
with open(fnames[0]) as f:
for i, line in enumerate(f.readlines()):
if i >= sub_list_start and i < sub_list_end:
file1_line_10_to_100.append(line)
if i >= sub_list_end:
break
for fname in fnames[1:]:
with open(fname) as f:
with open(fname + '.new', 'w') as f_out:
for i, line in enumerate(f.readlines()):
if i >= sub_list_start and i < sub_list_end:
f_out.write(file1_line_10_to_100[i - sub_list_start])
else:
f_out.write(line)

Related

Fastest way to read and delete N lines in python

Fastest way to read and delete N lines in python.
First I read the file something like this: (I think this is the best way to read large files: Source)
N = 50
with open("ahref.txt", "r+") as f:
link_list = [(next(f)).removesuffix("\n") for x in range(N)]
after that I run my code:
# My code here
After that I want to delete the first N line (I read it: Source).
# Source: https://stackoverflow.com/questions/4710067/how-to-delete-a-specific-line-in-a-file/28057753#28057753
with open("target.txt", "r+") as f:
d = f.readlines()
f.seek(0)
for i in d:
if i != "line you want to remove...":
f.write(i)
f.truncate()
This code doesn't work for me. Because I read only N numbers of lines.
I can remove lines:
with open("xml\\ahref.txt", "r+") as f:
N = 5
all_lines = f.readlines()
f.seek(0)
f.truncate()
f.writelines(all_lines[N:])
But there is a problem with that:
I have to read all the lines and after that I have to write all the lines.
which is not a fast way (There are many ways, but it needs to read all line)
What is the fastest way in terms of performance? because the file is huge.

fastest way is not to read the entire file in memory and use a temporary output file that you can then move over the original file if required
try:
N = 50
mode = "r+"
if not os.path.isfile('output'): mode = "w+"
with open('input', 'r') as fin, open('output', mode) as fout:
for index, line in enumerate(fout): N += 1
for index, line in enumerate(fin):
if index > N: fout.write(line)
# i haven't tested this you may need index > N or index >= N

Changing the contents of a text file and making a new file with same format

I have a big text file with a lot of parts. Every part has 4 lines and next part starts immediately after the last part.
The first line of each part starts with #, the 2nd line is a sequence of characters, the 3rd line is a + and the 4th line is again a sequence of characters.
Small example:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACGCTTATCGATAAAATTTTGAATTTTGTAACTTGTTTTTGTAATTCTTTAGTTTGTATGTCTGTTGCTATTATGTCTACTATTCTTTCCCCTGCACTGTACCCCCCAATCCCCCCTTTTCTTTTAAAAGTTAACCGATACCGTCGAGATCCGTTCACTAATCGAACGGATCTGTCTCTGTCTCTCTC
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5AEG1EF511F1?GFH3#BFADGD55F?#GFHFGGFCGG/GHGHHHHHHHDBG4E?FB?BGHHHHHHHHHHHHHHHHHFHHHHHHHHHGHGHGHHHHHFHHHHHGGGGHHHHGGGGHHHHHHHGHGHHHHHHFGHCFGGGHGGGGGGGGFGGEGBFGGGGGGGGGFGGGGFFB9/BFFFFFFFFFF/
I want to change the 2nd and the 4th line of each part and make a new file with similar structure (4 lines for each part). In fact I want to keep the 1st 65 characters (in lines 2 and 4) and remove the rest of characters. The expected output for the small example would look like this:
#M00872:462:000000000-D47VR:1:1101:15294:1338 1:N:0:ACATCG
TGCTCGGTGTATGTAAACTTCCGACTTCAACTGTATAGGGATCCAATTTTGACAAAATATTAACG
+
BAABBADBBBFFGGGGGGGGGGGGGGGHHGHHGH55FB3A3GGH3ADG5FAAFEGHHFFEFHD5A
I wrote the following code:
infile = open("file.fastq", "r")
new_line=[]
for line_number in len(infile.readlines()):
if line_number ==2 or line_number ==4:
new_line.append(infile[line_number])
with open('out_file.fastq', 'w') as f:
for item in new_line:
f.write("%s\n" % item)
but it does not return what I want. How to fix it to get the expected output?

This code will achieve what you want -
from itertools import islice
with open('bio.txt', 'r') as infile:
while True:
lines_gen = list(islice(infile, 4))
if not lines_gen:
break
a,b,c,d = lines_gen
b = b[0:65]+'\n'
d = d[0:65]+'\n'
with open('mod_bio.txt', 'a+') as f:
f.write(a+b+c+d)
How it works?
We first make a generator that gives 4 lines at a time as you mention.
Then we open the lines into individual lines a,b,c,d and perform string slicing. Eventually we join that string and write it to a new file.

I think some itertools.cycle could be nice here:
import itertools
with open("transformed.file.fastq", "w+") as output_file:
with open("file.fastq", "r") as input_file:
for i in itertools.cycle((1,2,3,4)):
line = input_file.readline().strip()
if not line:
break
if i in (2,4):
line = line[:65]
output_file.write("{}\n".format(line))

readlines() will return list of each line in your file. You don't need to prepare a list new_line. Directly iterate over index-value pair of list, then you can modify all the values in your desired position.
By modifying your code, try this
infile = open("file.fastq", "r")
new_lines = infile.readlines()
for i, t in enumerate(new_lines):
if i == 1 or i == 3:
new_lines[i] = new_lines[i][:65]
with open('out_file.fastq', 'w') as f:
for item in new_lines:
f.write("%s" % item)

How to define length of lines to read in from a file

I am reading lines from a file in Python. Here is my code:
with open('words','rb') as f:
for line in f:
Is there a way to define the amount of lines I want to use? Say for example, the first 1000 lines in the file?

You can use enumerate():
with open('words','rb') as f:
for i, line in enumerate(f):
if i >= 1000:
break
# do work for first 1000 lines

Make a variable to count. I have used i for example below. The value will be incremented in each iteration. When the value reached 999 that is, 1000 times, you can do stuffs there
i = 0
with open('words','rb') as f:
for line in f:
if(i<1000):
#do stuffs
i = i+1

parse blocks of text from text file using Python

I am trying to parse some text files and need to extract blocks of text. Specifically, the lines that start with "1:" and 19 lines after the text. The "1:" does not start on the same row in each file and there is only one instance of "1:". I would prefer to save the block of text and export it to a separate file. In addition, I need to preserve the formatting of the text in the original file.
Needless to say I am new to Python. I generally work with R but these files are not really compatible with R and I have about 100 to process. Any information would be appreciated.
The code that I have so far is:
tmp = open(files[0],"r")
lines = tmp.readlines()
tmp.close()
num = 0
a=0
for line in lines:
num += 1
if "1:" in line:
a = num
break
a = num is the line number for the block of text I want. I then want to save to another file the next 19 lines of code, but can't figure how how to do this. Any help would be appreciated.

Here is one option. Read all lines from your file. Iterate till you find your line and return next 19 lines. You would need to handle situations where your file doesn't contain additional 19 lines.
fh = open('yourfile.txt', 'r')
all_lines = fh.readlines()
fh.close()
for count, line in enumerate(all_lines):
if "1:" in line:
return all_lines[count+1:count+20]

Could be done in a one-liner...
open(files[0]).read().split('1:', 1)[1].split('\n')[:19]
or more readable
txt = open(files[0]).read() # read the file into a big string
before, after = txt.split('1:', 1) # split the file on the first "1:"
after_lines = after.split('\n') # create lines from the after text
lines_to_save = after_lines[:19] # grab the first 19 lines after "1:"
then join the lines with a newline (and add a newline to the end) before writing it to a new file:
out_text = "1:" # add back "1:"
out_text += "\n".join(lines_to_save) # add all 19 lines with newlines between them
out_text += "\n" # add a newline at the end
open("outputfile.txt", "w").write(out_text)
to comply with best practice for reading and writing files you should also be using the with statement to ensure that the file handles are closed as soon as possible. You can create convenience functions for it:
def read_file(fname):
"Returns contents of file with name `fname`."
with open(fname) as fp:
return fp.read()
def write_file(fname, txt):
"Writes `txt` to a file named `fname`."
with open(fname, 'w') as fp:
fp.write(txt)
then you can replace the first line above with:
txt = read_file(files[0])
and the last line with:
write_file("outputfile.txt", out_text)

I always prefer to read the file into memory first, but sometimes that's not possible. If you want to use iteration then this will work:
def process_file(fname):
with open(fname) as fp:
for line in fp:
if line.startswith('1:'):
break
else:
return # no '1:' in file
yield line # yield line containing '1:'
for i, line in enumerate(fp):
if i >= 19:
break
yield line
if __name__ == "__main__":
with open('ouput.txt', 'w') as fp:
for line in process_file('intxt.txt'):
fp.write(line)
It's using the else: clause on a for-loop which you don't see very often anymore, but was created for just this purpose (the else clause if executed if the for-loop doesn't break).

Splitting file into smaller files by lines

I am trying to figure out a way to split a big txt file with columns of data into smaller files for uploading purposes. The big file has 4000 lines and I wondering if there is a way to divide it into four parts such as
file 1 (lines 1-1000)
file 2 (lines 1001-2000)
file 3 (lines 2001-3000)
file 4 (lines 3001-4000)
I appreciate the help.

This works (you could implement a for rather than a while loop but it makes little difference and does not assume how many files will be necessary):
with open('longFile.txt', 'r') as f:
lines = f.readlines()
threshold=1000
fileID=0
while fileID<len(lines)/float(threshold):
with open('fileNo'+str(fileID)+'.txt','w') as currentFile:
for currentLine in lines[threshold*fileID:threshold*(fileID+1)]:
currentFile.write(currentLine)
fileID+=1
Hope this helps. Try to use open in a with block as suggested in python docs.

Give this a try:
fhand = open(filename, 'r')
all_lines = fhand.readlines()
for x in xrange(4):
new_file = open(new_file_names[x], 'w')
new_file.write(all_lines[x * 1000, (x + 1) * 1000])

I like Aleksander Lidtke's, but with a for loop and a pop() twist for fun. I also like to maintain some of the files original naming when I do this, since it is usually to multiple files. So I added the name "split" in it.
with open('Data.txt','r') as f:
lines = f.readlines()
limit=1000
for o in range(len(lines)):
if lines!=[]:
with open(f.name.split(".")[0] +"_" + str(o) + '.txt','w') as NewFile:
for i in range(limit):
if lines!=[]:NewFile.write(lines.pop(0))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

large text editing, paste replace - python

I have a large text file with over ~200 million lines. It is split into blocks of approximately 50000 lines. What I need to do is replace lines 10-100 from all the blocks with lines 10-100 from the first block. Any ideas how to go about this? Thanks in advance

Related

Fastest way to read and delete N lines in python

Changing the contents of a text file and making a new file with same format

How to define length of lines to read in from a file

parse blocks of text from text file using Python

Splitting file into smaller files by lines

Categories

Resources