Fastest way to read and delete N lines in python.
First I read the file something like this: (I think this is the best way to read large files: Source)
N = 50
with open("ahref.txt", "r+") as f:
link_list = [(next(f)).removesuffix("\n") for x in range(N)]
after that I run my code:
# My code here
After that I want to delete the first N line (I read it: Source).
# Source: https://stackoverflow.com/questions/4710067/how-to-delete-a-specific-line-in-a-file/28057753#28057753
with open("target.txt", "r+") as f:
d = f.readlines()
f.seek(0)
for i in d:
if i != "line you want to remove...":
f.write(i)
f.truncate()
This code doesn't work for me. Because I read only N numbers of lines.
I can remove lines:
with open("xml\\ahref.txt", "r+") as f:
N = 5
all_lines = f.readlines()
f.seek(0)
f.truncate()
f.writelines(all_lines[N:])
But there is a problem with that:
I have to read all the lines and after that I have to write all the lines.
which is not a fast way (There are many ways, but it needs to read all line)
What is the fastest way in terms of performance? because the file is huge.
fastest way is not to read the entire file in memory and use a temporary output file that you can then move over the original file if required
try:
N = 50
mode = "r+"
if not os.path.isfile('output'): mode = "w+"
with open('input', 'r') as fin, open('output', mode) as fout:
for index, line in enumerate(fout): N += 1
for index, line in enumerate(fin):
if index > N: fout.write(line)
# i haven't tested this you may need index > N or index >= N
Related
I have this text file and let's say it contains 10 lines.
Bye
Hi
2
3
4
5
Hi
Bye
7
Hi
Every time it says "Hi" and "Bye" I want it to be removed except for the first time it was said.
My current code is (yes filename is actually pointing towards a file, I just didn't place it in this one)
text_file = open(filename)
for i, line in enumerate(text_file):
if i == 0:
var_Line1 = line
if i = 1:
var_Line2 = line
if i > 1:
if line == var_Line2:
del line
text_file.close()
It does detect the duplicates, but it takes a very long time considering the amount of lines there are, but I'm not sure on how to delete them and save it as well
You could use dict.fromkeys to remove duplicates and preserve order efficiently:
with open(filename, "r") as f:
lines = dict.fromkeys(f.readlines())
with open(filename, "w") as f:
f.writelines(lines)
Idea from Raymond Hettinger
Using a set & some basic filtering logic:
with open('test.txt') as f:
seen = set() # keep track of the lines already seen
deduped = []
for line in f:
line = line.rstrip()
if line not in seen: # if not seen already, write the lines to result
deduped.append(line)
seen.add(line)
# re-write the file with the de-duplicated lines
with open('test.txt', 'w') as f:
f.writelines([l + '\n' for l in deduped])
I have a large text file with over ~200 million lines. It is split into blocks of approximately 50000 lines. What I need to do is replace lines 10-100 from all the blocks with lines 10-100 from the first block. Any ideas how to go about this?
Thanks in advance
Use a list. First read the lines you want to use from the first block into a list. Next, read each other file in turn line by line, and write them out to a new file, but if the line number is between 1-100 then use the line from your list. Example that achieves your goal:
fnames = ["file1.txt", "file2.txt", "file3.txt"]
sub_list_start = 9
sub_list_end = 100
file1_line_10_to_100 = []
with open(fnames[0]) as f:
for i, line in enumerate(f.readlines()):
if i >= sub_list_start and i < sub_list_end:
file1_line_10_to_100.append(line)
if i >= sub_list_end:
break
for fname in fnames[1:]:
with open(fname) as f:
with open(fname + '.new', 'w') as f_out:
for i, line in enumerate(f.readlines()):
if i >= sub_list_start and i < sub_list_end:
f_out.write(file1_line_10_to_100[i - sub_list_start])
else:
f_out.write(line)
I have a dataset of about 10 CSV files. I want to combine those files row-wise into a single CSV file.
What I tried:
import csv
fout = open("claaassA.csv","a")
# first file:
writer = csv.writer(fout)
for line in open("a01.ihr.60.ann.csv"):
print line
writer.writerow(line)
# now the rest:
for num in range(2, 10):
print num
f = open("a0"+str(num)+".ihr.60.ann.csv")
#f.next() # skip the header
for line in f:
print line
writer.writerow(line)
#f.close() # not really needed
fout.close()
Definitively need more details in the question (ideally examples of the inputs and expected output).
Given the little information provided, I will assume that you know that all files are valid CSV and they all have the same number or lines (rows). I'll also assume that memory is not a concern (i.e. they are "small" files that fit together in memory). Furthermore, I assume that line endings are new line (\n).
If all these assumptions are valid, then you can do something like this:
input_files = ['file1.csv', 'file2.csv', 'file3.csv']
output_file = 'output.csv'
output = None
for infile in input_files:
with open(infile, 'r') as fh:
if output:
for i, l in enumerate(fh.readlines()):
output[i] = "{},{}".format(output[i].rstrip('\n'), l)
else:
output = fh.readlines()
with open(output_file, 'w') as fh:
for line in output:
fh.write(line)
There are probably more efficient ways, but this is a quick and dirty way to achieve what I think you are asking for.
The previous answer implicitly assumes we need to do this in python. If bash is an option then you could use the paste command. For example:
paste -d, file1.csv file2.csv file3.csv > output.csv
I don't understand fully why you use the library csv. Actually, it's enough to fill the output file with the lines from given files (it they have the same columns' manes and orders).
input_path_list = [
"a01.ihr.60.ann.csv",
"a02.ihr.60.ann.csv",
"a03.ihr.60.ann.csv",
"a04.ihr.60.ann.csv",
"a05.ihr.60.ann.csv",
"a06.ihr.60.ann.csv",
"a07.ihr.60.ann.csv",
"a08.ihr.60.ann.csv",
"a09.ihr.60.ann.csv",
]
output_path = "claaassA.csv"
with open(output_path, "w") as fout:
header_written = False
for intput_path in input_path_list:
with open(intput_path) as fin:
header = fin.next()
# it adds the header at the beginning and skips other headers
if not header_written:
fout.write(header)
header_written = True
# it adds all rows
for line in fin:
fout.write(line)
I am trying to figure out a way to split a big txt file with columns of data into smaller files for uploading purposes. The big file has 4000 lines and I wondering if there is a way to divide it into four parts such as
file 1 (lines 1-1000)
file 2 (lines 1001-2000)
file 3 (lines 2001-3000)
file 4 (lines 3001-4000)
I appreciate the help.
This works (you could implement a for rather than a while loop but it makes little difference and does not assume how many files will be necessary):
with open('longFile.txt', 'r') as f:
lines = f.readlines()
threshold=1000
fileID=0
while fileID<len(lines)/float(threshold):
with open('fileNo'+str(fileID)+'.txt','w') as currentFile:
for currentLine in lines[threshold*fileID:threshold*(fileID+1)]:
currentFile.write(currentLine)
fileID+=1
Hope this helps. Try to use open in a with block as suggested in python docs.
Give this a try:
fhand = open(filename, 'r')
all_lines = fhand.readlines()
for x in xrange(4):
new_file = open(new_file_names[x], 'w')
new_file.write(all_lines[x * 1000, (x + 1) * 1000])
I like Aleksander Lidtke's, but with a for loop and a pop() twist for fun. I also like to maintain some of the files original naming when I do this, since it is usually to multiple files. So I added the name "split" in it.
with open('Data.txt','r') as f:
lines = f.readlines()
limit=1000
for o in range(len(lines)):
if lines!=[]:
with open(f.name.split(".")[0] +"_" + str(o) + '.txt','w') as NewFile:
for i in range(limit):
if lines!=[]:NewFile.write(lines.pop(0))
Suppose I have a file (say file1.txt) with data around 3mb or more. If I want to write this data to a second file (say file2.txt), which one of the following approaches will be better?
Language used: Python 2.7.3
Approach 1:
file1_handler = file("file1.txt", 'r')
for lines in file1_handler:
line = lines.strip()
# Perform some operation
file2_handler = file("file2.txt", 'a')
file2_handler.write(line)
file2_handler.write('\r\n')
file2_handler.close()
file1_handler.close()
Approach 2:
file1_handler = file("file1.txt", 'r')
file2_handler = file("file2.txt", 'a')
for lines in file1_handler:
line = lines.strip()
# Perform some operation
file2_handler.write(line)
file2_handler.write('\r\n')
file2_handler.close()
file1_handler.close()
I think approach two will be better because you just have to open and close file2.txt once. What do you say?
Use with, it will close the files automatically for you:
with open("file1.txt", 'r') as in_file, open("file2.txt", 'a') as out_file:
for lines in in_file:
line = lines.strip()
# Perform some operation
out_file.write(line)
out_file.write('\r\n')
Use open instead of file, file is deprecated.
Of course it's unreasonable to open file2 on every line of file1.
I was recently doing something similar (if I understood you well). How about:
file = open('file1.txt', 'r')
file2 = open('file2.txt', 'wt')
for line in file:
newLine = line.strip()
# You can do your operation here on newLine
file2.write(newLine)
file2.write('\r\n')
file.close()
file2.close()
This approach works like a charm!
My solution (derived from Pavel Anossov + buffering):
dim = 1000
buffer = []
with open("file1.txt", 'r') as in_file, open("file2.txt", 'a') as out_file:
for i, lines in enumerate(in_file):
line = lines.strip()
# Perform some operation
buffer.append(line)
if i%dim == dim-1:
for bline in buffer:
out_file.write(bline)
out_file.write('\r\n')
buffer = []
Pavel Anossov gave the right solution first: this is just a suggestion ;)
Probably it exists a more elegant way to implement this function. If anyone knows it, please tell us.