I am new in python. I want to read one file and copy data to another file. my code is following. In code below, when I open the files inside the for loop then I can write all the data into dst_file. but it takes 8 seconds to write dst_file.
for cnt, hex_num in enumerate(hex_data):
with open(src_file, "r") as src_f, open(dst_file, "a") as dst_f:
copy_flag = False
for src_line in src_f:
if r"SPI_frame_0" in src_line:
src_line = src_line.replace('SPI_frame_0', 'SPI_frame_' + str(cnt))
copy_flag = True
if r"halt" in src_line:
copy_flag = False
if copy_flag:
copy_mid_data += src_line
updated_data = WriteHexData(copy_mid_data, hex_num, cnt, msb_lsb_flag)
copy_mid_data = ""
dst_f.write(updated_data)
To improve performance, I am trying to open the files outside of the for loop. but it is not working properly. it is writing only once (one iteration of for loop) in the dst_file. As shown below.
with open(src_file, "r") as src_f, open(dst_file, "a") as dst_f:
for cnt, hex_num in enumerate(hex_data):
copy_flag = False
for src_line in src_f:
if r"SPI_frame_0" in src_line:
src_line = src_line.replace('SPI_frame_0', 'SPI_frame_' + str(cnt))
copy_flag = True
if r"halt" in src_line:
copy_flag = False
if copy_flag:
copy_mid_data += src_line
updated_data = WriteHexData(copy_mid_data, hex_num, cnt, msb_lsb_flag)
copy_mid_data = ""
dst_f.write(updated_data)
can someone please help me to find my mistake?
Files are iterators. Looping over them reads the file line by line. Until you reach the end. They then don't just go back to the start when you try to read more. A new for loop over a file object does not 'reset' the file.
Either re-open the input file each time in the loop, seek back to the start explicitly, or read the file just once. You can seek back with src_f.seek(0), reopening means you need to use two with statements (one to open the output file once, the other in the for loop to handle the src_f source file).
In this case, given that you build up the data to be written out to memory in one go anyway, I'd read the input file just once, keeping only the lines you need to copy.
You can use multiple for loops over the same file object, the file position will change accordingly. That makes reading a series of lines from a match on one key string to another very simple. The itertools.takewhile() function makes it even easier:
from itertools import takewhile
# read the correct lines (from SPI_frame_0 to halt) from the source file
lines = []
with open(src_file, "r") as src_f:
for line in src_f:
if r"SPI_frame_0" in line:
lines.append(line)
# read additional lines until we find 'halt'
lines += takewhile(lambda l: 'halt' not in l, src_f)
# transform the source lines with a new counter
with open(dst_file, "a") as dst_f:
for cnt, hex_num in enumerate(hex_data):
copy_mid_data = []
for line in lines:
if "SPI_frame_0" in line:
line = line.replace('SPI_frame_0', 'SPI_frame_{}'.format(cnt))
copy_mid_data.append(line)
updated_data = WriteHexData(''.join(copy_mid_data), hex_num, cnt, msb_lsb_flag)
dst_f.write(updated_data)
Note that I changed copy_mid_data to a list to avoid quadratic string copying; it is far more efficient to join a list of strings just once.
Related
The standard Python approach to working with files using the open() function to create a 'file object' f allows you to either load the entire file into memory at once using f.read() or to read lines one-by-one using a for loop:
with open('filename') as f:
# 1) Read all lines at once into memory:
all_data = f.read()
# 2) Read lines one-by-one:
for line in f:
# Work with each line
I'm searching through several large files looking for a pattern that might span multiple lines. The most intuitive way to do this is to read line-by-line looking for the beginning of the pattern, and then to load in the next few lines to see where it ends:
with open('large_file') as f:
# Read lines one-by-one:
for line in f:
if line.startswith("beginning"):
# Load in the next line, i.e.
nextline = f.getline(line+1) # ??? #
# or something
The line I've marked with # ??? # is my own pseudocode for what I imagine this should look like.
My question is, does this exist in Python? Is there any method for me to access other lines as needed while keeping the cursor at line and without loading the entire file into memory?
Edit Inferring from the responses here and other reading, the answer is "No."
Like this:
gather = []
for line in f:
if gather:
gather.append(line)
if "ending" in line:
process( ''.join(gather) )
gather = []
elif line.startswith("beginning"):
gather = [line]
Although in many cases it's easier just to load the whole file into a string and search it.
You may want to rstrip the newline before appending the line.
Just store the interesting lines into a list while going line-wise through the file:
with open("file.txt","w") as f:
f.write("""
a
b
------
c
d
e
####
g
f""")
interesting_data = []
inside = False
with open ("file.txt") as f:
for line in f:
line = line.strip()
# start of interesting stuff
if line.startswith("---"):
inside = True
# end of interesting stuff
elif line.startswith("###"):
inside = False
# adding interesting bits
elif inside:
interesting_data.append(line)
print(interesting_data)
to get
['c', 'd', 'e']
I think you're looking for .readline(), which does exactly that. Here is a sketch to proceed to the line where a pattern starts.
with open('large_file') as f:
line = f.readline()
while not line.startswith("beginning"):
line = f.readline()
# end of file
if not line:
print("EOF")
break
# do_something with line, get additional lines by
# calling .readline() again, etc.
I'm trying to extract lines from a very large text file (10Gb). The text file contains the output from an engineering software (it's not a CSV file). I want to copy from line 1 to the first line containing the string 'stop' and then resume from the first line containing 'restart' to the end of the file.
The following code works but it's rather slow (about a minute). Is there a better way to do it using pandas? I have tried the read_csv function but I don't have a delimiter to input.
file_to_copy = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes.txt"
output = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes_extract.txt"
stop = '***** EIGENVECTOR (MODE SHAPE) SOLUTION *****'
restart = '***** PARTICIPATION FACTOR CALCULATION ***** X DIRECTION'
with open(file_to_copy) as f:
orig = f.readlines()
newf = open(output, "w")
write = True
first_time = True
for line in orig:
if first_time == True:
if stop in line:
first_time = False
write = False
for i in range(300):
newf.write(
'\n -------------------- MIDDLE OF THE FILE -------------------')
newf.write('\n\n')
if restart in line: write = True
if write: newf.write(line)
newf.close()
print('Done.')
readlines iterates over the whole file. Then you iterate over the result of readlines. I think the following edit will save you one whole iteration through the big file.
write = True
first_time = True
with open(file_to_copy) as f, open(output, "w") as newf:
for line in f:
if first_time == True:
if stop in line:
first_time = False
write = False
for i in range(300):
newf.write(
'\n -------------------- MIDDLE OF THE FILE -------------------')
print('\n\n')
if restart in line: write = True
if write: newf.write(line)
print('Done.')
You should use python generators. Also printing makes the process slower.
Following are few examples to use generators:
Python generator to read large CSV file
Lazy Method for Reading Big File in Python?
trying to find a way of making this process work pythonically or at all. Basically, I have a really long text file that is split into lines. Every x number of lines there is one that is mainly uppercase, which should roughly be the title of that particular section. Ideally, I'd want the title and everything after to go into a text file using the title as the name for the file. This would have to happen 3039 in this case as that is as many titles will be there.
My process so far is this: I created a variable that reads through a text file tells me if it's mostly uppercase.
def mostly_uppercase(text):
threshold = 0.7
isupper_bools = [character.isupper() for character in text]
isupper_ints = [int(val) for val in isupper_bools]
try:
upper_percentage = np.mean(isupper_ints)
except:
return False
if upper_percentage >= threshold:
return True
else:
return False
Afterwards, I made a counter so that I could create an index and then I combined it:
counter = 0
headline_indices = []
for line in page_text:
if mostly_uppercase(line):
print(line)
headline_indices.append(counter)
counter+=1
headlines_with_articles = []
headline_indices_expanded = [0] + headline_indices + [len(page_text)-1]
for first, second in list(zip(headline_indices_expanded, headline_indices_expanded[1:])):
article_text = (page_text[first:second])
headlines_with_articles.append(article_text)
All of that seems to be working fine as far as I can tell. But when I try to print the pieces that I want to files, all I manage to do is print the entire text into all of the txt files.
for i in range(100):
out_pathname = '/sharedfolder/temp_directory/' + 'new_file_' + str(i) + '.txt'
with open(out_pathname, 'w') as fo:
fo.write(articles_filtered[2])
Edit: This got me halfway there. Now, I just need a way of naming each file with the first line.
for i,text in enumerate(articles_filtered):
open('/sharedfolder/temp_directory' + str(i + 1) + '.txt', 'w').write(str(text))
One conventional way of processing a single input file involves using a Python with statement and a for loop, in the following way. I have also adapted a good answer from someone else for counting uppercase characters, to get the fraction you need.
def mostly_upper(text):
threshold = 0.7
## adapted from https://stackoverflow.com/a/18129868/131187
upper_count = sum(1 for c in text if c.isupper())
return upper_count/len(text) >= threshold
first = True
out_file = None
with open('some_uppers.txt') as some_uppers:
for line in some_uppers:
line = line.rstrip()
if first or mostly_upper(line):
first = False
if out_file: out_file.close()
out_file = open(line+'.txt', 'w')
print(line, file=out_file)
out_file.close()
In the loop, we read each line, asking whether it's mostly uppercase. If it is we close the file that was being used for the previous collection of lines and open a new file for the next collection, using the contents of the current line as a title.
I allow for the possibility that the first line might not be a title. In this case the code creates a file with the contents of the first line as its names, and proceeds to write everything it finds to that file until it does find a title line.
I have some CSV files that I have to modify which I do through a loop. The code loops through the source file, reads each line, makes some modifications and then saves the output to another CSV file. In order to check my work, I want the first line and the last line saved in another file so I can confirm that nothing was skipped.
What I've done is put all of the lines into a list then get the last one from the index minus 1. This works but I'm wondering if there is a more elegant way to accomplish this.
Code sample:
def CVS1():
fb = open('C:\\HP\\WS\\final-cir.csv','wb')
check = open('C:\\HP\\WS\\check-all.csv','wb')
check_count = 0
check_list = []
with open('C:\\HP\\WS\\CVS1-source.csv','r') as infile:
skip_first_line = islice(infile, 3, None)
for line in skip_first_line:
check_list.append(line)
check_count += 1
if check_count == 1:
check.write(line)
[CSV modifications become a string called "newline"]
fb.write(newline)
final_check = check_list[len(check_list)-1]
check.write(final_check)
fb.close()
If you actually need check_list for something, then, as the other answers suggest, using check_list[-1] is equivalent to but better than check_list[len(check_list)-1].
But do you really need the list? If all you want to keep track of is the first and last lines, you don't. If you keep track of the first line specially, and keep track of the current line as you go along, then at the end, the first line and the current line are the ones you want.
In fact, since you appear to be writing the first line into check as soon as you see it, you don't need to keep track of anything but the current line. And the current line, you've already got that, it's line.
So, let's strip all the other stuff out:
def CVS1():
fb = open('C:\\HP\\WS\\final-cir.csv','wb')
check = open('C:\\HP\\WS\\check-all.csv','wb')
first_line = True
with open('C:\\HP\\WS\\CVS1-source.csv','r') as infile:
skip_first_line = islice(infile, 3, None)
for line in skip_first_line:
if first_line:
check.write(line)
first_line = False
[CSV modifications become a string called "newline"]
fb.write(newline)
check.write(line)
fb.close()
You can enumerate the csv rows of inpunt file, and check the index, like this:
def CVS1():
with open('C:\\HP\\WS\\final-cir.csv','wb') as fb, open('C:\\HP\\WS\\check-all.csv','wb') as check, open('C:\\HP\\WS\\CVS1-source.csv','r') as infile:
skip_first_line = islice(infile, 3, None)
for idx,line in enumerate(skip_first_line):
if idx==0 or idx==len(skip_first_line):
check.write(line)
#[CSV modifications become a string called "newline"]
fb.write(newline)
I've replaced the open statements with with block, to delegate to interpreter the files handlers
you can access the index -1 directly:
final_check = check_list[-1]
which is nicer than what you have now:
final_check = check_list[len(check_list)-1]
If it's not an empty or 1 line file you can:
my_file = open(root_to file, 'r')
my_lines = my_file.readlines()
first_line = my_lines[0]
last_line = my_lines[-1]
I am writing a python script and I just need the second line of a series of very small text files. I would like to extract this without saving the file to my harddrive as I currently do.
I have found a few threads that reference the TempFile and StringIO modules but I was unable to make much sense of them.
Currently I download all of the files and name them sequentially like 1.txt, 2.txt, etc, then go through all of them and extract the second line. I would like to open the file grab the line then move on to finding and opening and reading the next file.
Here is what I do currently with writing it to my HDD:
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = string.strip(linecache.getline(file_path, 2))
linkFile = open('Summary.txt', 'a')
linkFile.write(cand_summary)
linkFile.write("\n")
count4 = count4 + 1
linkFile.close()
Just replace the file writing with a call to append() on a list. For example:
summary = []
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = string.strip(linecache.getline(file_path, 2))
summary.append(cand_summary)
count4 = count4 + 1
As an aside you would normally write count += 1. Also it looks like count4 uses 1-based indexing. That seems pretty unusual for Python.
You open and close the output file in every iteration.
Why not simply do
with open("Summary.txt", "w") as linkfile:
while (count4 <= num_files):
file_p = [directory,str(count4),'.txt']
file_path = ''.join(file_p)
cand_summary = linecache.getline(file_path, 2).strip() # string module is deprecated
linkFile.write(cand_summary)
linkFile.write("\n")
count4 = count4 + 1
Also, linecache is probably not the right tool here since it's optimized for reading multiple lines from the same file, not the same line from multiple files.
Instead, better do
with open(file_path, "r") as infile:
dummy = infile.readline()
cand_summary = infile.readline.strip()
Also, if you drop the strip() method, you don't have to re-add the \n, but who knows why you have that in there. Perhaps .lstrip() would be better?
Finally, what's with the manual while loop? Why not use a for loop?
Lastly, after your comment, I understand you want to put the result in a list instead of a file. OK.
All in all:
summary = []
for count in xrange(num_files):
file_p = [directory,str(count),'.txt'] # or count+1, if you start at 1
file_path = ''.join(file_p)
with open(file_path, "r") as infile:
dummy = infile.readline()
cand_summary = infile.readline().strip()
summary.append(cand_summary)