Extracting data from a very large text file using python and pandas?

Extracting data from a very large text file using python and pandas? - python

I'm trying to extract lines from a very large text file (10Gb). The text file contains the output from an engineering software (it's not a CSV file). I want to copy from line 1 to the first line containing the string 'stop' and then resume from the first line containing 'restart' to the end of the file.
The following code works but it's rather slow (about a minute). Is there a better way to do it using pandas? I have tried the read_csv function but I don't have a delimiter to input.
file_to_copy = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes.txt"
output = r"C:\Users\joedoe\Desktop\C ANSYS R1\PATCHED\modes_extract.txt"
stop = '***** EIGENVECTOR (MODE SHAPE) SOLUTION *****'
restart = '***** PARTICIPATION FACTOR CALCULATION ***** X DIRECTION'
with open(file_to_copy) as f:
orig = f.readlines()
newf = open(output, "w")
write = True
first_time = True
for line in orig:
if first_time == True:
if stop in line:
first_time = False
write = False
for i in range(300):
newf.write(
'\n -------------------- MIDDLE OF THE FILE -------------------')
newf.write('\n\n')
if restart in line: write = True
if write: newf.write(line)
newf.close()
print('Done.')

readlines iterates over the whole file. Then you iterate over the result of readlines. I think the following edit will save you one whole iteration through the big file.
write = True
first_time = True
with open(file_to_copy) as f, open(output, "w") as newf:
for line in f:
if first_time == True:
if stop in line:
first_time = False
write = False
for i in range(300):
newf.write(
'\n -------------------- MIDDLE OF THE FILE -------------------')
print('\n\n')
if restart in line: write = True
if write: newf.write(line)
print('Done.')

You should use python generators. Also printing makes the process slower.
Following are few examples to use generators:
Python generator to read large CSV file
Lazy Method for Reading Big File in Python?

Related

How can I handle multiple lines at once while reading from a file?

The standard Python approach to working with files using the open() function to create a 'file object' f allows you to either load the entire file into memory at once using f.read() or to read lines one-by-one using a for loop:
with open('filename') as f:
# 1) Read all lines at once into memory:
all_data = f.read()
# 2) Read lines one-by-one:
for line in f:
# Work with each line
I'm searching through several large files looking for a pattern that might span multiple lines. The most intuitive way to do this is to read line-by-line looking for the beginning of the pattern, and then to load in the next few lines to see where it ends:
with open('large_file') as f:
# Read lines one-by-one:
for line in f:
if line.startswith("beginning"):
# Load in the next line, i.e.
nextline = f.getline(line+1) # ??? #
# or something
The line I've marked with # ??? # is my own pseudocode for what I imagine this should look like.
My question is, does this exist in Python? Is there any method for me to access other lines as needed while keeping the cursor at line and without loading the entire file into memory?
Edit Inferring from the responses here and other reading, the answer is "No."

Like this:
gather = []
for line in f:
if gather:
gather.append(line)
if "ending" in line:
process( ''.join(gather) )
gather = []
elif line.startswith("beginning"):
gather = [line]
Although in many cases it's easier just to load the whole file into a string and search it.
You may want to rstrip the newline before appending the line.

Just store the interesting lines into a list while going line-wise through the file:
with open("file.txt","w") as f:
f.write("""
a
b
------
c
d
e
####
g
f""")
interesting_data = []
inside = False
with open ("file.txt") as f:
for line in f:
line = line.strip()
# start of interesting stuff
if line.startswith("---"):
inside = True
# end of interesting stuff
elif line.startswith("###"):
inside = False
# adding interesting bits
elif inside:
interesting_data.append(line)
print(interesting_data)
to get
['c', 'd', 'e']

I think you're looking for .readline(), which does exactly that. Here is a sketch to proceed to the line where a pattern starts.
with open('large_file') as f:
line = f.readline()
while not line.startswith("beginning"):
line = f.readline()
# end of file
if not line:
print("EOF")
break
# do_something with line, get additional lines by
# calling .readline() again, etc.

object.write() is not working as expected

I am new in python. I want to read one file and copy data to another file. my code is following. In code below, when I open the files inside the for loop then I can write all the data into dst_file. but it takes 8 seconds to write dst_file.
for cnt, hex_num in enumerate(hex_data):
with open(src_file, "r") as src_f, open(dst_file, "a") as dst_f:
copy_flag = False
for src_line in src_f:
if r"SPI_frame_0" in src_line:
src_line = src_line.replace('SPI_frame_0', 'SPI_frame_' + str(cnt))
copy_flag = True
if r"halt" in src_line:
copy_flag = False
if copy_flag:
copy_mid_data += src_line
updated_data = WriteHexData(copy_mid_data, hex_num, cnt, msb_lsb_flag)
copy_mid_data = ""
dst_f.write(updated_data)
To improve performance, I am trying to open the files outside of the for loop. but it is not working properly. it is writing only once (one iteration of for loop) in the dst_file. As shown below.
with open(src_file, "r") as src_f, open(dst_file, "a") as dst_f:
for cnt, hex_num in enumerate(hex_data):
copy_flag = False
for src_line in src_f:
if r"SPI_frame_0" in src_line:
src_line = src_line.replace('SPI_frame_0', 'SPI_frame_' + str(cnt))
copy_flag = True
if r"halt" in src_line:
copy_flag = False
if copy_flag:
copy_mid_data += src_line
updated_data = WriteHexData(copy_mid_data, hex_num, cnt, msb_lsb_flag)
copy_mid_data = ""
dst_f.write(updated_data)
can someone please help me to find my mistake?

Files are iterators. Looping over them reads the file line by line. Until you reach the end. They then don't just go back to the start when you try to read more. A new for loop over a file object does not 'reset' the file.
Either re-open the input file each time in the loop, seek back to the start explicitly, or read the file just once. You can seek back with src_f.seek(0), reopening means you need to use two with statements (one to open the output file once, the other in the for loop to handle the src_f source file).
In this case, given that you build up the data to be written out to memory in one go anyway, I'd read the input file just once, keeping only the lines you need to copy.
You can use multiple for loops over the same file object, the file position will change accordingly. That makes reading a series of lines from a match on one key string to another very simple. The itertools.takewhile() function makes it even easier:
from itertools import takewhile
# read the correct lines (from SPI_frame_0 to halt) from the source file
lines = []
with open(src_file, "r") as src_f:
for line in src_f:
if r"SPI_frame_0" in line:
lines.append(line)
# read additional lines until we find 'halt'
lines += takewhile(lambda l: 'halt' not in l, src_f)
# transform the source lines with a new counter
with open(dst_file, "a") as dst_f:
for cnt, hex_num in enumerate(hex_data):
copy_mid_data = []
for line in lines:
if "SPI_frame_0" in line:
line = line.replace('SPI_frame_0', 'SPI_frame_{}'.format(cnt))
copy_mid_data.append(line)
updated_data = WriteHexData(''.join(copy_mid_data), hex_num, cnt, msb_lsb_flag)
dst_f.write(updated_data)
Note that I changed copy_mid_data to a list to avoid quadratic string copying; it is far more efficient to join a list of strings just once.

Python Splitting Text file based on a keyword

I am trying to write a python program that will constantly read a text file line by line and each time it comes across a line with the word 'SPLIT' it will write the contents to a new text file.
Please could someone point me in the right direction of writing a new text file each time the script comes across the word 'split'. I have no problem reading a text file with Python, I'm unsure how to split on the keyword and create an individual text file each time.
THE SCRIPT BELOW WORKS IN 2.7.13
file_counter = 0
done = False
with open('test.txt') as input_file:
# with open("test"+str(file_counter)+".txt", "w") as out_file:
while not done:
for line in input_file:
if "SPLIT" in line:
done = True
file_counter += 1
else:
print(line)
out_file = open("test"+str(file_counter)+".txt", "a")
out_file.write(line)
#out_file.write(line.strip()+"\n")
print file_counter

You need to have two loops. One which iterates the filenames of the output files then another inside to write the input contents to the current active output until "split" is found:
out_n = 0
done = False
with open("test.txt") as in_file:
while not done: #loop over output file names
with open(f"out{out_n}.txt", "w") as out_file: #generate an output file name
while not done: #loop over lines in inuput file and write to output file
try:
line = next(in_file).strip() #strip whitespace for consistency
except StopIteration:
done = True
break
if "SPLIT" in line: #more robust than 'if line == "SPLIT\n":'
break
else:
out_file.write(line + '\n') #must add back in newline because we stripped it out earlier
out_n += 1 #increment output file name integer

for line in text.splitlines():
if " SPLIT " in line:
# write in new file.
pass
To write in new file check here:
https://www.tutorialspoint.com/python/python_files_io.htm
or
https://docs.python.org/3.6/library/functions.html#open

How to make log parsing faster for large text files

I have large (500,000 line) log files that I parse through for specified sections. When found the sections are printed to a Text widget. Even if I cut the readlines down to the last 50,000 lines it takes upwards of a minute or longer to finish.
with open(i, "r") as f:
r = f.readlines()
r = r[-50000:]
start = 0
for line in r:
if 'Start section' in line:
if start == 1:
cpfotxt.insert('end', line + "\n", 'hidden')
start = 1
if 'End section' in line:
start = 0
cpfotxt.insert('end', line + "\n")
if start == 1:
cpfotxt.insert('end', line + "\n")
f.close()
Any way to do this faster?

You should try to read it in chunks.
with open(...) as f:
for line in f:
<do something with line>
A more clear approach that can be applied for you:
def readInChunks(fileObj, chunkSize=2048):
"""
Lazy function to read a file piece by piece.
Default chunk size: 2kB.
"""
while True:
data = fileObj.read(chunkSize)
if not data:
break
yield data
f = open('bigFile')
for chuck in readInChunks(f):
do_something(chunk)

Another possibility is to use seek to skip over a lot of lines. However, this requires that you have some idea of how large the last 50K lines might be. Instead of reading through all the early lines, jump close to the end:
with ... as f:
f.seek(-50000 * 80)
# insert your processing here

Read File lines and characters

I have an input file which looks like this
some data...
some data...
some data...
...
some data...
<binary size="2358" width="32" height="24">
data of size 2358 bytes
</binary>
some data...
some data...
The value 2358 in the binary size can change for different files.
Now I want to extract the 2358 bytes of data for this file (which is a variable)
and write to another file.
I wrote the following code for the same. But it gives me an error. The problem is, I am not able to extract this 2358 bytes of binary data and write to another file.
c = responseFile.read(1)
ValueError: Mixing iteration and read methods would lose data
Code Is -
import re
outputFile = open('output', 'w')
inputFile = open('input.txt', 'r')
fileSize=0
width=0
height=0
for line in inputFile:
if "<binary size" in line:
x = re.findall('\w+', line)
fileSize = int(x[2])
width = int(x[4])
height = int(x[6])
break
print x
# Here the file will point to the start location of 2358 bytes.
for i in range(0,fileSize,1):
c = inputFile.read(1)
outputFile.write(c)
outputFile.close()
inputFile.close()
Final Answer to my Question -
#!/usr/local/bin/python
import os
inputFile = open('input', 'r')
outputFile = open('output', 'w')
flag = False
for line in inputFile:
if line.startswith("<binary size"):
print 'Start of Data'
flag = True
elif line.startswith("</binary>"):
flag = False
print 'End of Data'
elif flag:
outputFile.write(line) # remove newline
inputFile.close()
outputFile.close()
# I have to delete the last extra new line character from the output.
size = os.path.getsize('output')
outputFile = open('output', 'ab')
outputFile.truncate(size-1)
outputFile.close()

How about a different approach? In pseudo-code:
for each line in input file:
if line starts with binary tag: set output flag to True
if line starts with binary-termination tag: set output flag to False
if output flag is True: copy line to the output file
And in real code:
outputFile = open('./output', 'w')
inputFile = open('./input.txt', 'r')
flag = False
for line in inputFile:
if line.startswith("<binary size"):
flag = True
elif line.startswith("</binary>"):
flag = False
elif flag:
outputFile.write(line[:-1]) # remove newline
outputFile.close()
inputFile.close()

Try changing your first loop to something like this:
while True:
line = inputFile.readline()
# continue the loop as it was
This gets rid of iteration and only leaves read methods, so the problem should disappear.

Consider this method:
import re
line = '<binary size="2358" width="32" height="24">'
m = re.search('size="(\d*)"', line)
print m.group(1) # 2358
It varies from your code, so its not a drop-in replacement, but the regular expressions functionality is different.
This uses Python's regex group capturing features and is much better than your string splitting method.
For example, consider what would happen if the attributes were re-ordered. For example:
<binary width="32" size="2358" height="24">'
instead of
<binary size="2358" width="32" height="24">'
Would your code still work? Mine would. :-)
Edit: To answer your question:
If you want to read n bytes of data from the beginning of a file, you could do something like
bytes = ifile.read(n)
Note that you may get less than n bytes if the input file is not long enough.
If you don't want to start from the "0th" byte, but some other byte, use seek() first, as in:
ifile.seek(9)
bytes = ifile.read(5)
Which would give you bytes 9:13 or the 10th through 14th bytes.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting data from a very large text file using python and pandas? - python

You should use python generators. Also printing makes the process slower. Following are few examples to use generators: Python generator to read large CSV file Lazy Method for Reading Big File in Python?

Related

How can I handle multiple lines at once while reading from a file?

object.write() is not working as expected

Python Splitting Text file based on a keyword

How to make log parsing faster for large text files

Read File lines and characters

Categories

Resources