Memory Error Python Processing Large File Line by Line

Memory Error Python Processing Large File Line by Line - python

I am trying to concatenate model output files, the model run was broken up in 5 and each output corresponds to one of those partial run, due to the way the software outputs to file it start relabelling from 0 on each of the file outputs. I wrote some code to:
1) concatenate all the output files together
2) edit the merged file to re-label all timesteps, starting at 0 and increasing by an increment at each one.
The aim is that I can load this single file into my visualization software in one chunk, rather than open 5 different windows.
So far my code throws a memory error due to the large files I am dealing with.
I have a few ideas of how I could try and get rid of it but I'm not sure what will work or/and might slow things down to a crawl.
Code so far:
import os
import time
start_time = time.time()
#create new txt file in smae folder as python script
open("domain.txt","w").close()
"""create concatenated document of all tecplot output files"""
#look into file number 1
for folder in range(1,6,1):
folder = str(folder)
for name in os.listdir(folder):
if "domain" in name:
with open(folder+'/'+name) as file_content_list:
start = ""
for line in file_content_list:
start = start + line# + '\n'
with open('domain.txt','a') as f:
f.write(start)
# print start
#identify file with "domain" in name
#extract contents
#append to the end of the new document with "domain" in folder level above
#once completed, add 1 to the file number previously searched and do again
#keep going until no more files with a higher number exist
""" replace the old timesteps with new timesteps """
#open folder named domain.txt
#Look for lines:
##ZONE T="0.000000000000e+00s", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL
##STRANDID=1, SOLUTIONTIME=0.000000000000e+00
# if they are found edits them, otherwise copy the line without alteration
with open("domain.txt", "r") as combined_output:
start = ""
start_timestep = 0
time_increment = 3.154e10
for line in combined_output:
if "ZONE" in line:
start = start + 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
elif "STRANDID" in line:
start = start + 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
start_timestep = start_timestep + time_increment
else:
start = start + line
with open('domain_final.txt','w') as f:
f.write(start)
end_time = time.time()
print 'runtime : ', end_time-start_time
os.remove("domain.txt")
So far, I get the memory error at the concatenation stage.
To improve I could:
1) Try and do the corrections on the go as I read each file, but since it's already failing to go through an entire one I don't think that would make much of a difference other than computing time
2) Load all the file as into an array and make a function of the checks and run that function on the array:
Something like:
def do_correction(line):
if "ZONE" in line:
return 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
elif "STRANDID" in line:
return 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
else:
return line
3) keep it as is and ask Python to indicate when it is about to run out of memory and write to the file at that stage. Anyone knows if that is possible ?
Thank you for your help

It is not necessary to read the entire contents of each file into memory before writing to the output file. Large files will just consume, possibly all, available memory.
Simply read and write one line at a time. Also open the output file once only... and choose a name that will not be picked up and treated as an input file itself, otherwise you run the risk of concatenating the output file onto itself (not yet a problem, but could be if you also process files from the current directory) - if loading it doesn't already consume all memory.
import os.path
with open('output.txt', 'w') as outfile:
for folder in range(1,6,1):
for name in os.listdir(folder):
if "domain" in name:
with open(os.path.join(str(folder), name)) as file_content_list:
for line in file_content_list:
# perform corrections/modifications to line here
outfile.write(line)
Now you can process the data in a line oriented manner - just modify it before writing to the output file.

Related

Is there a way to specify a range of files for Python to loop over in a given directory?

I have a script that was iterating over files in a directory to convert them from one format to another. Unfortunately, I did not take into consideration the possibility of losing the connection to the network drive where the files reside, thus terminating my script. In the event of an error and as a way to keep track of how far into the directory the script was, I did have the program display to me the last file it read in. I would like to begin at the file where the script stopped instead of starting all the way back at the beginning.
Below is my original script. This script converts from DBF format to CSV.
import os
from dbfread import DBF
import pandas as pd
directory = 'Directory containing files'
for file in os.listdir(directory):
if file.startswith('File_Prefix') and file.endswith('.DBF'):
file_path = os.path.join(directory, file)
print(f'\nReading in {file}...')
dbf = DBF(file_path)
dbf.encoding = 'utf-8'
dbf.char_decode_errors = 'ignore'
print('\nConverting to DataFrame...')
df = pd.DataFrame(iter(dbf))
df.columns.astype(str)
print(df)
print('\nWriting to CSV...')
dest_directory = 'Destination_Directory\\%s.csv' % (File_Prefix + file.strip('.DBF'))
df.to_csv(dest_directory, index = False)
print(f'\nConverted {file} to CSV. Moving to next file...')
elif file.startswith(Another_File_Prefix) and file.endswith('.DBF'):
print('File not needed.')
continue
elif file.endswith('.FPT'):
print('Skipping FPT file.')
continue
elif file.startswith('Another_file_prefix') and file.endswith('.DB~'):
print('All files converted to CSV.')
break
else:
print('\nFile not found or error.')
print(f'Last file read in was {file}.')
What could I modify to specify the last file read in and start from there while ignoring the previous converted files? The names of the files in the directory are rather vague, just a letter and a number that increases as you travel down through the directory (e.g. 'A0001.DBF', 'A0002.DBF', 'A0003.DBF', etc.)
My initial solution was to assign the last file to a variable and then modify my "if" statement.
start_file = last_file_read_in
for file in os.listdir(directory):
if file == start_file:
#run conversion code
#continue iterating through each file starting from this point

Given a filename, go to the next file in a directory

I am writing a method that takes a filename and a path to a directory and returns the next available filename in the directory or None if there are no files with names that would sort after the file.
There are plenty of questions about how to list all the files in a directory or iterate over them, but I am not sure if the best solution to finding a single next filename is to use the list that one of the previous answers generated and then find the location of the current file in the list and choose the next element (or None if we're already on the last one).
EDIT: here's my current file-picking code. It's reused from a different part of the project, where it is used to pick a random image from a potentially nested series of directories.
# picks a file from a directory
# if the file is also a directory, pick a file from the new directory
# this might choke up if it encounters a directory only containing invalid files
def pickNestedFile(directory, bad_files):
file=None
while file is None or file in bad_files:
file=random.choice(os.listdir(directory))
#file=directory+file # use the full path name
print "Trying "+file
if os.path.isdir(os.path.join(directory, file))==True:
print "It's a directory!"
return pickNestedFile(directory+"/"+file, bad_files)
else:
return directory+"/"+file
The program I am using this in now is to take a folder of chatlogs, pick a random log, starting position, and length. These will then be processed into a MOTD-like series of (typically) short log snippets. What I need the next-file picking ability for is when the length is unusually long or the starting line is at the end of the file, so that it continues at the top of the next file (a.k.a. wrap around midnight).
I am open to the idea of using a different method to choose the file, since the above method does not discreetly give a separate filename and directory and I'd have to go use a listdir and match to get an index anyway.

You should probably consider rewriting your program to not have to use this. But this would be how you could do it:
import os
def nextFile(filename,directory):
fileList = os.listdir(directory)
nextIndex = fileList.index(filename) + 1
if nextIndex == 0 or nextIndex == len(fileList):
return None
return fileList[nextIndex]
print(nextFile("mail","test"))

I tweaked the accepted answer to allow new files to be added to the directory on the fly and for it to work if a file is deleted or changed or doesn't exist. There are better ways to work with filenames/paths, but the example below keeps it simple. Maybe it's helpful:
import os
def next_file_in_dir(directory, current_file=None):
file_list = os.listdir(directory)
next_index = 0
if current_file in file_list:
next_index = file_list.index(current_file) + 1
if next_index >= len(file_list):
next_index = 0
return file_list[next_index]
file_name = None
directory = "videos"
user_advanced_to_next = True
while user_advanced_to_next:
file_name = next_file_in_dir(directory=directory, current_file=file_name )
user_advanced_to_next = play_video("{}/{}".format(directory, file_name))
finish_and_clean_up()

Python - Duplicate File Finder using defaultdict

I'm experimenting with different ways to identify duplicate files, based on file content, by looping through the top level directory where folders A-Z exist. Within folders A-Z there is one additional layer of folders named after the current date. Finally, within the dated folders there are between several thousand to several million (<3 million) files in various formats.
Using the script below I was able to process roughly 800,000 files in about 4 hours. However, running it over a larger data set of roughly 13,000,000 files total it consistently breaks on letter "I" that contains roughly 1.5 million files.
Given the size of data I'm dealing with I'm considering outputting the content directly to a text file and then importing it into MySQL or something similar for further processing. Please let me know if I'm going down the right track or if you feel a modified version of the script below should be able to handle 13+ million files.
Question - How can I modify the script below to handle 13+ million files?
Error traceback:
Traceback (most recent call last):
File "C:/Users/"user"/PycharmProjects/untitled/dups.py", line 28, in <module>
for subdir, dirs, files in os.walk(path):
File "C:\Python34\lib\os.py", line 379, in walk
yield from walk(new_path, topdown, onerror, followlinks)
File "C:\Python34\lib\os.py", line 372, in walk
nondirs.append(name)
MemoryError
my code:
import hashlib
import os
import datetime
from collections import defaultdict
def hash(filepath):
hash = hashlib.md5()
blockSize = 65536
with open(filepath, 'rb') as fpath:
block = fpath.read(blockSize)
while len(block) > 0:
hash.update(block)
block = fpath.read(blockSize)
return hash.hexdigest()
directory = "\\\\path\\to\\files\\"
directories = [name for name in os.listdir(directory) if os.path.isdir(os.path.join(directory, name))]
outFile = open("\\path\\output.txt", "w", encoding='utf8')
for folder in directories:
sizeList = defaultdict(list)
path = directory + folder
print("Start time: " + str(datetime.datetime.now()))
print("Working on folder: " + folder)
# Walk through one level of directories
for subdir, dirs, files in os.walk(path):
for file in files:
filePath = os.path.join(subdir, file)
sizeList[os.stat(filePath).st_size].append(filePath)
print("Hashing " + str(len(sizeList)) + " Files")
## Hash remaining files
fileList = defaultdict(list)
for fileSize in sizeList.values():
if len(fileSize) > 1:
for dupSize in fileSize:
fileList[hash(dupSize)].append(dupSize)
## Write remaining hashed files to file
print("Writing Output")
for fileHash in fileList.values():
if len(fileHash) > 1:
for hashOut in fileHash:
outFile.write(hashOut + " ~ " + str(os.stat(hashOut).st_size) + '\n')
outFile.write('\n')
outFile.close()
print("End time: " + str(datetime.datetime.now()))

Disclaimer: I don't know if this is a solution.
I looked at your code, and I realized the error is provoked by .walk. Now it's true that this might be because of too much info being processed (so maybe an external DB would help matters, though the added operations might hinder your speed). But other than that, .listdir (which is called by .walk) is really terrible when you handle a huge amount of files. Hopefully, this is resolved in Python 3.5 because it implements the way better scandir, so if you're willing* to try the latest (and I do mean latest, it was release, what, 8 days ago?), that might help.
Other than that you can try and trace bottlenecks, and garbage collection to maybe figure it out.
*you can also just install it with pip using your current python, but where's the fun in that?

How to go through files in a directory and delete them depending on if the "time" column is less than local time

I am very new to python and right now I am trying to go through a directory of 8 "train" text files in which, for example one that says: Paris - London 19.10
What I want to do is create a code (probably some sort of for loop) to automatically go through and delete the files in which the time column is less than the local time. In this case, when the train has left. I want this to happen when i start my code. What I have manage to do is for this to happen only when I give an input to try to open the file, but I do not manage to make it happen without any input given from the user.
def read_from_file(textfile):
try:
infile = open(textfile + '.txt', 'r')
infotrain = infile.readline().rstrip().split(' ')
localtime = time.asctime(time.localtime(time.time()))
localtime = localtime.split(' ')
if infotrain[2] < localtime[3]:
os.remove(textfile + '.txt')
print('This train has left the station.')
return None, None
except:
pass
(Be aware that this is not the whole function as it is very long and contains code that does not relate to my question)
Does anyone have a solution?

os.listdir() gives you all of the the file names in a directory.
import os
file_names = os.listdir(".")
for fname in file_names:
# do your time checking stuff here

problem when re-running program

I have a fairly simple python loop that calls a few functions, and writes the output to a file. To do this is creates a folder, and saves the file in this folder.
When I run the program the first time with a unique file name, it runs fine. However, if I try to run it again, it will not work and I do not understand why. I am quite certain that it is not a problem of overwriting the file, as I delete the folder before re-running, and this is the only place that the file is stored. Is there a concept that I am mis-understanding?
The problematic file is 'buff1.shp'. I am using Python 2.5 to run some analysis in ArcGIS
Thanks for any advice (including suggestions about how to improve my coding style). One other note is that my loops currently only use one value as I am testing this at the moment.
# Import system modules
import sys, string, os, arcgisscripting, shutil
# Create the Geoprocessor object
gp = arcgisscripting.create()
# Load required toolboxes...
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Spatial Statistics Tools.tbx")
gp.AddToolbox("C:/Program Files/ArcGIS/ArcToolbox/Toolboxes/Analysis Tools.tbx")
# specify workspace
gp.Workspace = "C:/LEED/Cities_20_Oct/services"
path = "C:\\LEED\\Cities_20_Oct\\services\\"
results = 'results\\'
os.mkdir( path + results )
newpath = path + results
# Loop through each file (0 -> 20)
for j in range(0,1):
in_file = "ser" + str(j) + ".shp"
in_file_2 = "ser" + str(j) + "_c.shp"
print "Analyzing " + str(in_file) + " and " + str(in_file_2)
#Loop through a range of buffers - in this case, 1,2
for i in range(1,2):
print "Buffering....."
# Local variables...
center_services = in_file_2
buffer_shp = newpath + "buff" + str(i) + ".shp"
points = in_file_2
buffered_analysis_count_shp = newpath + "buffered_analysis_count.shp"
count_txt = newpath + "count.txt"
# Buffer size
b_size = 1000 + 1000 * i
b_size_input = str(b_size) + ' METERS'
print "Buffer:" + b_size_input + "\n"
# Process: Buffer...
gp.Buffer_analysis(center_services, buffer_shp, b_size_input, "FULL", "ROUND", "ALL", "")
print "over"
(To clarify this question I edited a few parts that did not make sense without the rest of the code. The error still remains in the program.)
Error message:
ExecuteError: ERROR 000210: Cannot create output C:\LEED\Cities_20_Oct\services\results\buff1.shp Failed to execute (Buffer).

I can't see how the file name in the error message blahblah\buff1.shp can arise from your code.
for i in range(0,1):
buffer_shp = newpath + "buff" + str(i) + ".shp"
gp.Buffer_analysis(center_services, buffer_shp, etc etc)
should produce blahblah\buff0.shp not blahblah\buff1.shp... I strongly suggest that the code you display should be the code that you actually ran. Throw in a print statement just before the gp.Buffer_analysis() call to show the value of i and repr(buffer_shp). Show all print results.
Also the comment #Loop through a range of buffers (1 ->100) indicates you want to start at 1, not 0. It helps (you) greatly if the comments match the code.
Don't repeat yourself; instead of
os.mkdir( path + results )
newpath = path + results
do this:
newpath = path + results # using os.path.join() is even better
os.mkdir(newpath)
You might like to get into the habit of constructing all paths using os.path.join().
You need to take the call to os.mkdir() outside the loops i.e. do it once per run of the script, not once each time round the inner loop.
The results of these statements are not used:
buffered_analysis_count_shp = newpath + "buffered_analysis_count.shp"
count_txt = newpath + "count.txt"
Update
Googling with the first few words in your error message (always a good idea!) brings up this: troubleshooting geoprocessing errors which provides the following information:
geoprocessing errors that occur when
reading or writing ArcSDE/DBMS data
receive a generic 'catch-all' error
message, such as error 00210 when
writing output
This goes on to suggest some ways of determining what your exact problem is. If that doesn't help you, you might like to try asking in the relevant ESRI forum or on GIS StackExchange.

I see this is a 3 year old posting, but for others will add:
As I generate python script to work with Arc, I always include right after my import:
arcpy.env.overwriteOutput=True # This allows the script to overwrite files.
Also you mentioned you delete your "folder"?. That would be part of your directory, and I do not see where you are creating a directory in the script. You would want to clear the folder, not delete it (maybe you meant you delete the file though).
JJH

I'd be tempted to look again at
path = "C:\LEED\Cities_20_Oct\services\"
Surely you want double front slashes, not double back slashes?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Memory Error Python Processing Large File Line by Line - python

Related

Is there a way to specify a range of files for Python to loop over in a given directory?

Given a filename, go to the next file in a directory

Python - Duplicate File Finder using defaultdict

How to go through files in a directory and delete them depending on if the "time" column is less than local time

problem when re-running program

Categories

Resources