I have some questions related to setting the maximum running time in Python. In fact, I would like to use pdfminer to convert the PDF files to .txt. The problem is that very often, some files are not possible to decode and take an extremely long time. So I want to set time.time() to limit the conversion time for each file to 20 seconds. In addition, I run under Windows so I cannot use signal function.
I succeeded in running the conversion code with pdfminer.convert_pdf_to_txt() (in my code it is "c"), but I could not integrate the time.time() in the while loop. It seems to me that in the following code, the while loop and time.time() do not work.
In summary, I want to:
Convert the PDf file to a .txt file
The time limit for each conversion is 20 seconds. If it runs out of time, throw an exception and save an empty file
Save all the txt files under the same folder
If there are any exceptions/errors, still save the file, but with empty content.
Here is the current code:
import converter as c
import os
import timeit
import time
yourpath = 'D:/hh/'
for root, dirs, files in os.walk(yourpath, topdown=False):
for name in files:
t_end = time.time() + 20
try:
while time.time() < t_end:
c.convert_pdf_to_txt(os.path.join(root, name))
t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
g = str(a.split("\\")[1])
with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))
print "yes"
if time.time() > t_end:
print "no"
with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
newfile.write("")
except KeyboardInterrupt:
raise
except:
for name in files:
t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
g = str(a.split("\\")[1])
with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
newfile.write("")
You have the wrong approach.
You define the end time and immediately enter the while loop if the current timestamp is lower than the end timestamp (will be always True). So the while loop is entered and you get stuck at the converting function.
I would suggest the signal module, which is already included in Python. It allows you to quit a function after n seconds. A basic example can be seen in this Stack Overflow answer.
Your code would be like this:
return astring
import converter as c
import os
import timeit
import time
import threading
import thread
yourpath = 'D:/hh/'
for root, dirs, files in os.walk(yourpath, topdown=False):
for name in files:
try:
timer = threading.Timer(5.0, thread.interrupt_main)
try:
c.convert_pdf_to_txt(os.path.join(root, name))
except KeyboardInterrupt:
print("no")
with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
newfile.write("")
else:
timer.cancel()
t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
g = str(a.split("\\")[1])
print("yes")
with open("D:/f/" + g + "&" + t + "&" + name + ".txt", mode="w") as newfile:
newfile.write(c.convert_pdf_to_txt(os.path.join(root, name)))
except KeyboardInterrupt:
raise
except:
for name in files:
t = os.path.split(os.path.dirname(os.path.join(root, name)))[1]
a = str(os.path.split(os.path.dirname(os.path.join(root, name)))[0])
g = str(a.split("\\")[1])
with open("D:/f/"+g+"&"+t+"&"+name+".txt", mode="w") as newfile:
newfile.write("")
Just for the future: Four spaces indentation and not too much whitespace ;)
Related
I've written a simple python script to search for a log file in a folder (which has approx. 4 million files) and read the file.
Currently, the average time taken for the entire operation is 20 seconds. I was wondering if there is a way to get the response faster.
Below is my script
import re
import os
import timeit
from datetime import date
log_path = "D:\\Logs Folder\\"
rx_file_name = r"[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}"
log_search_script = True
today = str(date.today())
while log_search_script:
try:
log_search = input("Enter image file name: ")
file_name = re.search(rx_file_name, log_search).group()
log_file_name = str(file_name) + ".log"
print(f"\nLooking for log file '{log_file_name}'...\n")
pass
except:
print("\n ***** Invalid input. Try again! ***** \n")
continue
start = timeit.default_timer()
if log_file_name in os.listdir(log_path):
log_file = open(log_path + "\\" + log_file_name, 'r', encoding="utf8")
print('\n' + "--------------------------------------------------------" + '\n')
print(log_file.read())
log_file.close()
print('\n' + "--------------------------------------------------------" + '\n')
print("Time Taken: " + str(timeit.default_timer() - start) + " seconds")
print('\n' + "--------------------------------------------------------" + '\n')
else:
print("Log File Not Found")
search_again = input('\nDo you want to search for another log ("y" / "n") ?').lower()
if search_again[0] == 'y':
print("======================================================\n\n")
continue
else:
log_search_script = False
Your problem is the line:
if log_file_name in os.listdir(log_path):
This has two problems:
os.listdir will create a huge list which can take a lot of time (and space...).
the ... in ... part will now go over that huge list linearly and search for the file.
Instead, let your OS do the hard work and "ask for forgivness, not permission". Just assume the file is there and try to open it. If it is not actually there - an error will be raised, which we will catch:
try:
with open(log_path + "\\" + log_file_name, 'r', encoding="utf8") as file:
print(log_file.read())
except FileNotFoundError:
print("Log File Not Found")
You can use glob.
import glob
print(glob.glob(directory_path))
Good day.
I wrote a little Python program to help me easily create .cbc files for Calibre, which is just a renamed .zip file with a text file called comics.txt for TOC purposes. Each chapter is another zip file.
The issue is that the last zip file zipped always has the error "Unexpected end of data". The file itself is not corrupt, if I unzip it and rezip it it works perfectly. Playing around it seems that the problem is that Python doesn't close the last zip file after zipping it, since I can't delete the last zip while the program is still running since it's still open in Python. Needless to say, Calibre doesn't like the file and fails to convert it unless I manually rezip the affected chapters.
The code is as follows, checking the folders for not-image files, zipping the folders, zipping the zips while creating the text file, and "changing" extension.
import re, glob, os, zipfile, shutil, pathlib, gzip, itertools
Folders = glob.glob("*/")
items = len(Folders)
cn_list = []
cn_list_filtered = []
dirs_filtered = []
ch_id = ["c", "Ch. "]
subdir_im = []
total = 0
Dirs = next(os.walk('.'))[1]
for i in range(0, len(Dirs)):
for items in os.listdir("./" + Dirs[i]):
if items.__contains__('.png') or items.__contains__('.jpg'):
total+=1
else:
print(items + " not an accepted format.")
subdir_im.append(total)
total = 0
for fname in Folders:
if re.search(ch_id[0] + r'\d+' + r'[\S]' + r'\d+', fname):
cn = re.findall(ch_id[0] + "(\d+[\S]\d+)", fname)[0]
cn_list.append(cn)
elif re.search(ch_id[0] + r'\d+', fname):
cn = re.findall(ch_id[0] + "(\d+)", fname)[0]
cn_list.append(cn)
elif re.search(ch_id[1] + r'\d+' + '[\S]' + r'\d+', fname):
cn = re.findall(ch_id[1] + "(\d+[\S]\d+)", fname)[0]
cn_list.append(cn)
elif re.search(ch_id[1] + r'\d+', fname):
cn = re.findall(ch_id[1] + "(\d+)", fname)[0]
cn_list.append(cn)
else:
print('Warning: File found without proper filename format.')
cn_list_filtered = set(cn_list)
cn_list_filtered = sorted(cn_list_filtered)
cwd = os.getcwd()
Dirs = Folders
subdir_zi = []
total = 0
for i in range(0, len(cn_list_filtered)):
for folders in Dirs:
if folders.__contains__(ch_id[0] + cn_list_filtered[i] + " ")\
or folders.__contains__(ch_id[1] + cn_list_filtered[i] + " "):
print('Zipping folder ', folders)
namezip = "Chapter " + cn_list_filtered[i] + ".zip"
current_zip = zipfile.ZipFile(namezip, "a")
for items in os.listdir(folders):
if items.__contains__('.png') or items.__contains__('.jpg'):
current_zip.write(folders + "/" + items, items)
total+=1
subdir_zi.append(total)
total = 0
print('Folder contents in order:', subdir_im, ' Total:', sum(subdir_im))
print("Number of items per zip: ", subdir_zi, ' Total:', sum(subdir_zi))
if subdir_im == subdir_zi:
print("All items in folders have been successfully zipped")
else:
print("Warning: File count in folders and zips do not match. Please check the affected chapters")
zips = glob.glob("*.zip")
namezip2 = os.path.basename(os.getcwd()) + ".zip"
zipfinal = zipfile.ZipFile(namezip2, "a")
for i in range(0, len(zips), 1):
zipfinal.write(zips[i],zips[i])
Data = []
for i in range (0,len(cn_list_filtered),1):
Datai = ("Chapter " + cn_list_filtered[i] + ".zip" + ":Chapter " + cn_list_filtered[i] + "\r\n")
Data.append(Datai)
Dataok = ''.join(Data)
with zipfile.ZipFile(namezip2, 'a') as myzip:
myzip.writestr("comics.txt", Dataok)
zipfinal.close()
os.rename(namezip2, namezip2 + ".cbc")
os.system("pause")
I am by no means a programmer, that is just a Frankenstein monster code I eventually managed to put together by checking threads, but this last issue has me stumped.
Some solutions I tried are:
for i in range(0, len(zips), 1):
zipfinal.write(zips[i],zips[i])
zips[i].close()
Fails with:
zips[i].close()
AttributeError: 'str' object has no attribute 'close'
and:
for i in range(0, len(zips), 1):
zipfinal.write(zips[i],zips[i])
zips[len(zips)].close()
Fails with:
zips[len(zips)].close()
IndexError: list index out of range
Thanks for the help.
This solved my issue:
def generate_zip(file_list, file_name=None):
zip_buffer = io.BytesIO()
zf = zipfile.ZipFile(zip_buffer, mode="w", compression=zipfile.ZIP_DEFLATED)
for file in file_list:
print(f"Filename: {file[0]}\nData: {file[1]}")
zf.writestr(file[0], file[1])
**zf.close()**
with open(file_name, 'wb') as f:
f.write(zip_buffer.getvalue())
f.close()
I'm running a series of python scripts from the Command window via a batch file.
Previously, it's worked without issue. However now, without a change in code, every time it gets to the end of a script I get a "Python.exe has stopped working" error. The scripts have actually completed processing, but I need to close the error window for the batch to proceed.
I've tried adding sys.exit to ends of the scripts but that makes no difference. The first script has no issue but every script after has this issue.
How do I stop this error from happening?
Batch File
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script1
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script2
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script3
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script4a
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script4b
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script4c
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script4d
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script5
C:\Path\to\Python\ArcGIS64bitversion C:\Path\to\Script6
the python scripts do, all, actually complete. Scripts 2-5 all use multiprocessing, however script 6 does not use multiprocessing and still experiences the error.
General Script Structure
import statements
global variables
get data statements
Def Code:
try:
code
sys.exit
except:
print error in text file
Def multiprocessing:
pool = multiprocessing.pool(32)
pool.map(Code, listofData)
if main statement
try:
code
multiprocessing()
sys.exit
except:
print error to text file
Script 2 (the first script to error)
import arcpy, fnmatch, os, shutil, sys, traceback
import multiprocessing
from time import strftime
#===========================================================================================
ras_dir = r'C:\Path\to\Input'
working_dir = r'C:\Path\to\Output'
output_dir = os.path.join(working_dir, 'Results')
if not os.path.isdir(output_dir):
os.mkdir(output_dir)
#===========================================================================================
global input_files1
global raslist
global ras
raslist = []
input_files1 = []
#===========================================================================================
for r, d, f in os.walk(working_dir):
for inFile in fnmatch.filter(f, '*.shp'):
input_files1.append(os.path.join(r, inFile))
for r, d, f in os.walk(ras_dir):
for rasf in fnmatch.filter(f,'*.tif'):
raslist.append(os.path.join(r, rasf))
ras = raslist[0]
del rasf,raslist
def rasextract(file):
arcpy.CheckOutExtension("Spatial")
arcpy.env.overwriteOutput = True
proj = file.split('.')
proj = proj[0] + '.' + proj[1] + '.prj'
arcpy.env.outputCoordinateSystem = arcpy.SpatialReference(proj)
try:
filename = str(file)
filename = filename.split('\\')
filename = filename[-1]
filename = filename.split('.')
filename = filename[0]
tif_dir = output_dir + '\\' + filename
os.mkdir(tif_dir)
arcpy.env.workspace = tif_dir
arcpy.env.scratchWorkspace = tif_dir
dname = tif_dir + '\\' + filename + '_ras.tif'
fname = working_dir+ '\\' + filename + '_ras.tif'
bufname = tif_dir + '\\' + filename + '_rasbuf.shp'
arcpy.Buffer_analysis(file, bufname, "550 METERS", "FULL", "ROUND", "ALL")
newras = arcpy.sa.ExtractByMask(ras, bufname)
newras.save(rasname)
print "Saved " + filename + " ras"
sys.exit
except:
var = traceback.format_exc()
x = str(var)
timecode = strftime("%a, %d %b %Y %H:%M:%S + 0000")
logfile = open(r'C:\ErrorLogs\Log_Script2_rasEx.txt', "a+")
ent = "\n"
logfile.write(timecode + " " + x + ent)
logfile.close()
def MCprocess():
pool = multiprocessing.Pool(32)
pool.map(rasextract, input_files1)
if __name__ == '__main__':
try:
arcpy.CheckOutExtension("Spatial")
ras_dir = r'C:\Path\to\Input'
working_dir = r'C:\Path\to\Output'
output_dir = os.path.join(working_dir, 'Results')
if not os.path.isdir(output_dir):
os.mkdir(output_dir)
#=============================================================
raslist = []
input_files1 = []
#=============================================================
for r, d, f in os.walk(working_dir):
for inFile in fnmatch.filter(f, '*.shp'):
input_files1.append(os.path.join(r, inFile))
for r, d, f in os.walk(ras_dir):
for demf in fnmatch.filter(f,'*.tif'):
demlist.append(os.path.join(r, rasf))
ras = raslist[0]
del rasf,raslist
MCprocess()
sys.exit
except:
var = traceback.format_exc()
x = str(var)
timecode = strftime("%a, %d %b %Y %H:%M:%S + 0000")
logfile = open(r'C:\ErrorLogs\Log_Script2_rasEx.txt', "a+")
ent = "\n"
logfile.write(timecode + " " + x + ent)
logfile.close()
NEW error message
this error was encountered after disabling error reporting.
Windows is catching the error.
Try disabling 'Window Error Reporting' in the Registry. After that a traceback/error should be shown. Here you find instructions how to disable 'WER' for Windows 10.
Posting this as it is the only web search I could find matching my error (which was that a script ran flawlessly in IDLE, but threw the "Python has stopped working" error when called from a batch (.bat) file) - Full disclosure, I was using shelve, not using arcpy.
I think the issue is that you are somehow leaving files 'open', and then when the script ends Python is forced to clean up the open files in an 'unplanned' fashion. Inside the IDE, this is caught and handled, but once in a batch file, the issue bubbles up to give the 'stopped working'
contrast:
f = open("example.txt", "r")
with
f = open("example.txt", "r")
f.close()
The first will error out from a bat file, the second will not.
So I'm writing a script to take large csv files and divide them into chunks. These files each have lines formatted accordingly:
01/07/2003,1545,12.47,12.48,12.43,12.44,137423
Where the first field is the date. The next field to the right is a time value. These data points are at minute granularity. My goal is to fill files with 8 days worth of data, so I want to write all the lines from a file for 8 days worth into a new file.
Right now, I'm only seeing the program write one line per "chunk," rather than all the lines. Code shown below and screenshots included showing how the chunk directories are made and the file as well as its contents.
For reference, day 8 shown and 1559 means it stored the last line right before the mod operator became true. So I'm thinking that everything is getting overwritten somehow since only the last values are being stored.
import os
import time
CWD = os.getcwd()
WRITEDIR = CWD+"/Divided Data/"
if not os.path.exists(WRITEDIR):
os.makedirs(WRITEDIR)
FILEDIR = CWD+"/SP500"
os.chdir(FILEDIR)
valid_files = []
filelist = open("filelist.txt", 'r')
for file in filelist:
cur_file = open(file.rstrip()+".csv", 'r')
cur_file.readline() #skip first line
prev_day = ""
count = 0
chunk_count = 1
for line in cur_file:
day = line[3:5]
WDIR = WRITEDIR + "Chunk"
cur_dir = os.getcwd()
path = WDIR + " "+ str(chunk_count)
if not os.path.exists(path):
os.makedirs(path)
if(day != prev_day):
# print(day)
prev_day = day
count += 1
#Create new directory
if(count % 8 == 0):
chunk_count += 1
PATH = WDIR + " " + str(chunk_count)
if not os.path.exists(PATH):
os.makedirs(PATH)
print("Chunk count: " + str(chunk_count))
print("Global count: " + str(count))
temp_path = WDIR +" "+str(chunk_count)
os.chdir(temp_path)
fname = file.rstrip()+str(chunk_count)+".csv"
with open(fname, 'w') as f:
try:
f.write(line + '\n')
except:
print("Could not write to file. \n")
os.chdir(cur_dir)
if(chunk_count >= 406):
continue
cur_file.close()
# count += 1
The answer is in the comment but let me give it here so that your question is answered.
You're opening your file in 'w' mode which overwrites all the previously written content. You need to open it in the 'a' (append) mode:
fname = file.rstrip()+str(chunk_count)+".csv"
with open(fname, 'a') as f:
See more on open function and modes in Python documentation. It specifically mentions about 'w' mode:
note that 'w+' truncates the file
I wrote a Python script that collects file metadata (filename, creation date, creation time, last modified data, last modified time) from a file directory. However, when the directory is a path that is located in an external hard drive the script doesn't work. I can't figure out why.
Here is the code:
import os
from os.path import basename
import datetime
import time
def getSize(filename):
st = os.stat(filename)
print st
return st.st_size
#get last modified date
def getMTime(filename):
fileModTime = os.path.getmtime(filename)
return fileModTime
#get creation date
def getCTime(filename):
fileModTime = os.path.getctime(filename)
return fileModTime
#get data from directory
MyDirectory = "H:\0_tempfiles\150115_Portfolio\Work\Work\BarBackUp"
MyExtension = ".jpg"
#write to file
WorkingDirectory = "C:\\Users\Admin\Downloads\demo\\"
MyTxtFile = WorkingDirectory + "fileData6.txt"
delim = ";"
with open(MyTxtFile, 'wb') as f:
f.write(delim.join(["FILENAME", "FILESIZE", "mDATE","mTIME",
"cDATE","cTIME"]) + "\n")
for root, dirs, files in os.walk(MyDirectory):
for file in files:
if file.endswith(MyExtension):
#get File Name
a = (os.path.join(root, file))
#print a
filename = a
MyFileName = basename(a)
#get File Size
MyFileSize = getSize(filename) / 1000
print MyFileName + " >>> file size: " + str(MyFileSize) + "Kb"
#get modification time V2
modTimeV2 = getMTime(filename)
modTimeV2 = time.strftime("%Y/%d/%m;%I:%M:%S %p", \
time.localtime(modTimeV2))
print "time modified: " + str(modTimeV2)
#get creation time
creTime = getCTime(filename)
creTime = time.strftime("%Y/%d/%m;%I:%M:%S %p", \
time.localtime(creTime))
print "time created: " + str(creTime)
#--------
#write data to file
entry = delim.join([str(MyFileName), str(MyFileSize), \
str(modTimeV2), str(creTime)]) + "\n"
f.write(entry)
print "<<<<<<everything went fine>>>>>>"
Your code works fine for me. Your "MyDirectory" variable has escape characters in it. Try adding an r in front of the quotations:
MyDirectory = r"H:\0_tempfiles\150115_Portfolio\Work\Work\BarBackUp"
or
MyDirectory = "H:/0_tempfiles/150115_Portfolio/Work/Work/BarBackUp"
or
MyDirectory = "H:\\0_tempfiles\\150115_Portfolio\\Work\\Work\\BarBackUp"