I'm assuming this has to be a memory issue but I'm not sure. The program loops through PDF's to look for corrupted files. When a file is corrupted, it writes that location to a txt file for me to review later. When running it the first time, I logged both pass and fail scenarios to the log. After 67381 log entries, it stopped. Then I changed this logic so it only logs errors, however, in the console I did display a count of the loop so I can tell how far along the process is. There are about 190k files to loop through and at exactly 67381 the count stops every time. It looks like the python program is still running in the background as the memory and cpu keeps fluctuating but it's hard to be sure. I also don't know now if it will still write errors to the log.
Here is the code,
import PyPDF2, os
from time import gmtime,strftime
path = raw_input("Enter folder path of PDF files:")
t = open(r'c:\pdf_check\log.txt','w')
count = 1
for dirpath,dnames,fnames in os.walk(path):
for file in fnames:
print count
count = count + 1
if file.endswith(".pdf"):
file = os.path.join(dirpath, file)
try:
PyPDF2.PdfFileReader(open(file, "rb"))
except PyPDF2.utils.PdfReadError:
curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
t.write (str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n")
else:
pass
#curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
#t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n")
t.close()
Edit 1: (New Code)
New code and the same issue:
import PyPDF2, os
from time import gmtime,strftime
path = raw_input("Enter folder path of PDF files:")
t = open(r'c:\pdf_check\log.txt','w')
count = 1
for dirpath,dnames,fnames in os.walk(path):
for file in fnames:
print count
count = count + 1
if file.endswith(".pdf"):
file = os.path.join(dirpath, file)
try:
with open(file,'rb') as f:
PyPDF2.PdfFileReader(f)
except PyPDF2.utils.PdfReadError:
curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
t.write (str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n")
f.close()
else:
pass
f.close()
#curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
#t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n")
t.close()
Edit 2: I am trying to now run this from a different machine with beefier hardware and a different version of windows (10 pro instead of server 2008 r2) but I don't think this is the issue.
Try to edit one of the .pdf files to make it larger. That way, if the loop number your program "stops" at is smaller, you can identify the problem as a memory issue.
Else, it might be an unusually large pdf file that is taking your program a while to verify integrity.
Debugging this, you could print the file location of the .pdf files you open to find this particular .pdf and manually open it to investigate further..
Figured it out. The issue is actually due to a random and very large corrupted PDF. So this is not a loop issue, it's a corrupted file issue.
Related
I've written a simple python script to search for a log file in a folder (which has approx. 4 million files) and read the file.
Currently, the average time taken for the entire operation is 20 seconds. I was wondering if there is a way to get the response faster.
Below is my script
import re
import os
import timeit
from datetime import date
log_path = "D:\\Logs Folder\\"
rx_file_name = r"[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}"
log_search_script = True
today = str(date.today())
while log_search_script:
try:
log_search = input("Enter image file name: ")
file_name = re.search(rx_file_name, log_search).group()
log_file_name = str(file_name) + ".log"
print(f"\nLooking for log file '{log_file_name}'...\n")
pass
except:
print("\n ***** Invalid input. Try again! ***** \n")
continue
start = timeit.default_timer()
if log_file_name in os.listdir(log_path):
log_file = open(log_path + "\\" + log_file_name, 'r', encoding="utf8")
print('\n' + "--------------------------------------------------------" + '\n')
print(log_file.read())
log_file.close()
print('\n' + "--------------------------------------------------------" + '\n')
print("Time Taken: " + str(timeit.default_timer() - start) + " seconds")
print('\n' + "--------------------------------------------------------" + '\n')
else:
print("Log File Not Found")
search_again = input('\nDo you want to search for another log ("y" / "n") ?').lower()
if search_again[0] == 'y':
print("======================================================\n\n")
continue
else:
log_search_script = False
Your problem is the line:
if log_file_name in os.listdir(log_path):
This has two problems:
os.listdir will create a huge list which can take a lot of time (and space...).
the ... in ... part will now go over that huge list linearly and search for the file.
Instead, let your OS do the hard work and "ask for forgivness, not permission". Just assume the file is there and try to open it. If it is not actually there - an error will be raised, which we will catch:
try:
with open(log_path + "\\" + log_file_name, 'r', encoding="utf8") as file:
print(log_file.read())
except FileNotFoundError:
print("Log File Not Found")
You can use glob.
import glob
print(glob.glob(directory_path))
I have more than 500 xml files and each xml file should processed on FME workbench individually (iteration of FME workbench for each xml file).
For such a propose i have to run a python file (loop.py) to iterate FME workbench for each xml file.
The whole process was working in past on other PC without any problem. Now Once i run Module i got the following error:
Traceback (most recent call last):E:\XML_Data
File "E:\XML_Data\process\01_XML_Tile_1.py", line 28, in
if "Translation was SUCCESSFUL" in open(path_log + "\" + data + ".log").read():
IOError: [Errno 2] No such file or directory: 'E:\XML_Data\data_out\log_01\re_3385-5275.xml.log'
Attached the python code(loop.py).
Any help is greatly appreciated.
import os
import time
# Mainpath and Working Folder:
#path_main = r"E:\XML_Data"
path_main = r"E:\XML_Data"
teil = str("01")
# variables
path_in = path_main + r"\data_in\03_Places\teil_" + teil # "Source folder of XML files"
path_in_tile10 = path_main + r"\data_in\01_Tiling\10x10.shp" # "Source folder of Grid shapefile"
path_in_commu = path_main + r"\data_in\02_Communities\Communities.shp" # "Source folder of Communities shapefile"
path_out = path_main + r"\data_out\teil_" + teil # "Output folder of shapefiles that resulted from XML files (tile_01 folder)"
path_log = path_main + r"\data_out\log_" + teil # "Output folder of log files for each run(log_01 folder)"
path_fme = r"%FME_EXE_2015%" # "C:\Program Files\FME2015\fme.exe"
path_fme_workbench = path_main + r"\process\PY_FME2015.fmw" # "path of FME workbench"
datalists = os.listdir(path_in)
count = 0
# loop each file individually in FME
for data in datalists:
if data.find(".xml") != -1:
count +=1
print ("Run-No." + str(count) + ": with data " + data)
os.system (path_fme + " " + path_fme_workbench + " " + "--SourceDataset_XML"+ " " + path_in + "\\" + data + " " + "--SourceDataset_SHAPE" + " " + path_in_tile10 + " " + "--SourceDataset_SHAPE_COMU" + " " + path_in_commu + " " + "--DestDataset_SHAPE" +" " +path_out + " " +"LOG_FILENAME" + " " + path_log + "\\" + data + ".log" )
print ("Data processed: " + data)
shape = str(data[19:28]) + "_POPINT_CENTR_UTM32N.shp"
print ("ResultsFileName: " + shape)
if "Translation was SUCCESSFUL" in open(path_log + "\\" + data + ".log").read():
# Translation was successful and SHP file exists:
if os.path.isfile(path_out + "\\" + shape):
write_log = open(path_out + "\\" + "result_xml.log", "a")
write_log.write(time.asctime(time.localtime()) + " " + shape + "\n")
write_log.close()
print("Everything ok")
#Translation was successful, but SHP file does not exist:
else:
write_log = open(path_out + "\\" + "error_xml.log", "a")
write_log.write(time.asctime(time.localtime()) + " Data: " + shape + " unavailable.\n")
write_log.close()
# Translation was not successful:
else:
write_log = open(path_out + "\\" + "error_xml.log", "a")
write_log.write(time.asctime(time.localtime()) + " Translation " + Data + " not successful.\n")
write_log.close()
print ("Number of calculated files: " + str(count))
Most likely, the script failed at the os.system line, so the log file was not created from the command. Since you mentioned a different computer, it could be caused by many reasons, such as a different version of FME (so the environment variable %FME_EXE_2015% would not exist).
Use a workspace runner transformer to do this.
The FME version is outdated.so first check the version whether it is creating the problem.
subprocess.call(["C:/Program Files/fme/FMEStarter/FMEStarter.exe", "C:/Program Files/fme/fme20238/fme.exe", "/fmefile.fmw" "LOG_FILENAME","logfile"], stdin=None, stdout=None, stderr=None, shell=True, timeout=None)
I am trying to parse multiple files dealing with "Mike's Pies" as you can see in the code below. I have written it to where I get the desired output, now I would like to parse all the files named "Mike's Pies"
import json
import sys
import glob
with open("Mike's Pies.20130201.json") as json_data:
data = json.load(json_data)
#Keep all orders with variable of r
for r in data ["orders"]:
orderName = r["orderPlacer"]["name"]
#Print with address to acquire the housenumber/street/city/state
address = r["address"]["houseNumber"]
street = r["address"]["street"]
city = r["address"]["city"]
state = r["address"]["state"]
Mikes = "Mike's Pies,"
output = str(orderName) + ", " + str(address) + " " + str(street) +
" " + str(city) + " " + str(state) + ", " + Mikes + " "
length = len(r["pizzas"])
for i in range (0,length):
#if length >= 1 print r["pizzas"][1]["name"]
#if i!=length:
pizza = ((r["pizzas"][i]["name"].strip("\n"))).strip(" ")
if(i!=length-1):
output += pizza + ", "
else:
output += pizza
print(output+"\n")
It sounds like you have code which works on "Mike's Pies.20130201.json", and you want to run that code on every file that starts with "Mike's Pies" and ends with "json", regardless of the timestamp-like bit in the middle. Am I right? You can get all matching filenames with glob and parse them one after the other.
for filename in glob.glob("Mike's Pies.*.json"):
with open(filename) as json_data:
data = json.load(json_data)
#etc etc... Insert rest of code here
I am trying to create a python script that will look in a series of sub-folders and delete empty shapefiles. I have successfully created the part of the script that will delete the empty files in one folder, but there are a total of 70 folders within the "Project" folder. While I could just copy and paste the code 69 times I'm sure must be a way to get it to look at each sub-folder and run the code for each of those sub-folders. Below is the what I have so far. Any ideas? I'm very new to this and I have simply edited an existing code to get this far. Thanks!
import os
# Set the working directory
os.chdir ("C:/Naview/Platypus/Project")
# Get the list of only files in the current directory
file = filter(os.path.isfile, os.listdir('C:/Naview/Platypus/Project'))
# For each file in directory
for shp in file:
# Get only the files that end in ".shp"
if shp.endswith(".shp"):
# Get the size of the ".shp" file.
# NOTE: The ".dbf" file can vary is size whereas
# the shp & shx are always the same when "empty".
size = os.path.getsize(shp)
print "\nChecking " + shp + "'s file size..."
#If the file size is greater than 100 bytes, leave it alone.
if size > 100:
print "File is " + str(size) + " bytes"
print shp + " will NOT be deleted \n"
#If the file size is equal to 100 bytes, delete it.
if size == 100:
# Convert the int output from (size) to a string.
print "File is " + str(size) + " bytes"
# Get the filename without the extention
base = shp[:-4]
# Remove entire shapefile
print "Removing " + base + ".* \n"
if os.path.exists(base + ".shp"):
os.remove(base + ".shp")
if os.path.exists(base + ".shx"):
os.remove(base + ".shx")
if os.path.exists(base + ".dbf"):
os.remove(base + ".dbf")
if os.path.exists(base + ".prj"):
os.remove(base + ".prj")
if os.path.exists(base + ".sbn"):
os.remove(base + ".sbn")
if os.path.exists(base + ".sbx"):
os.remove(base + ".sbx")
if os.path.exists(base + ".shp.xml"):
os.remove(base + ".shp.xml")
There are several ways to do this. I'm a fan of glob
for shp in glob.glob('C:/Naview/Platypus/Project/**/*.shp'):
size = os.path.getsize(shp)
print "\nChecking " + shp + "'s file size..."
#If the file size is greater than 100 bytes, leave it alone.
if size > 100:
print "File is " + str(size) + " bytes"
print shp + " will NOT be deleted \n"
continue
print "Removing", shp, "files"
for file in glob.glob(shp[:-3] + '*'):
print " removing", file
os.remove(file)
Time to learn about procedural programming: Defining Functions.
Put your code into a function with a path parameter and call it for each of your 70 paths:
def delete_empty_shapefiles(path):
# Get the list of only files in the current directory
file = filter(os.path.isfile, os.listdir(path))
...
paths = ['C:/Naview/Platypus/Project', ...]
for path in paths:
delete_empty_shapefiles(path)
Bonus points for creating a function that performs the os.path.exists() and os.remove() calls.
I have the following script (see below) which is taken stdin and manipulating into some simple files.
# Import Modules for script
import os, sys, fileinput, platform, subprocess
# Global variables
hostsFile = "hosts.txt"
hostsLookFile = "hosts.csv"
# Determine platform
plat = platform.system()
if plat == "Windows":
# Define Variables based on Windows and process
currentDir = os.getcwd()
hostsFileLoc = currentDir + "\\" + hostsFile
hostsLookFileLoc = currentDir + "\\" + hostsLookFile
ipAddress = sys.argv[1]
hostName = sys.argv[2]
hostPlatform = sys.argv[3]
hostModel = sys.argv[4]
# Add ipAddress to the hosts file for python to process
with open(hostsFileLoc,"a") as hostsFilePython:
for line in open(hostsFilePython, "r"):
if ipAddress in line:
print "ipAddress \(" + ipAddress + "\) already present in hosts file"
else:
print "Adding ipAddress: " + ipAddress + " to file"
hostsFilePython.write(ipAddress + "\n")
# Add all details to the lookup file for displaying on-screen and added value
with open(hostsLookFileLoc,"a") as hostsLookFileCSV:
for line in open(hostsLookFileCSV, "r"):
if ipAddress in line:
print "ipAddress \(" + ipAddress + "\) already present in lookup file"
else:
print "Adding details: " + ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + " to file"
hostsLookFileCSV.write(ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + "\n")
if plat == "Linux":
# Define Variables based on Linux and process
currentDir = os.getcwd()
hostsFileLoc = currentDir + "/" + hostsFile
hostsLookFileLoc = currentDir + "/" + hostsLookFile
ipAddress = sys.argv[1]
hostName = sys.argv[2]
hostPlatform = sys.argv[3]
hostModel = sys.argv[4]
# Add ipAddress to the hosts file for python to process
with open(hostsFileLoc,"a") as hostsFilePython:
print "Adding ipAddress: " + ipAddress + " to file"
hostsFilePython.write(ipAddress + "\n")
# Add all details to the lookup file for displaying on-screen and added value
with open(hostsLookFileLoc,"a") as hostsLookFileCSV:
print "Adding details: " + ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + " to file"
hostsLookFileCSV.write(ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + "\n")
This code obviously does not work, because the for line in open(hostsFilePython, "r"): syntax is wrong... I can not use a current file object with "open()". However this is want I want to achieve, how can I do this?
You want to open your file using the a+ mode so that you can both read and write, then simply use the existing file object (hostsFilePython).
However, this still won't work as you can only iterate over a file once before it is exhausted.
It's worth noting that this isn't very efficient. The better plan is to read the data into a set, update the set with your new values, then write the set to the file. (As pointed out in the comments, sets don't preserve duplicates (good for your purposes), and order, which may or may not work for you. If not, then you might need to use a list, which will be less efficient).
with open(hostsFileLoc) as hostsFilePython:
lines = hostsFilePython.readlines()
for filename in lines:
with open(hostsFileLoc, 'a') as hostFilePython:
with open(filename) as hostsFile:
for line in hostsFile.readlines():
if ipAddress in line:
print "ipAddress \(" + ipAddress + "\) already present in hosts file"
else:
print "Adding ipAddress: " + ipAddress + " to file"
hostsFilePython.write(ipAddress + "\n")
The default mode is read, so you don't need to pass in r explicitly.