Equivalent to grep, but not using open()

Equivalent to grep, but not using open() - python

I have the following script (see below) which is taken stdin and manipulating into some simple files.
# Import Modules for script
import os, sys, fileinput, platform, subprocess
# Global variables
hostsFile = "hosts.txt"
hostsLookFile = "hosts.csv"
# Determine platform
plat = platform.system()
if plat == "Windows":
# Define Variables based on Windows and process
currentDir = os.getcwd()
hostsFileLoc = currentDir + "\\" + hostsFile
hostsLookFileLoc = currentDir + "\\" + hostsLookFile
ipAddress = sys.argv[1]
hostName = sys.argv[2]
hostPlatform = sys.argv[3]
hostModel = sys.argv[4]
# Add ipAddress to the hosts file for python to process
with open(hostsFileLoc,"a") as hostsFilePython:
for line in open(hostsFilePython, "r"):
if ipAddress in line:
print "ipAddress \(" + ipAddress + "\) already present in hosts file"
else:
print "Adding ipAddress: " + ipAddress + " to file"
hostsFilePython.write(ipAddress + "\n")
# Add all details to the lookup file for displaying on-screen and added value
with open(hostsLookFileLoc,"a") as hostsLookFileCSV:
for line in open(hostsLookFileCSV, "r"):
if ipAddress in line:
print "ipAddress \(" + ipAddress + "\) already present in lookup file"
else:
print "Adding details: " + ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + " to file"
hostsLookFileCSV.write(ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + "\n")
if plat == "Linux":
# Define Variables based on Linux and process
currentDir = os.getcwd()
hostsFileLoc = currentDir + "/" + hostsFile
hostsLookFileLoc = currentDir + "/" + hostsLookFile
ipAddress = sys.argv[1]
hostName = sys.argv[2]
hostPlatform = sys.argv[3]
hostModel = sys.argv[4]
# Add ipAddress to the hosts file for python to process
with open(hostsFileLoc,"a") as hostsFilePython:
print "Adding ipAddress: " + ipAddress + " to file"
hostsFilePython.write(ipAddress + "\n")
# Add all details to the lookup file for displaying on-screen and added value
with open(hostsLookFileLoc,"a") as hostsLookFileCSV:
print "Adding details: " + ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + " to file"
hostsLookFileCSV.write(ipAddress + "," + hostName + "," + hostPlatform + "," + hostModel + "\n")
This code obviously does not work, because the for line in open(hostsFilePython, "r"): syntax is wrong... I can not use a current file object with "open()". However this is want I want to achieve, how can I do this?

You want to open your file using the a+ mode so that you can both read and write, then simply use the existing file object (hostsFilePython).
However, this still won't work as you can only iterate over a file once before it is exhausted.
It's worth noting that this isn't very efficient. The better plan is to read the data into a set, update the set with your new values, then write the set to the file. (As pointed out in the comments, sets don't preserve duplicates (good for your purposes), and order, which may or may not work for you. If not, then you might need to use a list, which will be less efficient).

with open(hostsFileLoc) as hostsFilePython:
lines = hostsFilePython.readlines()
for filename in lines:
with open(hostsFileLoc, 'a') as hostFilePython:
with open(filename) as hostsFile:
for line in hostsFile.readlines():
if ipAddress in line:
print "ipAddress \(" + ipAddress + "\) already present in hosts file"
else:
print "Adding ipAddress: " + ipAddress + " to file"
hostsFilePython.write(ipAddress + "\n")
The default mode is read, so you don't need to pass in r explicitly.

Related

Mongodump specify the output and filename

Is there a way to specify the whole path when running mongodump? I tried using --out but what it does at the moment is saving into a file at my_given_path/database/collection_name.json.gz
I have the following:
path = file_path + '/' + database + '/' + collection + '/'
query_input = "{\\\"metadata_id\\\": {\\\"\$oid\\\": \\\"" + metadata_id + "\\\"}}"
command = "mongodump --uri " + connection_string + database + " " \
"--collection=" + collection + " --query=\"" + query_input + "\" --gzip --out=" + path + " --quiet"
for which the file is saved at:
file_path/database/collection/database/collection.json.gz
Ideally I would like to save it into
file_path/database/collection/metadata_id.json.gz
Would this be possible?

You can use the --archive=<file> flag instead of --out

Python script to run FME workbench

I have more than 500 xml files and each xml file should processed on FME workbench individually (iteration of FME workbench for each xml file).
For such a propose i have to run a python file (loop.py) to iterate FME workbench for each xml file.
The whole process was working in past on other PC without any problem. Now Once i run Module i got the following error:
Traceback (most recent call last):E:\XML_Data
File "E:\XML_Data\process\01_XML_Tile_1.py", line 28, in
if "Translation was SUCCESSFUL" in open(path_log + "\" + data + ".log").read():
IOError: [Errno 2] No such file or directory: 'E:\XML_Data\data_out\log_01\re_3385-5275.xml.log'
Attached the python code(loop.py).
Any help is greatly appreciated.
import os
import time
# Mainpath and Working Folder:
#path_main = r"E:\XML_Data"
path_main = r"E:\XML_Data"
teil = str("01")
# variables
path_in = path_main + r"\data_in\03_Places\teil_" + teil # "Source folder of XML files"
path_in_tile10 = path_main + r"\data_in\01_Tiling\10x10.shp" # "Source folder of Grid shapefile"
path_in_commu = path_main + r"\data_in\02_Communities\Communities.shp" # "Source folder of Communities shapefile"
path_out = path_main + r"\data_out\teil_" + teil # "Output folder of shapefiles that resulted from XML files (tile_01 folder)"
path_log = path_main + r"\data_out\log_" + teil # "Output folder of log files for each run(log_01 folder)"
path_fme = r"%FME_EXE_2015%" # "C:\Program Files\FME2015\fme.exe"
path_fme_workbench = path_main + r"\process\PY_FME2015.fmw" # "path of FME workbench"
datalists = os.listdir(path_in)
count = 0
# loop each file individually in FME
for data in datalists:
if data.find(".xml") != -1:
count +=1
print ("Run-No." + str(count) + ": with data " + data)
os.system (path_fme + " " + path_fme_workbench + " " + "--SourceDataset_XML"+ " " + path_in + "\\" + data + " " + "--SourceDataset_SHAPE" + " " + path_in_tile10 + " " + "--SourceDataset_SHAPE_COMU" + " " + path_in_commu + " " + "--DestDataset_SHAPE" +" " +path_out + " " +"LOG_FILENAME" + " " + path_log + "\\" + data + ".log" )
print ("Data processed: " + data)
shape = str(data[19:28]) + "_POPINT_CENTR_UTM32N.shp"
print ("ResultsFileName: " + shape)
if "Translation was SUCCESSFUL" in open(path_log + "\\" + data + ".log").read():
# Translation was successful and SHP file exists:
if os.path.isfile(path_out + "\\" + shape):
write_log = open(path_out + "\\" + "result_xml.log", "a")
write_log.write(time.asctime(time.localtime()) + " " + shape + "\n")
write_log.close()
print("Everything ok")
#Translation was successful, but SHP file does not exist:
else:
write_log = open(path_out + "\\" + "error_xml.log", "a")
write_log.write(time.asctime(time.localtime()) + " Data: " + shape + " unavailable.\n")
write_log.close()
# Translation was not successful:
else:
write_log = open(path_out + "\\" + "error_xml.log", "a")
write_log.write(time.asctime(time.localtime()) + " Translation " + Data + " not successful.\n")
write_log.close()
print ("Number of calculated files: " + str(count))

Most likely, the script failed at the os.system line, so the log file was not created from the command. Since you mentioned a different computer, it could be caused by many reasons, such as a different version of FME (so the environment variable %FME_EXE_2015% would not exist).

Use a workspace runner transformer to do this.

The FME version is outdated.so first check the version whether it is creating the problem.

subprocess.call(["C:/Program Files/fme/FMEStarter/FMEStarter.exe", "C:/Program Files/fme/fme20238/fme.exe", "/fmefile.fmw" "LOG_FILENAME","logfile"], stdin=None, stdout=None, stderr=None, shell=True, timeout=None)

Python Loop Count Stops at 67381

I'm assuming this has to be a memory issue but I'm not sure. The program loops through PDF's to look for corrupted files. When a file is corrupted, it writes that location to a txt file for me to review later. When running it the first time, I logged both pass and fail scenarios to the log. After 67381 log entries, it stopped. Then I changed this logic so it only logs errors, however, in the console I did display a count of the loop so I can tell how far along the process is. There are about 190k files to loop through and at exactly 67381 the count stops every time. It looks like the python program is still running in the background as the memory and cpu keeps fluctuating but it's hard to be sure. I also don't know now if it will still write errors to the log.
Here is the code,
import PyPDF2, os
from time import gmtime,strftime
path = raw_input("Enter folder path of PDF files:")
t = open(r'c:\pdf_check\log.txt','w')
count = 1
for dirpath,dnames,fnames in os.walk(path):
for file in fnames:
print count
count = count + 1
if file.endswith(".pdf"):
file = os.path.join(dirpath, file)
try:
PyPDF2.PdfFileReader(open(file, "rb"))
except PyPDF2.utils.PdfReadError:
curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
t.write (str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n")
else:
pass
#curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
#t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n")
t.close()
Edit 1: (New Code)
New code and the same issue:
import PyPDF2, os
from time import gmtime,strftime
path = raw_input("Enter folder path of PDF files:")
t = open(r'c:\pdf_check\log.txt','w')
count = 1
for dirpath,dnames,fnames in os.walk(path):
for file in fnames:
print count
count = count + 1
if file.endswith(".pdf"):
file = os.path.join(dirpath, file)
try:
with open(file,'rb') as f:
PyPDF2.PdfFileReader(f)
except PyPDF2.utils.PdfReadError:
curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
t.write (str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n")
f.close()
else:
pass
f.close()
#curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
#t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n")
t.close()
Edit 2: I am trying to now run this from a different machine with beefier hardware and a different version of windows (10 pro instead of server 2008 r2) but I don't think this is the issue.

Try to edit one of the .pdf files to make it larger. That way, if the loop number your program "stops" at is smaller, you can identify the problem as a memory issue.
Else, it might be an unusually large pdf file that is taking your program a while to verify integrity.
Debugging this, you could print the file location of the .pdf files you open to find this particular .pdf and manually open it to investigate further..

Figured it out. The issue is actually due to a random and very large corrupted PDF. So this is not a loop issue, it's a corrupted file issue.

Converting multiple gz file within subdirectories into csv

I have many subdirectories in my main directory and would like to write a script to unzip and convert all the files within it. If possible, I would also like to combine all the CSV within a single directory into a single CSV. But more importantly, I need help with my nested loop.
import gzip
import csv
import os
subdirlist = os.listdir('/home/user/Desktop/testloop')
subtotal = len(subdirlist)
subcounter = 0
for dirlist in subdirlist:
print "Working On " + dirlist
total = len(dirlist)
counter = 0
for dir in dirlist:
print "Working On " + dir
f = gzip.open('/' + str(subdirlist) + '/' + dir, 'rb')
file_content = f.read()
f.close()
print "25% Complete"
filename = '/' + str(subdirlist) + '/temp.txt'
target = open(filename, 'w')
target.write(file_content)
target.close()
print "50% Complete!"
csv_file = '/' + str(subdirlist) + '/' + str(dir) + '.csv'
in_txt = csv.reader(open(filename, "rb"), delimiter = '\t')
out_csv = csv.writer(open(csv_file, 'wb'))
out_csv.writerows(in_txt)
os.remove(filename)
os.remove('/' + str(subdirlist) + '/' + dir)
counter+=1
print str(counter) + "/" + str(total) + " " + str(dir) + " Complete!"
print "SubDirectory Converted!"
print str(subcounter) + "/" + str(subtotal) + " " + str(subdirlist) + " Complete!"
subcounter+=1
print "All Files Converted!"
Thanks in advance

To get lists of files and subdirectories, you can use os.walk. Below is an implementation I wrote to get all files (optionally, of certain type(s)) in arbitrarily nested subdirectories:
from os import walk, sep
from functools import reduce # in Python 3.x only
def get_filelist(root, extensions=None):
"""Return a list of files (path and name) within a supplied root directory.
To filter by extension(s), provide a list of strings, e.g.
get_filelist(root, ["zip", "csv"])
"""
return reduce(lambda x, y: x+y,
[[sep.join([item[0], name]) for name in item[2]
if (extensions is None or
name.split(".")[-1] in extensions)]
for item in walk(root)])

Python File Integrity Monitoring

I have written the following code and can't get it working and I can't see why. The code:
Reads list of target files
Loops through the directories
Runs MD5 hash on the file
Checks the activefile for previous md5 hashes for said file
If the file is new it will log it as new
If the log is existing but changed it will write the change and log the change
If not new and no change then do nothing
Here is the code:
import hashlib
import logging as log
import optparse
import os
import re
import sys
import glob
import shutil
def md5(fileName):
"""Compute md5 hash of the specified file"""
try:
fileHandle = open(fileName, "rb")
except IOError:
return
m5Hash = hashlib.md5()
while True:
data = fileHandle.read(8192)
if not data:
break
m5Hash.update(data)
fileHandle.close()
return m5Hash.hexdigest()
req = open("requested.txt")
for reqline in req:
reqName = reqline[reqline.rfind('/') + 1:len(reqline) - 1]
reqDir = reqline[0:reqline.rfind('/') + 1]
tempFile = open("activetemp.txt", 'w')
for name in glob.glob(reqDir + reqName):
fileHash = md5(name)
actInt = 0
if fileHash != None:
actFile = open("activefile.txt")
for actLine in actFile:
actNameDir = actLine[0:actLine.rfind(' : ')]
actHash = actLine[actLine.rfind(' : ') + 3:len(actLine) -1]
if actNameDir == name and actHash == fileHash:
tempFile.write(name + " : " + fileHash + "\n")
actInt = 1
print fileHash
print actHash
print name
print actNameDir
if actNameDir == name and actHash != fileHash:
fimlog = open("fimlog.txt", 'a')
tempFile.write(name + " : " + actHash + "\n")
actInt = 1
fimlog.write("FIM Log: The file " + name + " was modified: " + actHash + "\n")
if actInt == 0:
fimlog = open("fimlog.txt", 'a')
fimlog.write("FIM Log: The file " + name + " was created: " + fileHash + "\n")
tempFile.write(name + " : " + fileHash + "\n")
shutil.copyfile("activetemp.txt", "activefile.txt")

You never really described the problem, but one possible culprit is that you never close the tempFile (or the other files for that matter), so the file copy at the end may fail.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Equivalent to grep, but not using open() - python

Related

Mongodump specify the output and filename

Python script to run FME workbench

Python Loop Count Stops at 67381

Converting multiple gz file within subdirectories into csv

Python File Integrity Monitoring

Categories

Resources