I wrote a quick and sloppy python script for my dad in order to read in text files from a given folder and replace the top lines with a specific format. My apologies for any mix of pluses (+) and commas (,). The purpose was to replace something like this:
Sounding: BASF CPT-1
Depth: 1.05 meter(s)
with something like this:
Tempo(ms); Amplitude(cm/s) Valores provisorios da Sismica; Profundidade[m] = 1.05
I thought I had gotten it all resolved until my dad mentioned that all the text files had the last number repeated in a new line. Here are some examples of output:
output sample links - not enough reputation to post more than 2 links, sorry
Here is my code:
TIME AMPLITUDE
(ms)
#imports
import glob, inspect, os, re
from sys import argv
#work
is_correct = False
succeeded = 0
failed = 0
while not is_correct:
print "Please type the folder name: "
folder_name = raw_input()
full_path = os.path.dirname(os.path.abspath(__file__)) + "\\" + folder_name + "\\"
print "---------Looking in the following folder: " + full_path
print "Is this correct? (Y/N)"
confirm_answer = raw_input()
if confirm_answer == 'Y':
is_correct = True
else:
is_correct = False
files_list = glob.glob(full_path + "\*.txt")
print "Files found: ", files_list
for file_name in files_list:
new_header = "Tempo(ms); Amplitude(cm/s) Valores provisorios da Sismica; Profundidade[m] ="
current_file = open(file_name, "r+")
print "---------Looking at: " + current_file.name
file_data = current_file.read()
current_file.close()
match = re.search("Depth:\W(.+)\Wmeter", file_data)
if match:
new_header = new_header + str(match.groups(1)[0]) + "\n"
print "Depth captured: ", match.groups()
print "New header to be added: ", new_header
else:
print "Match failed!"
match_replace = re.search("(Sounding.+\s+Depth:.+\s+TIME\s+AMPLITUDE\s+.+\s+) \d", file_data)
if match_replace:
print "Replacing text ..."
text_to_replace = match_replace.group(1)
print "SANITY CHECK - Text found: ", text_to_replace
new_data = file_data.replace(text_to_replace, new_header)
current_file = open(file_name, "r+")
current_file.write(new_data)
current_file.close()
succeeded = succeeded + 1
else:
print "Text not found!"
failed = failed + 1
# this was added after I noticed the mysterious repeated number (quick fix)
# why do I need this?
lines = file(file_name, 'r').readlines()
del lines[-1]
file(file_name, 'w').writelines(lines)
print "--------------------------------"
print "RESULTS"
print "--------------------------------"
print "Succeeded: " , succeeded
print "Failed: ", failed
#template -- new_data = file_data.replace("Sounding: BASF CPT-1\nDepth: 29.92 meter(s)\nTIME AMPLITUDE \n(ms)\n\n")
What am I doing wrong exactly? I am not sure why the extra number is being added at the end (as you can see on the "modified text file - broken" link above). I'm sure it is something simple, but I am not seeing it. If you want to replicate the broken output, you just need to comment out these lines:
lines = file(file_name, 'r').readlines()
del lines[-1]
file(file_name, 'w').writelines(lines)
The problem is that, when you go to write your new data to the file, you are opening the file in mode r+, which means "open the file for reading and writing, and start at the beginning". Your code then writes data into the file starting at the beginning. However, your new data is shorter than the data already in the file, and since the file isn't getting truncated, that extra bit of data is left over at the end of the file.
Quick solution: in your if match_replace: section, change this line:
current_file = open(file_name, "r+")
to this:
current_file = open(file_name, "w")
This will open the file in write mode, and will truncate the file before you write to it. I just tested it, and it works fine.
Related
Been trying to write my PYTHON code but it will always output the file with a blank line at the end. Is there a way to mod my code so it doesn't print out the last blank line.
def write_concordance(self, filename):
""" Write the concordance entries to the output file(filename)
See sample output files for format."""
try:
file_out = open(filename, "w")
except FileNotFoundError:
raise FileNotFoundError("File Not Found")
word_lst = self.concordance_table.get_all_keys() #gets a list of all the words
word_lst.sort() #orders it
for i in word_lst:
ln_num = self.concordance_table.get_value(i) #line number list
ln_str = "" #string that will be written to file
for c in ln_num:
ln_str += " " + str(c) #loads line numbers as a string
file_out.write(i + ":" + ln_str + "\n")
file_out.close()
Output_file
Line 13 in this picture is what I need gone
Put in a check so that the new line is not added for the last element of the list:
def write_concordance(self, filename):
""" Write the concordance entries to the output file(filename)
See sample output files for format."""
try:
file_out = open(filename, "w")
except FileNotFoundError:
raise FileNotFoundError("File Not Found")
word_lst = self.concordance_table.get_all_keys() #gets a list of all the words
word_lst.sort() #orders it
for i in word_lst:
ln_num = self.concordance_table.get_value(i) #line number list
ln_str = "" #string that will be written to file
for c in ln_num:
ln_str += " " + str(c) #loads line numbers as a string
file_out.write(i + ":" + ln_str)
if i != word_lst[-1]:
file_out.write("\n")
file_out.close()
The issue is here:
file_out.write(i + ":" + ln_str + "\n")
The \n adds a new line.
The way to fix this is to rewrite it slightly:
ln_strs = []
for i in word_lst:
ln_num = self.concordance_table.get_value(i) #line number list
ln_str = " ".join(ln_num) #string that will be written to file
ln_strs.append(f"{i} : {ln_str}")
file_out.write('\n'.join(ln_strs))
Just btw, you should actually not use file_out = open() and file_out.close() but with open() as file_out:, this way you always close the file and an exception won't leave the file hanging
I have tried to create a really simple program that counts the words that you have written. When I run my code, I do not get any errors, the problem is that it always says: "the numbers of words are 0" when it is clearly not 0. I have tried to add this and see if it actually reads anything from the file: print(data) . It doesn't print anything ): so there must be a problem with the read part.
print("copy ur text down below")
words = input("")
f = open("data.txt", "w+")
z = open("data.txt", "r+")
info = f.write(words)
data = z.read()
res = len(data.split())
print("the numbers of words are " + str(res))
f.close()
Thx in advance
This is beacuse you haven't closed the file after writing to it. Use f.close() before using z.read()
Code:
print("copy ur text down below")
words = input("")
f = open("data.txt", "w+")
z = open("data.txt", "r+")
info = f.write(words)
f.close() # closing the file here after writing
data = z.read()
res = len(data.split())
print("the numbers of words are " + str(res))
f.close()
Output:
copy ur text down below
hello world
the numbers of words are 2
After writing to f with f.write, you should close f with f.close before calling z.read. See here.
I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?
import os
import glob
import csv
def check(filename):
if 'DELIVERY NOTIFICATION' in open(filename).read():
isDenied = True
print ("This claim was Denied")
print (isDenied)
elif 'Dear Customer:' in open(filename).read():
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
filename = infile
check(filename)
iterate()
Any help would be appreciated. this is what the text file looks like
Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT. WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------
update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.
import os
import glob
arrayDenied = []
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
check(infile)
def check(filename):
with open(filename, 'rt') as file_contents:
myText = file_contents.read()
if 'DELIVERY NOTIFICATION' in myText:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
print("Denied: " + myNumber)
arrayDenied.append(myNumber)
elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")
startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]
startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]
arrayApproved.append(myNumber + " - " + myClaimNumber)
else:
print("I don't know if this is approved or denied")
iterate()
with open('Approved.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayApproved:
writer.writerow([val])
with open('Denied.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayDenied:
writer.writerow([val])
print(arrayDenied)
print(arrayApproved)
Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.
If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.
# Read the text file into memory:
with open(filename, 'rt') as txt_file:
myText = txt_file.read()
if 'DELIVERY NOTIFICATION' in myText:
# Find the desired string and get the subsequent 18 characters:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
arrayDenied.append(myNumber)
You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.
Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.
import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
def check(filename):
file_contents = open(filename, 'r').read()
if 'DELIVERY NOTIFICATION' in file_contents:
isDenied = True
print ("This claim was Denied")
print (isDenied)
matches = re.finditer(pattern, test_str)
for match in matches:
print("Tracking Number = %s" % match.group().strip("."))
elif 'Dear Customer:' in file_contents:
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
Explanation:
r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
(?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
(?:(\.+)) matches one or more dots (.) (we strip these out after)
[A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers
More on Regex.
I think this solves your issue, just turn it into a function.
import re
string = 'Tracking Identification Number...1Z000000YW00000000'
no_dots = re.sub('\.', '', string) #Removes all dots from the string
matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"
try:
print (matchObj.group(1))
except:
print("No match!")
If you want to read the documentation it is here: https://docs.python.org/3/library/re.html#re.search
import os
searchquery = 'word'
with open('Y:/Documents/result.txt', 'w') as f:
for filename in os.listdir('Y:/Documents/scripts/script files'):
with open('Y:/Documents/scripts/script files/' + filename) as currentFile:
for line in currentFile:
if searchquery in line:
start = line.find(searchquery)
end = line.find("R")
result = line[start:end]
print result
f.write(result + ' ' +filename[:-4] + '\n')
Now this works well to search for "word" and prints everything after word up until an "R" providing that it is on the same line. However if the "R" is on the line it won't print the stuff before it.
eg:
this should not be printed!
this should also not be printed! "word" = 12345
6789 "R" After this R should not be printed either!
In the case above the 6789 on line 3 will not be printed with my current. However i want it to be. How do i make python keep going over multiple lines until it reaches the "R".
Thanks for any help!
It is normal that it does not print the content on the next line because you are searching for the word on one line. A better solution would be as follows.
import os
searchquery = 'word'
with open('Y:/Documents/result.txt', 'w') as f:
for filename in os.listdir('Y:/Documents/scripts/script files'):
with open('Y:/Documents/scripts/script files/' + filename) as currentFile:
content = ''.join([line for line in currentFile])
start = content.find(searchquery)
end = content.find("R")
result = content[start:end].replace("\n", "")
print result
f.write(result + ' ' +filename[:-4] + '\n')
Please be advised, this will work only for a single occurence. You will need to break it up further to print multiple occurences.
Id like to read a file for a specific match in the following style "word = word", specifically Im looking to find files with usernames and passwords in them. These files would be scripts created by admins using bad practices with clear credentials being used in logonscripts etc.
The code I have created so far does the job but its very messy and prints an entire line if the match is found (I cant help but think there is a more elegant way to do this). This creates ugly output, id like to print only the match in the line. I cant seem to find a way to do that. If I can create the correct regex for a match of something like the below match, is it possible to only print the match found in the line rather than the entire line?
(I am going to try describe the type of match im looking for)
Key
* = wildcard
- = space
^ = anycharacter until a space
Match
*(U|u)ser^-=-^
dirt = "/dir/path/"
def get_files():
for root, dirs, files in os.walk(dirt):
for filename in files:
if filename.endswith(('.bat', '.vbs', '.ps', '.txt')):
readfile = open(os.path.join(root, filename), "r")
for line in readfile:
if re.match("(.*)(U|u)ser(.*)", line) and re.match("(.*)(=)(.*)", line) or re.match("(.*)(P|p)ass(.*)", line) and re.match("(.*)(=)(.*)", line):
print line
TEST SCRIPT
strComputer = "atl-ws-01"
strNamespace = “root\cimv2”
strUser = "Administrator"
strPassword = "4rTGh2#1"
user = AnotherUser #Test
pass = AnotherPass #test
Set objWbemLocator = CreateObject("WbemScripting.SWbemLocator")
Set objWMIService = objwbemLocator.ConnectServer _
(strComputer, strNamespace, strUser, strPassword)
objWMIService.Security_.authenticationLevel = WbemAuthenticationLevelPktPrivacy
Set colItems = objWMIService.ExecQuery _
("Select * From Win32_OperatingSystem")
For Each objItem in ColItems
Wscript.Echo strComputer & ": " & objItem.Caption
Next
Latest Code after taking on bored the responses
This is the latest code I am using. It seems to be doing the job as expected, apart from the output isnt managed as well as Id like. Id like to add the items into a dictionary. Key being the file name. And two vaules, the username and password. Although this will be added as a separate question.
Thanks all for the help
dirt = "~/Desktop/tmp"
def get_files():
regs = ["(.*)((U|u)ser(.*))(\s=\s\W\w+\W)", "(.*)((U|u)ser(.*))(\s=\s\w+)", "(.*)((P|p)ass(.*))\s=\s(\W(.*)\W)", "(.*)((P|p)ass(.*))(\s=\s\W\w+\W)"]
combined = "(" + ")|(".join(regs) + ")"
results = dict()
for root, dirs, files in os.walk(dirt):
for filename in files:
if filename.endswith(('.bat', '.vbs', '.ps', '.txt')):
readfile = open(os.path.join(root, filename), "r")
for line in readfile:
m = re.match(combined, line)
if m:
print os.path.join(root, filename)
print m.group(0)
Latest Code output
~/Desktop/tmp/Domain.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript1.vbs
strUser = "guytom"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts /Logon/logonscript1.vbs
strPassword = "P#ssw0rd1"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strUsername = "guytom2"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strPass = "SECRETPASSWORD"
https://docs.python.org/2/library/re.html
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string;
match.group(0)
Since you can have many object=value you need to use regular expressions. Here is some sample code for you.
line1 = " someuser = bob "
line2 = " bob'spasswd= secretpassword"
#re.I will do case insensitive search
userMatchObj=re.search('.*user.*=\\s*([\\S]*).*', line1, re.I)
pwdMatchObj=re.search(r'.*pass.*=\s*(.*)', line2, re.I)
if userMatchObj: print "user="+userMatchObj.group(1)
if pwdMatchObj: print "password="+pwdMatchObj.group(1)
output:
user=bob
password=secretpassword
References: https://docs.python.org/2/library/re.html , http://www.tutorialspoint.com/python/python_reg_expressions.htm
Thanks all for the help. Below is my working code (needs further work on the output but the matching is working well)
dirt = "~/Desktop/tmp"
def get_files():
regs = ["(.*)((U|u)ser(.*))(\s=\s\W\w+\W)", "(.*)((U|u)ser(.*))(\s=\s\w+)", "(.*)((P|p)ass(.*))\s=\s(\W(.*)\W)", "(.*)((P|p)ass(.*))(\s=\s\W\w+\W)"]
combined = "(" + ")|(".join(regs) + ")"
results = dict()
for root, dirs, files in os.walk(dirt):
for filename in files:
if filename.endswith(('.bat', '.vbs', '.ps', '.txt')):
readfile = open(os.path.join(root, filename), "r")
for line in readfile:
m = re.match(combined, line)
if m:
print os.path.join(root, filename)
print m.group(0)
Latest Code output
~/Desktop/tmp/Domain.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript1.vbs
strUser = "guytom"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript1.vbs
strPassword = "P#ssw0rd1"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strUsername = "guytom2"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strPass = "SECRETPASSWORD"