I am just picking up and learning Python, For work i go through a lot of pdfs and so I found a PDFMINER tool that converts a directory to a text file. I then made the below code to tell me whether the pdf file is an approved claim or a denied claim. I dont understand how I can say find me the string that starts with "Tracking Identification Number..." AND is the 18 characters after that and stuff it into an array?
import os
import glob
import csv
def check(filename):
if 'DELIVERY NOTIFICATION' in open(filename).read():
isDenied = True
print ("This claim was Denied")
print (isDenied)
elif 'Dear Customer:' in open(filename).read():
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
filename = infile
check(filename)
iterate()
Any help would be appreciated. this is what the text file looks like
Shipper Number............................577140Pickup Date....................................06/27/17
Number of Parcels........................1Weight.............................................1 LBS
Shipper Invoice Number..............30057010Tracking Identification Number...1Z000000YW00000000
Merchandise..................................1 S NIKE EQUALS EVERYWHERE T BK B
WE HAVE BEEN UNABLE TO PROVIDE SATISFACTORY PROOF OF DELIVERY FOR THE ABOVE
SHIPMENT. WE APOLOGIZE FOR THE INCONVENIENCE THIS CAUSES.
NPT8AEQ:000A0000LDI 07
----------------Page (1) Break----------------
update: Many helpful answers, here is the route I took, and is working quite nicely if I do say so myself. this is gonna save tons of time!! Here is my the entire code for any future viewers.
import os
import glob
arrayDenied = []
def iterate():
path = 'text/'
for infile in glob.glob(os.path.join(path, '*.txt')):
print ('current file is:' + infile)
check(infile)
def check(filename):
with open(filename, 'rt') as file_contents:
myText = file_contents.read()
if 'DELIVERY NOTIFICATION' in myText:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
print("Denied: " + myNumber)
arrayDenied.append(myNumber)
elif 'Dear Customer:' in open(filename).read():
print("This claim was Approved")
startTrackingNum = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[startTrackingNum : startTrackingNum+18]
startClaimNumberIndex = myText.index("Claim Number ") + len("Claim Number ")
myClaimNumber = myText[startClaimNumberIndex : startClaimNumberIndex+11]
arrayApproved.append(myNumber + " - " + myClaimNumber)
else:
print("I don't know if this is approved or denied")
iterate()
with open('Approved.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayApproved:
writer.writerow([val])
with open('Denied.csv', "w") as output:
writer = csv.writer(output, lineterminator='\n')
for val in arrayDenied:
writer.writerow([val])
print(arrayDenied)
print(arrayApproved)
Update: Added the rest of my finished code, Writes the lists to a CSV file where i go execute some =left()'s and such and boom I have 1000 tracking numbers in a matter of minutes. This is why programming is great.
If your goal is just to find the "Tracking Identification Number..." string and the subsequent 18 characters; you can just find the index of that string, then reach where it ends, and slice from that point until the end of the subsequent 18 characters.
# Read the text file into memory:
with open(filename, 'rt') as txt_file:
myText = txt_file.read()
if 'DELIVERY NOTIFICATION' in myText:
# Find the desired string and get the subsequent 18 characters:
start = myText.index("Tracking Identification Number...") + len("Tracking Identification Number...")
myNumber = myText[start : start+18]
arrayDenied.append(myNumber)
You can also modify the append line into arrayDenied.append(myText + ' ' + myNumber) or things like that.
Regular expressions are the way to go for your task. Here is a way to modify your code to search for the pattern.
import re
pattern = r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
def check(filename):
file_contents = open(filename, 'r').read()
if 'DELIVERY NOTIFICATION' in file_contents:
isDenied = True
print ("This claim was Denied")
print (isDenied)
matches = re.finditer(pattern, test_str)
for match in matches:
print("Tracking Number = %s" % match.group().strip("."))
elif 'Dear Customer:' in file_contents:
isDenied = False
print("This claim was Approved")
print (isDenied)
else:
print("I don't know if this is approved or denied")
Explanation:
r"(?<=Tracking Identification Number)(?:(\.+))[A-Z-a-z0-9]{18}"
(?<=Tracking Identification Number) Looks behind the capturing group to find the string "Tracking Identification Number"
(?:(\.+)) matches one or more dots (.) (we strip these out after)
[A-Z-a-z0-9]{18} matches 18 instances of (capital or lowercase) letters or numbers
More on Regex.
I think this solves your issue, just turn it into a function.
import re
string = 'Tracking Identification Number...1Z000000YW00000000'
no_dots = re.sub('\.', '', string) #Removes all dots from the string
matchObj = re.search('^Tracking Identification Number(.*)', no_dots) #Matches anything after the "Tracking Identification Number"
try:
print (matchObj.group(1))
except:
print("No match!")
If you want to read the documentation it is here: https://docs.python.org/3/library/re.html#re.search
Related
I want to basically remove all the characters in delete list from the file (Line 11 to 15). What would be the neatest way to delete the words without making the code not neat. I am not sure whether to open the file again here which I know would not be the right way but I can't think of a different solution. Any help would be appreciated.
from os import write
import re
def readText():
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt') as f:
print(f.read())
def longestWord():
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt', 'r+') as f:
users_text = f.read()
#I want to basically remove all the char in delete list from the file. What would be the neatest way to delete the words without making the code not neat. I am not sure wether to open the file again here and re write it or what!
deleteList = ['!','£','$','%','^','&','*','()','_','+']
for line in f:
for word in deleteList:
line = line.replace(word, '')
longest = max(users_text.split(), key=len)
count_longest = str(len(longest))
print('The longest word in the file is: ' + long)
print('Thats a total of '+count_longest+' letters!')
def writeWord():
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt', 'w') as f:
users_text = input('Enter your desired text to continue. \n: ')
f.write(users_text)
f.close()
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt', 'r') as file:
print(file.read())
longestWord()
Had to re work it and implement it in a different def. Need to add relative paths and will be alot cleaner aswell.
from os import write
import re
def longestWord():
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt', 'r+') as f:
users_text = f.read()
longest = max(users_text.split(), key=len)
count_longest = str(len(longest))
print('The longest word in the file is: ' + longest)
print('Thats a total of '+count_longest+' letters!')
def writeWord():
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt', 'w') as f:
users_text = input('Enter your desired text to continue. \n: ')
cleanText = re.sub('[^a-zA-Z0-9 \n\.]', ' ', users_text)
f.write(cleanText)
with open(r'C:\Users\maxth\Desktop\TextCounter\Text.txt', 'r') as clean:
print('\nRemoved any illegal characters. Here is your text:\n\n' + cleanText + '\n')
f.close()
while True:
print("""
Welcome to Skies word text counter!
====================================================
""")
writeWord()
longestWord()
userDecide = input("""
====================================================
Would you like to enter new text and repeat?
Type 'yes' to continue else program will terminate.
====================================================
: """)
if not userDecide.lower == 'yes':
print('Application closing...')
exit()
Other questions don't seem to be getting answered or are not getting answered for Python. I'm trying to get it to find the keyword "name", set the position to there, then set a variable to that specific line, and then have it use only that piece of text as a variable. In shorter terms, I'm trying to locate a variable in the .txt file based on "name" or "HP" which will always be there.
I hope that makes sense...
I've tried to use different variables like currentplace instead of namePlace but neither works.
import os
def savetest():
save = open("nametest_text.txt", "r")
print("Do you have a save?")
conf = input(": ")
if conf == "y" or conf == "Y" or conf == "Yes" or conf == "yes":
text = save.read()
namePlace = text.find("name")
currentText = namePlace + 7
save.seek(namePlace)
nameLine = save.readline()
username = nameLine[currentText:len(nameLine)]
print(username)
hpPlace = text.find("HP")
currentText = hpPlace + 5
save.seek(hpPlace)
hpLine = save.readline()
playerHP = hpLine[currentText:len(hpLine)]
print(playerHP)
os.system("pause")
save.close()
savetest()
My text file is simply:
name = Wubzy
HP = 100
I want it to print out whatever is put after the equals sign at name and the same for HP, but not name and HP itself.
So it should just print
Wubzy
100
Press any key to continue . . .
But it instead prints
Wubzy
Press any key to continue . . .
This looks like a good job for a regex. Regexes can match and capture patterns in text, which seems to be exactly what you are trying to do.
For example, the regex ^name\s*=\s*(\w+)$ will match lines that have the exact text "name", followed by 0 or more whitespace characters, an '=', and then another 0 or more whitespace characters then a one or more letters. It will capture the word group at the end.
The regex ^HP\s*=\s*(\d+)$ will match lines that have the exact text "HP", followed by 0 or more whitespace characters, an '=', and then another 0 or more whitespace characters then one or more digits. It will capture the number group at the end.
# This is the regex library
import re
# This might be easier to use if you're getting more information in the future.
reg_dict = {
"name": re.compile(r"^name\s*=\s*(\w+)$"),
"HP": re.compile(r"^HP\s*=\s*(\d+)$")
}
def savetest():
save = open("nametest_text.txt", "r")
print("Do you have a save?")
conf = input(": ")
# instead of checking each one individually, you can check if conf is
# within a much smaller set of valid answers
if conf.lower() in ["y", "yes"]:
text = save.read()
# Find the name
match = reg_dict["name"].search(text)
# .search will return the first match of the text, or if there are
# no occurrences, None
if(match):
# With match groups, group(0) is the entire match, group(1) is
# What was captured in the first set of parenthesis
username = match.group(1)
else:
print("The text file does not contain a username.")
return
print(username)
# Find the HP
match = reg_dict["HP"].search(text)
if(match):
player_hp = match.group(1)
else:
print("The text file does not contain a HP.")
return
print(player_hp)
# Using system calls to pause output is not a great idea for a
# variety of reasons, such as cross OS compatibility
# Instead of os.system("pause") try
input("Press enter to continue...")
save.close()
savetest()
Use a regex to extract based on a pattern:
'(?:name|HP) = (.*)'
This captures anything that follows an equal to sign preceded by either name or HP.
Code:
import re
with open("nametest_text.txt", "r") as f:
for line in f:
m = re.search(r'(?:name|HP) = (.*)', line.strip())
if m:
print(m.group(1))
Simplest way may be to use str.split() and then print everything after the '=' character:
with open("nametest_text.txt", "r") as f:
for line in f:
if line.strip():
print(line.strip().split(' = ')[1])
output:
Wubzy
100
Instead of trying to create and parse a proprietary format (you will most likely hit limitations at some point and will need to change your logic and/or file format), better stick to a well-known and well-defined file format that comes with the required writers and parsers, like yaml, json, cfg, xml, and many more.
This saves a lot of pain; consider the following quick example of a class that holds a state and that can be serialized to a key-value-mapped file format (I'm using yaml here, but you can easily exchange it for json, or others):
#!/usr/bin/python
import os
import yaml
class GameState:
def __init__(self, name, **kwargs):
self.name = name
self.health = 100
self.__dict__.update(kwargs)
#staticmethod
def from_savegame(path):
with open(path, 'r') as savegame:
args = yaml.safe_load(savegame)
return GameState(**args)
def save(self, path, overwrite=False):
if os.path.exists(path) and os.path.isfile(path) and not overwrite:
raise IOError('Savegame exists; refusing to overwrite.')
with open(path, 'w') as savegame:
savegame.write(yaml.dump(self.__dict__))
def __str__(self):
return (
'GameState(\n{}\n)'
.format(
'\n'.join([
' {}: {}'.format(k, v)
for k, v in self.__dict__.iteritems()
]))
)
Using this simple class exemplarily:
SAVEGAMEFILE = 'savegame_01.yml'
new_gs = GameState(name='jbndlr')
print(new_gs)
new_gs.health = 98
print(new_gs)
new_gs.save(SAVEGAMEFILE, overwrite=True)
old_gs = GameState.from_savegame(SAVEGAMEFILE)
print(old_gs)
... yields:
GameState(
health: 100
name: jbndlr
)
GameState(
health: 98
name: jbndlr
)
GameState(
health: 98
name: jbndlr
)
I've never used Python and have copied some script (with permission) from someone online, so I'm not sure why the code is dropping. I'm hoping someone can understand it and put it right for me!
from os import walk
from os.path import join
#First some options here.
!RootDir = "C:\\Users\\***\\Documents\\GoGames"
!OutputFile = "C:\\Users\\***\\Documents\\GoGames\\protable.csv"
Properties = !!['pb', 'pw', 'br', 'wr', 'dt', 'ev', 're']
print """
SGF Database Maker
==================
Use this program to create a CSV file with sgf info.
"""
def getInfo(filename):
"""Read out file info here and return a dictionary with all the
properties needed."""
result = !![]
file = open(filename, 'r')
data = file.read(1024) read at most 1kb since we assume all relevant info is in the beginning
file.close()
for prop in Properties:
try:
i = data.lower().index(prop)
except !ValueError:
result.append((prop, ''))
continue
try:
value = data![data.index('![', i)+1 : data.index(']', i)]
except !ValueError:
value = ''
result.append((prop, value))
return dict(result)
!ProgressCounter = 0
file = open(!OutputFile, "w")
file.write('^Filename^;^PB^;^BR^;^PW^;^WR^;^RE^;^EV^;^DT^\n')
for root, dirs, files in walk(!RootDir):
for name in files:
if name![-3:].lower() != "sgf":
continue
info = getInfo(join(root, name))
file.write('^'+join(root, name)+'^;^'+info!['pb']+'^;^'+info!['br']+'^;^'+info!['pw']+'^;^'+info!['wr']+'^;^'+info!['re']+'^;^'+info!['ev']+'^;^'+info!['dt']+'^\n')
!ProgressCounter += 1
if (!ProgressCounter) % 100 == 0:
print str(!ProgressCounter) + " games processed."
file.close()
print "A total of " + str(!ProgressCounter) + " have been processed."
Using Netbeans IDE I get the following error:
!RootDir = "C:\\Users\\***\\Documents\\GoGames"
^
SyntaxError: mismatched input '' expecting EOF
I have previously been able to step through the code as far as file.close(), where I go an error "does not match outer indentation level".
Anyone able to put the syntax of this code right for me?
Remove the exclamation marks in front of variable names, list declarations (!![]) and in except clauses (except !ValueError), this is not valid Python syntax.
I wrote a quick and sloppy python script for my dad in order to read in text files from a given folder and replace the top lines with a specific format. My apologies for any mix of pluses (+) and commas (,). The purpose was to replace something like this:
Sounding: BASF CPT-1
Depth: 1.05 meter(s)
with something like this:
Tempo(ms); Amplitude(cm/s) Valores provisorios da Sismica; Profundidade[m] = 1.05
I thought I had gotten it all resolved until my dad mentioned that all the text files had the last number repeated in a new line. Here are some examples of output:
output sample links - not enough reputation to post more than 2 links, sorry
Here is my code:
TIME AMPLITUDE
(ms)
#imports
import glob, inspect, os, re
from sys import argv
#work
is_correct = False
succeeded = 0
failed = 0
while not is_correct:
print "Please type the folder name: "
folder_name = raw_input()
full_path = os.path.dirname(os.path.abspath(__file__)) + "\\" + folder_name + "\\"
print "---------Looking in the following folder: " + full_path
print "Is this correct? (Y/N)"
confirm_answer = raw_input()
if confirm_answer == 'Y':
is_correct = True
else:
is_correct = False
files_list = glob.glob(full_path + "\*.txt")
print "Files found: ", files_list
for file_name in files_list:
new_header = "Tempo(ms); Amplitude(cm/s) Valores provisorios da Sismica; Profundidade[m] ="
current_file = open(file_name, "r+")
print "---------Looking at: " + current_file.name
file_data = current_file.read()
current_file.close()
match = re.search("Depth:\W(.+)\Wmeter", file_data)
if match:
new_header = new_header + str(match.groups(1)[0]) + "\n"
print "Depth captured: ", match.groups()
print "New header to be added: ", new_header
else:
print "Match failed!"
match_replace = re.search("(Sounding.+\s+Depth:.+\s+TIME\s+AMPLITUDE\s+.+\s+) \d", file_data)
if match_replace:
print "Replacing text ..."
text_to_replace = match_replace.group(1)
print "SANITY CHECK - Text found: ", text_to_replace
new_data = file_data.replace(text_to_replace, new_header)
current_file = open(file_name, "r+")
current_file.write(new_data)
current_file.close()
succeeded = succeeded + 1
else:
print "Text not found!"
failed = failed + 1
# this was added after I noticed the mysterious repeated number (quick fix)
# why do I need this?
lines = file(file_name, 'r').readlines()
del lines[-1]
file(file_name, 'w').writelines(lines)
print "--------------------------------"
print "RESULTS"
print "--------------------------------"
print "Succeeded: " , succeeded
print "Failed: ", failed
#template -- new_data = file_data.replace("Sounding: BASF CPT-1\nDepth: 29.92 meter(s)\nTIME AMPLITUDE \n(ms)\n\n")
What am I doing wrong exactly? I am not sure why the extra number is being added at the end (as you can see on the "modified text file - broken" link above). I'm sure it is something simple, but I am not seeing it. If you want to replicate the broken output, you just need to comment out these lines:
lines = file(file_name, 'r').readlines()
del lines[-1]
file(file_name, 'w').writelines(lines)
The problem is that, when you go to write your new data to the file, you are opening the file in mode r+, which means "open the file for reading and writing, and start at the beginning". Your code then writes data into the file starting at the beginning. However, your new data is shorter than the data already in the file, and since the file isn't getting truncated, that extra bit of data is left over at the end of the file.
Quick solution: in your if match_replace: section, change this line:
current_file = open(file_name, "r+")
to this:
current_file = open(file_name, "w")
This will open the file in write mode, and will truncate the file before you write to it. I just tested it, and it works fine.
Id like to read a file for a specific match in the following style "word = word", specifically Im looking to find files with usernames and passwords in them. These files would be scripts created by admins using bad practices with clear credentials being used in logonscripts etc.
The code I have created so far does the job but its very messy and prints an entire line if the match is found (I cant help but think there is a more elegant way to do this). This creates ugly output, id like to print only the match in the line. I cant seem to find a way to do that. If I can create the correct regex for a match of something like the below match, is it possible to only print the match found in the line rather than the entire line?
(I am going to try describe the type of match im looking for)
Key
* = wildcard
- = space
^ = anycharacter until a space
Match
*(U|u)ser^-=-^
dirt = "/dir/path/"
def get_files():
for root, dirs, files in os.walk(dirt):
for filename in files:
if filename.endswith(('.bat', '.vbs', '.ps', '.txt')):
readfile = open(os.path.join(root, filename), "r")
for line in readfile:
if re.match("(.*)(U|u)ser(.*)", line) and re.match("(.*)(=)(.*)", line) or re.match("(.*)(P|p)ass(.*)", line) and re.match("(.*)(=)(.*)", line):
print line
TEST SCRIPT
strComputer = "atl-ws-01"
strNamespace = “root\cimv2”
strUser = "Administrator"
strPassword = "4rTGh2#1"
user = AnotherUser #Test
pass = AnotherPass #test
Set objWbemLocator = CreateObject("WbemScripting.SWbemLocator")
Set objWMIService = objwbemLocator.ConnectServer _
(strComputer, strNamespace, strUser, strPassword)
objWMIService.Security_.authenticationLevel = WbemAuthenticationLevelPktPrivacy
Set colItems = objWMIService.ExecQuery _
("Select * From Win32_OperatingSystem")
For Each objItem in ColItems
Wscript.Echo strComputer & ": " & objItem.Caption
Next
Latest Code after taking on bored the responses
This is the latest code I am using. It seems to be doing the job as expected, apart from the output isnt managed as well as Id like. Id like to add the items into a dictionary. Key being the file name. And two vaules, the username and password. Although this will be added as a separate question.
Thanks all for the help
dirt = "~/Desktop/tmp"
def get_files():
regs = ["(.*)((U|u)ser(.*))(\s=\s\W\w+\W)", "(.*)((U|u)ser(.*))(\s=\s\w+)", "(.*)((P|p)ass(.*))\s=\s(\W(.*)\W)", "(.*)((P|p)ass(.*))(\s=\s\W\w+\W)"]
combined = "(" + ")|(".join(regs) + ")"
results = dict()
for root, dirs, files in os.walk(dirt):
for filename in files:
if filename.endswith(('.bat', '.vbs', '.ps', '.txt')):
readfile = open(os.path.join(root, filename), "r")
for line in readfile:
m = re.match(combined, line)
if m:
print os.path.join(root, filename)
print m.group(0)
Latest Code output
~/Desktop/tmp/Domain.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript1.vbs
strUser = "guytom"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts /Logon/logonscript1.vbs
strPassword = "P#ssw0rd1"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strUsername = "guytom2"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strPass = "SECRETPASSWORD"
https://docs.python.org/2/library/re.html
group([group1, ...])
Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string;
match.group(0)
Since you can have many object=value you need to use regular expressions. Here is some sample code for you.
line1 = " someuser = bob "
line2 = " bob'spasswd= secretpassword"
#re.I will do case insensitive search
userMatchObj=re.search('.*user.*=\\s*([\\S]*).*', line1, re.I)
pwdMatchObj=re.search(r'.*pass.*=\s*(.*)', line2, re.I)
if userMatchObj: print "user="+userMatchObj.group(1)
if pwdMatchObj: print "password="+pwdMatchObj.group(1)
output:
user=bob
password=secretpassword
References: https://docs.python.org/2/library/re.html , http://www.tutorialspoint.com/python/python_reg_expressions.htm
Thanks all for the help. Below is my working code (needs further work on the output but the matching is working well)
dirt = "~/Desktop/tmp"
def get_files():
regs = ["(.*)((U|u)ser(.*))(\s=\s\W\w+\W)", "(.*)((U|u)ser(.*))(\s=\s\w+)", "(.*)((P|p)ass(.*))\s=\s(\W(.*)\W)", "(.*)((P|p)ass(.*))(\s=\s\W\w+\W)"]
combined = "(" + ")|(".join(regs) + ")"
results = dict()
for root, dirs, files in os.walk(dirt):
for filename in files:
if filename.endswith(('.bat', '.vbs', '.ps', '.txt')):
readfile = open(os.path.join(root, filename), "r")
for line in readfile:
m = re.match(combined, line)
if m:
print os.path.join(root, filename)
print m.group(0)
Latest Code output
~/Desktop/tmp/Domain.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript1.vbs
strUser = "guytom"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript1.vbs
strPassword = "P#ssw0rd1"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strUsername = "guytom2"
~/Desktop/tmp/DLsec.local/Policies/{31B2F340-016D-11D2-945F-00C04FB984F9}/USER/Scripts/Logon/logonscript2.bat
strPass = "SECRETPASSWORD"