CombiningPDF's on filename

CombiningPDF's on filename - python

Hello I wrote this code to extract a list of PDF names.it** then compares to the input of where the PDF data is stored.The current code merges everything in that folder that is in the pdf data i commented out the part where it checks if the filename "starts with" because it's only then merges one of the documents and not all that are specified.
import json
from PyPDF2 import PdfFileMerger
import os
input = 'pdfdata'
list = os.listdir(input)
with open('pdf.json','r') as jsonfile:
jason_info =json.load(jsonfile)
json_data = [jason_info]
for item in json_data:
for data_item in item['info']:
jobjects = (data_item.values())#return only values
informa = ''.join(ll)#string
filemerger = PdfFileMerger()
for file in dlist:
# if file.startswith(informa):
filemerger.append(file)
filemerger.write("combinedpdfs.pdf")
filemerger.close()

Does this do what you want it to?
filemerger = PdfFileMerger()
for item in json_data:
for dicitonary in item['documents']: # retrieve key & value
for pdf in dictionary.values():
if pdf in dlist:
filemerger.append(pdf)
else:
raise ValueError(f'Pdf file {pdf} missing!')
filemerger.write("combinedpdfs.pdf")
filemerger.close()
I moved the filemerger line out of the for block as otherwise you're constantly resetting it. I also removed the for loop which goes through dlist as you can just check if the pdf file is in dlist.

Related

read multiple files and save to xls in columns (pypdf2 and xlsxwriterr

I need to take a directory with multiple PDFs and structure it into an xls
but I didn't understand how to make a list in the directory save the data in xls
enter import PyPDF2
import xlsxwriter
#---------------------Input file-----------------------------------#
pdf_file = open('arquivo_file','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
doc = read_pdf.getOutlines
page_content = page.extractText()
text = page_content.replace("\n", " ").replace("\t", " ").replace(" ", "")
content = page_content.split("\n")
data = content[0]
worksheet.write(1, 1, data)
workbook.close() here

Typically your code would resemble something like this.
import os
import glob
DIRPATH = "/path/to/your/pdf/directory"
# Get list of files with extension .pdf in a given directory
pdf_filepaths = glob.glob(os.path.join(DIRPATH, '*.pdf'))
# Loop over the pdf file-paths
# For each pdf-file:
# 1. read each pdf file
# 2. process the content you read (optional)
# 3. save the processed content to excel file
for i, pdf_filepath in enumerate(pdf_filepaths):
content = read_pdf_file(pdf_filepath)
content = process_data(content)
write_excel_file(filename='out_{i}.xlsx', content=content)
Here I assumed you would have wrapped your reading, processing and writing logic in three functions:
def read_pdf_file(filepath):
# your pdf reading logic goes here
...
return content
def process_data(content):
# your post-reading data-processing logic goes here
...
return content
def write_excel_file(filepath, content):
# your logic for writing to excel-file goes here
...

Get PDF attachments using Python

I admit that I am new to Python.
We have to process PDF files with attachments or annotated attachments. I am trying to extract attachments from a PDF file using PyPDF2 library.
The only (!) example found on GitHub contains the following code:
import PyPDF2
def getAttachments(reader):
catalog = reader.trailer["/Root"]
# VK
print (catalog)
#
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
And the call is:
rootdir = "C:/Users/***.pdf" # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
I am getting a KeyError: '/EmbeddedFiles'
A print of the catalog indeed does not contain EmbeddedFiles:
{'/Extensions': {'/ADBE': {'/BaseVersion': '/1.7', '/ExtensionLevel': 3}}, '/Metadata': IndirectObject(2, 0), '/Names': IndirectObject(5, 0), '/OpenAction': IndirectObject(6, 0), '/PageLayout': '/OneColumn', '/Pages': IndirectObject(3, 0), '/PieceInfo': IndirectObject(7, 0), '/Type': '/Catalog'}
This particular PDF contains 9 attachments. How can I get them?

Too Long for comments, and I have not tested personally this code, which looks very similar to your outline in the question, however I am adding here for others to test. It is the subject of a Pull Request https://github.com/mstamy2/PyPDF2/pull/440 and here is the full updated sequence as described by Kevin M Loeffler in https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/
Viewable at https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py
Download as
https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py
It always helps if you can provide an example input of the type you have problems with so that others can adapt the extraction routine to suit.
In response to getting an error
"I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error."
"Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch."
Unfortunately there are many pending pull requests not included into PyPDF2 https://github.com/mstamy2/PyPDF2/pulls and others may also be relevant or needed to aid with this and other shortcomings. Thus you need to see if any of those may also help.
For one pending example of a try catch that you might be able to include / and adapt for other use cases see https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3
Associated keywords for imbedded files apart from /Type/EmbeddedFiles include /Type /Filespec & /Subtype /FileAttachment note the pairs may not always have spaces so perhaps see if those can be interrogated for the attachments
Again on that last point the example searches for /EmbeddedFiles as indexed in the plural whilst any individual entry itself is identified as singular

This can be improved but it was tested to work (using PyMuPDF).
It detects corrupted PDF files, encryption, attachments, annotations and portfolios.
I am yet to compare the output with our internal classification.
Produces a semicolon separated file that can be imported into Excel.
import fitz # = PyMuPDF
import os
outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"
print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio", file = outfile)
enc=pages=count=names=annots=collection=''
for subdir, dirs, files in os.walk(folder):
for file in files:
#print (os.path.join(subdir, file))
filepath = subdir + os.sep + file
if filepath.endswith(".pdf"):
#print (filepath, file = outfile)
try:
doc = fitz.open(filepath)
enc = doc.is_encrypted
#print("Encrypted? ", enc, file = outfile)
pages = doc.page_count
#print("Number of pages: ", pages, file = outfile)
count = doc.embfile_count()
#print("Number of embedded files:", count, file = outfile) # shows number of embedded files
names = doc.embfile_names()
#print("Embedded files:", str(names), file = outfile)
#if count > 0:
# for emb in names:
# print(doc.embfile_info(emb), file = outfile)
annots = doc.has_annots()
#print("Has annots?", annots, file = outfile)
links = doc.has_links()
#print("Has links?", links, file = outfile)
trailer = doc.pdf_trailer()
#print("Trailer: ", trailer, file = outfile)
xreflen = doc.xref_length() # length of objects table
for xref in range(1, xreflen): # skip item 0!
#print("", file = outfile)
#print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
#print(doc.xref_object(i, compressed=False), file = outfile)
if "Collection" in doc.xref_object(xref, compressed=False):
#print ("Portfolio", file = outfile)
collection ='True'
break
else: collection="False"
#print(doc.xref_object(xref, compressed=False), file = outfile)
except:
#print ("Not a valid PDF", file = outfile)
enc=pages=count=names=annots=collection="Not a valid PDF"
print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )
outfile.close()

I was also running into the same problem with several pdfs that I have. I was able to make these changes to the referenced code that got it to work for me:
import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
attachments = {}
#First, get those that are pdf attachments
catalog = reader.trailer["/Root"]
if "/EmbeddedFiles" in catalog["/Names"]:
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
#Next, go through all pages and all annotations to those pages
#to find any attached files
for pagenum in range(0, reader.getNumPages()):
page_object = reader.getPage(pagenum)
if "/Annots" in page_object:
for annot in page_object['/Annots']:
annotobj = annot.getObject()
if annotobj['/Subtype'] == '/FileAttachment':
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].getData()
return attachments
handler = open(filename, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)

I know it is a late reply, but i only started looking into this yesterday. I have used the PyMuPdf library to extract the embedded files. here is my code:
import os
import fitz
def get_embedded_pdfs(input_pdf_path, output_path=None):
input_path = "/".join(input_pdf_path.split('/')[:-1])
if not output_path :
output_path = input_pdf_path.split(".")[0] + "_embeded_files/"
if output_path not in os.listdir(input_path):
os.mkdir(output_path)
doc = fitz.open(input_pdf_path)
item_name_dict = {}
for each_item in doc.embfile_names():
item_name_dict[each_item] = doc.embfile_info(each_item)["filename"]
for item_name, file_name in item_name_dict.items():
out_pdf = output_path + file_name
## get embeded_file in bytes
fData = doc.embeddedFileGet(item_name)
## save embeded file
with open(out_pdf, 'wb') as outfile:
outfile.write(fData)

disclaimer: I am the author of borb (the library used in this answer)
borb is an open-source, pure Python PDF library. It abstracts away most of the unpleasantness of dealing with PDF (such as having to deal with dictionaries and having to know PDF-syntax and structure).
There is a huge repository of examples, containing a section on dealing with embedded files, which you can find here.
I'll repeat the relevant example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we have read a Document
assert doc is not None
# retrieve all embedded files and their bytes
for k, v in doc.get_embedded_files().items():
# display the file name, and the size
print("%s, %d bytes" % (k, len(v)))
if __name__ == "__main__":
main()
After the Document has been read, you can simply ask it for a dict mapping the filenames unto the bytes.

Run a python script on all files in a directory

First time posting a question here, hopefully, someone who experienced/tried this please share your insights... I've been working to get this far in the last few days and nights... now I am getting nowhere to loop this script on every file in a directory.
Bascially, these two scripts work perfectly fine it brings a pdf file and changes it to an excel workbook. Now what I need to do is going through all files from a selected directory and do the same job.
I am keep getting stuck at the opening the file stage - is this saying that the data (the pdf page - data[0]) cant be called in? or should i add more stages in to bring the dataset in...?
Do I have to create a list for the dataset so I can call in the data as you would have more than a data to call in.. is this why python can read the data[0] ???
Revised Script
# import
import os
import glob
import pdftotext
import openpyxl
from pathlib import Path
from string import ascii_uppercase
# open a pdf file
def to_excel(pdf_file):
with open(pdf_file,'rb') as f:
data = pdftotext.PDF(f)
# operate data to get titles, values
datas = data[0].split('\r\n')
finalData = list()
for item in datas:
if item != '':
finalData.append(item)
finalDataRefined = list()
for item in finalData:
if item != ' BCA Scheduled Maintenance Questions' and item != ' Do you suspect there is Asbestos at the property?' and item != ' Yes' and item != ' No' and item != '\x0c':
finalDataRefined.append(item.strip())
titles = list()
values = list()
for num, item in enumerate(finalDataRefined):
if num % 2 == 0:
titles.append(item)
else:
values.append(item)
# get an output file name
OPRAST = values[1]
filename = work_dir / f"{OPRAST}.xlxs"
# create an excel workbook
excel_file = openpyxl.Workbook()
excel_sheet = excel_file.active
excel_sheet.append([])
alphaList = list(ascii_uppercase)
for alphabet in alphaList:
excel_sheet.column_dimensions[alphabet].width = 20
excel_sheet.append(titles)
excel_sheet.append(values)
# save the excel workbook
excel_file.save(filename)
excel_file.close
# run a python script every file in a directory
alphaList = list(ascii_uppercase)
work_dir = Path(r"C:\Users\Sunny Kim\Downloads\Do Forms")
for pdf_file in work_dir.glob("*.pdf"):
to_excel(pdf_file)

I basically know what you want to do, but your code's indent is not so readable... especially it's python.
Your goal is to create a excel for each pdf file in you prefix dir? or aggregate all the pdf files together to a single excel file?
The follow coding is for the first goal.
Code logic.
get all the pdf file
loop over all the pdf file, for each:
open pdf file
some operation
export to excel file
You full code maybe like this(just guess):
# ----------------import part-------------------
import os
import glob
import pdftotext
import openpyxl
from string import ascii_uppercase
from pathlib import Path
def to_excel(pdf_file):
with open(pdf_file, 'rb') as f: # this open the pdf file
data = pdftotext.PDF(f)
# ---------------operate the data, get title and value-----------
datas = data[0].split('\r\n')
finalData = list()
for item in datas:
if item != '':
finalData.append(item)
finalDataRefined = list()
for item in finalData:
if item != ' BCA Scheduled Maintenance Questions' and item != ' Do you suspect there is Asbestos at the property?' and item != ' Yes' and item != ' No' and item != '\x0c':
finalDataRefined.append(item.strip())
titles = list()
values = list()
for num, item in enumerate(finalDataRefined):
if num % 2 == 0:
titles.append(item)
else:
values.append(item)
# ------------------get output file name---------------------
OPRAST = values[1]
filename = work_dir / f"{OPRAST}.xlxs"
# ------------------create excel file sheet------------------
excel_file = openpyxl.Workbook()
excel_sheet = excel_file.active
excel_sheet.append([])
alphaList = list(ascii_uppercase)
for alphabet in alphaList:
excel_sheet.column_dimensions[alphabet].width = 20
excel_sheet.append(titles)
excel_sheet.append(values)
# --------------------save----------------
excel_file.save(filename)
excel_file.close
# -------------------main program---------------
alphaList = list(ascii_uppercase)
work_dir = Path(r"C:\Users\Sunny Kim\Downloads\Do Forms")
for pdf_file in work_dir.glob("*.pdf"):
to_excel(pdf_file)

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

I'm trying to automate merging several PDF files and have two requirements: a) existing bookmarks AND b) pagelabels (custom page numbering) need to be retained.
Retaining bookmarks when merging happens by default with PyPDF2 and pdftk, but not with pdfrw.
Pagelabels are consistently not retained in PyPDF2, pdftk or pdfrw.
I am guessing, after having searched a lot, that there is no straightforward approach to doing what I want. If I'm wrong then I hope someone can point to this easy solution. But, if there is no easy solution, any tips on how to get this going in python will be much appreciated!
Some example code:
1) With PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
#extracting pagelabels is easy
pl1 = tmp1.trailer['/Root']['/PageLabels']
pl2 = tmp2.trailer['/Root']['/PageLabels']
#but PdfFileWriter or PdfFileMerger does not support writing from what I understand
So I dont know how to proceed from here
2) With pdfrw (has more promise)
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
#read 1st file
tmp1 = PdfReader('file1')
#add the pages
writer.addpages(tmp1.pages)
#copy bookmarks to writer
writer.trailer.Root.Outlines = tmp1.Root.Outlines
#copy pagelabels to writer
writer.trailer.Root.PageLabels = tmp1.Root.PageLabels
#read second file
tmp2 = PdfReader('file2')
#append pages
writer.addpages(tmp2.pages)
# so far so good
Page numbers of bookmarks from 2nd file need to be offset before adding them, but when reading outlines I almost always get (IndirectObject, XXX) instead of page numbers. Its unclear how to get page numbers for each label and bookmark using pdfrw. So, I'm stuck again
zp

As mentioned in my comment, I'm posting a generic solution to merge several pdfs that works in PyPDF2. Dont know what is different to make this work in PyPDF2 other than initializing pls as ArrayObject()
from PyPDF2 import PdfFileWriter, PdfFileMerger, PdfFileReader
import PyPDF2.pdf as PDF
# pls holds all the pagelabels as we iterate through multiple pdfs
pls = PDF.ArrayObject()
# used to offset bookmarks
pageCount = 0
cpdf = PdfFileMerger()
# pdffiles is a list of all files to be merged
for i in range(len(pdffiles)):
tmppdf = PdfFileReader(pdffiles[i], 'rb')
cpdf.append(tmppdf)
# copy all the pagelabels which I assume is present in all files
# you could use 'try' in case no pagelabels are present
plstmp = tmppdf.trailer['/Root']['/PageLabels']['/Nums']
# sometimes keys are indirect objects
# so, iterate through each pagelabel and...
for j in range(len(plstmp)):
# ... get the actual values
plstmp[j] = plstmp[j].getObject()
# offset pagenumbers by current count of pages
if isinstance(plstmp[j], int):
plstmp[j] = PDF.NumberObject(plstmp[j] + pageCount)
# once all the pagelabels are processed I append to pls
pls += plstmp
#increment pageCount
pageCount += tmppdf.getNumPages()
# rest follows KevinM's answer
pagenums = PDF.DictionaryObject()
pagenums.update({PDF.NameObject('/Nums') : pls})
pagelabels = PDF.DictionaryObject()
pagelabels.update({PDF.NameObject('/PageLabels') : pagenums})
cpdf.output._root_object.update(pagelabels)
cpdf.write("filename.pdf")

You need to iterate through the existing PageLabels and add them to the merged output, taking care to add an offset to the page index entry, based on the number of pages already added.
This solution also requires PyPDF4, since PyPDF2 produces a weird error (see bottom).
from PyPDF4 import PdfFileWriter, PdfFileMerger, PdfFileReader
# To manipulate the PDF dictionary
import PyPDF4.pdf as PDF
import logging
def add_nums(num_entry, page_offset, nums_array):
for num in num_entry['/Nums']:
if isinstance(num, (int)):
logging.debug("Found page number %s, offset %s: ", num, page_offset)
# Add the physical page information
nums_array.append(PDF.NumberObject(num+page_offset))
else:
# {'/S': '/r'}, or {'/S': '/D', '/St': 489}
keys = num.keys()
logging.debug("Found page label, keys: %s", keys)
number_type = PDF.DictionaryObject()
# Always copy the /S entry
s_entry = num['/S']
number_type.update({PDF.NameObject("/S"): PDF.NameObject(s_entry)})
logging.debug("Adding /S entry: %s", s_entry)
if '/St' in keys:
# If there is an /St entry, fetch it
pdf_label_offset = num['/St']
# and add the new offset to it
logging.debug("Found /St %s", pdf_label_offset)
number_type.update({PDF.NameObject("/St"): PDF.NumberObject(pdf_label_offset)})
# Add the label information
nums_array.append(number_type)
return nums_array
def write_merged(pdf_readers):
# Output
merger = PdfFileMerger()
# For PageLabels information
page_labels = []
page_offset = 0
nums_array = PDF.ArrayObject()
# Iterate through all the inputs
for pdf_reader in pdf_readers:
try:
# Merge the content
merger.append(pdf_reader)
# Handle the PageLabels
# Fetch page information
old_page_labels = pdf_reader.trailer['/Root']['/PageLabels']
page_count = pdf_reader.getNumPages()
# Add PageLabel information
add_nums(old_page_labels, page_offset, nums_array)
page_offset = page_offset + page_count
except Exception as err:
print("ERROR: %s" % err)
# Add PageLabels
page_numbers = PDF.DictionaryObject()
page_numbers.update({PDF.NameObject("/Nums"): nums_array})
page_labels = PDF.DictionaryObject()
page_labels.update({PDF.NameObject("/PageLabels"): page_numbers})
root_obj = merger.output._root_object
root_obj.update(page_labels)
# Write output
merger.write('merged.pdf')
pdf_readers = []
tmp1 = PdfFileReader('file1.pdf', 'rb')
tmp2 = PdfFileReader('file2.pdf', 'rb')
pdf_readers.append(tmp1)
pdf_readers.append(tmp2)
write_merged(pdf_readers)
Note: PyPDF2 produces this weird error:
...
...
File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 552, in _sweepIndirectReferences
data[key] = value
File "/usr/lib/python3/dist-packages/PyPDF2/generic.py", line 507, in __setitem__
raise ValueError("key must be PdfObject")
ValueError: key must be PdfObject

Python - stuck at reading a specific row from a csv

I need to add columns to a "matched" shape file based on a csv. I have one last step to complete which is to get the value to enter into the shp from the csv.
I get
readCSV[rowID] Traceback (most recent call last): File "", line 1, in TypeError: '_csv.reader'
object is not subscriptable
The stripped down CSV is
The files look like
The code mataches OVL_CAT + OVL2_DESC to the File Name.
I then get the code to add a column called LGA_CODE and need to populate it with '583094' which is row 2, column 1...how do I get this when I can't call FileList2 to get row 2 from the csv (3 in the example below but 2 in python)?
import os, sys, datetime, csv, arcpy, string
from subprocess import Popen
from itertools import islice
top = os.getcwd() # change to a specific path if required.
# This otherwise starts with the directory the script is in (preferred).
RootOutput = r'P:\2012\273_CCRC_Townplanning_Datasets\Working\scratch' # change if you want output somewhere else
top=RootOutput
SourceDIR = r'P:\2012\273_CCRC_Townplanning_Datasets\Working\scratch\OM_011' # source of your data (subdirectories searched as well
outDIR = top+"\\workingFiles" # directory where output is written to. Includes temp files
finalDIR = top+"\\final" # folder for final data only
AOI = 'AOI.csv' # name of the file containing the las file names in the second column
Compare_Column = 2
Compare_Column2 = 3
# END setting base paths
# NOTHING BELOW should need editing.
FileTypes=['shp']
SearchStrings=[]
filecount=0
List =[]
count=0
x=0
os.chdir(top)
#Generate list with unique file name codes from CSV
FileList = csv.reader(open(AOI))
SearchStrings=[]
rowID=0
for File in FileList:
#SearchStrings.append(File[0]+","+File[1])
SearchStrings.append(str(rowID)+'_'+File[Compare_Column]+'_'+File[Compare_Column2])
rowID=rowID+1
for root, dirs, files in os.walk(SourceDIR, topdown=False):
for fl in files:
currentFile=os.path.join(root, fl)
for FileType in FileTypes:
status= str.endswith(currentFile,FileType)
if str(status) == 'True':
List.append(currentFile)
for SearchString in SearchStrings:
#print currentFile
#print SearchString
if str(SearchString in currentFile) == 'True':
#print str(currentFile)+str(status)
List.append(currentFile)
filecount=filecount+1
#del fl
# Get list of Column Names
headers_count = 1
with open(AOI) as fin:
headers = list(islice(fin, headers_count))
delimiter=','
header=str(headers)
header_list=header.split(delimiter)
# Process matching files
for fl in List:
header_count=0
for header in header_list:
dfStore=fl
#arcpy.AddField_management(dfStore, str(header) ,'TEXT')
# Get RowID to read column data from
filename=fl[fl.rfind('\\')+1:fl.rfind('_')]
for field in SearchStrings:
#print field, filename
if field.endswith(filename):
rowID=field[:field.find('_')]
with open(AOI, 'rb') as f:
readCSV= csv.reader(f)
text=readCSV[rowID][1]
## arcpy.CalculateField_management(fl, header, text,"PYTHON_9.3")
=== UPDATED CODE BASED ON COMMENTS -it's all working find if anyone needs it.
import os, sys, datetime, csv, arcpy, string
from subprocess import Popen
from itertools import islice
top = os.getcwd() # change to a specific path if required.
# This otherwise starts with the directory the script is in (preferred).
RootOutput = r'P:\2012\273_CCRC_Townplanning_Datasets\Working\scratch' # change if you want output somewhere else
top=RootOutput
SourceDIR = r'P:\2012\273_CCRC_Townplanning_Datasets\Working\scratch\OM_011' # source of your data (subdirectories searched as well
outDIR = top+"\\workingFiles" # directory where output is written to. Includes temp files
finalDIR = top+"\\final" # folder for final data only
AOI = 'AOI.csv' # name of the file containing the las file names in the second column
Compare_Column = 3
Compare_Column2 = 4
# END setting base paths
# NOTHING BELOW should need editing.
FileTypes=['shp']
SearchStrings=[]
filecount=0
List =[]
count=0
x=0
os.chdir(top)
#Generate list with unique file name codes from CSV
FileList = csv.reader(open(AOI))
SearchStrings=[]
rows=[]
#FinalList=[]
rowID=0
for File in FileList:
#SearchStrings.append(File[0]+","+File[1])
SearchStrings.append(str(rowID)+'_'+File[Compare_Column]+'_'+File[Compare_Column2])
rows.append(File)
#FinalList.append()
rowID+=1
for root, dirs, files in os.walk(SourceDIR, topdown=False):
for fl in files:
currentFile=os.path.join(root, fl)
for FileType in FileTypes:
status= str.endswith(currentFile,FileType)
if status:
#List.append(currentFile)
for SearchString in SearchStrings:
#print currentFile, SearchString
if str(SearchString[SearchString.find('_')+1:] in currentFile) == 'True':
#print str(currentFile)+str(status)
List.append(currentFile)
filecount=filecount+1
#del fl
# Get list of Column Names
headers_count = 1
with open(AOI) as fin:
headers = list(islice(fin, headers_count))
delimiter=','
header=str(headers)
header_listT=header.split(delimiter)
header_list=[]
for hdr in header_listT:
header_list.append(arcpy.ValidateTableName(hdr)[:10])
# Process matching files
columnID=1
for fl in List:
header_count=0
for header in header_list:
print header
dfStore=fl
try:
arcpy.AddField_management(dfStore, str(header) ,'TEXT')
except:
pass
# Get RowID to read column data from
filename=fl[fl.rfind('\\')+1:fl.rfind('_')]
for field in SearchStrings:
#print field, filename
#print header, field
if field.endswith(filename):
#print 'FOUND......................'
column_count=len(fl)
if columnID < len(header_list):
rowID=int(field[:field.find('_')])
text = rows[rowID][columnID]
print filename, header, text
columnID+=1
arcpy.CalculateField_management(fl, header, "text" ,"PYTHON_9.3")
#arcpy.CalculateField_management("P:/2012/273_CCRC_Townplanning_Datasets/Working/scratch/OM_011/OM_011_Waterway_Envelopes_ccrc.shp","LGA_CODE","5","PYTHON","#")

Your problem is in these two lines:
readCSV= csv.reader(f)
text=readCSV[rowID][1]
csv.reader is an iterable over the lines of the file; it cannot be directly indexed. You could use islice to get the element you want (islice(readCSV, rowID, rowID+1).next()), though a neater solution would just be to store a dictionary mapping rowID to the AOI row when you read it the first time (in the SearchStrings loop):
FileList = csv.reader(open(AOI))
SearchStrings = []
rows = []
rowID=0
for File in FileList:
#SearchStrings.append(File[0]+","+File[1])
SearchStrings.append(str(rowID)+'_'+File[Compare_Column]+'_'+File[Compare_Column2])
rows.append(File)
rowID=rowID+1
... # later
rowID=int(field[:field.find('_')])
text = rows[rowID][1]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

CombiningPDF's on filename - python

Related

read multiple files and save to xls in columns (pypdf2 and xlsxwriterr

Get PDF attachments using Python

Run a python script on all files in a directory

Merging PDFs while retaining custom page numbers (aka pagelabels) and bookmarks

Python - stuck at reading a specific row from a csv

Categories

Resources