Python: Input and output filename matching - python

I'm trying to come up with a way for the filenames that I'm reading to have the same filename as what I'm writing. The code is currently reading the images and doing some processing. My output will be extracting the data from that process into a csv file. I want both the filenames to be the same. I've come across fname for matching, but that's for existing files.

So if your input file name is in_file = myfile.jpg do this:
my_outfile = "".join(infile.split('.')[:-1]) + 'csv'
This splits infile into a list of parts that are separated by '.'. It then puts them back together minus the last part, and adds csv
your my_outfile will be myfile.csv

Well in python it's possible to do that but, the original file might be corrupted if we were to have the same exact file name i.e BibleKJV.pdf to path BibleKJV.pdf will corrupt the first file. Take a look at this script to verify that I'm on the right track (if I'm totally of disregard my answer):
import os
from PyPDF2 import PdfFileReader , PdfFileWriter
path = "C:/Users/Catrell Washington/Pride"
input_file_name = os.path.join(path, "BibleKJV.pdf")
input_file = PdfFileReader(open(input_file_name , "rb"))
output_PDF = PdfFileWriter()
total_pages = input_file.getNumPages()
for page_num in range(1,total_pages) :
output_PDF.addPage(input_file.getPage(page_num))
output_file_name = os.path.join(path, "BibleKJV.pdf")
output_file = open(output_file_name , "wb")
output_PDF.write(output_file)
output_file.close()
When I ran the above script, I lost all data from the original path "BibleKJV.pdf" thus proving that if the file name and the file delegation i.e .pdf .cs .word etc, are the same then the data, unless changed very minimal, will be corrupted.
If this doesn't give you any help please, edit your question with a script of what you're trying to achieve.

Related

Python: writing file to specific directory

This is likely a fundamental Python question, but I'm stumped (still learning). My script uses Pandas to create txt files from csv cells, and works properly. However, I'd like to write the files to a specific directory, listed as save_path below. However, my efforts to put this together keep running into errors.
Here's my (not) working code:
import os
import pandas as pd
save_path = "C:\users\name\folder\txts"
df= pd.read_csv("C:\users\name\folder\test.csv", sep=",")
df2 = df.fillna('')
for index in range(len(df)):
with open(df2["text_number"][index] + '.txt', 'w') as output:
output2 = os.path.join(save_path, output) # I'm uncertain how to structure or place the os.path.join command.
output2.write(df2["text"][index])
The resulting error is below:
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'TextIOWrapper'
Thoughts? Any assistance is greatly appreciated.
You need to first generate the file name and then open it in write mode to put the contents.
for index in range(len(df)):
# create file name
filename = df2["text_number"][index] + '.txt'
# then generate full path using os lib
full_path = os.path.join(save_path, filename)
# now open that file, dont forget to use w+ to create the file if it doesn't exist
with open(full_path, 'w+') as output_file_handler:
# and write the contents
output_file_handler.write(df2["text"][index])
This should work.
(But you might want to check out this answer)
for index in range(len(df)):
filename = df2["text_number"][index] + '.txt'
fp = os.path.join(save_path, filename)
with open(fp, 'w') as output:
output.write(df2["text"][index])

Edit multiple text files, and save as new files

My first post on StackOverflow, so please be nice. In other words, a super beginner to Python.
So I want to read multiple files from a folder, divide the text and save the output as a new file. I currently have figured out this part of the code, but it only works on one file at a time. I have tried googling but can't figure out a way to use this code on multiple text files in a folder and save it as "output" + a number, for each file in the folder. Is this something that's doable?
with open("file_path") as fReader:
corpus = fReader.read()
loc = corpus.find("\n\n")
print(corpus[:loc], file=open("output.txt","a"))
Possibly work with a list, like:
from pathlib import Path
source_dir = Path("./") # path to the directory
files = list(x for x in filePath.iterdir() if x.is_file())
for i in range(len(files)):
file = Path(files[i])
outfile = "output_" + str(i) + file.suffix
with open(file) as fReader, open(outfile, "w") as fOut:
corpus = fReader.read()
loc = corpus.find("\n\n")
fOut.write(corpus[:loc])
** sorry for multiple editting....
welcome to the site. Yes, what you are asking above is completely doable and you are on the right track. You will need to do a little research/practice with the os module which is highly useful when working with files. The two commands that you will want to research a bit are:
os.path.join()
os.listdir()
I would suggest you put two folders within your python file, one called data and the other called output to catch the results. Start and see if you can just make the code to list all the files in your data directory, and just keep building that loop. Something like this should list all the files:
# folder file lister/test writer
import os
source_folder_name = 'data' # the folder to be read that is in the SAME directory as this file
output_folder_name = 'output' # will be used later...
files = os.listdir(source_folder_name)
# get this working first
for f in files:
print(f)
# make output folder names and just write a 1-liner into each file...
for f in files:
output_filename = f.split('.')[0] # the part before the period
output_filename += '_output.csv'
output_path = os.path.join(output_folder_name, output_filename)
with open(output_path, 'w') as writer:
writer.write('some data')

Python script not combining csv files

I am trying to combine over 100,000 CSV files (all same formats) in a folder using below script. Each CSV file is on average 3-6KB of size. When I run this script, it only opens exact 47 .csv files and combines. When I re-run it only combines same .csv files, not all of them. I don't understand why it is doing that?
import os
import glob
os.chdir("D:\Users\Bop\csv")
want_header = True
out_filename = "combined.files.csv"
if os.path.exists(out_filename):
os.remove(out_filename)
read_files = glob.glob("*.csv")
with open(out_filename, "w") as outfile:
for filename in read_files:
with open(filename) as infile:
if want_header:
outfile.write('{},Filename\n'.format(next(infile).strip()))
want_header = False
else:
next(infile)
for line in infile:
outfile.write('{},{}\n'.format(line.strip(), filename))
Firstly check the length of read_files:
read_files = glob.glob("*.csv")
print(len(read_files))
Note that glob isn't necessarily recursive as described in this SO question.
Otherwise your code looks fine. You may want to consider using the CSV library but note that you need to adjust the field size limit with really large files.
Are you shure your all filenames ends with .csv? If all files in this directory contains what you need, then open all of them without filtering.
glob.glob('*')

Python csv : writing to a different directory

I'm downloading files from a site and I need to save the original file, then open it and then add the url that the file was downloaded from and the date of the download to the file before saving the file to a different directory.
I've used this answer to amend the csv: how to Add New column in beginning of CSV file by Python
but I'm struggling to redirect the file to a different directory before the write() function is called.
Is the best answer to write the file and then move it, or is there a way to write the file to a different directory within the open() function?
if fileName in fileList:
print "already got file "+ fileName
else:
# download the file
urllib.urlretrieve(csvUrl, os.path.basename(fileName))
#print "Saving to 1_Downloaded "+ fileName
# open the file and then add the extra columns
with open(fileName, 'rb') as inf, open("out_"+fileName, 'wb') as outf:
csvreader = csv.DictReader(inf)
# add column names to beginning
fieldnames = ['url_source','downloaded_at'] + csvreader.fieldnames
csvwriter = csv.DictWriter(outf, fieldnames)
csvwriter.writeheader()
for node, row in enumerate(csvreader, 1):
csvwriter.writerow(dict(row, url_source=csvUrl, downloaded_at=today))
I believe both would work.
To me it seems the neatest way to do it would be to append to the file and relocate it afterwards.
Have a look at:
shutil.move
I belive rewriting the entire file would be less efficient.
It's not necessary to rebuild the file, try using the time module to create a time stamp string for your file name, and using os.rename to move your file.
Example - this just moves the file to your specified location:
os.rename('filename.csv','NEW_dir/filename.csv')
Hope this helps.
Went with an additional routine using shutil in the end:
# move and rename the 'out_' files to the right dir
source = os.listdir(downloaded)
for files in source:
if files.startswith('out_'):
newName = files.replace('out_','')
newPath = renamed+'/'+newName
shutil.move(files,newPath)

pypdf Merging multiple pdf files into one pdf

If I have 1000+ pdf files need to be merged into one pdf,
from PyPDF2 import PdfReader, PdfWriter
writer = PdfWriter()
for i in range(1000):
filepath = f"my/pdfs/{i}.pdf"
reader = PdfReader(open(filepath, "rb"))
for page in reader.pages:
writer.add_page(page)
with open("document-output.pdf", "wb") as fh:
writer.write(fh)
Execute the above code,when reader = PdfReader(open(filepath, "rb")),
An error message:
IOError: [Errno 24] Too many open files:
I think this is a bug, If not, What should I do?
I recently came across this exact same problem, so I dug into PyPDF2 to see what's going on, and how to resolve it.
Note: I am assuming that filename is a well-formed file path string. Assume the same for all of my code
The Short Answer
Use the PdfFileMerger() class instead of the PdfFileWriter() class. I've tried to provide the following to as closely resemble your content as I could:
from PyPDF2 import PdfFileMerger, PdfFileReader
[...]
merger = PdfFileMerger()
for filename in filenames:
merger.append(PdfFileReader(file(filename, 'rb')))
merger.write("document-output.pdf")
The Long Answer
The way you're using PdfFileReader and PdfFileWriter is keeping each file open, and eventually causing Python to generate IOError 24. To be more specific, when you add a page to the PdfFileWriter, you are adding references to the page in the open PdfFileReader (hence the noted IO Error if you close the file). Python detects the file to still be referenced and doesn't do any garbage collection / automatic file closing despite re-using the file handle. They remain open until PdfFileWriter no longer needs access to them, which is at output.write(outputStream) in your code.
To solve this, create copies in memory of the content, and allow the file to be closed. I noticed in my adventures through the PyPDF2 code that the PdfFileMerger() class already has this functionality, so instead of re-inventing the wheel, I opted to use it instead. I learned, though, that my initial look at PdfFileMerger wasn't close enough, and that it only created copies in certain conditions.
My initial attempts looked like the following, and were resulting in the same IO Problems:
merger = PdfFileMerger()
for filename in filenames:
merger.append(filename)
merger.write(output_file_path)
Looking at the PyPDF2 source code, we see that append() requires fileobj to be passed, and then uses the merge() function, passing in it's last page as the new files position. merge() does the following with fileobj (before opening it with PdfFileReader(fileobj):
if type(fileobj) in (str, unicode):
fileobj = file(fileobj, 'rb')
my_file = True
elif type(fileobj) == file:
fileobj.seek(0)
filecontent = fileobj.read()
fileobj = StringIO(filecontent)
my_file = True
elif type(fileobj) == PdfFileReader:
orig_tell = fileobj.stream.tell()
fileobj.stream.seek(0)
filecontent = StringIO(fileobj.stream.read())
fileobj.stream.seek(orig_tell)
fileobj = filecontent
my_file = True
We can see that the append() option does accept a string, and when doing so, assumes it's a file path and creates a file object at that location. The end result is the exact same thing we're trying to avoid. A PdfFileReader() object holding open a file until the file is eventually written!
However, if we either make a file object of the file path string or a PdfFileReader(see Edit 2) object of the path string before it gets passed into append(), it will automatically create a copy for us as a StringIO object, allowing Python to close the file.
I would recommend the simpler merger.append(file(filename, 'rb')), as others have reported that a PdfFileReader object may stay open in memory, even after calling writer.close().
Hope this helped!
EDIT: I assumed you were using PyPDF2, not PyPDF. If you aren't, I highly recommend switching, as PyPDF is no longer maintained with the author giving his official blessings to Phaseit in developing PyPDF2.
If for some reason you cannot swap to PyPDF2 (licensing, system restrictions, etc.) than PdfFileMerger won't be available to you. In that situation you can re-use the code from PyPDF2's merge function (provided above) to create a copy of the file as a StringIO object, and use that in your code in place of the file object.
EDIT 2: Previous recommendation of using merger.append(PdfFileReader(file(filename, 'rb'))) changed based on comments (Thanks #Agostino).
The pdfrw package reads each file all in one go, so will not suffer from the problem of too many open files. Here is an example concatenation script.
The relevant part -- assumes inputs is a list of input filenames, and outfn is an output file name:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
Disclaimer: I am the primary pdfrw author.
The problem is that you are only allowed to have a certain number of files open at any given time. There are ways to change this (http://docs.python.org/3/library/resource.html#resource.getrlimit), but I don't think you need this.
What you could try is closing the files in the for loop:
input = PdfFileReader()
output = PdfFileWriter()
for file in filenames:
f = open(file, 'rb')
input = PdfFileReader(f)
# Some code
f.close()
I have written this code to help with the answer:-
import sys
import os
import PyPDF2
merger = PyPDF2.PdfFileMerger()
#get PDFs files and path
path = sys.argv[1]
pdfs = sys.argv[2:]
os.chdir(path)
#iterate among the documents
for pdf in pdfs:
try:
#if doc exist then merge
if os.path.exists(pdf):
input = PyPDF2.PdfFileReader(open(pdf,'rb'))
merger.append((input))
else:
print(f"problem with file {pdf}")
except:
print("cant merge !! sorry")
else:
print(f" {pdf} Merged !!! ")
merger.write("Merged_doc.pdf")
In this, I have used PyPDF2.PdfFileMerger and PyPDF2.PdfFileReader, instead of explicitly converting the file name to file object
It maybe just what it says, you are opening to many files.
You may explicitly use f=file(filename) ... f.close() in the loop, or use the with statement. So that each opened file is properly closed.

Categories

Resources