So I've created a form that includes the following item
<input type="file" name="form_file" multiple/>
This tells the browser to allow the user to select multiple files while browsing. The problem I am having is is that when reading / writing the files that are being uploaded, I can only see the last of the files, not all of them. I was pretty sure I've seen this done before, but had no luck searching. Here's generally what my read looks like
if request.FILES:
filename = parent_id + str(random.randrange(0,100))
output_file = open(settings.PROJECT_PATH + "static/img/inventory/" + filename + ".jpg", "w")
output_file.write(request.FILES["form_file"].read())
output_file.close()
Now, as you can see I'm not looping through each file, because I've tried a few different ways and can't seem to find the other files (in objects and such)
I added in this print(request.FILES["form_file"]) and was only getting the last filename, as expected. Is there some trick to get to the other files? Am I stuck with a single file upload? Thanks!
Based on your file element form_file, the value in request.FILES['form_file'] should be a list of files. So you can do something like:
for upfile in request.FILES.getlist('form_file'):
filename = upfile.name
# instead of "filename" specify the full path and filename of your choice here
fd = open(filename, 'w')
fd.write(upfile['content'])
fd.close()
Using chunks:
for upfile in request.FILES.getlist('form_file'):
filename = upfile.name
fd = open(filename, 'w+') # or 'wb+' for binary file
for chunk in upfile.chunks():
fd.write(chunk)
fd.close()
Related
I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.
I am using PyPDF2 to do this curretly
list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))
individual_files = []
stream = io.StringIO()
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
individual_files.append(output)
with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
outputStream.write(stream.getvalue())
#print(outputStream.read())
with open(outputStream.name, 'rb') as f:
data = f.seek(85)
data = f.read()
individual_files.append(data)
bucket.blob('processed/' + "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
In the output, I see different PyPDF2 objects such as
<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next. I am also open to using other libraries if those work better.
There were two reasons why my program was not working:
I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,
I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')
Below is the corrected code:
if inputpdf.numPages > 2:
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
output.write(outputStream)
with open(outputStream.name, 'rb') as f:
data = f.seek(0)
data = f.read()
#print(data)
bucket.blob(prefix + '/processed/' + "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
stream.truncate(0)
To split a PDF file in several small file (page), you need to download the data for that. You can materialize the data in a file (in the writable directory /tmp) or simply keep them in memory in a python variable.
In both cases:
The data will reside in memory
You need to get the data to perform the PDF split.
If you absolutely want to read the data in streaming (I don't know if it's possible with PDF format!!), you can use the streaming feature of GCS. But, because there isn't CRC on the downloaded data, I won't recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.
I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)
I have tried the following code by slightly modifying the example in documentation
class Upload():
def POST(self):
web.header('enctype','multipart/form-data')
print strftime("%Y-%m-%d %H:%M:%S", gmtime())
x = web.input(file={})
filedir = '/DiginUploads' # change this to the directory you want to store the file in.
if 'file' in x: # to check if the file-object is created
filepath=x.file.filename.replace('\\','/') # replaces the windows-style slashes with linux ones.
filename=filepath.split('/')[-1] # splits the and chooses the last part (the filename with extension)
fout = open(filedir +'/'+ filename,'w') # creates the file where the uploaded file should be stored
fout.write(x.file.file.read()) # writes the uploaded file to the newly created file.
fout.close() # closes the file, upload complete.
But this works only for csv and txt documents. For Excel/pdf etc file gets created but it can't be opened (corrupted). What should I do to handle this scenario?
I saw this but it is about printing the content which does not address my matter.
You need to use wb (binary) mode when opening the file:
fout = open(filedir +'/'+ filename, 'wb')
test.txt contains the list of files to be downloaded:
http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif
How these files can be downloaded using python with maximum download speed?
my thinking was as follows:
import urllib.request
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
What after that?How to select download directory?
Select a path to your desired output directory (output_dir). In your for loop split every url on / character and use the last peace as the filename. Also open the files for writing in binary mode wb since the response.read() returns bytes, not str.
import os
import urllib.request
output_dir = 'path/to/you/output/dir'
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
output_file = os.path.join(output_dir, line.split('/')[-1])
with open(output_file, 'wb') as writer:
writer.write(response.read())
Note:
Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._
Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). As #Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()).
I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down.
This works fine for me: (note that name must be absolute, for example 'afaf1.tif')
import urllib,os
def download(baseUrl,fileName,layer=0):
print 'Trying to download file:',fileName
url = baseUrl+fileName
name = os.path.join('foldertodwonload',fileName)
try:
#Note that folder needs to exist
urllib.urlretrieve (url,name)
except:
# Upon failure to download retries total 5 times
print 'Download failed'
print 'Could not download file:',fileName
if layer > 4:
return
else:
layer+=1
print 'retrying',str(layer)+'/5'
download(baseUrl,fileName,layer)
print fileName+' downloaded'
for fileName in nameList:
download(url,fileName)
Moved unnecessary code out from try block
I am trying to create bulk text files based on list. A text file has number of lines/titles and aim is to create text files. Following is how my titles.txt looks like along with non-working code and expected output.
titles = open("C:\\Dropbox\\Python\\titles.txt",'r')
for lines in titles.readlines():
d_path = 'C:\\titles'
output = open((d_path.lines.strip())+'.txt','a')
output.close()
titles.close()
titles.txt
Title-A
Title-B
Title-C
new blank files to be created under directory c:\\titles\\
Title-A.txt
Title-B.txt
Title-C.txt
It's a little difficult to tell what you're attempting here, but hopefully this will be helpful:
import os.path
with open('titles.txt') as f:
for line in f:
newfile = os.path.join('C:\\titles',line.strip()) + '.txt'
ff = open( newfile, 'a')
ff.close()
If you want to replace existing files with blank files, you can open your files with mode 'w' instead of 'a'.
The following should work.
import os
titles='C:/Dropbox/Python/titles.txt'
d_path='c:/titles'
with open(titles,'r') as f:
for l in f:
with open(os.path.join(d_path,l.strip()),'w') as _:
pass