python - Saving images to generated directories - python

Hopefully this is a quick one for someone. its been annoying me for a while now.
I can create the directory and save the images to the directory where the script is ran, but i cannot figure how to save the images to its specific folder created for that specific advert.
Would someone be able to shed some light on this please?
gundir = soup.find("title").text #keep - folder creation for each advert using title
gun_folders = os.makedirs(gundir)
for img in imgs:
clean = re.compile('src=".*?"')
strings = clean.findall(str(img))
for string in strings:
imgUrl = string.split('"')[1]
filename = imgUrl.split('/')[-1]
resp = requests.get(imgUrl, stream=True)
local_file = open(filename, 'wb')
resp.raw.decode_content = True
shutil.copyfileobj(resp.raw, local_file)
del resp
I understand the above code does what its supposed to do, but its not enough for what i wish it to do.
Could someone point me in the direction on how to achieve what i'm after?
Thanks!

String concatenation
local_file = open('{}/{}'.format(gun_folders ,filename), 'wb')

Related

Beautifulsoup - why arent the images im scraping saving?

Im iterating through and scraping images off a website... but for some reason the "write" isn't working and saving the image. Am I supposed to declare a directory to save them to or something? here's my request. Im using python 2.7
for img in imgs:
image = img['href']
img_url = my_url + image
resource = urllib.urlretrieve(img_url)
resource = resource[0]
output = open(resource, "wb")
output.write(resource)
output.close()
You're working too hard! urlretrieve will already have written the file to disk, all you need to do is copy it to somewhere more permanent.
filename,headers = urllib.urlretreive(img_url)
import shutil
shutil.copy(filename, "/path/to/somewhere")
But to answer your question about what is going on...
resource = urllib.urlretrieve(img_url) # the file is on disk at /tmp/foobar
resource = resource[0] # resource now contains "/tmp/foobar"
output = open(resource, "wb") # oops! You just opened "/tmp/foobar" for writing, which clears the file

KeyError: 'APP1' when reading metadata with exif

I want to cycle through every jpg in my pictures folder and return the name and the date it was taken and add it to a list. Running my code however results in KeyError: APP1. Below you can see my code:
from exif import Image
path = 'F:/Bilder/'
bilder_list = []
for i in os.listdir(path):
if ".jpg" in i:
with open(path+i, 'rb') as image_file:
image = Image(image_file)
bilder_list.append(image.datetime)
print(bilder_list)
Any idea what went wrong here? Any help is greatly appreciated :)

App Engine - download files from Cloud Storage

I am using Python 2.7 and Reportlab to create .pdf files for display/print in my app engine system. I am using ndb.Model to store the data if that matters.
I am able to produce the equivalent of a bank statement for a single client on-line. That is; the user clicks the on-screen 'pdf' button and the .pdf statement appears on screen in a new tab, exactly as it should.
I am using the following code to save .pdf files to Google Cloud Storage successfully
buffer = StringIO.StringIO()
self.p = canvas.Canvas(buffer, pagesize=portrait(A4))
self.p.setLineWidth(0.5)
try:
# create .pdf of .csv data here
finally:
self.p.save()
pdfout = buffer.getvalue()
buffer.close()
filename = getgcsbucket() + '/InvestorStatement.pdf'
write_retry_params = gcs.RetryParams(backoff_factor=1.1)
try:
gcs_file = gcs.open(filename,
'w',
content_type='application/pdf',
retry_params=write_retry_params)
gcs_file.write(pdfout)
except:
logging.error(traceback.format_exc())
finally:
gcs_file.close()
I am using the following code to create a list of all files for display on-screen, it shows all the files stored above.
allfiles = []
bucket_name = getgcsbucket()
rfiles = gcs.listbucket(bucket_name)
for rfile in rfiles:
allfiles.append(rfile.filename)
return allfiles
My screen (html) shows rows of ([Delete] and Filename). When the user clicks the [Delete] button, the following delete code snippet works (filename is /bucket/filename, complete)
filename = self.request.get('filename')
try:
gcs.delete(filename)
except gcs.NotFoundError:
pass
My question - given I have a list of files on-screen, I want the user to click on the filename and for that file to be downloaded to the user's computer. In Google's Chrome Browser, this would result in the file being downloaded, with it's name displayed on the bottom left of the screen.
One other point, the above example is for .pdf files. I will also have to show .csv files in the list and would like them to be downloaded as well. I only want the files to be downloaded, no display is required.
So, I would like a snippet like ...
filename = self.request.get('filename')
try:
gcs.downloadtousercomputer(filename) ???
except gcs.NotFoundError:
pass
I think I have tried everything I can find both here and elsewhere. Sorry I have been so long-winded. Any hints for me?
To download a file instead of showing it in the browser, you need to add a header to your response:
self.response.headers["Content-Disposition"] = 'attachment; filename="%s"' % filename
You can specify the filename as shown above and it works for any file type.
One solution you can try is to read the file from the bucket and print the content as the response with the correct header:
import cloudstorage
...
def read_file(self, filename):
bucket_name = "/your_bucket_name"
file = bucket_name + '/' + filename
with cloudstorage.open(file) as cloudstorage_file:
self.response.headers["Content-Disposition"] = str('attachment;filename=' + filename)
contents = cloudstorage_file.read()
cloudstorage_file.close()
self.response.write(contents)
Here filename could be something you are sending as GET parameter and needs to be a file that exist on your bucket or you will raise an exception.
[1] Here you will find a sample.
[1]https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)

Python download file with original date

I am quire sure, this is something very common. I want to download files from a https-server and want to keep the original (changed) date. So it should not show me the date the one which is after downloaded.
Here I am using this way.
filename = file.text
file = session.get(url, verify=False)
if file.endswith('arg'):
file = open('C:/RD/M/' + filename, 'wb+')
file.write(file.content)
file.close()
else:
'do something'
Is there any way to add something after file.write(file.content)?
Thanks for info in advance

Categories

Resources