I am downloading multiple PDFs. I have a list of urls and the code is written to download them and also create one big pdf with them all in. The code works for the first 144 pdfs then it throws this error:
PdfReadError: EOF marker not found
I've tried making all the pdfs end in %%EOF but that doesn't work - it still reaches the same point then I get the error again.
Here's my code:
my file and converting to list for python to read each separately
with open('minutelinks.txt', 'r') as file:
data = file.read()
links = data.split()
download pdfs
from PyPDF2 import PdfFileMerger
import requests
urls = links
merger = PdfFileMerger()
for url in urls:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
merger.append(title)
merger.write("allminues.pdf")
merger.close()
I want to be able to download all of them and create one big pdf - which it appears to do until it throws this error. I have about 750 pdfs and it only gets to 144.
This is how I changed my code so it now downloads all of the pdfs and skips the one (or more) that may be correupted. I also had to add the self argument to the function.
from PyPDF2 import PdfFileMerger
import requests
import sys
urls = links
def download_pdfs(self):
merger = PdfFileMerger()
for url in urls:
try:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
except PdfReadError:
print(title)
sys.exit()
merger.append(title)
merger.write("allminues.pdf")
merger.close()
The end of file marker '%%EOF' is meant to be the very last line. It is a kind of marker where the pdf parser knows, that the PDF document ends here.
My solution is to force this marker to stay at the end:
def reset_eof(self, pdf_file):
with open(pdf_file, 'rb') as p:
txt = (p.readlines())
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(txt)-i-1
break
txtx = txt[:actual_line] + [b'%%EOF']
with open(pdf_file, 'wb') as f:
f.writelines(txtx)
return PyPDF4.PdfFileReader(pdf_file)
I read that EOF is a kind of tag included in PDF files. link in portuguese
However, I guess some kinds of PDF files do not have the 'EOF marker' and PyPDF2 do not recognizes those ones.
So, what I did to fix "PdfReadError: EOF marker not found" was opening my PDF with Google Chromer and print it as .pdf once more, so that the file is converted to .pdf by Chromer and hopefully with the EOF marker.
I ran my script with the new .pdf file converted by Chromer and it worked fine.
Related
I am trying to read a whole pdf file that is more then 250 pages. for that first i am converting my pdf to docx thorough the pdf2docx library.
here is a code;
from docx import Document
document = Document()
document.save('file.docx')
url = file_path #(google drive url where file was uploaded)
response = requests.get(url)
my_raw_data = response.content
with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(my_raw_data)
open_pdf_file = open("my_pdf.pdf", 'rb')
cv = Converter(open_pdf_file)
cv.convert("roshni.docx")
Parse=parser.from_file("file.docx")
data=[]
for i in (Parse['content'].strip().split('\n')):
if len(i.split())<5:
pass
else:
data.append(i)
Text=data[1:-1]
But I am not able to read the file. getting error like "Exception: No parsed pages. Please parse page first."
How to solve this issue ? how to read a whole pdf using python ?
The goal is to download GTFS data through python web scraping, starting with https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download
Currently, I'm using requests like so:
def download(url):
fpath = "prov/city/GTFS"
r = requests.get(url)
if r.ok:
print("Saving file.")
open(fpath, "wb").write(r.content)
else:
print("Download failed.")
The results of requests.content of the above url unfortunately renders the following:
You can see the files of interest within the output (e.g. stops.txt) but how might I access them to read/write?
I fear you're trying to read a zip file with a text editor, perhaps you should try using the "zipfile" module.
The following worked:
def download(url):
fpath = "path/to/output/"
f = requests.get(url, stream = True, headers = headers)
if f.ok:
print("Saving to {}".format(fpath))
g=open(fpath+'output.zip','wb')
g.write(f.content)
g.close()
else:
print("Download failed with error code: ", f.status_code)
You need to write this file into a zip.
import requests
url = "https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download"
fname = "gtfs.zip"
r = requests.get(url)
open(fname, "wb").write(r.content)
Now fname exists and has several text files inside. If you want to programmatically extract this zip and then read the content of a file, for example stops.txt, then you need first to extract a single file, or simply extractall.
import zipfile
# this will extract only a single file, and
# raise a KeyError if the file is missing from the archive
zipfile.ZipFile(fname).extract("stops.txt")
# this will extract all the files found from the archive,
# overwriting files in the process
zipfile.ZipFile(fname).extractall()
Now you just need to work with your file(s).
thefile = "stops.txt"
# just plain text
text = open(thefile).read()
# csv file
import csv
reader = csv.reader(open(thefile))
for row in reader:
...
Drawing inspiration from this post, I am trying to download a bunch of xml files in batch from a website:
import urllib2
url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
data = f.read()
with open("C:\Users\MyName\Desktop\data.zip", "wb") as code:
code.write(data)
The zip file is created within seconds, but as I attempt to access it, an error window comes up:
Windows cannot open the folder.
The Compressed (zipped) Folder "C:\Users\MyName\Desktop\data.zip" is invalid.
What am I doing wrong here?
you are not opening file handles inside the zip file:
import urllib2
from bs4 import BeautifulSoup
import zipfile
url='http://ratings.food.gov.uk/open-data/'
fileurls = []
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')
tablewrapper = soup.find(id='openDataStatic')
for table in tablewrapper.find_all('table'):
for link in table.find_all('a'):
fileurls.append(link['href'])
with zipfile.ZipFile("data.zip", "w") as code:
for url in fileurls:
print('Downloading: %s' % url)
f = urllib2.urlopen(url)
data = f.read()
xmlfilename = url.rsplit('/', 1)[-1]
code.writestr(xmlfilename, data)
You are doing nothing to encode this as zip file. If instead you choose to open it in a plain text editor such as notepad it should show you the raw xml.
I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...
In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)
test.txt contains the list of files to be downloaded:
http://example.com/example/afaf1.tif
http://example.com/example/afaf2.tif
http://example.com/example/afaf3.tif
http://example.com/example/afaf4.tif
http://example.com/example/afaf5.tif
How these files can be downloaded using python with maximum download speed?
my thinking was as follows:
import urllib.request
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
What after that?How to select download directory?
Select a path to your desired output directory (output_dir). In your for loop split every url on / character and use the last peace as the filename. Also open the files for writing in binary mode wb since the response.read() returns bytes, not str.
import os
import urllib.request
output_dir = 'path/to/you/output/dir'
with open ('test.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
response = urllib.request.urlopen(line)
output_file = os.path.join(output_dir, line.split('/')[-1])
with open(output_file, 'wb') as writer:
writer.write(response.read())
Note:
Downloading multiple files can be faster if you use multiple threads since the download is rarely using the full bandwidth of your internet connection._
Also if the files you are downloading are pretty big you should probably stream the read (reading chunk by chunk). As #Tiran commented you should use shutil.copyfileobj(response, writer) instead of writer.write(response.read()).
I would only add that you should probably always specify the length parameter too: shutil.copyfileobj(response, writer, 5*1024*1024) # (at least 5MB) since the default value of 16kb is really small and it will just slow things down.
This works fine for me: (note that name must be absolute, for example 'afaf1.tif')
import urllib,os
def download(baseUrl,fileName,layer=0):
print 'Trying to download file:',fileName
url = baseUrl+fileName
name = os.path.join('foldertodwonload',fileName)
try:
#Note that folder needs to exist
urllib.urlretrieve (url,name)
except:
# Upon failure to download retries total 5 times
print 'Download failed'
print 'Could not download file:',fileName
if layer > 4:
return
else:
layer+=1
print 'retrying',str(layer)+'/5'
download(baseUrl,fileName,layer)
print fileName+' downloaded'
for fileName in nameList:
download(url,fileName)
Moved unnecessary code out from try block