Convert PDF to docx [closed]

Convert PDF to docx [closed] - python

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
How can we convert a PDF to docx with/without using python. Actually I want to automate conversion of large number of file, so I need an API.
I have used online websites like:
https://pdf2docx.com/
https://online2pdf.com/pdf2docx
https://www.zamzar.com/convert/pdf-to-docx/
I was unable to get access for using there api directly

pdf2docx
Install the pdf2docx package Click here
Installation
Clone or download pdf2docx
pip install pdf2docx
or
# download the package and install your environment
python setup.py install
Option 1
from pdf2docx import Converter
pdf_file = r'C:\Users\ABCD\Desktop\XYZ/Document1.pdf'# source file
docx_file = r'C:\Users\ABCD\Desktop\XYZ/sample.docx' # destination file
# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file, start=0, end=None)
cv.close()
#Output
Parsing Page 53: 53/53...
Creating Page 53: 53/53...
--------------------------------------------------
Terminated in 6.258919400000195s.
Option 2
from pdf2docx import parse
pdf_file = r'C:\Users\ABCD\Desktop\XYZ/Document2.pdf' # source file
docx_file = r'C:\Users\ABCD\Desktop\XYZ/sample_2.docx' # destination file
# convert pdf to docx
parse(pdf_file, docx_file, start=0, end=None)
# output
Parsing Page 53: 53/53...
Creating Page 53: 53/53...
--------------------------------------------------
Terminated in 5.883666100000482s.

You can try pdftohtml, then use Pandoc to convert HTML to docx.
Actually, PDF is not a really document format but a rather page layout format, so conversion can be problematic.

I'm the CTO at Zamzar and we have an API to do just this available at https://developers.zamzar.com/
We have a Test account you can use for free to try out the service, and code samples for Python in our docs which which would enable you to convert a PDF file to DOCX quite simply:
import requests
from requests.auth import HTTPBasicAuth
api_key = 'YOUR_API_KEY'
endpoint = "https://sandbox.zamzar.com/v1/jobs"
source_file = "/tmp/my.pdf"
target_format = "docx"
file_content = {'source_file': open(source_file, 'rb')}
data_content = {'target_format': target_format}
res = requests.post(endpoint, data=data_content, files=file_content, auth=HTTPBasicAuth(api_key, ''))
print res.json()
You can then poll the job to see when it has finished before downloading your converted file.

Try PDF.to it has a PDF API that has Curl, PHP, Python, and NodeJS examples, and has good documentation

Converting PDFs to Documents can be a problematic task rather it would be easy the other way.
One possible solution could be "Save As" the PDF file in a desired location with ".docx" extension. This might work if the PDF was saved from docx and vice-versa.

Related

Read a csv file from bitbucket using Python and convert it to a df

I am trying to read a url csv file from bitbucket and I want to read it into a df using python. Also for the work I am doing I can not read it locally , it has to be from bitbucket all the time.
Any ideas on how to do this? Thank you!
Here is my example:
url = 'https://bitbucket.EXAMPLE.com/EXAMPLE/EXAMPLE/EXAMPLE/EXAMPLE/raw/wpcProjects.csv?at=refs%2Fheads%2Fmaster'
colnames=['project_id','project_name','gourmet_url']
df7 = pd.read_csv(url, names =colnames)
However, the output is not correct, its not the df being outputted its some bad data.

You have multiple options, but your question is actually 2 separate questions.
How to get a file (.csv in this case) from a remote location.
How to load a csv into a "df" which is a pandas data frame.
For #2, you simply import pandas, and use the df = pandas.read_csv() function call. See the documentation! If the CSV file was in the current directory, you would do pandas.read_csv('myfile.csv')
The CSV is on a server somewhere. In this case, it happens to be on bitbucket's servers accessed from their website. You can fetch it and save it locally, then access it, or you can fetch it to a temporary location, read it into pandas, and discard it. You could even read the data from the file into python as a string. However, having a lot of options doesn't mean they are all useful. I am just listing them for completeness. Looking at the documentation, pandas already has remote fetching built into the read_csv() function. If the passed in path is a valid URL scheme, where, in pandas,
"Valid URL schemes include http, ftp, s3, gs, and file".
If you want to locally save it, you can use pandas to do so once again, using the .write() method of a data frame.
FOR BITBUCKET SPECIFICALLY:
You need to make sure to link to the 'raw' file on bitbucket. Get the link to the raw file, and pass that in. The link used to view the file on your web browser is not the direct link to the raw file by default, it's a webpage that offers a view into that file. Get the raw file link, then pass that into pandas.
Code example:
Assume we want (a random csv file I found on bitbucket):
https://bitbucket.org/pedrorijo91/nodejstutorial/src/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv?at=master
What you need is a link to the raw file! clicking on ... and pressing 'open raw' we get:
https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv
Let's look at this in detail, the link is the same up to the project name:
https://bitbucket.org/pedrorijo91/nodejstutorial/
afterwards, the raw file is under raw/
then it's the same pointer (random but same letters and numbers)
db4c991864e65c4d72e98a1dc94e33606e3adde9/
Finally, it's the same directory structure:
node_modules/levelmeup/data/horse_js.csv
The first link ends with a ?at=master which is parsed by the web server and originates from src/ at the web server. The second link, the actual link to the raw file, starts from raw/ and ends with .csv
import pandas as pd
RAW_Bitbucket_URL = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
df = pd.read_csv(RAW_Bitbucket_URL)
The above code is successful for me.

 You may need to download the entire file so you can try to make the request with requests and then read it as a file in pandas.read_csv().
>>> import pandas as pd
>>> import requests
>>> url = 'https://bitbucket.org/pedrorijo91/nodejstutorial/raw/db4c991864e65c4d72e98a1dc94e33606e3adde9/node_modules/levelmeup/data/horse_js.csv'
>>> r = requests.get(url, allow_redirects=True)
>>> open('file.csv', 'wb').write(r.content)
>>> pd.read_csv('file.csv', encoding='utf-8-sig').head()
ID Tweet Date Via
0 374667940827635712 So, yes, a 100% JS App is 100% awesome 08:59:32, 9-3, 2013 web
1 374656867466637312 "vituperating priests" who rail against JavaSc... 08:15:32, 9-3, 2013 web
2 374654221292806144 Node/Browserify/CJS folks, is there any benefit 08:05:01, 9-3, 2013 Twitter for iPhone
3 374640446955212800 100% JavaScript applications. You may get some 07:10:17, 9-3, 2013 Twitter for iPhone
4 374613490763169792 A node.js app that will order you a sandwich 05:23:10, 9-3, 2013 web

How to extract a column of data from an online PDF page?

I am interested in extracting the 'Company Name' column from this link:
https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf
I was able to achieve something similar with this solution: How do I decode text from a pdf online with Requests?
However I was wondering how would I go about extracting only the company name column from that? Since the solution returns all of the text in an unstructured format. Thanks in advance as I am fairly new to python and having difficulties.

You get the error as the Server is preventing bots from web scraping or something. I don't quite understand it either but I found a fix which is to download the file locally first and then use tabula to get the data like so
import requests
from tabula import read_pdf
url = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
r = requests.get(url, allow_redirects=True)
open('data.pdf', 'wb').write(r.content)
tables = read_pdf("data.pdf", pages = "all", multiple_tables = True)
you may then get the following message
tabula.errors.JavaNotFoundError: `java` command is not found from this Python process.Please ensure Java is installed and PATH is set for `java`
to fix it follow the steps from this thread.
`java` command is not found from this Python process. Please ensure Java is installed and PATH is set for `java`
and everything should be working.

There is a python library named tabula-py
You can install it using "pip install tabula-py"
You can use it as follows:
import tabula
file = "https://calgaryeconomicdevelopment.com/assets/PDFs/Industry-Quick-Lists/Energy-2019-07.pdf"
tables = tabula.read_pdf(file, pages = "all", multiple_tables = True)
You can use this to convert the table to a csv file
tabula.convert_into(file, "table.csv")
Then you can use csv library to get the required columns you want

Only 1 KB of file is downloading, instead of the whole thing in Python

I've attempted to use urllib, requests, and wget. All three don't work.
I'm trying to download a 300KB .npz file from a URL. When I download the file with wget.download(), urllib.request.urlretrieve(), or with requests, an error is not thrown. The .npz file downloads. However, this .npz file is not 300KB. The file size is only 1 KB. Also, the file is unreadable - when I use np.load(), the error OSError: Failed to interpret file 'x.npz' as a pickle shows up.
I am also certain that the URL is valid. When I download the file with a browser, it is correctly read by np.load() and has the right file size.
Thank you very much for the help.
Edit 1:
The full code was requested. This was the code:
loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
wget.download(loadfrom, savedir)
data = np.load(savedir)
I've also used variants with urllib:
loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
urllib.request.urlretrieve(loadfrom, savedir)
data = np.load(savedir)
and requests:
loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
r = requests.get(loadfrom).content
with open("x.npz",'wb') as f:
f.write(r)
data = np.load(savedir)
They all produce the same result, with the aforementioned conditions.

Kindly show the full code and the exact lines you use to download the file. Remember you need to use
r=requests.get("direct_URL_of_your_file.npz").content
with open("local_file.npz",'wb') as f:
f.write(r)
Also make sure the URL is a direct download link.

The issue was that the server needed javascript to run as a security precaution. So, when I send the request, all I got was html with "This Site Requres Javascript to Work". I found out that there was a __test cookie that needed to be passed during the request.
This answer explains it fully. This video may also be helpful.

Python-Tika returning "None" content for PDF's, but works with TIFF's

I have a PDF that i'm trying to get Tika to parse. The PDF is not OCR. Tesseract is installed on my machine.
I used ImageMagik to convert the file.tiff to file.pdf, so the tiff file I am parsing is a direct conversion from the PDF.
Tika parses the TIFF with no problems, but returns "None" content for the PDF. What gives? I'm using Tika 1.14.1, tesseract 3.03, leptonica-1.70
Here is the code...
from tika import parser
# This works
print(parser.from_file('/from/file.tiff', 'http://localhost:9998/tika'))
# This returns "None" for content
print(parser.from_file('/from/file.pdf', 'http://localhost:9998/tika'))

So, after some feedback from the Chris Mattman (who was wonderful, and very helpful!), I sorted out the issue.
His response:
Since Tika Python acts as a thin client to the REST server, you just
need to make sure the REST server is started with a classpath
configuration that sets the right flags for TesseractOCR, see here:
http://wiki.apache.org/tika/TikaOCR
While I had read this before, the issue did not click for me until later and some further reading. TesseractOCR does not natively support OCR conversion of PDF's - therefore, Tika doesn't either as Tika relies on Tesseract's support of PDF conversion (and further, neither does tika-python)
My solution:
I combined subprocess, ImageMagick (CLI) and Tika to work together in python to first convert the PDF to a TIFF, and then allow Tika/Tesseract to perform an OCR conversion on the file.
Notes:
This process is very slow for large PDF's
Requires: tika-python, tesseract, imagemagick
The code:
from tika import parser
import subprocess
import os
def ConvertPDFToOCR(file):
meta = parser.from_file(fil, 'http://localhost:9998/tika')
# Check if parsed content is NoneType and handle accordingly.
if "content" in meta and meta['content'] is None:
# Run ImageMagick via subprocess (command line)
params = ['convert', '-density', '300', u, '-depth', '8', '-strip', '-background', 'white', '-alpha', 'off', 'temp.tiff']
subprocess.check_call(params)
# Run Tika again on new temp.tiff file
meta = parser.from_file('temp.tiff', 'http://localhost:9998/tika')
# Delete the temporary file
os.remove('temp.tiff')
return meta['content']

You can enable the X-Tika-PDFextractInlineImages': 'true' and directly extract text from images in the pdfs. No need for conversion. Took a while to figure out but works perfectly.
from tika import parser
headers = {
'X-Tika-PDFextractInlineImages': 'true',
}
parsed = parser.from_file("Citi.pdf",serverEndpoint='http://localhost:9998/rmeta/text',headers=headers)
print(parsed['content'])

How can I add a header to urllib.request.urlretrieve keeping my variables?

I'm trying to download a file from a website but it looks like it is detecting urllib and doesn't allow it to download (I'm getting the error "urllib.error.HTTPError: HTTP Error 403: Forbidden").
How can I fix this? I found on the internet that I had to add a header but the answers weren't going the way I need (It was using Request and I didn't find anything about an argument to add in urllib.request.urlretrieve() for a header)
I'm using Python 3.6
Here's the code:
import urllib.request
filelink = 'https://randomwebsite.com/changelog.txt'
filename = filelink.rsplit('/', 1)
filename = str(filename[1])
urllib.request.urlretrieve(filelink, filename)
I want to include a header to give me the permission to download the file but I need to keep a line like the last one, using the two variables (one for the link of the file and one for the name that depends of the link).
Thanks already for your help !

Check the below link:
https://stackoverflow.com/a/7244263/5903276
The most correct way to do this would be to use the urllib.request.urlopen function to return a file-like object that represents an HTTP response and copy it to a real file using shutil.copyfileobj.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert PDF to docx [closed] - python

You can try pdftohtml, then use Pandoc to convert HTML to docx. Actually, PDF is not a really document format but a rather page layout format, so conversion can be problematic.

Try PDF.to it has a PDF API that has Curl, PHP, Python, and NodeJS examples, and has good documentation

Converting PDFs to Documents can be a problematic task rather it would be easy the other way. One possible solution could be "Save As" the PDF file in a desired location with ".docx" extension. This might work if the PDF was saved from docx and vice-versa.

Related

Read a csv file from bitbucket using Python and convert it to a df

How to extract a column of data from an online PDF page?

Only 1 KB of file is downloading, instead of the whole thing in Python

Python-Tika returning "None" content for PDF's, but works with TIFF's

How can I add a header to urllib.request.urlretrieve keeping my variables?

Categories

Resources