Downloading pdf file using Python - python

I am working on an API which returns me a document ID and then I can use that document ID to get the PDF. For example document_id = 'fanlfe48ry4ihkefewfl934'. Now I concatenate this id like the following
document_id = 'fanlfe48ry4ihkefewfl934'
full_url = url + document_id
from urllib.request import urlopen
response = urlopen(full_url)
file = open("document.pdf", 'w')
file.write(response.read())
file.close()
But it is getting just an html response. The reason is that the url is not a file with .pdf extension but it is a url which when clicked, pops up the window for save location and then it saves the file as pdf.
I don't understand how to handle this situation

Related

Exception: No parsed pages. Please parse page first

I am trying to read a whole pdf file that is more then 250 pages. for that first i am converting my pdf to docx thorough the pdf2docx library.
here is a code;
from docx import Document
document = Document()
document.save('file.docx')
url = file_path #(google drive url where file was uploaded)
response = requests.get(url)
my_raw_data = response.content
with open("my_pdf.pdf", 'wb') as my_data:
my_data.write(my_raw_data)
open_pdf_file = open("my_pdf.pdf", 'rb')
cv = Converter(open_pdf_file)
cv.convert("roshni.docx")
Parse=parser.from_file("file.docx")
data=[]
for i in (Parse['content'].strip().split('\n')):
if len(i.split())<5:
pass
else:
data.append(i)
Text=data[1:-1]
But I am not able to read the file. getting error like "Exception: No parsed pages. Please parse page first."
How to solve this issue ? how to read a whole pdf using python ?

Download a file without name using Python

I want to download a file, there is a hyper link in html page which does not include the file name and extension. How can I download the file using python?
For example the link is http://1.1.1.1:8080/tank-20/a/at_download/file,
but whenever I click on it the file will download and open with browser.
Use python requests to get the body of the response and write to file, this is essentially what the browser is doing when you click the link.
Try the below:
import requests
# define variables
request_url = "http://1.1.1.1:8080/tank-20/a/at_download/file"
output_file = "output.txt"
# send get request
response = requests.get(request_url)
# use 'with' to write to file
with open(output_file, 'w') as fh:
fh.write(response.content)
fh.close()

Retrieve filename from the API URL once its downloaded, Python 3.6

I am downloading file from the API URL http://api.worldbank.org/v2/en/topic/19?downloadformat=csv and We get file "API_19_DS2_en_csv_v2_10225248.zip" after hit.
Above URL does not contain "File name" like other URL "http://databank.worldbank.org/data/download/SE4ALL_csv.zip", here I can use
ntpath.basename(URL)
How to get file name?
Below code working
r = requests.get(Source_Link)
URL_Metadata = r.headers['Content-Disposition']
Source_File_Name = URL_Metadata[URL_Metadata.find('filename=')+9:]

How to save a file with a URL name?

I'm saving HTTP request as an HTML page.
How can I save the HTML file with the name of the URL.
I'm using Linux OS
So the file name will look like this: "http://www.test.com.html"
My code:
url = "http://www.test.com"
page = urllib.urlopen(url).read()
f = open("./file.html", "w")
f.write(page)
f.close()
Unfortunately you cannot save a file with a url name. The character "/" is not allowed in Windows file names.
However, you can create a file with the name www.test.com.html with the following line
file_name = url.split('/')[2]
If you need to anything like https://www.test.com/posts/1, you can try to replace / with another custom character that usually not occurs in url such as __
url = 'https://www.test.com/posts/11111111'
file_name = '__'.join(url.split('/')[2:])
Will result in
www.test.com__posts__1

Writing data to csv or text file using python

I am trying to write some data to csv file by checking some condition as below
I will have a list of urls in a text file as below
urls.txt
www.example.com/3gusb_form.aspx?cid=mum
www.example_second.com/postpaid_mum.aspx?cid=mum
www.example_second.com/feedback.aspx?cid=mum
Now i will go through each url from the text file and read the content of the url using urllib2 module in python and will search a string in the entire html page. If the required string founds i will write that url in to a csv file.
But when i am trying to write data(url) in to csv file,it is saving like each character in to one coloumn as below instead of saving entire url(data) in to one column
h t t p s : / / w w w......
Code.py
import urllib2
import csv
search_string = 'Listen Capcha'
html_urls = open('/path/to/input/file/urls.txt','r').readlines()
outputcsv = csv.writer(open('output/path' + 'urls_contaning _%s.csv'%search_string, "wb"),delimiter=',', quoting=csv.QUOTE_MINIMAL)
outputcsv.writerow(['URL'])
for url in html_urls:
url = url.replace('\n','').strip()
if not len(url) == 0:
req = urllib2.Request(url)
response = urllib2.urlopen(req)
if str(search_string) in response.read():
outputcsv.writerow(url)
So whats wrong with the above code, what needs to be done in order to save the entire url(string) in to one column in a csv file ?
Also how can we write data to a text file as above ?
Edited
Also i had a url suppose like http://www.vodafone.in/Pages/tuesdayoffers_che.aspx ,
this url will be redirected to http://www.vodafone.in/pages/home_che.aspx?cid=che in browser actually, but when i tried through code as below it is same as the above given url
import urllib2, httplib
httplib.HTTPConnection.debuglevel = 1
request = urllib2.Request("http://www.vodafone.in/Pages/tuesdayoffers_che.aspx")
opener = urllib2.build_opener()
f = opener.open(request)
print f.geturl()
Result
http://www.vodafone.in/pages/tuesdayoffers_che.aspx?cid=che
So finally how to catch the redirected url with urllib2 and fetch the data from it ?
Change the last line to:
outputcsv.writerow([url])

Categories

Resources