Cannot download pdf using python requests - python

I was able to do this previously, but I think the site might have updated something, and I'm not sure what to change.
URL = "https://www.bursamalaysia.com/misc/missftp/securities/securities_equities_2020-12-10.pdf"
r = requests.get(URL, stream = True)
with open(f"{path_to_store_pdfs}/KLSE 2020-12-10.pdf", "wb") as fd:
fd.write(r.content)
When I try to download the data using the above code now, the file appears but there's an error message that says "Adobe Reader could not open … because it is either not a supported file type or because the file has been damaged"
My main task is to perform the following code, which also does not work and gives the error "PdfReadError: EOF marker not found".
pdf_file = io.BytesIO(r.content)
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
It appears that both problems have to do with the encoding of the pdf, but I'm new to encoding and am not sure if a different encoding was used or a purposely damaged one was used (for detecting bots). Any help or guidance is greatly appreciated.

Check the request status code. For me, it gives 503 Service Unavailable. Setting the User-Agent fixed it:
import requests
user_agent = "scrapping_script/1.0"
headers = {'User-Agent': user_agent}
URL = "https://www.bursamalaysia.com/misc/missftp/securities/securities_equities_2020-12-10.pdf"
r = requests.get(URL, headers=headers, stream = True)
with open("KLSE 2020-12-10.pdf", "wb") as fd:
fd.write(r.content)

Related

SSRS Download from Python Giving Corrupt File

When I run my script, it creates a corrupt, not-openable PDF file.
Here is my code, which seems to work for a lot of other people:
filename = f'./report.pdf'
username = "myusername"
password = "mypassword"
url = "http://RptServerIp/ReportServerName/reports/Untitled&rs:Command=Render&rs:Format=PDF"
r = requests.get(url, auth=HttpNtlmAuth(username, password))
print(r.status_code)
if r.status_code == 200:
with open(filename, 'wb') as out:
for bits in r.iter_content():
out.write(bits)
This is a no-parameter test report (and it's called Untitled). The status code is 200 and I've confirmed the login info is correct by changing one character in the password to make it incorrect, which returns a bad status code. If I go to the URL http://RptServerIp/ReportServerName/reports/Untitled in my browser it shows the report, but if I do the full URL http://RptServerIp/ReportServerName/reports/Untitled&rs:Command=Render&rs:Format=PDF it gives me an error.
Any suggestions?
Thank you to everyone that answered, I found a solution that for some reason isn't everywhere on stack overflow.
The URL I was using was something like:
http://ServerIpAddress/Reports/report/Untitled&rs:Command=Render&rs:Format=PDF
but it needed to be changed to:
http://ServerIpAddress/Reportserver?/Untitled&rs:Command=Render&rs:Format=PDF
The big difference is that the default client name when going through the browser is Reports/{admin-named-subfolder} but the default when using the URL connect is Reportserver? with no subfolder included.

Can't download csv.gz from website using Python

I'm currently trying to download a csv.gz file from the following link: https://www.cryptoarchive.com.au/bars/pair. As you can see, opening the link with a browser simply opens the save file dialogue. However, passing the link to requests or urllib simply downloads HTML as opposed to the actual file.
This is the current approach I'm trying:
EDIT: Updated to reflect changes I've made.
url = "https://www.cryptoarchive.com.au/bars/pair"
file_name = "test.csv.gz"
headers = {"PLAY_SESSION": play_session}
r = requests.get(url, stream=True, headers=headers)
with open(file_name, "wb") as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
f.flush()
The only saved cookie I can find is the PLAY_SESSION. Setting that as a header doesn't change the result I'm getting.
Further, I've tried posting a request to the login page like this:
login = "https://www.cryptoarchive.com.au/signup"
data = {"email": email,
"password": password,
"accept": "checked"}
with requests.Session() as s:
p = s.post(login, data=data)
print(p.text)
However, this also doesn't seem to work and I especially can't figure out what to pass to the login page or how to actually check the checkbox...
Just browsing that url from a private navigation shows the error:
Please login/register first.
To get that file, you need to login first into the site. Probably with the login you will get a session token, some cookie or something similar that you need to put in the request command.
Both #Daniel Argüelles' and #Abhyudaya Sharma's answer have helped me. The solution was simply getting the PLAY_SESSION cookie after logging into the website and passing it to the request function.
cookies = {"PLAY_SESSION": play_session}
url = "https://www.cryptoarchive.com.au/bars/pair"
r = requests.get(url, stream=True, cookies=cookies)
with open(file_name, "wb") as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
f.flush()

Why request fails to download an excel file from web?

the url link is the direct link to a web file (xlsb file) which I am trying to downlead. The code below works with no error and the file seems created in the path but once I try to open it, corrupt file message pops up on excel. The response status is 400 so it is a bad request. Any advice on this?
url = 'http://rigcount.bakerhughes.com/static-files/55ff50da-ac65-410d-924c-fe45b23db298'
file_name = r'local path with xlsb extension'
with open(file_name, "wb") as file:
response = requests.request(method="GET", url=url)
file.write(response.content)
Seems working for me. Try this out:
from requests import get
url = 'http://rigcount.bakerhughes.com/static-files/55ff50da-ac65-410d-924c-fe45b23db298'
# make HTTP request to fetch data
r = get(url)
# check if request is success
r.raise_for_status()
# write out byte content to file
with open('out.xlsb', 'wb') as out_file:
out_file.write(r.content)

Extract media files from cache using Selenium

I'm trying to download some videos from a website using Selenium.
Unfortunately I can't download it from source cause the video is stored in a directory with restricted access, trying to retrieve them using urllib, requests or ffmpeg returns a 403 Forbidden error, even after injecting my user data to the website.
I was thinking of playing the video in its entirety and store the media file from cache.
Would it be a possibility? Where can I find the cache folder in a custom profile? How do I discriminate among files in cache?
EDIT: This is what I attempted to do using requests
import requests
def main():
s = requests.Session()
login_page = '<<login_page>>'
login_data = dict()
login_data['username'] = '<<username>>'
login_data['password'] = '<<psw>>'
login_r = s.post(login_page)
video_src = '<<video_src>>'
cookies = dict(login_r.cookies) # contains the session cookie
# static cookies for every session
cookies['_fbp'] = 'fb.1.1630500067415.734723547'
cookies['_ga'] = 'GA1.2.823223936.1630500067'
cookies['_gat'] = '1'
cookies['_gid'] = 'GA1.2.1293544716.1631011551'
cookies['user'] = '66051'
video_r = s.get(video_src, cookies=cookies)
print(video_r.status_code)
if __name__ == '__main__':
main()
The print() function returns:
403
This is the network tab for the video:
Regarding video_r = s.get(video_src, cookies=cookies) Have you try to stream the response ? which send correct byte-range headers to download the video. Most websites prevent downloading the file as "one" block.
with open('...', 'wb') as f:
response = s.get(url=link, stream=True)
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
You can send a head request before if you want, in that way you can create a progress bar, you will retrieve the full content length from header.
Also a 403 is commonly use by anti-bot system, may be your selenium is detected.
You blocked because you forgot about headers.
You must use:
s.get('https://httpbin.org / headers', headers ={'user-agent': <The user agent value (for example: last line of your uploaded image)>})
or:
s.headers.update({'user-agent': <The user agent value (for example: last line of your uploaded image)>})
before sending a request

How do i download pdf file over https with python

I am writing a python script, which will save pdf file locally according to the format given in URL. for eg.
https://Hostname/saveReport/file_name.pdf #saves the content in PDF file.
I am opening this URL through python script :
import webbrowser
webbrowser.open("https://Hostname/saveReport/file_name.pdf")
The url contains lots of images and text. Once this URL is opened i want to save a file in pdf format using python script.
This is what i have done so far.
Code 1:
import requests
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False)
file = open("file_name.pdf", 'w')
file.write(r.read())
file.close()
Code 2:
import urllib2
import ssl
url="https://Hostname/saveReport/file_name.pdf"
context = ssl._create_unverified_context()
response = urllib2.urlopen(url, context=context) #How should i pass authorization details here?
html = response.read()
In above code i am getting: urllib2.HTTPError: HTTP Error 401: Unauthorized
If i use Code 2, how can i pass authorization details?
I think this will work
import requests
import shutil
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True)
r.raw.decode_content = True
with open("file_name.pdf", 'wb') as f:
shutil.copyfileobj(r.raw, f)
One way you can do that is:
import urllib3
urllib3.disable_warnings()
url = r"https://websitewithfile.com/file.pdf"
fileName = r"file.pdf"
with urllib3.PoolManager() as http:
r = http.request('GET', url)
with open(fileName, 'wb') as fout:
fout.write(r.data)
You can try something like :
import requests
response = requests.get('https://websitewithfile.com/file.pdf',verify=False, auth=('user', 'pass'))
with open('file.pdf','w') as fout:
fout.write(response.read()):
For some files - at least tar archives (or even all other files) you can use pip:
import sys
from subprocess import call, run, PIPE
url = "https://blabla.bla/foo.tar.gz"
call([sys.executable, "-m", "pip", "download", url], stdout=PIPE, stderr=PIPE)
But you should confirm that the download was successful some other way as pip would raise error for any files that are not archives containing setup.py, hence stderr=PIPE (Or may be you can determine if the download was successful by parsing subprocess error message).

Categories

Resources