Can't download csv.gz from website using Python

Can't download csv.gz from website using Python - python

I'm currently trying to download a csv.gz file from the following link: https://www.cryptoarchive.com.au/bars/pair. As you can see, opening the link with a browser simply opens the save file dialogue. However, passing the link to requests or urllib simply downloads HTML as opposed to the actual file.
This is the current approach I'm trying:
EDIT: Updated to reflect changes I've made.
url = "https://www.cryptoarchive.com.au/bars/pair"
file_name = "test.csv.gz"
headers = {"PLAY_SESSION": play_session}
r = requests.get(url, stream=True, headers=headers)
with open(file_name, "wb") as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
f.flush()
The only saved cookie I can find is the PLAY_SESSION. Setting that as a header doesn't change the result I'm getting.
Further, I've tried posting a request to the login page like this:
login = "https://www.cryptoarchive.com.au/signup"
data = {"email": email,
"password": password,
"accept": "checked"}
with requests.Session() as s:
p = s.post(login, data=data)
print(p.text)
However, this also doesn't seem to work and I especially can't figure out what to pass to the login page or how to actually check the checkbox...

Just browsing that url from a private navigation shows the error:
Please login/register first.
To get that file, you need to login first into the site. Probably with the login you will get a session token, some cookie or something similar that you need to put in the request command.

Both #Daniel Argüelles' and #Abhyudaya Sharma's answer have helped me. The solution was simply getting the PLAY_SESSION cookie after logging into the website and passing it to the request function.
cookies = {"PLAY_SESSION": play_session}
url = "https://www.cryptoarchive.com.au/bars/pair"
r = requests.get(url, stream=True, cookies=cookies)
with open(file_name, "wb") as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
f.flush()

Related

Extract media files from cache using Selenium

I'm trying to download some videos from a website using Selenium.
Unfortunately I can't download it from source cause the video is stored in a directory with restricted access, trying to retrieve them using urllib, requests or ffmpeg returns a 403 Forbidden error, even after injecting my user data to the website.
I was thinking of playing the video in its entirety and store the media file from cache.
Would it be a possibility? Where can I find the cache folder in a custom profile? How do I discriminate among files in cache?
EDIT: This is what I attempted to do using requests
import requests
def main():
s = requests.Session()
login_page = '<<login_page>>'
login_data = dict()
login_data['username'] = '<<username>>'
login_data['password'] = '<<psw>>'
login_r = s.post(login_page)
video_src = '<<video_src>>'
cookies = dict(login_r.cookies) # contains the session cookie
# static cookies for every session
cookies['_fbp'] = 'fb.1.1630500067415.734723547'
cookies['_ga'] = 'GA1.2.823223936.1630500067'
cookies['_gat'] = '1'
cookies['_gid'] = 'GA1.2.1293544716.1631011551'
cookies['user'] = '66051'
video_r = s.get(video_src, cookies=cookies)
print(video_r.status_code)
if __name__ == '__main__':
main()
The print() function returns:
403
This is the network tab for the video:

Regarding video_r = s.get(video_src, cookies=cookies) Have you try to stream the response ? which send correct byte-range headers to download the video. Most websites prevent downloading the file as "one" block.
with open('...', 'wb') as f:
response = s.get(url=link, stream=True)
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
You can send a head request before if you want, in that way you can create a progress bar, you will retrieve the full content length from header.
Also a 403 is commonly use by anti-bot system, may be your selenium is detected.

You blocked because you forgot about headers.
You must use:
s.get('https://httpbin.org / headers', headers ={'user-agent': <The user agent value (for example: last line of your uploaded image)>})
or:
s.headers.update({'user-agent': <The user agent value (for example: last line of your uploaded image)>})
before sending a request

Cannot download pdf using python requests

I was able to do this previously, but I think the site might have updated something, and I'm not sure what to change.
URL = "https://www.bursamalaysia.com/misc/missftp/securities/securities_equities_2020-12-10.pdf"
r = requests.get(URL, stream = True)
with open(f"{path_to_store_pdfs}/KLSE 2020-12-10.pdf", "wb") as fd:
fd.write(r.content)
When I try to download the data using the above code now, the file appears but there's an error message that says "Adobe Reader could not open … because it is either not a supported file type or because the file has been damaged"
My main task is to perform the following code, which also does not work and gives the error "PdfReadError: EOF marker not found".
pdf_file = io.BytesIO(r.content)
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
It appears that both problems have to do with the encoding of the pdf, but I'm new to encoding and am not sure if a different encoding was used or a purposely damaged one was used (for detecting bots). Any help or guidance is greatly appreciated.

Check the request status code. For me, it gives 503 Service Unavailable. Setting the User-Agent fixed it:
import requests
user_agent = "scrapping_script/1.0"
headers = {'User-Agent': user_agent}
URL = "https://www.bursamalaysia.com/misc/missftp/securities/securities_equities_2020-12-10.pdf"
r = requests.get(URL, headers=headers, stream = True)
with open("KLSE 2020-12-10.pdf", "wb") as fd:
fd.write(r.content)

Grab auto Download Links Using requests

I'm Trying to grab auto started direct download Link from Yourupload using Bs4
the direct download Link is auto generated every time,
the direct download Link also start automatically after 5 seconds,
i want to get the direct download Link and store it in "Link.txt" Files
import requests
import bs4
req = requests.get('https://www.yourupload.com/download?file=2573285', stream = True)
req = bs4.BeautifulSoup(req.text,'lxml')
print(req)

Well, actually the site is running a JavaScript code to handle the redirect to the final-destination url to stream the download with just token validation.
Now we will be more wolfs and get through it.
We will send a GET request firstly with maintaining the session via requests.Session() to maintain the session object and again send GET request to download the Video :).
Which means that you currently have the final url, you can do whatever, to download it now or later.
import requests
from bs4 import BeautifulSoup
def Main():
main = "https://www.yourupload.com/download?file=2573285"
with requests.Session() as req:
r = req.get(main)
soup = BeautifulSoup(r.text, 'html.parser')
token = soup.findAll("script")[2].text.split("'")[1][-4:]
headers = {
'Referer': main
}
r = req.get(
f"https://www.yourupload.com/download?file=2573285&sendFile=true&token={token}", stream=True, headers=headers)
print(f"Downloading From {r.url}")
name = r.headers.get("Content-Disposition").split('"')[1]
with open(name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
f.write(chunk)
print(f"File {name} Saved.")
Main()
Output:
Downloading From https://s205.vidcache.net:8166/play/a202003090La0xSot1Kl/okanime-2107-HD-19_99?&attach=okanime-2107-HD-19_99.mp4
File okanime-2107-HD-19_99.mp4 Saved.
Confirmation By Size: As you can see 250M
Notice that the download link is one time callable as the token is only validated one-time by the back-end.

Python Requests GET fails after successfull session login

So I'm using the Python Requests library to login to a PHP-WebCMS. So far I was able to login using the post-command with a payload. I am able to download a file.
The problem is: When I'm running the GET-Command just after loggin in via POST it tells me that im not logged in anymore - although I'm still using the same session! Please have a look at the code
#Lets Login
with requests.session() as s:
payload = {'username': strUserName, 'password': strUserPass, 'Submit':'Login'}
r = s.post(urlToLoginPHP, data=payload, stream=True)
#Ok we are logged in. If I would run the #DOWNLOADING Files code right here I would get a correct zip file
#But since the r2-Get-Command delivers the "please login" page it doesn't work anymore
r2 = s.get("urlToAnotherPageOfThisWebsite",stream=True)
#We are not logged in anymore
#DOWNLOADING Files: This now just delivers a 5KB big file which contains the content
of the "please login page"
local_filename = 'image.zip'
# NOTE the stream=True parameter
r1 = s.get(downloadurl, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r1.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
f.flush()

I found the solution: I was logged into the Browser and was trying to login via python at the same time. The site noticed that I was already logged in and I misunderstood it.

Log in into website and download file with python requests

I have a website with a HTML-Form. After logging in it takes me to a start.php site and then redirects me to an overview.php.
I want to download files from that server... When I click on the download link of a ZIP-File the address behind the link is:
getimage.php?path="vol/img"&id="4312432"
How can I do that with requests? I tried to create a session and do the GET-Command with the right params... but the answer is just the website I would see when I'm not logged in.
c = requests.Session()
c.auth =('myusername', 'myPass')
request1 = c.get(myUrlToStart.PHP)
tex = request1.text
with open('data.zip', 'wb') as handle:
request2 = c.get(urlToGetImage.Php, params=payload2, stream=True)
print(request2.headers)
for block in request2.iter_content(1024):
if not block:
break
handle.write(block)

What you're doing is a request with basic authentication. This does not fill out the form that is displayed on the page.
If you know the URL that your form sends a POST request to, you can try sending the form data directly to this URL

Those who are looking for the same thing could try this...
import requests
import bs4
site_url = 'site_url_here'
userid = 'userid'
password = 'password'
file_url = 'getimage.php?path="vol/img"&id="4312432"'
o_file = 'abc.zip'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'_username': userid, '_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print(f"requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't download csv.gz from website using Python - python

Just browsing that url from a private navigation shows the error: Please login/register first. To get that file, you need to login first into the site. Probably with the login you will get a session token, some cookie or something similar that you need to put in the request command.

Related

Extract media files from cache using Selenium

Cannot download pdf using python requests

Grab auto Download Links Using requests

Python Requests GET fails after successfull session login

Log in into website and download file with python requests

Categories

Resources