Grab auto Download Links Using requests - python

I'm Trying to grab auto started direct download Link from Yourupload using Bs4
the direct download Link is auto generated every time,
the direct download Link also start automatically after 5 seconds,
i want to get the direct download Link and store it in "Link.txt" Files
import requests
import bs4
req = requests.get('https://www.yourupload.com/download?file=2573285', stream = True)
req = bs4.BeautifulSoup(req.text,'lxml')
print(req)

Well, actually the site is running a JavaScript code to handle the redirect to the final-destination url to stream the download with just token validation.
Now we will be more wolfs and get through it.
We will send a GET request firstly with maintaining the session via requests.Session() to maintain the session object and again send GET request to download the Video :).
Which means that you currently have the final url, you can do whatever, to download it now or later.
import requests
from bs4 import BeautifulSoup
def Main():
main = "https://www.yourupload.com/download?file=2573285"
with requests.Session() as req:
r = req.get(main)
soup = BeautifulSoup(r.text, 'html.parser')
token = soup.findAll("script")[2].text.split("'")[1][-4:]
headers = {
'Referer': main
}
r = req.get(
f"https://www.yourupload.com/download?file=2573285&sendFile=true&token={token}", stream=True, headers=headers)
print(f"Downloading From {r.url}")
name = r.headers.get("Content-Disposition").split('"')[1]
with open(name, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
f.write(chunk)
print(f"File {name} Saved.")
Main()
Output:
Downloading From https://s205.vidcache.net:8166/play/a202003090La0xSot1Kl/okanime-2107-HD-19_99?&attach=okanime-2107-HD-19_99.mp4
File okanime-2107-HD-19_99.mp4 Saved.
Confirmation By Size: As you can see 250M
Notice that the download link is one time callable as the token is only validated one-time by the back-end.

Related

Can't download csv.gz from website using Python

I'm currently trying to download a csv.gz file from the following link: https://www.cryptoarchive.com.au/bars/pair. As you can see, opening the link with a browser simply opens the save file dialogue. However, passing the link to requests or urllib simply downloads HTML as opposed to the actual file.
This is the current approach I'm trying:
EDIT: Updated to reflect changes I've made.
url = "https://www.cryptoarchive.com.au/bars/pair"
file_name = "test.csv.gz"
headers = {"PLAY_SESSION": play_session}
r = requests.get(url, stream=True, headers=headers)
with open(file_name, "wb") as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
f.flush()
The only saved cookie I can find is the PLAY_SESSION. Setting that as a header doesn't change the result I'm getting.
Further, I've tried posting a request to the login page like this:
login = "https://www.cryptoarchive.com.au/signup"
data = {"email": email,
"password": password,
"accept": "checked"}
with requests.Session() as s:
p = s.post(login, data=data)
print(p.text)
However, this also doesn't seem to work and I especially can't figure out what to pass to the login page or how to actually check the checkbox...
Just browsing that url from a private navigation shows the error:
Please login/register first.
To get that file, you need to login first into the site. Probably with the login you will get a session token, some cookie or something similar that you need to put in the request command.
Both #Daniel Argüelles' and #Abhyudaya Sharma's answer have helped me. The solution was simply getting the PLAY_SESSION cookie after logging into the website and passing it to the request function.
cookies = {"PLAY_SESSION": play_session}
url = "https://www.cryptoarchive.com.au/bars/pair"
r = requests.get(url, stream=True, cookies=cookies)
with open(file_name, "wb") as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
f.flush()

Extract media files from cache using Selenium

I'm trying to download some videos from a website using Selenium.
Unfortunately I can't download it from source cause the video is stored in a directory with restricted access, trying to retrieve them using urllib, requests or ffmpeg returns a 403 Forbidden error, even after injecting my user data to the website.
I was thinking of playing the video in its entirety and store the media file from cache.
Would it be a possibility? Where can I find the cache folder in a custom profile? How do I discriminate among files in cache?
EDIT: This is what I attempted to do using requests
import requests
def main():
s = requests.Session()
login_page = '<<login_page>>'
login_data = dict()
login_data['username'] = '<<username>>'
login_data['password'] = '<<psw>>'
login_r = s.post(login_page)
video_src = '<<video_src>>'
cookies = dict(login_r.cookies) # contains the session cookie
# static cookies for every session
cookies['_fbp'] = 'fb.1.1630500067415.734723547'
cookies['_ga'] = 'GA1.2.823223936.1630500067'
cookies['_gat'] = '1'
cookies['_gid'] = 'GA1.2.1293544716.1631011551'
cookies['user'] = '66051'
video_r = s.get(video_src, cookies=cookies)
print(video_r.status_code)
if __name__ == '__main__':
main()
The print() function returns:
403
This is the network tab for the video:
Regarding video_r = s.get(video_src, cookies=cookies) Have you try to stream the response ? which send correct byte-range headers to download the video. Most websites prevent downloading the file as "one" block.
with open('...', 'wb') as f:
response = s.get(url=link, stream=True)
for chunk in response.iter_content(chunk_size=512):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
You can send a head request before if you want, in that way you can create a progress bar, you will retrieve the full content length from header.
Also a 403 is commonly use by anti-bot system, may be your selenium is detected.
You blocked because you forgot about headers.
You must use:
s.get('https://httpbin.org / headers', headers ={'user-agent': <The user agent value (for example: last line of your uploaded image)>})
or:
s.headers.update({'user-agent': <The user agent value (for example: last line of your uploaded image)>})
before sending a request

Web Scraping with Requests -Python

I have been struggling to do a web scraping with the below code and its showing me null records. If we print the output data, it dosent show the requested output. this is the web site i am going to do this web scraping https://coinmarketcap.com/. there are several pages which need to be taken in to the data frame. (64 Pages)
import requests
import pandas as pd
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req= requests.post(url)
main_data=req.json()
can anyone help me to sort this out?
Instead of using post requests use get in request call it will
work!
import requests
res=requests.get("https://api.coinmarketcap.com/data-api/v3/topsearch/rank")
main_data=res.json()
data=main_data['data']['cryptoTopSearchRanks']
With All pages: You can find this URL from Network tab go to xhr and reload now go to second page URL will avail in xhr tab you can copy and make call of it i have shorten the URL here
res=requests.get("https://coinmarketcap.com/")
soup=BeautifulSoup(res.text,"html.parser")
last_page=soup.find_all("p",class_="sc-1eb5slv-0 hykWbK")[-1].get_text().split(" ")[-1]
res=requests.get(f"https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit={last_page}&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath")
Now use json method
data=res.json()['data']['cryptoCurrencyList']
print(len(data))
Output:
6304
For getting/reading the data you need to use get method not post
import requests
import pandas as pd
import json
url = "https://api.coinmarketcap.com/data-api/v3/topsearch/rank"
req = requests.get(url)
main_data = req.json()
print(main_data) # without pretty printing
pretty_json = json.loads(req.text)
print(json.dumps(pretty_json, indent=4)) # with pretty print
Their terms of use prohibit web scraping. The site provides a well-documented API that has a free tier. Register and get API token:
from requests import Session
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start':'1',
'limit':'5000',
'convert':'USD'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': HIDDEN_TOKEN, # replace that with your API Key
}
session = Session()
session.headers.update(headers)
response = session.get(url, params=parameters)
data = response.json()
print(data)

python-requests does not grab JSESSIONID and SessionData cookies

I want to scrape a pdf file from http://www.jstor.org/stable/pdf/10.1086/512825.pdf but it wants me to accept Terms and Conditions. While downloading from browser I found out that JSTOR saves my acceptance in 2 cookies with names JSESSIONID and SessionData but python-requests does not grab these two cookie( It grab two other cookies but not these).
Here is my session instantiation code:
def get_raw_session():
session = requests.Session()
session.headers.update({'User-Agent': UserAgent().random})
session.headers.update({'Connection': 'keep-alive'})
return session
Note that I used python-requests for login-required sites several times before and it worked great but in this case it's not.
I guess problem is that JSTOR is built with jsp and python-requests does not support that.
Any Idea?
The following code is working perfectly fine for me -
import requests
from bs4 import BeautifulSoup
s = requests.session()
r = s.get('http://www.jstor.org/stable/pdf/10.1086/512825.pdf')
soup = BeautifulSoup(r.content)
pdfurl = 'http://www.jstor.org' + soup.find('a', id='acptTC')['href']
with open('export.pdf', 'wb') as handle:
response = s.get(pdfurl, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)

Log in into website and download file with python requests

I have a website with a HTML-Form. After logging in it takes me to a start.php site and then redirects me to an overview.php.
I want to download files from that server... When I click on the download link of a ZIP-File the address behind the link is:
getimage.php?path="vol/img"&id="4312432"
How can I do that with requests? I tried to create a session and do the GET-Command with the right params... but the answer is just the website I would see when I'm not logged in.
c = requests.Session()
c.auth =('myusername', 'myPass')
request1 = c.get(myUrlToStart.PHP)
tex = request1.text
with open('data.zip', 'wb') as handle:
request2 = c.get(urlToGetImage.Php, params=payload2, stream=True)
print(request2.headers)
for block in request2.iter_content(1024):
if not block:
break
handle.write(block)
What you're doing is a request with basic authentication. This does not fill out the form that is displayed on the page.
If you know the URL that your form sends a POST request to, you can try sending the form data directly to this URL
Those who are looking for the same thing could try this...
import requests
import bs4
site_url = 'site_url_here'
userid = 'userid'
password = 'password'
file_url = 'getimage.php?path="vol/img"&id="4312432"'
o_file = 'abc.zip'
# create session
s = requests.Session()
# GET request. This will generate cookie for you
s.get(site_url)
# login to site.
s.post(site_url, data={'_username': userid, '_password': password})
# Next thing will be to visit URL for file you would like to download.
r = s.get(file_url)
# Download file
with open(o_file, 'wb') as output:
output.write(r.content)
print(f"requests:: File {o_file} downloaded successfully!")
# Close session once all work done
s.close()

Categories

Resources