Any idea how to get to this url with scrapy? - python

The URL is "https://sb-content.pa.caesarsonline.com/content-service/api/v1/q/time-band-event-list?".
I think it uses Cloudflare which is why I am having difficulty but I am not sure if that is the only issue. I don't necessarily need a solution shown in Scrapy, I have played around with cfscrape and can't get it to get any response other than 403 as well.

You are correct in assuming that this is cloudflare blocking automated requests.
<title>Access denied | sb-content.pa.caesarsonline.com used Cloudflare to restrict access</title>
You can use the library "cloudscraper" to try and bypass this but as cloudflare changes their detection methods periodically you might eventually have troubles until the library is updated.
Cloud Scraper Library: https://pypi.org/project/cloudscraper/
Example:
import cloudscraper
scraper = cloudscraper.create_scraper()
response = scraper.get("https://sb-content.pa.caesarsonline.com/content-service/api/v1/q/time-band-event-list?").text
print(response)
Output:
{"data":{"timeBandEvents":[{"type":"LIVE","date":null,"competitionSummary":[],"events":[],"outrights":[]},{"type":"NEXT_TO_GO"........

Related

Why don't I get a response from my request?

I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance
I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.

Cannot scrape using python.requests() but work when loading on browser

I want to scrape data from this page: https://raritysniffer.com/viewcollection/primeapeplanet
The API request works on the browser but returns 403 ERROR when I use python.requests.
requests.get("https://raritysniffer.com/api/index.php?query=fetch&collection=0x6632a9d63e142f17a668064d41a21193b49b41a0&taskId=any&norm=true&partial=true&traitCount=true")
I understand it is possible that I have to pass on specific headers to make it work, but as a python novice, I have no idea how to make it work. Please advise. Thanks!
If you check the response, you can see that the website uses Cloudfare and which indeed returns the 403. To bypass this, try cloudscraper. (be mindful)
import cloudscraper
url = 'https://raritysniffer.com/api/index.php?query=fetch&collection=0x6632a9d63e142f17a668064d41a21193b49b41a0&taskId=any&norm=true&partial=true&traitCount=true'
scraper = cloudscraper.create_scraper(browser = 'firefox')
print(scraper.get(url).text)

urllib.error.HTTPError: HTTP Error 413: Payload Too Large

I am scraping a variety of pages (the_url) within a large website using the following code:
opener = urllib.request.build_opener()
url = opener.open(the_url)
contents_of_webpage = url.read()
url.close()
contents_of_webpage = contents_of_webpage.decode("utf-8")
This works fine for almost every page but occasionally I get:
urllib.error.HTTPError: HTTP Error 413: Payload Too Large
Looking for solutions I come up against answers of the form: well a web server may choose to give this as a response... as if there was nothing to be done - but all of my browsers can read the page without problem and presumably my browsers should be making the same kind of request. So surely there exists some kind of solution... For example can you ask for a web page a little bit at a time to avoid a large payload?
It depends heavily on the site and the URL you're requesting. To avoid your problem, most sites/APIs offer pagination on their endpoints. Try to check if the endpoint you're requesting accepts GET parameters like ?offset=<int>&limit=<int> or smth.
UPD: besides that, urllib is not so good in emulating browser behavior.
So you could try making the same request using requests, or setting the User-Agent header your browser has.

Downloading torrent file using get request (.torrent)

I am trying to download torrent file from this code :
url = "https://itorrents.org/torrent/0BB4C10F777A15409A351E58F6BF37E8FFF53CDB.torrent"
r = requests.get(url, allow_redirects=True)
open('test123.torrent', 'wb').write(r.content)
It downloads a torrent file , but when i load it to bittorrent error occurs.
It says Unable to Load , Torrent Is Not Valid Bencoding
Can anybody please help me to resolve this problem ? Thanks in advance
This page use cloudflare to prevent scraping the page,I am sorry to say that bypassing cloudflare is very hard if you only use requests, the measures cloudflare takes will update soon.This page will check your browser whether it support Javascript.If not, they won't give you the bytes of the file.That's why you couldn't use them.(You could use r.text to see the response content, it is a html page.Not a file.)
Under this circumstance, I think you should consider about using selenium.
Bypassing Cloudflare can be a pain, so I suggest using a library that handles it. Please don't forget that your code may break in the future because Cloudflare changes their techniques periodically. Well, if you use the library, you will just need to update the library (at least you should hope for that).
I used a similar library only in NodeJS, but I see python also has something like that - cloudscraper
Example:
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance
# Or: scraper = cloudscraper.CloudScraper() # CloudScraper inherits from requests.Session
print scraper.get("http://somesite.com").text # => "<!DOCTYPE html><html><head>..."
Depending on your usage you may need to consider using proxies - CloudFlare can still block you if you send too many requests.
Also, if you are working with video torrents, you may be interested in Torrent Stream Server. It a server that downloads and streams video at the same time, so you can watch the video without fully downloading it.
We can do by adding cookies in headers .
But after some time cookie expires.
Therefore only solution is to download from opening browser

Extract HTML-Content from URL of Site that probably uses Cookies via Python

I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)

Categories

Resources