In my code my get url have error in python - python

Python error when using request get
Hello guys i have this in my code
from bs4 import BeautifulSoup
r = requests.get(url)
And I'm gettin this
<Response [403]>
Whats could be the solution
The url is 'https://www3.animeflv.net/anime/sailor-moon'
btw the title is weird because i dont know why stack overflow dont allow me the way i want to put it :(

For your specific case you can overcome that by faking your User-Agent in request headers.
import requests
url = 'https://www3.animeflv.net/anime/sailor-moon'
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
res = requests.get(url, headers=headers)
print(res.status_code)
<Response [200]>
Some websites try to block requests made with python requests library, by default when you make a request from python script your User-Agent is something like python3/requests but if you fake it with manipulating headers you can easily bypass that. Take a look at this library https://pypi.org/project/fake-useragent/ for generating fake User-Agent strings.

Related

Not getting any HTML Response Codes

I am new to the whole scraping thing and am trying to scrape some information off a website through python but when checking for HTML response (i.e. 200) I am not getting any results back on the terminal. below is my code. Appreciate all sort of help! Edit: I have fixed my rookie mistake in the print section below xD thank you guys for the correction!
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
page = requests.get(url)
print(page.status_code)
The problem is that the page you are trying to scrape protects against scraping by ignoring requests from unusual user agents.
Set the user agent to some well-known string like below
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)
For one thing, you don't print to the console in Python with the syntax Print = (page). That code assigns the page variable to a variable called Print, which is probably not a good idea as print is a keyword in Python. In order to output to the console, change your code to:
print(page)
Second, printing page is just printing the response object you are receiving after making your GET request, which is not very helpful. The response object has a number of properties you can access, which you can read about in the documentation for the requests Python library.
To get the status code of your response, try:
print(page.status_code)

How to bypass Mod_Security while scraping

I tried running this Python script using BeautifulSoup and requests modules :
from bs4 import BeautifulSoup as bs
import requests
url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')
But when I send this line :
print(soup.get_text())
It doesn't scrape the text data but instead, It returns this output:
Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage
Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.
Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?
Any help would be very very helpful, Thanks.
EDIT: The Dash in "User-Agent" is essential.
Following this Answer https://stackoverflow.com/a/61968635/8106583
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
Your User-Agent is the problem. This User-Agent works for me.
Also: Your ip might be blocked by now :D

ValueError while scraping instagram with python

Hello I am trying to scrape this url : https://www.instagram.com/cristiano/?__a=1 but I get a Value Error
url_user = "https://www.instagram.com/cristiano/?__a=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = get(url_user,headers=headers)
print(response) # 200
html_soup = BeautifulSoup(response.content, 'html.parser')
# print(html_soup)
jsondata=json.loads(str(html_soup))
ValueError: No JSON object could be decoded
Any idea why I get this error?
The reason you're getting the error is because you're trying to parse a JSON response as if it was HTML. You don't need BeautifulSoup for that.
Try this:
import json
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = json.loads(requests.get(url_user).text)
print(d)
However, best practice suggests to use .json() from requests, as it'll do a better job of figuring out the encoding used.
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = requests.get(url_user).json()
print(d)
You might be getting non-200 HTTP Status Code, which means that server responded with error, e.g. server might have banned your IP for frequent requests. requests library doesn't throw any errors for that. To control erroneous status codes insert after get(...) line this code:
response.raise_for_status()
Also it is enough just to do jsondata = response.json(). requests library can parse json this way without need for beautiful soup. Easy to read tutorial about main requests library features is located here.
Also if there is some parsing problem save binary content of response to file to attach it to question like this:
with open('response.dat', 'wb') as f:
f.write(response.content)

Getting response code 403 on API data in requests python

The following python script gives me 403 error, the type of request is 'GET'.
import requests
import json
url ='https://footballapi.pulselive.com/football/players?pageSize=30&compSeasons=210&altIds=true&page=2&type=player&id=-1&compSeasonId=210'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.status_code)
A Screenshot:
Check XHR Request Screenshot
Your code looks fine. I ran it and got the same 403 response. But if you open the url you posted, you'll notice a 403 error there as well. This looks like an issue with the website itself or maybe you are using an incorrect url.
This might be a late answer, but what you're missing is the correct header to access the Pulselive API. The necessary header is 'Origin':'https://www.premierleague.com'.
This makes the API think that the request is coming from the official Premier League website, and they have access to the API.
Hope this helps!

Download ZIP file from the web (Python)

I am trying to download a ZIP file using from this website. I have looked at other questions like this, tried using the requests and urllib but I get the same error:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Found
Any ideas on how to open the file straight from the web?
Here is some sample code
from urllib.request import urlopen
response = urlopen('http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip')
The linked url will redirect indefinitely, that's why you get the 302 error.
You can examine this yourself over here. As you can see the linked url immediately redirects to itself creating a single-url loop.
Works for me using the Requests library
import requests
url = 'http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip'
response = requests.get(url)
# Unzip it into a local directory if you want
import zipfile, io
zip = zipfile.ZipFile(io.BytesIO(response.content))
zip.extractall("/path/to/your/directory")
Note that sometimes trying to access web pages programmatically leads to 302 responses because they only want you to access the page via a web browser.
If you need to fake this (don't be abusive), just set the 'User-Agent' header to be like a browser. Here's an example of making a request look like it's coming from a Chrome browser.
user_agent = 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
headers = {'User-Agent': user_agent}
requests.get(url, headers=headers)
There are several libraries (e.g. https://pypi.org/project/fake-useragent/) to help with this for more extensive scraping projects.

Categories

Resources