I am new to the whole scraping thing and am trying to scrape some information off a website through python but when checking for HTML response (i.e. 200) I am not getting any results back on the terminal. below is my code. Appreciate all sort of help! Edit: I have fixed my rookie mistake in the print section below xD thank you guys for the correction!
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
page = requests.get(url)
print(page.status_code)
The problem is that the page you are trying to scrape protects against scraping by ignoring requests from unusual user agents.
Set the user agent to some well-known string like below
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)
For one thing, you don't print to the console in Python with the syntax Print = (page). That code assigns the page variable to a variable called Print, which is probably not a good idea as print is a keyword in Python. In order to output to the console, change your code to:
print(page)
Second, printing page is just printing the response object you are receiving after making your GET request, which is not very helpful. The response object has a number of properties you can access, which you can read about in the documentation for the requests Python library.
To get the status code of your response, try:
print(page.status_code)
Related
Python error when using request get
Hello guys i have this in my code
from bs4 import BeautifulSoup
r = requests.get(url)
And I'm gettin this
<Response [403]>
Whats could be the solution
The url is 'https://www3.animeflv.net/anime/sailor-moon'
btw the title is weird because i dont know why stack overflow dont allow me the way i want to put it :(
For your specific case you can overcome that by faking your User-Agent in request headers.
import requests
url = 'https://www3.animeflv.net/anime/sailor-moon'
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'}
res = requests.get(url, headers=headers)
print(res.status_code)
<Response [200]>
Some websites try to block requests made with python requests library, by default when you make a request from python script your User-Agent is something like python3/requests but if you fake it with manipulating headers you can easily bypass that. Take a look at this library https://pypi.org/project/fake-useragent/ for generating fake User-Agent strings.
I am currently trying to build a webscraping program to pull data from a real estate website using Beautiful Soup. I haven't gotten very far but the code is as follows:
import requests
from bs4 import BeautifulSoup
r=requests.get("http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/")
c=r.content
soup=BeautifulSoup(c,"html.parser")
print(soup)
When I try to print the data to at least see if the program is working I get an error message saying "Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security." How do I get the server to stop blocking my IP address? I've read some similar issues with other programs and tried clearing the cookies, trying different browsers, etc and nothing has fixed it.
This is happening since the webpage thinks that your a bot (and is correct), therefore you will get blocked when sending a request.
To "bypass" this issue, try adding the user-agent to the headers parameter in the requests.get() method.
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
url = "http://pyclass.com/real-estate/rock-springs-wy/LCWYROCKSPRINGS/"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
print(soup.prettify())
Hello I am trying to scrape this url : https://www.instagram.com/cristiano/?__a=1 but I get a Value Error
url_user = "https://www.instagram.com/cristiano/?__a=1"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
response = get(url_user,headers=headers)
print(response) # 200
html_soup = BeautifulSoup(response.content, 'html.parser')
# print(html_soup)
jsondata=json.loads(str(html_soup))
ValueError: No JSON object could be decoded
Any idea why I get this error?
The reason you're getting the error is because you're trying to parse a JSON response as if it was HTML. You don't need BeautifulSoup for that.
Try this:
import json
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = json.loads(requests.get(url_user).text)
print(d)
However, best practice suggests to use .json() from requests, as it'll do a better job of figuring out the encoding used.
import requests
url_user = "https://www.instagram.com/cristiano/?__a=1"
d = requests.get(url_user).json()
print(d)
You might be getting non-200 HTTP Status Code, which means that server responded with error, e.g. server might have banned your IP for frequent requests. requests library doesn't throw any errors for that. To control erroneous status codes insert after get(...) line this code:
response.raise_for_status()
Also it is enough just to do jsondata = response.json(). requests library can parse json this way without need for beautiful soup. Easy to read tutorial about main requests library features is located here.
Also if there is some parsing problem save binary content of response to file to attach it to question like this:
with open('response.dat', 'wb') as f:
f.write(response.content)
I'm a Korean who just started learning Python.
First, I apologize for my English.
I learned how to use beautifulSoup on YouTube. And on certain sites, crawling was successful.
However, I found out that crawl did not go well on certain sites, and that I had to set up user-agent through a search.
So I used 'requests' to make code to set user-agent. Subsequently, the code to read a particular class from html was used equally, but it did not work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
url ='https://store.leagueoflegends.co.kr/skins'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.text, 'html.parser')
for skin in soup.select(".item-name"):
print(skin)
Here's my code. I have no idea what the problem is.
Please help me.
Your problem is that requests do not render javascript. instead, it only gives you the "initial" source code of the page. what you should use is a package called Selenium. it lets you control your browser )Chrome, Firefox, ...etc) from Python. the website won't be able to tell the difference and you won't need to mess with the headers and user-agents. there are plenty of videos on Youtube on how to use it.
I've had some success using the POST requests in the past on other sites and receiving data from them but for some reason I'm having difficulty with the metacritic site.
Using chrome and the developer tools, I can see that when I begin to type in the search bar, it starts a POST request to the following url.
searchURL = 'http://www.metacritic.com/g00/3_c-6bbb.rjyfhwnynh.htr_/c-6RTWJUMJZX77x24myyux3ax2fx2fbbb.rjyfhwnynh.htrx2ffzytx78jfwhmx3fn65h.rfwpx3dcmw_$/$'
I also know that my headers need to be the following in order to get a response
headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"}
When I run this, I get a status code of 200 which indicates it worked but my response text is not what I expected. I am receiving the content of the entire page when I'm expecting json of search results. What am I missing here?
title = 'Grand Theft Auto'
#search request using POST
r = requests.post(searchURL, data = {'searchTerm' : title}, headers = headers)
print(r.status_code)
print(r.text)
You can see in the images below what I'm expecting to get.
Headers
Response
Not sure about the difference - maybe GDPR-related since i live in Europe, or because i have set DNT (Do not track) to true in Chrome - but for me, Metacritic autocomplete requests post simply to http://www.metacritic.com/autosearch with the parameters search_term set to the search value and search_filter set to all :
From your screenshots, i think the URL for autocomplete in your browser is constructed with your session id, maybe to avoid stuff like you intend to do :)
So in your case i would try in following order:
post to the /autosearch URL and if that doesn't work
figure out the session-id to URL-writing logic, then make an initial request in the code to get a session id and work with that