I've had some success using the POST requests in the past on other sites and receiving data from them but for some reason I'm having difficulty with the metacritic site.
Using chrome and the developer tools, I can see that when I begin to type in the search bar, it starts a POST request to the following url.
searchURL = 'http://www.metacritic.com/g00/3_c-6bbb.rjyfhwnynh.htr_/c-6RTWJUMJZX77x24myyux3ax2fx2fbbb.rjyfhwnynh.htrx2ffzytx78jfwhmx3fn65h.rfwpx3dcmw_$/$'
I also know that my headers need to be the following in order to get a response
headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"}
When I run this, I get a status code of 200 which indicates it worked but my response text is not what I expected. I am receiving the content of the entire page when I'm expecting json of search results. What am I missing here?
title = 'Grand Theft Auto'
#search request using POST
r = requests.post(searchURL, data = {'searchTerm' : title}, headers = headers)
print(r.status_code)
print(r.text)
You can see in the images below what I'm expecting to get.
Headers
Response
Not sure about the difference - maybe GDPR-related since i live in Europe, or because i have set DNT (Do not track) to true in Chrome - but for me, Metacritic autocomplete requests post simply to http://www.metacritic.com/autosearch with the parameters search_term set to the search value and search_filter set to all :
From your screenshots, i think the URL for autocomplete in your browser is constructed with your session id, maybe to avoid stuff like you intend to do :)
So in your case i would try in following order:
post to the /autosearch URL and if that doesn't work
figure out the session-id to URL-writing logic, then make an initial request in the code to get a session id and work with that
Related
I am trying to scrape data that generates a chart on a website using python's request module.
My code currently looks like this:
# load modules
import os
import json
import requests as r
# url to send the call to
postURL = <insert website>
# utiliz get to pull cookie data
cookie_intel = r.get(postURL, verify = False)
# get cookies
search_cookies = cookie_intel.cookies
#### Request Information ####
# API request data
post_data = <insert request json>
# header information
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
# results
results_post = r.post(postURL, data = post_data, cookies = search_cookies, headers = headers, verify = False)
# result
print(results_post.json())
As a quick summary, I first loaded the site to then inspect it, from there I identified the url for the request in the network tab and then checked the required request data in the payload tab. Then I took the user-agent from the request headers tab.
The request itself works, however, it is always empty. I have tried altering all sorts of inputs but without success. I would highly appreciate any sort of tips that would help me to solve this issue. Thank you in advance!
in this case you have to use json= instead of data= when making the post request according to the requests documentation . By replacing this part of your code you should get the expected response.
results_post = r.post(postURL, json = post_data, cookies = search_cookies, headers = headers, verify = False)
You can also try other scraping tools like Scrapy to crawl these data and maybe running the crawler on the cloud using estela.
I am new to the whole scraping thing and am trying to scrape some information off a website through python but when checking for HTML response (i.e. 200) I am not getting any results back on the terminal. below is my code. Appreciate all sort of help! Edit: I have fixed my rookie mistake in the print section below xD thank you guys for the correction!
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
page = requests.get(url)
print(page.status_code)
The problem is that the page you are trying to scrape protects against scraping by ignoring requests from unusual user agents.
Set the user agent to some well-known string like below
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)
For one thing, you don't print to the console in Python with the syntax Print = (page). That code assigns the page variable to a variable called Print, which is probably not a good idea as print is a keyword in Python. In order to output to the console, change your code to:
print(page)
Second, printing page is just printing the response object you are receiving after making your GET request, which is not very helpful. The response object has a number of properties you can access, which you can read about in the documentation for the requests Python library.
To get the status code of your response, try:
print(page.status_code)
I'm trying to scrape the value of Balance from a webpage using requests module. I've looked for the name Balance in dev tools and in page source but found nowhere. I hope there should be any way to grab the value of Balance from that webpage without using any browser simulator.
website address
Output I'm after:
I've tried with:
import requests
from bs4 import BeautifulSoup
link = 'https://tronscan.org/?fbclid=IwAR2WiSKZoTDPWX1ufaAIEg9vaA5oLj9Yd_RUfpjE6MWEQKRGBaK-L_JdtwQ#/contract/TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,'lxml')
balance = soup.select_one("li:has(> p:contains('Balance'))").get_text(strip=True)
print(balance)
The reason the page's HTML doesn't have the balance is because the page is making AJAX requests which are sending back the information you want after the page is loaded. You can look at these requests by loading up your developer window by pressing F12 in Chrome (it might be different in other browsers), go to the Network tab and you'll see this:
Here you can see the request that you want is account?address= followed by the code that is in the URL string for the page, and mousing over that shows the complete URL for the AJAX request, highlighted in coral, and the part of the response which holds the data you want is on the right highlighted in turquoise.
You can look at response by going here and find tokenBalances.
In order to get the balance in Python you can run the following:
import requests, json
url = 'https://apilist.tronscan.org/api/account?address=TCSPn1Lbdv62QfSCczbLdwupNoCFYAfUVL'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36"}
response = requests.get(url, headers=headers)
response = json.loads(response.text)
balance = response['tokenBalances'][0]['balance']
print(balance)
The following python script gives me 403 error, the type of request is 'GET'.
import requests
import json
url ='https://footballapi.pulselive.com/football/players?pageSize=30&compSeasons=210&altIds=true&page=2&type=player&id=-1&compSeasonId=210'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'}
result = requests.get(url, headers=headers)
print(result.status_code)
A Screenshot:
Check XHR Request Screenshot
Your code looks fine. I ran it and got the same 403 response. But if you open the url you posted, you'll notice a 403 error there as well. This looks like an issue with the website itself or maybe you are using an incorrect url.
This might be a late answer, but what you're missing is the correct header to access the Pulselive API. The necessary header is 'Origin':'https://www.premierleague.com'.
This makes the API think that the request is coming from the official Premier League website, and they have access to the API.
Hope this helps!
BackgroundInfo:
I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.
Code:
import requests
# I am currently working in China, so it's cn.
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies
#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies
#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)
Desired Result:
When navigate to the amazon homepage in Chrome, the cookies should be something like:
As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.
So that is the cookie I want, which I attempted to get by the above code.
In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':
{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}
What I am getting instead:
The 'session-id' is there, but the 'ubid-acbcn' is missing.
>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}
Related Info:
OS: WINDOWS 10
PYTHON: 3.5
requests: 2.11.1
I am sorry for being a bit verbose.
What I tried and figure:
I googled for certain keywords, but nobody seems to be facing this
problem.
I figure it might be something to do with the amazon
anti-scraping measure. But other than change my headers to disguise
myself as a human, there isn't much I know I should do.
I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?
You're trying to get cookies from simple "nameless" GET request. But if to sent it "on behalf" of Session you can get required ubid-acbcn value:
session = requests.Session()
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = session.get(homepage,headers = headers)
cookies = response.cookies
print(cookies.get_dict())
Output:
{'ubid-acbcn': '456-2652288-5841140' ...}
The cookies being set are from other pages/resources, probably loaded by JavaScript code. So you probably need to used selenium web driver for it. Check out the link for detail discussion.
not getting all cookie info using python requests module