I am trying to get some data from a page. I open Chrome's development tools and successfully find the data I wanted. It's in XHR with GET method (sorry I don't know how to descript it).Then I copy the params, headers, and put all these to requests.get() method. The response I get is totally different to what I saw on the development tools.
Here is my code
import requests
queryList={
"category":"summary",
"subcategory":"all",
"statsAccumulationType":"0",
"isCurrent":"true",
"playerId":None,
"teamIds":"825",
"matchId":"1103063",
"stageId":None,
"tournamentOptions":None,
"sortBy":None,
"sortAscending":None,
"age":None,
"ageComparisonType":None,
"appearances":None,
"appearancesComparisonType":None,
"field":None,
"nationality":None,
"positionOptions":None,
"timeOfTheGameEnd":None,
"timeOfTheGameStart":None,
"isMinApp":None,
"page":None,
"includeZeroValues":None,
"numberOfPlayersToPick":None,
}
header={
'modei-last-mode':'JL7BrhwmeqKfQpbWy6CpG/eDlC0gPRS2BCvKvImVEts=',
'Referer':'https://www.whoscored.com/Matches/1103063/LiveStatistics/Spain-La-Liga-2016-2017-Leganes-Real-Madrid',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
"x-requested-with":"XMLHttpRequest",
}
url='https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics'
test=requests.get(url=url,params=queryList,headers=header)
print(test.text)
I follow this post below but it's already 2 years ago and I believe the structure is changed.
XHR request URL says does not exist when attempting to parse it's content
Related
I am trying to use python to simulate a request in (http://bmfw.haedu.gov.cn/jycx/pthcj_check/78). In the link, only 姓名(name)and 准考证号 (exam number) are required. However, we can input anything in 姓名(name)field, like aaa, but 准考证号 (exam number) should be a correct number, for example, 04301022291, after input correct image check number, then click Query (查询)button,there will be query result come out.
My question is that, how to use python to simulate this request? I tried use developer tools to find out the backend request, which is http://bmfw.haedu.gov.cn/jycx/pthcj_check, but when I use python to call this url, it still comes to query page. My codes as following:
import requests
import json
url = f'http://bmfw.haedu.gov.cn/jycx/pthcj_check'
header = {
"cookie":"HAEDU_SESSION_ID=02c06a8f-fb1e-4c42-80d7-5c17e4d9cdb2; Hm_lvt_6b92f031645868e1e23be9be5938d979=1668318839,1668343665; Hm_lpvt_6b92f031645868e1e23be9be5938d979=1668343665",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36",
"Content-Type":"application/x-www-form-urlencoded"
}
data = {
'xxid':'78',
'trueName':'hahaha',
'sfzh':'',
'zkzh':'04301022266',
'imageId':'19GJI'
}
response = requests.post(url, data=data, headers=header)
print(response.text)
I am confused, but maybe this is a fundermental front end question.
BTW, there is no login requird for this request.
I've been trying to make a code auto redeemer for a site theres a problem every time i send a request to the website the. The issue is a 403 error which means i haven't passed the right fooling methods like headers, cookies, CF. But I have so I'm lost I've tried everything the problem is 100% cloud flare having a strange verification I can't find a way to bypass it. I've passed auth headers with correct cookies aswell. I've tried with requests library and with cloudscrape and bs4
The site is
from bs4 import BeautifulSoup
import cloudscraper
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
scraper = cloudscraper.create_scraper()
r = scraper.get('https://rblxwild.com/api/promo-code/redeem-code', headers=headers)
print(r) > 403
Someone too tell me how to bypass the cloudflare protection methods.
I am trying to do a web scraping with yahoo finance.
"https://finance.yahoo.com/quote/AUDUSD%3DX/history?p=AUDUSD%3DX"
I finish the code and it returns the response code 404.
I notice that I need to add the user-agent header before I can scrape the website e.g.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
}
But I was just wondering how can I get the above header information via python. Any code I should enter to get the user-agent header? Thank you.
Why dont you check this package, maybe you might find it easier and less confusing:
Download market data from Yahoo! Finance's API Python
I'm new to web scraping and i am trying to use basic skills on Amazon. I want to make a code for finding top 10 'Today's Greatest Deals' with prices and rating and other information.
Every time I try to find a specific tag using find() and specifying class it keeps saying 'None'. However the actual HTML has that tag.
On manual scanning i found out half the code of isn't being displayed in the output terminal. The code displayed is half but then the body and html tag do close. Just a huge chunk of code in body tag is missing.
The last line of code displayed is:
<!--[endif]---->
then body tag closes.
Here is the code that i'm trying:
from bs4 import BeautifulSoup as bs
import requests
source = requests.get('https://www.amazon.in/gp/goldbox?ref_=nav_topnav_deals')
soup = bs(source.text, 'html.parser')
print(soup.prettify())
#On printing this it misses some portion of html
article = soup.find('div', class_ = 'a-row dealContainer dealTile')
print(article)
#On printing this it shows 'None'
Ideally, this should give me the code within the div tag, so that i can continue further to get the name of the product. However the output just shows 'None'. And on printing the whole code without tags it is missing a huge chunk of html inside.
And of course the information needed is in the missing html code.
Is Amazon blocking my request? Please help.
The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent. Validating User-Agent header on server side is a common operation so be sure to use valid browser’s User-Agent string to avoid getting blocked.
(Source: http://go-colly.org/articles/scraping_related_http_headers/)
The only thing you need to do is to set a legitimate user-agent. Therefore add headers to emulate a browser. :
# This is a standard user-agent of Chrome browser running on Windows 10
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' }
Example:
from bs4 import BeautifulSoup
import requests
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get('https://www.amazon.com', headers=headers).text
soup = BeautifulSoup(resp, 'html.parser')
...
<your code here>
Additionally, you can add another set of headers to pretend like a legitimate browser. Add some more headers like this:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language' : 'en-US,en;q=0.5',
'Accept-Encoding' : 'gzip',
'DNT' : '1', # Do Not Track Request Header
'Connection' : 'close'
}
I am using python requests to get the html page.
I am using the latest version of chrome in the user agent.
But the response tells that Please update your browser.
Here is my sample code.
import requests
url = 'https://www.choicehotels.com/alabama/mobile/quality-inn-hotels/al045/hotel-reviews/4'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'content-type': 'application/xhtml+xml', 'referer': url}
url_response = s.get(url, headers=headers, timeout=15)
print url_response.text
I am using python 2.7 in a windows server.
But when I ran the same code in my local I got the required output.
Please update your browser is the answer.
You cannot do https with old browser (and request in python2.7 could be old browser). There were a lot of security problems in https protocols, so it seems that servers doesn't allow to connect with unsecure encryptions and connection standards.