I am dealing with this little error but I can not get the solution. I authenticate into a page and I had opened the "inspect/network" chrome tool to see what web service is called and how. I found out this is used:
I have censored sensitive data releated to the site. So, I have to do this same request using python, but I am always getting error 500 and the log on the server side is not showing helpful information (only java traceback).
This is the code of the request
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX')
URL has the same string that you see in the image under "General/Request URL" label.
Data has the same string that you see in the image under "Form Data".
It looks very simple request but I can not get it to work :( .
Best regards
If you want your request appears like coming from Chrome, other than sending correct data you need to specify headers as well. The reason you got 500 error is probably there're certain settings on your server side disallowing traffic from "non-browsers".
So in your case, you need to add headers:
headers = {'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': gzip, deflate,
...... # more
'User-Agent': 'Mozilla/5.0 XXXXX...' # this line tells the server what browser/agent is used for this request
}
response = requests.post(url,data = 'username=XXXXX&password=XXXXXXX', headers=headers)
P.S. If you are curious, default headers from requests are:
>>> import requests
>>> session = requests.Session()
>>> session.headers
{'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*', 'User-Agent': 'python-requests/2.13.0'}
As you can see the default User-Agent is python-requests/2.13.0, and some websites do block such traffic.
Related
i'm just starting learn python, and try to learn web scrapping.
When i do the get requests to the site, in chrome site inspector-> network i found the links to JSON data.
This one : https://knowledgedb-api.elmorelab.com/database/getNpcDetail?alias=c1&npcId=12082
And this: https://knowledgedb-api.elmorelab.com/database/getNpc?alias=c1&minlevel=1&maxLevel=10&type=Monster
Both of this link work good in browser, and i can see the json data and can copy it in on line viewer or do something else with it.
But alsow i have therd link : https://resources-service.elmorelab.com/Resources/getNpcInfo?alias=c1
And it's does not work in browser. And it does not work when i try to execute get request it return me error 415, then i try to do post request and it return me error 400.
Pleas help me.
My code:
headers ={
'Content-Type': 'application/problem+json; charset=utf-8',
'accept': 'application/json, text/plain, */*',
'content-encoding':'br, gzip, deflate,'}
response = requests.post(url='https://resources-service.elmorelab.com/Resources/getNpcInfo?alias=c1',headers=headers)
I am trying to download a series of trading history files from www.mql5.com. I want to automate this using Python which I have never done before. The website requires login so I have been following the tutorial at this page regarding the login/session. What I am trying to achieve is the equivalent of going to this page:
https://www.mql5.com/en/auth_login
and logging in, and then going to a link such as this:
https://www.mql5.com/en/signals/552592/export/history
If I do this in Chrome, pasting that link into the browser (once logged in) downloads a csv file immediately to my Downloads folder without opening any real page in the browser.
The code that I have written is:
import requests
loginurl = 'https://www.mql5.com/en/auth_login'
fileurl = 'https://www.mql5.com/en/signals/552592/export/history'
loginpayload = {
'Login': '<mylogingoeshere>',
'Password': '<mypasswordgoeshere>'
}
session = requests.Session()
post = session.post(loginurl, data=loginpayload)
print post.status_code
myfile = session.get(fileurl)
open('<pathtowhereIwantfilestogo>\\420560.history.csv', 'wb').write(myfile.content)
When I run the code, the print statement prints "200", but it is hard to tell whether it's really "logged in". I was sort of expecting that just "visiting" the file URL in python might be enough to make it appear on my computer somewhere, but that doesn't seem to be the case. I decided to use the first example on this page regarding getting the actual file. From my mode, a csv file is created at the path specified, but it appears to contain lines of HTML code from the website rather than the file data I was expecting. The real file when downloaded properly is a list of trading data.
Please would it be possible for someone to help me identify where I might have gone wrong? It looks like "myfile" is getting a webpage rather than the real file that I am after. There are millions of pages about requesting files/logging in on Google but many of them seem to be much more complicated than what I am trying to achieve, or I simply don't understand them! Any help would be much appreciated.
You have to add headers and proper Payload wile making login request.
import requests
login_url = 'https://www.mql5.com/en/auth_login'
file_url = 'https://www.mql5.com/en/signals/552592/export/history'
login_payload = {
"RedirectAfterLoginUrl":"https://www.mql5.com/",
"RegistrationUrl":"",
"ShowOpenId":"True",
"ViewType":"0",
"Login":"USER_ID",
"Password":"PASSWORD"
}
headers = {
'Host': 'www.mql5.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.mql5.com/en/auth_login',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'https://www.mql5.com'
}
session = requests.Session()
post = session.post(login_url, data=login_payload, headers=headers)
print post.status_code
my_file = session.get(file_url)
with open('<pathtowhereIwantfilestogo>\\420560.history.csv', 'wb') as file:
file.write(my_file.content)
I am trying to crawl a website and copied the Request Headers information from Chrome directly,however, after using the requests.get, the returned content is empty.But the header I printed from requests is correct. Anyone knows the reason for this? Thx!
Mac, Chrome, Python3.7
General InformationRequests Information
import requests
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': '_RSG=Ja4TD8hvFh2MGc7wBysunA; _RDG=28458f5367f9b123363c043b75e3f9aa31; _RGUID=2acfe6b2-0d74-4913-ac78-dbc2fa1e6416; _abtest_userid=bce0b01e-fdb6-48c8-9b86-4e1d8ef468df; _ga=GA1.2.937100695.1547968515; Session=SmartLinkCode=U155952&SmartLinkKeyWord=&SmartLinkQuery=&SmartLinkHost=&SmartLinkLanguage=zh; HotelCityID=5split%E5%93%88%E5%B0%94%E6%BB%A8splitHarbinsplit2019-01-25split2019-01-26split0; Mkt_UnionRecord=%5B%7B%22aid%22%3A%224897%22%2C%22timestamp%22%3A1548157938143%7D%5D; ASP.NET_SessionId=w1pq5dvchogxhbnxzmbgbtkk; OID_ForOnlineHotel=1509697509766jepc81550141458933102003; _RF1=123.165.147.203; MKT_Pagesource=PC; HotelDomesticVisitedHotels1=698432=0,0,4.5,3674,/hotel/8000/7899/df84daa197dd4b868868cba4db14f71f.jpg,&448367=0,0,4.3,4455,/fd/hotel/g6/M02/6D/8B/CggYtFc1nAKAEnRYAAdgA-rkEXw300.jpg,&13679014=0,0,4.9,1484,/200g0w000000k4wqrB407.jpg,; __zpspc=9.6.1550232718.1550232718.1%234%7C%7C%7C%7C%7C%23; _jzqco=%7C%7C%7C%7C1550232718632%7C1.2024536341.1547968514847.1550141461869.1550232718448.1550141461869.1550232718448.undefined.0.0.13.13; _gid=GA1.2.506035914.1550232719; _bfi=p1%3D102003%26p2%3D102003%26v1%3D18%26v2%3D17; appFloatCnt=8; _bfa=1.1509697509766.jepc8.1.1550141458610.1550232715314.7.19; _bfs=1.2',
'Host': 'hotels.ctrip.com',
'Referer': 'http://hotels.ctrip.com/hotel/698432.html?isFull=F',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36'
}
url ='http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01¤tPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815'
data = requests.get(url, headers = headers)
print(data.request.headers)
The request header information that you shared in the image, gives the info that the server responded correctly to the request. Also the actual url that you shared http://hotels.ctrip.com/Domestic/tool/AjaxHotelCommentList.aspx?MasterHotelID=698432&hotel=698432&property=0&card=0&cardpos=0&NewOpenCount=0&AutoExpiredCount=0&RecordCount=3663&OpenDate=2015-01-01¤tPage=1&orderBy=2&viewVersion=c&eleven=cb6ab06dc6aff1e215d71d006e6de92d3cb1428213f72763175fe035341c4f61&callback=CASTdHqLYNMOfGFbr&_=1550303542815
was something different from the one shown in the image. Infact it seems the actual page is indeed calling lot of other urls to form the final page. so there is no guarantee that you will get the response as you see in the browser when you use requests. If the server or the actual implementation at the server end is depending on the browser's javascript engine to execute the javascript and then render the content, you won't be able to get the final html as it looks like in the browser. Would be better to use selenium webdriver in those cases to hit the url and then get the html content. Again if you can share the actual url, can suggest on other ideas
I'm trying to scrape some data from an online GIS system that uses XML. I was able to whip up a quick script using the requests library that successfully posted a payload and returned a HTTP 200 with the correct results but when moving the request over to scrapy, I continually get a 413. I inspected the two requests using Wireshark and found a few differences, though I'm not totally sure I understand them.
The request in scrapy looks like:
yield Request(
self.parcel_number_url,
headers={'Accept': '*/*',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '823',
'Content-Type': 'application/xml',
'Host': 'xxxxxxxxxxxx',
'Origin': 'xxxxxxxxxxx',
'Referer': 'xxxxxxxxxxxx',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'},
method='POST',
cookies={'_ga': 'GA1.3.1332485584.1402003562', 'PHPSESSID': 'tpfn5s4k3nagnq29hqrolm2v02'},
body=PAYLOAD,
callback=self.parse
)
The packets I inspected are located here: http://justpaste.it/fxht
That includes the HTTP request when using the requests library and the HTTP request when yielding a scrapy Request object. The request seems to be larger when using scrapy, it looks like the 2nd TCP segment is 21 bytes larger than the 2nd TCP segment when using the requests library. The Content-Length header gets set twice in the scrapy request as well.
Has anyone ever experienced this kind of problem with scrapy? I've never gotten a 413 scraping anything before.
I resolved this by removing the cookies and not setting the "Content-Length" header manually on my yielded request. It seems like those 2 things were the extra 21 bytes on the 2nd TCP segment and caused the 413 response. Maybe the server was interpreting the "Content-Length" as the combined value of the 2 "Content-Length" headers and therefore returning a 413, but I'm not certain.
I've written an application in Python for crawling web-site that uses ASP.NET on server site.
That's what I've been doing (just copied HTTP headers and body from browser, because I can see no other way of doing that):
( And it worked! Some time ago.. But now it aborts with "connection timeout". )
def SBPageLoader(keyWord):
headers = {'Host': 'www.sberbank-ast.ru' ,
'Connection': 'keep-alive' ,
'Content-Length': '46203',
'Cache-Control': 'max-age=0' ,
'Origin': 'http://www.sberbank-ast.ru' ,
'User-Agent': 'Mozilla/5.0 (Linux i686)' ,
'Content-Type': 'application/x-www-form-urlencoded; charset=utf-8' ,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' ,
'Referer': 'http://www.sberbank-ast.ru/purchaseList.aspx' ,
'Accept-Encoding': 'gzip,deflate,sdch' ,
'Accept-Language': 'ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4' ,
'Accept-Charset': 'utf-8' ,
'Cookie': 'ASP.NET_SessionId=d4ki4j55hsq3km45b4qbrgjs; __utma=99173852.1461595200.1340564818.1341685237.1341758931.11; __utmb=99173852.4.9.1341758978151; __utmc=99173852; __utmz=99173852.1340564818.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'
}
#....( here is lots of data with undefined meaning - what is it? )......
data = '_EVENTTARGET=&__EVENTARGUMENT=........&__VIEWSTATE=%2FwEPDwUJMzUwNDEzMjgxD2QWAmYPZBYCZg9kFgICAw9kFgQCAQ9kFgICAg8PFgIeB1Zpc2libGVoZGQCBQ9kFgICAQ9kFgYCAQ9kFgICAQ9kFgwCFQ8PZBYGHgdjb250ZW50BRRsZWFmOnB1YmxpY2RhdGVzdGFydB4JbWF4bGVuZ3RoBQIxMB4FY2xhc3MFCCBkYXRlUlVTZAIXDw9kFgYfAQUSbGVhZjpwdWJsaWNkYXRlpurchID400=887031'
data = data.replace("Toyota", keyWord) # haha - cattlecode
log("Strat loading http://www.sberbank-ast.ru/purchaseList.aspx ...")
req = urllib2.Request('http://www.sberbank-ast.ru/purchaseList.aspx', data, headers)
response = urllib2.urlopen(req)
page = response.read()
log(".. Loading is finished")
Now, even if I replace old body and headers with new ones - same thing happen.
Any ideas about what's wrong with it are welcome.
The session for the website has probably expired. If you look at the cookies you can see that it is passing in a session identifier:
'Cookie': 'ASP.NET_SessionId=d4ki4j55hsq3km45b4qbrgjs;
__utma=99173852.1461595200.1340564818.1341685237.1341758931.11; __utmb=99173852.4.9.1341758978151; __utmc=99173852; __utmz=99173852.1340564818.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)'
(By the way, the rest of the cookies can be ignored, they are actually Google Analytics cookies which are only used client side in JavaScript.)
Most servers have sessions that expire if they are not used for a certain period of time. If the sessions are stored in memory on the server, then they would be lost if the server was rebooted.
You may need to go back to the site in your browser and get a new session identifier, or build that part into your crawler.
If you want to build it into your crawler then you need to take a look at storing the cookies that you receive back from the server.