I am trying to scrape data that generates a chart on a website using python's request module.
My code currently looks like this:
# load modules
import os
import json
import requests as r
# url to send the call to
postURL = <insert website>
# utiliz get to pull cookie data
cookie_intel = r.get(postURL, verify = False)
# get cookies
search_cookies = cookie_intel.cookies
#### Request Information ####
# API request data
post_data = <insert request json>
# header information
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"}
# results
results_post = r.post(postURL, data = post_data, cookies = search_cookies, headers = headers, verify = False)
# result
print(results_post.json())
As a quick summary, I first loaded the site to then inspect it, from there I identified the url for the request in the network tab and then checked the required request data in the payload tab. Then I took the user-agent from the request headers tab.
The request itself works, however, it is always empty. I have tried altering all sorts of inputs but without success. I would highly appreciate any sort of tips that would help me to solve this issue. Thank you in advance!
in this case you have to use json= instead of data= when making the post request according to the requests documentation . By replacing this part of your code you should get the expected response.
results_post = r.post(postURL, json = post_data, cookies = search_cookies, headers = headers, verify = False)
You can also try other scraping tools like Scrapy to crawl these data and maybe running the crawler on the cloud using estela.
Related
I am a beginner at using Scrapy and I was trying to scrape this website https://directory.ntschools.net/#/schools which is using javascript to load the contents. So I checked the networks tab and there's an API address available https://directory.ntschools.net/api/System/GetAllSchools If you open this address, the data is in XML format. But when you check the response tab while inspecting the network tab, the data is there in json format.
I first tried using Scrapy, sent the request to the API address WITHOUT any headers and the response that it returned was in XML which was throwing JSONDecode error upon using json.loads(). So I used the header 'Accept' : 'application/json' and the response I got was in JSON. That worked well
import scrapy
import json
import requests
class NtseSpider_new(scrapy.Spider):
name = 'ntse_new'
header = {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.56',
}
def start_requests(self):
yield scrapy.Request('https://directory.ntschools.net/api/System/GetAllSchools',callback=self.parse,headers=self.header)
def parse(self,response):
data = json.loads(response.body) #returned json response
But then I used the requests module WITHOUT any headers and the response I got was in JSON too!
import requests
import json
res = requests.get('https://directory.ntschools.net/api/System/GetAllSchools')
js = json.loads(res.content) #returned json response
Can anyone please tell me if there's any difference between both the types of requests? Is there a default response format for requests module when making a request to an API? Surely, I am missing something?
Thanks
It's because Scrapy sets the Accept header to 'text/html,application/xhtml+xml,application/xml ...'. You can see that from this.
I experimented and found that server sends a JSON response if the request has no Accept header.
I am new to the whole scraping thing and am trying to scrape some information off a website through python but when checking for HTML response (i.e. 200) I am not getting any results back on the terminal. below is my code. Appreciate all sort of help! Edit: I have fixed my rookie mistake in the print section below xD thank you guys for the correction!
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
page = requests.get(url)
print(page.status_code)
The problem is that the page you are trying to scrape protects against scraping by ignoring requests from unusual user agents.
Set the user agent to some well-known string like below
import requests
url = "https://www.sephora.ae/en/shop/makeup-c302/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'
}
response = requests.get(url, headers=headers)
print(response.status_code)
For one thing, you don't print to the console in Python with the syntax Print = (page). That code assigns the page variable to a variable called Print, which is probably not a good idea as print is a keyword in Python. In order to output to the console, change your code to:
print(page)
Second, printing page is just printing the response object you are receiving after making your GET request, which is not very helpful. The response object has a number of properties you can access, which you can read about in the documentation for the requests Python library.
To get the status code of your response, try:
print(page.status_code)
I've had some success using the POST requests in the past on other sites and receiving data from them but for some reason I'm having difficulty with the metacritic site.
Using chrome and the developer tools, I can see that when I begin to type in the search bar, it starts a POST request to the following url.
searchURL = 'http://www.metacritic.com/g00/3_c-6bbb.rjyfhwnynh.htr_/c-6RTWJUMJZX77x24myyux3ax2fx2fbbb.rjyfhwnynh.htrx2ffzytx78jfwhmx3fn65h.rfwpx3dcmw_$/$'
I also know that my headers need to be the following in order to get a response
headers = {'User-Agent' : "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36"}
When I run this, I get a status code of 200 which indicates it worked but my response text is not what I expected. I am receiving the content of the entire page when I'm expecting json of search results. What am I missing here?
title = 'Grand Theft Auto'
#search request using POST
r = requests.post(searchURL, data = {'searchTerm' : title}, headers = headers)
print(r.status_code)
print(r.text)
You can see in the images below what I'm expecting to get.
Headers
Response
Not sure about the difference - maybe GDPR-related since i live in Europe, or because i have set DNT (Do not track) to true in Chrome - but for me, Metacritic autocomplete requests post simply to http://www.metacritic.com/autosearch with the parameters search_term set to the search value and search_filter set to all :
From your screenshots, i think the URL for autocomplete in your browser is constructed with your session id, maybe to avoid stuff like you intend to do :)
So in your case i would try in following order:
post to the /autosearch URL and if that doesn't work
figure out the session-id to URL-writing logic, then make an initial request in the code to get a session id and work with that
The default data link is http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk
But, I do not want data on this default page. I want the data under Portfolio tab. So, I used Firefox to determine the url of the portfolio and attempted following python code:
testpage = urlopen('http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk&tabAction=Portfolio')
However, page is always redirected to the default link. How do I get to the portfolio page?
You need to pay attention to the request that is being made along with all the headers and the data.
For getting the "portfolio" data, if you inspect, you will see that POST request is being along with log of data is sent and payload data (form data) is to used to send the portfolio data back in response.
What you need to do is mimic the request to fetch the response data and then handle that according to your need. You can do something like this :
import requests
from lxml import html
headers = {
'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
'accept-language':'en-US,en;q=0.8,ms;q=0.6'
}
url = "http://tools.morningstar.co.uk/uk/fundscreener/results.aspx?LanguageId=en-GB&Universe=FOGBR%24%24ALL&CurrencyId=GBP&URLKey=t92wz0sj7c&Site=uk"
payload = {
'ctl00_ContentPlaceHolder1_aFundScreenerResultControl_ScriptManager1_HiddenField':';;AjaxControlToolkit, Version=3.5.7.123, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-GB:5a4df314-b4a2-4da2-a207-9284f1b1e96c:de1feab2:f2c8e708:720a52bf:f9cec9bc:589eaa30:a67c2700:ab09e3fe:87104b7c:8613aea7:3202a5a2:be6fb298',
'__EVENTTARGET':'TabAction',
'__EVENTARGUMENT':'Portfolio',
'__LASTFOCUS':'',
'__VIEWSTATE':'Or/Z5BkJx2WVGMIPWgbVTVzk9hu+/eDKDHsbG74cJRlPSPW9dXuSQt31f2njq7X4NCZF/VW7u63TU5lF3lWGIAFNRoIIWwlRVMeMWeHygunbmBVxWWO08k90rAhbOCiyeOgKoaL1lVKO0R0DGS9rjl1Gah7C2NiIyLeD8boWobKLRV47aRiqaWI9ZYprxoky4zmuNp4NP51z0QLfb/4TvQKfcXJcUHHAAknVurwXfye3cHiUGf7pOyI84E9KJscHsbowC6mejPX4XmlXLVrVXk/lupYU8yTXSp03D2vfyPcQcrxt3y/uF0kXNG+4A/hFWOQFazVk1SRMYnQlrWtQ9Ulh58Q71zEZvX3yZhnp2EA5ZnYuOfeFWCnwwUBa6s9o8uLocDK1Q4chtjXDqK7q2W89kPZoyYjmgB5xunFDt8A7Sz3IFaDkJEyPYdBPOKx1Y1zv0g3/gwBnd64UXkTlBySHZao2CB/OBNQoqI6RqI6L44nrbESabh+DHBCdcCKeL8Pj+lsM5o7P0ShXpXHbCRTPk4PiWVeP4hk1vyOFA7tiReoWEPwQvDe3sqWh+K7EHLHefW5ke6W9zy5seHuC1vfcVTwT5FUIcTaAnhoDSphsMHWPoVc/vtcfExPWUx/aC2KIf1m+DKtN/no8Frt4SYqxMGtDMSUZjMR5xhFHSaqfjv/0Gs+RVod4N+A4rYeUO07A9VTLTE8SuZ4ovxjhrEtAQ3bYqzt29leHpmFT7Pfl7OZw3t3wt6SjQX+Q3M5ozThannhRKaDJCnBZdFh7ZnY4fgCLpDNyMDq3FccJC0V6PDSuu6enpPWOcy4NJj5H+/rEqo61/e2wgmefzt7Zaygu4v66MmOKLqbWymNa6C1Xuc0u0FhERUrSWrL/rS9kwC+LA7aWFPhdnEnwPewV6yj7kWzb4IZZ6ivGs3CXYH7G2HTnsP/P2bHXNV+YaaXTkdKJXkiPF7/qQ3JKzZYhDJjj0PObqtI2RlhmeecF8Lq8SfRRrBTXWjvg48q7nXurZ7ztX28QRDHC5aP/13+X8RyvLRmiM4V3vRdMjxpt8ySZtKM5wpEA4XjTUtCWtrNKO18yc0pbMaGRm5xEoXLY/i1cHC62OvKRsEYX82q+KuGyqEKwPoElc2SbMfyEit8M6tkA12wBce4cqlUMX4D85OOKCMzhY+h0loQFSsgVFfqKpEHpH9yg5lKtRg0dZ8P301xoGCeXhBhyZIp234EAdOOQySV4iNcBykLFGOuB0w63KqVbQRejqlnj2Qd+OkoXQ4hAh9tgCXdxhOHZ1hLB3nHHNMT3TDBO+j8eXgxAE8PN9zt6Xj7qGqDmkHAlwMP3Q8er0Ms2i80x7pUzvy5ixozAbUgfuEeKtjkK7fSD0UkKMa/YELEjTkgVJm6goPPIR3D2lwNAQLyHM8xLFSy3evkpJojw+QEFw4U9n31CoO6OB15Isqy/E1MPgwq9Wz3mUn2iYH1JruwsgQqQXraUKAiyMlpfbtj2YQL39Zp+AwzPeDDbwRaCCNBFvmpapcJyMpmzlzd0tr9gdV1GoVTtWBg+UcVGSsQi4XkD+32CfUuQ+ZpFlmUoYLuYSAEFV7Y97MlqLMqW89r/BZXRXNacpizFFrnQlCnsM4Bj4DUp+K7pcxAaYRKWcH3tiQO6zhCa2b8YoawWzQ5Ij8b19Z7PLN3Yug9ldeJ2CcYOzUQebT23ofSNtCU+uTbYzzh6RE8Bg/rut8R6A1uwYBWvjfL7N7M2fUSd01pwYgJ0BfsViV1pipzpCvTL5hGf1aK25gR+T7GtIxNbrdlo6Z1LbV/xYQYIDTod5dq6wUttZJVLeLVZRkCAv+M+o7Bvd86pi82TIdC8foOPgo7OR6ykPk+aMt1pr/hBV3tmBOUvMyYADmmOZQR+L/AQ57tRukeRyACeTJq1b5icpxawI+qn71we6eAKmg5POvkbq+pI+YnoSs1Mhk9OWeJ1CPRg3P5TDMIhqXsG4mKY6awMwZXF12/r4qb7bRnfZGFukHBAYJTRsmZsLgiicM2uJ7kchxs2U/jwVcItGHgnIYkg1r7TTJ3oFo1rHEVhFHm8dIem3iI/VUpbe/XZyEKseDxoALSbASjYxM5n2eGfBLFnHMHv8RPfrX5EBfD0ZzMAVc8MoSycTsJuJI8L912Eewk9Cz3mb7o2zF9L+8syg8NpEDy78kIa0lE+QNqvdtk3P7uCxUckKWdmKLUfU2zaTBBGkIcDo8xXktZFgC+yQbUtxD2yFC21tvSA3xJaPVWqycMiVRp3fwIabWylnRnnwLqvAjIPTKiZI5w/szdciCwzx0GhSY14xpVV+jlLlfH8KCqVBVL5NIzxRTw+ELVPHOS3orE1dKtCcOqM22GE5PsU69E7ViA+fC2Gn/HzkUfUHPBKjKixX9hTmZzOnXToBU5sdEMZ1i3Jte+xfk3YVzYv9TO9f5EiibdNgw8MdCrXwxlgYNZUob0PixOajsPed+qv2PTl+kvLOSTkw6Z6K892TJkBvpAGQvP/zSgUorcNhuAJwQVG32TnX0HypMPpVwX0SqOhZLGM9essa7guKOrA3GdIDsoA2/f4JkFlJMtVgXKGPNXr7mTCeq2H8vFfQbH/59wPfMgrxxo6s9C+Tyt5zG3lyRoTEGUr4QwBkSeHq4J6Vya3sFDH911QHrfFuaKF2auqHHGuKyCCViqpb3A1Z4/GbllXBC4cmjyKc8FfI5i2eSSEMOd95N198ZCOD7x1zXPACX9QjaMdzZbadJ9UHXYsb/7l87ujNY4x5S9oQXgfW8fva9i4oqTqMV3VXTQK8lVcFovH0OxXXpNZ+rPm8Tj5kbRGrMgp6CdyxWSLKvqYv8f57ICr6ozaxyAd8XiTM+AhkfnXsN8BcH0u1yP6WUBDkUjhBi+4lfO6Dj5r6pFIN65GqPaz0mRFDpZU3nVQ1CmmeXneh0ZT/u7tG7Ray5Md5jr9onVsWfWnbc0hbUP0ghMANhtZtcrLpFikwxxQybdsS/xWdB4dLenTMAi2hn0KQ196thhQvvhEvEWaSxuEjX+iaQB14kXwOHAsBj8Ikp4lIdBsVctVQFVNzM3+F+UfDIbpTFh4IaAvOWNZzFGZYjdKDKKIuIgSAhdkHZbjQGpvXWdx12WR1/I/aqk5dx8OFpU3Lq/thZxQ+0oODetvex87L6lKWMgUcvQQAzAXbwzFp4wcTHnQuKJ21hqotOfn8F0GmWv59/hqfH1oFpt6/ENAs162hXOdGt5kTYl7u6X+ciQiIioRLiJ/NRIOoa1T++6v2FMk9acnOfNYMxEGeBdtqmLIN70aL8wvoFLliCkUhfe4yPaFQzFo26JsnnAXUpuiDKfs5fjDS+Rk/1BfVScqDIMv8IL8RDIoWxg8NX5DOOJPwAc3uC+s/kCCpoG2L0m9FLgSBv6Nr9wuv1rt59C/K/5RETD/VP415ArnuUBrdGpuYza1FvYyCo85HREzIL2lN6yZUBUXbBBrWxa3LiGaojhfhCyflhhHs+GoM8zfY5IW7Wpvp/YMPAgxXNRtegGL80+HU/dmlkRO8nRx3eyzpcpWZ302rK9m+OYqtfUXwvFKR7ULWnk/2aHsTQe6lwifxK70QG+jhZlrJqbPGi8vSpajsGMw5iU+VJM4CEDcGhvgpzODw3LkXPvsFrdLq8eUzHXo1Ox+yiZ2zSN3vGDcGeEZiQAbG2dcNt7niW+reozfdxVQAi4uLpPGWYu8jvVnRxoMuQEKEGzIiwNNsvpgCMGdUfk0izvvkTplz8lvk6ROlhy417VtiiVXVIMTAovFNO+W3O3/17LJ9Ed7QPYdoUO4n5fidYX6r4QUAoRGowMAPHQRIdg/AqN7N/EfmDRD72t7BOqvzVetXuTVId75vKB2P0CwoQPDIy0ynLZcTRykRs38LIHwYI5irp1NUjCee7mvo1RE0asD670LM03ZFMCOu/hmgln2dk5oFeyysISdxVUQKRmI6VytwEsSviOZeP1cZgB5DakdSgCaloRI2JGVbZ22B9UgO6hFSvfHhox5y1p/CzIrJPd+GUB80wmFX8Kgl0DSjsf9PJNQlAKu/jb85+wvF7exNrPyrShkWE9lYjbcmBHPYc+8J3ia1N3LWtVbR1x554dYoWHVGw3VbU3bWfqLjn5Eon4x3h5R/bCwVBorVCsQ99SzWCv5J9dMRF38r8y1yA4iEUPcX89n4nl91t6cnia4THuk2hhbaBPeu2PFwnTuwiJxAknVEGFUslAXu621wvmyssftVnQ+jzirCQNJAXyE75t+pNrWmQJXrpHDxnR3V9/LFrNy3tZn61H+UkEY1QK29bUJHE+DOfSnS4QkNY3VpLpaBdBeBorOOZ6dEc+lzVDcPrgjL+1fqu/yHwFmxCN8MfreDuX5E8M86YAR/xnyYAJRMxafR1p9eIG+cgHwIBeOhCw1J0p+ydN/bNK7KousyY4OcEr4zTF6crn6LmN6C7zDqabx8cjMRmUeNl24x27LxJhakNbmQjMVPSfo6Ro/edo0L7pG+pbj9SwiaJxGkr65b8pvrTwDFKh7tLUQtZ9j4s4y6MiQ565q2OJp93rm2deHHXUsM/ziI1t9OV00dbjuhTaLTmF5u3rIpKgryYVvmIa6G081WcKCeWz46amFLg7v39SB6XNuL0AIxIow5Hu+S0oIv51+ycUZoLUnypTFe/SnlobjYxAxsJt/cnxZ4wh556EZ9rN0HaJrbbb7pA4uaDBvz4EL8ndM+zEmlacyfQlKSr+jdB0XX+An6zhNQv3D5dkz84QdmPuAavhemrwr2m30Q/tNZ8DZdsBpOyR/U86nplu9Sx5LcFGnWULX+teY24nBUUghfuhGRPEr0dHPUUgMwpQq1fpcz6YQft09B0uthQiYWhNXvnsrlvnLzZTWTZLjFfwDlNn5RZqn0fAxudbM+eOzL9xvx4TBEEpcyf5nLTuNAKvfeZm4KWcRmV+WPnDJxmf7OlTVNKsXiY7Y+bJMjgNfKMh3oQws/1+gtATMlYSdjNIzuYSglhMyXS+BPRI8dpDq2VeF/cb6AII0Pvyq0H7nRadP1xccD6hTKdb4rP7kEAAZClm7P4M4Mog+CAXePDMw3kSkRGzsbT/6rKffKp9crRcOnKwSHU2yuf+NBTES6xeaPD0R7YwjbrRHPDsOoOdQXEcn/bl0oNnnLheSZhDKdFERtlvrpVB8qZ469A2Jqw/X5QMcIrEb0gisLWRSuiCpg/zmFDqaDsj1M8evc2MPGtkKxw9IsuupthsWKxYkbwB2inJdnLwgCDx+2B5oIT7pYbLricSseF1ukjL3uEyHicA3WztLzKoLjumpzevRWBs1VnYCL0Ow0U0yABR/dz3nh0mcE6X0iBb8ulgp+zn/8CNTNEE7lVSPPn3FFr6+mNuYu5O9fn9G6lji/8muhJWTW/9bbrA/2ZVPK4pto7mmfh/OWmkHnw3Te4ZysDIOXcXD7BCixoSB/3l88JQrGB/EAqrNz6oEhXeQ9hof2EhwKI3ZoxvKh5jfDii3PWI+NJPdFFtP+zRS+P1p4aMpQC703rHkmiSFRJIIaPnnbnXNN1NhBefjkjFA6nTvUcYsbBtKQzFJbAEiBnhOo/+jgUdd31gZbZbRi2Iw+Pv07qjDgwVznE5HLwEu8y2k+mdW7f1RKIgjiOhPA4CzBcWumeo7USUDpHaLNWEP0lLiwuxxB8CigRUln763e4xFAvd+vPlBoJJsJBUezJ5OdV5AC6Fe9/UuFT8Ov+Bsknk072xPIHxLks5J6XxDNrm7mnDKTirLE3y2OLpy5gAUPc1n7UpdH08k3C4y+8iqILZXfN6WzR459QcmY7Uu2YFSbxVM6dVYsE7arsp3zyDgjCgnctbrlO2A2iJ2P7f8eGdYEnMjm8Hv4lwFfSHDKVuVoD3+2Amw5CtE9Smtdidm4OTC/C1yePG+IvtXlx+21lgPpRdWOCFmz8/bQusVvRlQCz8any5fVnXJERaeQxC8UMbWgmRPQs1Q0rrpe7V8LKq4W5rwmEClsmUoqWiaDXN/nuzuPY2Bm7l4Qo7B0JQd/AEA3Kw4/4L8XLbQ8JHtnamJExXbDFZp2jPjz/9igiPq5j1+/ZqJtnwHPa54R2gLGbV9plpEMOu94Og993CM4QxKN4LSD/TKUV/ik46I5H1texmN7RWMcL1gAPnO0AVbiI8sP1xNxclROHanYTudyVVKld0qrq6ht4eYOgPL2RWB8ma1i8DiqfEy2Z+iDHIDv4nB92ktT7BKNA9MEC19+Nmbv1nNgiULtt+jOZZ92XlPnVU5fYIQZEsn/VPzxFx+6yoDmdN9+aeC6k/SfWFPJhdQ6e0Kk9sOpZQowS+GpaTGw86m2K/RyyzQyzJetV58Wlon3bYwzST6N/CHxiyWJ6KiC+JOJHjKuZjoX33FKKp/LqkG7PlC599l6afuCmU/e9r+MczU+BqdMZTAHshA5mpgPqT2gjHvD2Er+J9Be/P7YHCUvOUFpnfcDWVDz0Hx3kqibf8iFKBFDzK+YHk7U0I0O7yRgLTpxf3CC3ZpEF0unuuM/29BgViIucoRdPIHceE2aTUcH27myPTT5SJ8SGpJrfr5bS264VZ7la/ewkdSVAvfKZAD95KofW0up8Bt4z14+IQCE1Xe1c2fA6Vr20junvkX4ZgQVrlWGztCX26LigP19olHT8mDFPGzCEuXXqFDSGUaJ9bpRglkd5Ps9JkKbCfnzF+PLFMByr5xDywnCaMDyzLttVzfdu32qVI3UP51UZ1lnnylXuKjuctCJiC3it20cmMf4VK41gZpGbUlWoasmnx5OGapdvwGTtrT0cVNRkg6UXjj2BuZncnTBPzNhwarsZUsBBtOAbDue7xfEOXn1lLUDQ1u8xKqwzN4tqYTbp+VQAuI3LG9FaF6cD0wQX9qyFji7oIcmsHMe9KROCKi83IdG/Td0ML2/h3yUG/VkKVTdKbX2QGNShAKCT3JPPfR49q0WIeuJ4SCskuiZfzzo2m3rdodBvqEPiGSOmlmp6/RLGF0iWtZ177aqYzeEEMwO2BpxR0ZB/+0JTD9u+DE50h83q3eRsikNc4VlZDB8JWJW3WRLU/AxRefA0rP3nes0B7Thx0MTXpRsOGjprtSKYR9h73QxDugTSNA77sCpjSutaswovVdn9Lj6k4IL6av1gRK2Wso/e8YNEglHp+VBEegXw35ZOByleDUwqdOS07xCA9hBS0Oec0v7YvFEAIJnW4FunEV2fscciH2gVpDR2s+FbKjVdT7t6qNnWT4HX2PuL2mrHDKgE/l8tDvJ2z2zZW1fTiExfTngLbTOyhlptrc8RFDAJBi3jdYw4HU9LufawJjUIukBgiX9cM5y2IykqNzM10tMsMVfxRH10lHqieGf9e0u2ht+gRmYwooRzoyenkKlWVHh8E37+yXa0SHuV5qXb/8sk1IGqE5p0wL7qWUfOTRAdWtPll30n16f7Epfl+dYI8m93uTk2FrL0Dsosdkp5BmIilduNXje1bMonEtliHrJ012Q0FIxVjEOZDUTUXYwRw0mF0eaxvKu27cJ1OqYUGfJk9zqAiAc9QnTBDL/f+zljgBzs6FWC2PWASaoMrS5Q0aNlN/y5wmFma9swHoEwrBUXr4Vi2Nyf5jj/FijJz77DaNs1J4G5uUF8Abe8HRvYc0XCtEMWqCcv3W78Px4/v/ThOMvamMcJvBh91/6Ep95/SzHvuEDpb8WsUKjwpXDdmp1k7QgMwb0ymrheZhxj/mYklx4EtnMWYwIPt02RJbEFoEcgB09chAg5x8rTh6FmJzGHmZOv7A8oEg0CvrO2pT+aiKqCTRcJsOKKvZnLXlQg1TwgJh3jCgvVVSPGIEO4RpIWMNT1/Opno6ytmiubgX5NoythDBrG5WtoAltsfomRTkb1NWOhcam9Q=',
'__VIEWSTATEGENERATOR':'7400285C',
'__VIEWSTATEENCRYPTED':'',
'__EVENTVALIDATION':'1p0545yV8Pljo87c7Dlb6kiemgGIXd5S3wYUGUoMyg6IPO12GWyBgNM7bu67YNl9f9Sx9ad9lHLIfwYtw0nDGqWYtWBnM8PHrdmxYdOb5+qUooGamIPBCCel/8Ri+FGpvNGPTZkKeuYzfomnlqr/mYoMcjdnsiQWCf7Dvou8X5p8A/pkHRReFtE8H4xIhr8X3MU6lpxBhHZKj3UK+hBHCWxEnQkGb0Nz2Pi8hyWNt5AUu830RSQnl793RwuxwQ1HCmJYFEx00c11gXmSn36PPP42OCMstDR/GpK2LUPsNQbdJ7TUq25rzG/5SIjYxWA0nQbGY/mWaY3Q9iCo7k5o9QnZEf4yLaOF5g7nEva4lTZNwx27ynyDAWrRBVE0KsGTsbIQMgMPqCV8gzc6irsluosW+EI6zW0mdaeoiBaGYHFQnJ77a5rnbpL0j6fMiDfL+5VW7jAaRnyz3Y10Cn1TlXEY4rvQjHZIxvK/rBe2WSVkyXIhrUgAl7a7EvDGnBniVdsizrBSgASrjcT9svJ0aHPEpfxJmy8nuzV0pZbXGzG1q06Xyij4SoxHfi7tf1dPjOn6+zdR2SY0/sQvXmQ35bAlbnFMKWdzyJHB0uEm6GYQtV4Dcj6fjGijxjIiQW0SgjuRd7/8k1i7MvEnD+I6MRIBhx/KNOdP3os9oP8pyMicIz8V1o7KENKwX4fyUmbIx34adCXt/DXT1sjFtNu4S0vzvOU/AtxBLOEcw+clV25xSp/94dEq/dge+K2ySRuxKt6DcNhjDMYvc1ACXbfVjANG8ar3x7n9kX14EMtnpip0RI9ypma3tOmjqip+Qc+lyc12A0jV714BfTzw6nSjYya75Idztq1gZifNt+pQn7GO7Qw2kIqNnXvpA+UkWbsTlnTKyY9gTqPHbF5XcSvtvDfNYM6mxVMJZ1MyAt4pxrCgyGAC0IswPZ8wMAablR9fNStFs67D4kyeUCU/2IVTD1/pfmMC8meLuaXpgHkl2er3Wr84H2lVL+xUd7/wCUSkLFke68SeRfqPl7dIR9hstVJC0cCTbmco0KxTzcwln3QdoxveE/N8v31Z9teZoJxdeRRFyJQFGHw81JVor2kACBsIkioLUvz4IpUxE8XEwUrjHCBJyZG3QzcQAxSBXprztdoknBgrd38wCssuCa3gQvIoMtbCaedWhmY0pA/AI0aHTHR/j5nTg9jaqeEViZF0hLVEhVz2MojXswtp0aA70YwuMPBmMCdgy5w+wSeThtsyt6j+b2NHHRkE0uqEc8D5XDUB5M2UolpsZCcuOKSwK+jwGdeb3gPWssUQShMgRTEoFLapZJKX7c8/yeZ7Lf4KrFE4pBz8+JeFJOVFl3y1ewckAFZVHdYvu1901Aq6PKTMu6kGz/LElno8fCJbyKacsZ3LtpssGhvFBN0vv0WIn4elkiLCL3u9v6oWOxaK4OIDTaVwDLjb5BvBpd2Szj5diHAG1IXoVQYvJ2VEDVwbiTUChXRZcDY6bAm7dvkYWLOxsa0whGz2xeeApbEceeQrHREelH89ucBenmuENPiF98Kf4mZQ/ThiVhFxiWAux1b0Dn0z7M/mXfposuy8ytqtRry3SJoC5V7I+7E4N5x0JyVwN/vtxBpd5h443R35RvDZ1tnscirsGzNoulevkeUqM+I6TgjrvBF00fv3isZvzUjIXK6E9cAg4G7aPyuirI+KIICi9VNxF5fDRxi2UPTHiB3NT01Vez5GVt0Tu8lpn2iakJSBjihOYORrSI+xJzbQdnCzJa1+h8UiAFXgpqWviJUVXG22wFQ1HQckAbFxU/Pcyx+QrsnDrhwihqmnwFd1fuwOy74SAvPMojpxujxWDe+37nhroEyhrk5yOB65RDUcQFS77a+3RwuNyXAodTC3QMp5lMZD1Ae8zGEBesg4zbkP7aMS+ljYBShRN6n9KYhHZ7s2Iq5V4K6GrUcOFdXP157jN4vBuj8l+UoBIPjpMm9KKpLnCuSjGNPIyoxfPg==',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$ddlPageSize':'20',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnFilterBySelection':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedIndex':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedFunds':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnCustomImageFileIds':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnSelectedSecurityCount':'0',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$hdnTabs':'Snapshot,ShortTerm,Performance,Portfolio,FeesAndDetails',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnSelectedRow':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnCheckedColumns':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnExportLimit':'50',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnExportCount':'0',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$toExport':'',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$msExportToExcel$hdnAllowAllFundsExport':'true',
'ctl00$ContentPlaceHolder1$aFundScreenerResultControl$txtSaveSearch':'',
'ctl00$__RequestVerificationToken':'e_BrPK0DxBjkgMrhfkdyFJjp1nPzltSn0h20aUjHJPSe3W4w3FRsFQNo_YY3Ml0D1CkNGqC5PEJBigtZuvdbiYrldSMrUoOQFUjaPifPbM41'
}
r = requests.post(url, headers = headers, data = payload)
print(r.content)
root = html.fromstring(r.content)
You can now fetch the elements you need from root using xpath such as :
root.xpath('//input[#class="some_class"]')
Refer to scraping and lxml documentation for more understanding.
I have used all the payload data from the request, you can remove some and check for what is absolutely necessary for the request.
Also, follow the website rules about scraping and scrape gracefully without putting too much pressure on the website.
BackgroundInfo:
I am scraping amazon. I need to set up the session cookies before using requests.session.get() to get the final version of the page source code of a url.
Code:
import requests
# I am currently working in China, so it's cn.
# Use the homepage to get cookies. Then use it later to scrape data.
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = requests.get(homepage,headers = headers)
cookies = response.cookies
#set up the Session object, so as to preserve the cookies between requests.
session = requests.Session()
session.headers = headers
session.cookies = cookies
#now begin download the source code
url = 'https://www.amazon.cn/TCL-%E7%8E%8B%E7%89%8C-L65C2-CUDG-65%E8%8B%B1%E5%AF%B8-%E6%96%B0%E7%9A%84HDR%E6%8A%80%E6%9C%AF-%E5%85%A8%E6%96%B0%E7%9A%84%E9%87%8F%E5%AD%90%E7%82%B9%E6%8A%80%E6%9C%AF-%E9%BB%91%E8%89%B2/dp/B01FXB0ZG4/ref=sr_1_2?ie=UTF8&qid=1476165637&sr=8-2&keywords=L65C2-CUDG'
response = session.get(url)
Desired Result:
When navigate to the amazon homepage in Chrome, the cookies should be something like:
As you can find in the cookies part,which I underscore in red, part of the cookies set by the response to our request to the homepage is "ubid-acbcn", which is also part of the request header, probably left from last visit.
So that is the cookie I want, which I attempted to get by the above code.
In python code, it should be a cookieJar, or a dictionary. Either way, its content should be something that contains 'ubid-acbcn' and 'session-id':
{'ubid-acbcn':'453-7613662-1073007','session-id':'455-1363863-7141553','otherparts':'otherparts'}
What I am getting instead:
The 'session-id' is there, but the 'ubid-acbcn' is missing.
>>homepage = 'http://www.amazon.cn'
>>headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
>>response = requests.get(homepage,headers = headers)
>>cookies = response.cookies
>>print(cookies.get_dict()):
>>{'session-id': '456-2975694-3270026','otherparts':'otherparts'}
Related Info:
OS: WINDOWS 10
PYTHON: 3.5
requests: 2.11.1
I am sorry for being a bit verbose.
What I tried and figure:
I googled for certain keywords, but nobody seems to be facing this
problem.
I figure it might be something to do with the amazon
anti-scraping measure. But other than change my headers to disguise
myself as a human, there isn't much I know I should do.
I have also entertained the possibility that tt might not be a case of missing cookie. But rather I have not set up my requests.get(homepage,headers = headers) properly, hence the response.cookie is not as expected. Given this,I have tried to copying the request header in my browser, leaving out only the cookie part, but still the response cookie is missing the 'ubid-acbcn' part. Maybe some other parameter has to be set up?
You're trying to get cookies from simple "nameless" GET request. But if to sent it "on behalf" of Session you can get required ubid-acbcn value:
session = requests.Session()
homepage = 'http://www.amazon.cn'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
response = session.get(homepage,headers = headers)
cookies = response.cookies
print(cookies.get_dict())
Output:
{'ubid-acbcn': '456-2652288-5841140' ...}
The cookies being set are from other pages/resources, probably loaded by JavaScript code. So you probably need to used selenium web driver for it. Check out the link for detail discussion.
not getting all cookie info using python requests module