I am trying to scrape a website, in which i have to get to the right page using a POST request.
Here below are the different screen showing how i got to find which are the headers and payload that i needed to use in my request:
1) Here the page: it is a list of economic indicators:
2) It is possible to select which country's indicator are displayed using the "filter that is on the right hand side of the screen:
3) Clicking the "apply" button will send a POST requests to the site that will refresh the page to show only the information of the ticked boxes. Here a screencapture showing the elements of the form sent in the POST request:
But if i try to do this POST request using python requests using the following code (see below) it seems that the form is not processed, and the page returned is simply the default one.
payload= {
'country[]': 5,
'limit_from': '0',
'submitFilters': '1',
'timeFilter': 'timeRemain',
'currentTab': 'today',
'timeZone': '55'}
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Host':'www.investing.com',
'Origin':'https://www.investing.com',
'Referer':'https://www.investing.com/economic-calendar/',
'Content-Length':'94',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'adBlockerNewUserDomains=1505902229; __qca=P0-734073995-1505902265195; __gads=ID=d69b337b0f60d8f0:T=1505902254:S=ALNI_MYlYKXUUbs8WtYTEO2fN9O_q9oykA; cookieConsent=was-set; travelDistance=4; editionPostpone=1507424197769; PHPSESSID=v9q2deffu2n0b9q07t3jkgk4a4; StickySession=id.71595783179.419www.investing.com; geoC=GB; gtmFired=OK; optimizelySegments=%7B%224225444387%22%3A%22gc%22%2C%224226973206%22%3A%22direct%22%2C%224232593061%22%3A%22false%22%2C%225010352657%22%3A%22none%22%7D; optimizelyEndUserId=oeu1505902244597r0.8410692836488942; optimizelyBuckets=%7B%228744291438%22%3A%228731763165%22%2C%228785438042%22%3A%228807365450%22%7D; nyxDorf=OT5hY2M1P2E%2FY24xZTE3YTNoMG9hYmZjPDdlYWFnNz0wNjNvYW5kYWU6PmFvbDM6Y2Y0MDAwYTk1MzdpYGRhPDk2YTNjYT82P2E%3D; billboardCounter_1=1; _ga=GA1.2.1460679521.1505902261; _gid=GA1.2.655434067.1508542678'
}
import lxml.html
import requests
g=requests.post("https://www.investing.com/economic-calendar/",data=payload,headers=headers)
html = lxml.html.fromstring(g.text)
tr=html.xpath("//table[#id='economicCalendarData']//tr")
for t in tr[4:]:
print(t.find(".//td[#class='left flagCur noWrap']/span").attrib["title"])
This is visible as if, for instance, i select only country "5" (the USA), post the request, and look for the countries present in the result page, I will see other countries as well.
Anyone knows what i am doing wrong with that POST request?
As it shows in your own screenshot, it appears that the site posts to the URL
https://www.investing.com/economic-calendar/Service/getCalendarFilteredData
whereas you're only posting directly to
https://www.investing.com/economic-calendar/
Related
How would I add the item I got an id of to cart using post requests? This is my code:
post_headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36', 'x-requested-with': 'XMLHttpRequest', 'content-type': 'application/x-www-form-urlencoded'}
post_data = {"utf-8": "%E2%9C%93", "commit": "add to cart"}
url = "https://www.supremenewyork.com/shop/{productid}/add".format(productid=id)
add_to_cart = requests.post(url, headers=post_headers, data=post_data)
print(add_to_cart.content)
the specific product that I am trying to add to cart using post requests is : https://www.supremenewyork.com/shop/shirts/zkmt62fz1/lg1ehyx3s
It accurately prints the item id in the console, but when I go to my cart, there is nothing there.
I am guessing you are looking in your cart in your browser. Usually a website will keep track of you as a user with cookies (e.g. a session id) and will with this info display your orders. If your are sending the order as a request in Python, then you will receive the cookies from the server in return in the response. Therefore if you are looking for the order in you browser, then you do not have the cookies from the Python response and the site will not recognise you as the same user.
i am trying to web-scrape pokemon information from their online pokedex, but i'm having trouble with the findAll()function. i've got:
containers = page_soup.findAll("div",{"class":"pokemon-info"})
but I'm not sure if this div is where i need to be looking at all, because (see photo html) this div is inside a li, so perhaps i should search within it instead, like so:
containers = page_soup.findAll("li", {"class":"animating"})
but in both cases when i use len(containers), the length returned is always 0, even though there are several entries.
i also tried find_all(), but the results of len() are the same.
The problem is that BeautifulSoup can't read javascript. As furas said, you should open the webpage and turn off javascript (here's how) and then see if you can still access what you want. If you can't, you need to use something like Selenium to control the browser.
As the other comments and answer suggested, the site is loading the data in the background. The most common response to this is to use Selenium; my approach is to first check for any API calls in Chrome. Luckily for us, the page retrieves 953 pokemon on load.
Below is a script that will retrieve the clean JSON data and here is a little article I wrote explaining the use of chrome developer tools in the first instance over Selenium.
# Gotta catch em all
import requests
import pandas as pd
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'https://www.pokemon.com/us/pokedex/',
'Connection': 'keep-alive',
}
r = requests.get('https://www.pokemon.com/us/api/pokedex/kalos', headers=headers)
j = r.json()
print(j[0])
I am a rookie just learning Python, however, for our Bachelor's thesis, we need the data from the following website (its just municipal financial data from the Latvian government):
https://e2.kase.gov.lv/pub5.5_pasv/code/pub.php?module=pub
So far I have done the following:
Got frustrated that this is not a simple HTML page and that it has this 'interactive' header (sorry, my knowledge is very limited on how to interact with it).
By using Chrome dev tools and network tab I found out that I can run the following URL to 'request' the period, municipality, financial statement, etc. that I need: https://e2.kase.gov.lv/pub5.5_pasv/code/ajax.php?module=pub&job=getDoc&period_id=1626&org_id=2542&blank_id=200079¤cy_id=2&editable=0&type=HTML
Created basic python code to get that URL HTML (see below).
Found out that it returns empty data. Thought that this is a bug, however, the response code is 200, which as I understand means that it was successful.
Tested this URL in different browsers, and 'lo and behold. It works in Chrome, however, in Microsoft Edge, it returns an empty blank page.
Read somewhere that I have to 'introduce' myself to the server and tried to use headers and User-Agent both manually, and also using a fake_useragent library with Chrome User Agent. Yet it still doesn't work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get("https://e2.kase.gov.lv/pub5.5_pasv/code/ajax.php?module=pub&job=getDoc&period_id=1626&org_id=2542&blank_id=200079¤cy_id=2&editable=1&type=HTML", headers=headers)
print(r.text)
So I'm stuck in point 6. The URL works well in Chrome, does not work in Edge. And it seems that my Python code gets the same blank page Edge browser gets - with no data whatsoever.
I would appreciate it a lot if If anyone could at least lead me in the right direction or give some reading material because right now I have no idea how to configure my Python code to reproduce the HTML output from Chrome.. Or if this is even a legitimate (or good) way on how to approach this problem to obtain this data.
EDIT: Sorry guys, I found out that it is not possible to access this website from outside Latvia, however, I have found a solution (see below).
Solved the problem.
Previously when imitating a browser I only used the following headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'
}
Turns out I had to include all of the response headers sent to the server for the request (found through Chrome dev tools), as so:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Cookie': 'Cookie; Cookie',
'DNT': '1',
'Host': 'e2.kase.gov.lv',
'Referer': 'https://e2.kase.gov.lv/pub5.5_pasv/code/pub.php?module=pub',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'
}
I want to scrape https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch and download the entire set of PDF files that show up in the search results as on date (say 01-01-2016). The employee fields are optional. On clicking search, the site throws up a list of all the employees. I am unable to get the post method to work using python requests. Keep getting a 405 error. My code is below
from bs4 import BeautifulSoup
import requests
url = "https://sparrow.eoffice.gov.in/IPRSTATUS/IPRFiledSearch"
data = {
'assessmentYearId':'vH4pgBbZ8y8rhOFBoM0g7w',
'empName':'',
'allotmentYear':'',
'cadreId':'',
'iprReportType':'cqZvyXc--mpmnRNfPp2k7w',
'userType':'JgPOADxEXU1jGi53Xa2vGQ',
'_csrf':'7819ec72-eedf-4290-ba70-6f2b14cc4b79'
}
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'184',
'Content-Type':'application/x-www-form-urlencoded',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
response = requests.post(url,data=data,headers=headers)
I'm not familiar with the website but I strongly suggest reading their policy before trying to scrape the content.
In similar scenarios when you don't get the expected results by a simple post, using requests.Session usually helps.
The problem lay in my using the same csrf code. Needs to be changed with every request.
I am trying to get some data from a page. I open Chrome's development tools and successfully find the data I wanted. It's in XHR with GET method (sorry I don't know how to descript it).Then I copy the params, headers, and put all these to requests.get() method. The response I get is totally different to what I saw on the development tools.
Here is my code
import requests
queryList={
"category":"summary",
"subcategory":"all",
"statsAccumulationType":"0",
"isCurrent":"true",
"playerId":None,
"teamIds":"825",
"matchId":"1103063",
"stageId":None,
"tournamentOptions":None,
"sortBy":None,
"sortAscending":None,
"age":None,
"ageComparisonType":None,
"appearances":None,
"appearancesComparisonType":None,
"field":None,
"nationality":None,
"positionOptions":None,
"timeOfTheGameEnd":None,
"timeOfTheGameStart":None,
"isMinApp":None,
"page":None,
"includeZeroValues":None,
"numberOfPlayersToPick":None,
}
header={
'modei-last-mode':'JL7BrhwmeqKfQpbWy6CpG/eDlC0gPRS2BCvKvImVEts=',
'Referer':'https://www.whoscored.com/Matches/1103063/LiveStatistics/Spain-La-Liga-2016-2017-Leganes-Real-Madrid',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
"x-requested-with":"XMLHttpRequest",
}
url='https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics'
test=requests.get(url=url,params=queryList,headers=header)
print(test.text)
I follow this post below but it's already 2 years ago and I believe the structure is changed.
XHR request URL says does not exist when attempting to parse it's content