Can't scrape names from a table using requests - python

I'm trying to scrape names from a table populated upon selecting some options in a webpage. This is the options to generate the table in that site. however, When I try doing the same using the script below, I always get status 500. I got success using selenium, so I'm not after any solution based upon selenium.
webpage address
I've written so far:
import requests
from bs4 import BeautifulSoup
link = 'https://rederef-saude.appspot.com/proximidade/prestador/buscar?'
params = {
'canal': '1',
'latitude': '-23.5505199',
'longitude': '-46.63330939999999',
'categoria': '1',
'produto': '557',
'plano': '18051',
'nome': '',
'qualificacoes': '',
'prefixoEmpresa': '',
'empresa': '',
'especialidade': '',
'procedimento': '',
'tipoPesquisaProcedimento': '1',
'raio': '200000'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36',
'referer': 'https://rederef-saude.appspot.com/rederef/buscaPrestadores?login=publico&canal=1&data=23/04/2021&hora=00:16:55&tipoProduto=M&produto=557&plano=18051',
'x-requested-with': 'XMLHttpRequest',
'content-type': 'application/json;charset=utf-8',
'accept': '*/*',
'captcha-token': ''
}
with requests.Session() as s:
s.headers = headers
res = s.get(link,params=params)
print(res.status_code)
How can I scrape the names from that table using requests?

When Checking the requests it does have recaptcha protection its just that its in a invisible mode which only comes to display if required and this would even need cookies from first call or else it will give invalid session error so for recaptcha automated solution(Paid but pretty cheap) their various providers for it for ex:2Captcha,Anti-Captcha,CapMonster,etc
I Have used 2Captcha for it you could follow the below code.
Reference to 2Captcha Library
import requests
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('Twocaptcha API KEY')
result = solver.recaptcha(sitekey='6LdOzkUaAAAAANV9z7gQOokV2kNWI8WI_eSH80vC',
url='https://rederef-saude.appspot.com', invisible=1)["code"]
params = {
'login': 'publico',
'canal': 1,
'data': '23/04/2021',
'hora': '04:10:41',
'tipoProduto': 'M',
'produto': 557,
'plano': 18051
}
cookies = requests.get(
'https://rederef-saude.appspot.com/rederef/buscaPrestadores', params=params).cookies
headers = {
'captcha-token': result,
}
params = {
'canal': '1',
'latitude': '-23.5505199',
'longitude': '-46.63330939999999',
'categoria': '1',
'produto': '557',
'plano': '18051',
'nome': '',
'qualificacoes': '',
'prefixoEmpresa': '',
'empresa': '',
'especialidade': '',
'procedimento': '',
'tipoPesquisaProcedimento': '1',
'raio': '200000'
}
response = requests.get('https://rederef-saude.appspot.com/proximidade/prestador/buscar', headers=headers,
params=params, cookies=cookies)
print(response.json())
Don't forget to update the Twocaptcha API KEY in the above Code.
You could use any provider you feel okay with.
Output would be in json so no need to use BeautifulSoup
Output:
Let me know if you have any questions :)

Related

Scraping with AJAX - how to obtain the data?

I am trying to scrape the data from https://www.anre.ro/ro/info-consumatori/comparator-oferte-tip-de-furnizare-a-gn, which gets its input via Ajax (request URL is https://www.anre.ro/ro/ajax/comparator/get_results_gaz).
However, I can see that the Form Data is in a form of - tip_client=casnic&modalitate_racordare=sistem_de_distributie&transee_de_consum=b1&tip_pret_unitar=cu_reglementate&id_judet=ALBA&id_siruta=1222&consum_mwh=&pret_furnizare_mwh=&componenta_fixa=&suplimentar_componenta_fixa=&termen_plata=&durata_contractului=&garantii=&frecventa_emitere_factura=&tip_pret= (if I view source in Chrome). How do I pass this to scrapy or any other module to retrieve the desired webpage?
So far, I have this (is the json format correct considering the Form Data?):
class ExSpider(scrapy.Spider):
name = 'ExSpider'
allowed_domains = ['anre.ro']
def start_requests(self):
params = {
"tip_client":"casnic",
"modalitate_racordare":"sistem_de_distributie",
"transee_de_consum":"b1",
"tip_pret_unitar":"cu_reglementate",
"id_judet":"ALBA",
"id_siruta":"1222",
"consum_mwh":"",
"pret_furnizare_mwh":"",
"componenta_fixa":"",
"suplimentar_componenta_fixa":"",
"termen_plata":"",
"durata_contractului":"",
"garantii":"",
"frecventa_emitere_factura":"",
"tip_pret":""
}
r = scrapy.FormRequest('https://www.anre.ro/ro/ajax/comparator/get_results_gaz', method = "POST",formdata=params)
print(r)
The following should produce the required response from that page you wish to grab data from.
class ExSpider(scrapy.Spider):
name = "exspider"
url = 'https://www.anre.ro/ro/ajax/comparator/get_results_gaz'
payload = {
'tip_client': 'casnic',
'modalitate_racordare': 'sistem_de_distributie',
'transee_de_consum': 'b2',
'tip_pret_unitar': 'cu_reglementate',
'id_judet': 'ALBA',
'id_siruta': '1222',
'consum_mwh': '',
'pret_furnizare_mwh': '',
'componenta_fixa': '',
'suplimentar_componenta_fixa': '',
'termen_plata': '',
'durata_contractului': '',
'garantii': '',
'frecventa_emitere_factura': '',
'tip_pret': ''
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.anre.ro/ro/info-consumatori/comparator-oferte-tip-de-furnizare-a-gn'
}
def start_requests(self):
yield scrapy.FormRequest(
self.url,
formdata=self.payload,
headers=self.headers,
callback=self.parse
)
def parse(self, response):
print(response.text)

Unable to grab expected result from a site issuing post requests

I'm trying to fetch some json response from a webpage using the script below. Here are the steps to populate the result in that site. Click on the AGREE button located at the bottom of this webpage and then on the EDIT SEARCH button and finally on SHOW RESULTS button without changing anything.
I've tried like this:
import requests
from bs4 import BeautifulSoup
url = 'http://finra-markets.morningstar.com/BondCenter/Results.jsp'
post_url = 'http://finra-markets.morningstar.com/bondSearch.jsp'
payload = {
'postData': {'Keywords':[]},
'ticker': '',
'startDate': '',
'endDate': '',
'showResultsAs': 'B',
'debtOrAssetClass': '1,2',
'spdsType': ''
}
payload_second = {
'count': '20',
'searchtype': 'B',
'query': {"Keywords":[{"Name":"debtOrAssetClass","Value":"3,6"},{"Name":"showResultsAs","Value":"B"}]},
'sortfield': 'issuerName',
'sorttype': '1',
'start': '0',
'curPage': '1'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
s.headers['Referer'] = 'http://finra-markets.morningstar.com/BondCenter/UserAgreement.jsp'
r = s.post(url,json=payload)
s.headers['Access-Control-Allow-Headers'] = r.headers['Access-Control-Allow-Headers']
s.headers['cf-request-id'] = r.headers['cf-request-id']
s.headers['CF-RAY'] = r.headers['CF-RAY']
s.headers['X-Requested-With'] = 'XMLHttpRequest'
s.headers['Origin'] = 'http://finra-markets.morningstar.com'
s.headers['Referer'] = 'http://finra-markets.morningstar.com/BondCenter/Results.jsp'
r = s.post(post_url,json=payload_second)
print(r.content)
This is the result I get when I run the script above:
b'\n\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\n\n\n{}'
How can I make the script populate expected result from that site?
P.S. I do not wish to go for selenium to get this done.
The response for http://finra-markets.morningstar.com/BondCenter/Results.jsp doesn't contain the search results. It must be fetching the data asynchronously.
An easy way to find out which network requests returned the search results, is to search the requests for one of the search results using Firefox's Dev Tools:
To convert the HTTP request to a Python request, I copy the request as a CURL code method from Firefox, import it into Postman and then export it as a Python code (a little long-winded (and lazy) I know!):
All this leads to the following code:
import requests
url = "http://finra-markets.morningstar.com/bondSearch.jsp"
payload = "count=20&searchtype=B&query=%7B%22Keywords%22%3A%5B%7B%22Name%22%3A%22debtOrAssetClass%22%2C%22Value%22%3A%223%2C6%22%7D%2C%7B%22Name%22%3A%22showResultsAs%22%2C%22Value%22%3A%22B%22%7D%5D%7D&sortfield=issuerName&sorttype=1&start=0&curPage=1"
headers = {
'User-Agent': "...",
'Accept': "text/plain, */*; q=0.01",
'Accept-Language': "en-US,en;q=0.5",
'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8",
'X-Requested-With': "XMLHttpRequest",
'Origin': "http://finra-markets.morningstar.com",
'DNT': "1",
'Connection': "keep-alive",
'Referer': "http://finra-markets.morningstar.com/BondCenter/Results.jsp",
'Cookie': "...",
'cache-control': "no-cache"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)
The response wasn't 100% JSON. So I just stripped away the outer whitespace and {B:..} part:
>>> text = response.text.strip()[3:-1]
>>> import json
>>> data = json.loads(text)
>>> data['Columns'][0]
{'moodyRating': {'ratingText': '', 'ratingNumber': 0},
'fitchRating': {'ratingText': None, 'ratingNumber': None},
'standardAndPoorRating': {'ratingText': '', 'ratingNumber': 0},

Can't fetch the name of different suppliers from a webpage

I've created a script in python using post requests to fetch the name of different suppliers from a webpage but unfortunately I'm getting this error AttributeError: 'NoneType' object has no attribute 'text' whereas it occured to me that I did things in the right way.
websitelink
To populate the content, it is required to click on the search button just the way it is seen in the image.
I've tried so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.gebiz.gov.sg/ptn/supplier/directory/index.xhtml"
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
payload = {
'contentForm': 'contentForm',
'contentForm:j_idt225_listButton2_HIDDEN-INPUT': '',
'contentForm:j_idt161_inputText': '',
'contentForm:j_idt164_SEARCH': '',
'contentForm:j_idt167_selectManyMenu_SEARCH-INPUT': '',
'contentForm:j_idt167_selectManyMenu-HIDDEN-INPUT': '',
'contentForm:j_idt167_selectManyMenu-HIDDEN-ACTION-INPUT': '',
'contentForm:search': 'Search',
'contentForm:j_idt185_select': 'SUPPLIER_NAME',
'javax.faces.ViewState': soup.select_one('[id="javax.faces.ViewState"]')['value']
}
res = requests.post(url,data=payload,headers={
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
})
sauce = BeautifulSoup(res.text,"lxml")
item = sauce.select_one(".form2_ROW").text
print(item)
Only this portion will do as well: 8121 results found.
Full traceback:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\general_demo.py", line 27, in <module>
item = sauce.select_one(".form2_ROW").text
AttributeError: 'NoneType' object has no attribute 'text'
You need to find a way to get the cookie. The following currently works for me across multiple requests.
import requests
from bs4 import BeautifulSoup
url = "https://www.gebiz.gov.sg/ptn/supplier/directory/index.xhtml"
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0',
'Referer' : 'https://www.gebiz.gov.sg/ptn/supplier/directory/index.xhtml',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'en-US,en;q=0.9',
'Cache-Control' : 'max-age=0',
'Connection' : 'keep-alive',
'Cookie' : '__cfduid=d3fe47b7a0a7f3ef307c266817231b5881555951761; wlsessionid=pFpF87sa9OCxQhUzwQ3lXcKzo04j45DP3lIVYylizkFMuIbGi6Ka!1395223647; BIGipServerPTN2_PRD_Pool=52519072.47873.0000'
}
with requests.Session() as s:
r = s.get(url, headers= headers)
soup = BeautifulSoup(r.text,"lxml")
payload = {
'contentForm': 'contentForm',
'contentForm:search': 'Search',
'contentForm:j_idt185_select': 'SUPPLIER_NAME',
'javax.faces.ViewState': soup.select_one('[id="javax.faces.ViewState"]')['value']
}
res = s.post(url,data=payload,headers= headers)
sauce = BeautifulSoup(res.text,"lxml")
item = sauce.select_one(".formOutputText_HIDDEN-LABEL.outputText_TITLE-BLACK").text
print(item)

How can I log in this specific website using Python Requests?

I am trying to log in this website using the following request but it doesn't work
The cookie never contains 'userid'.
What should I change? Do I need to add headers in my post request?
import requests
payload = {
'ctl00$MasterMainContent$LoginCtrl$Username': 'myemail#email.com',
'ctl00$MasterMainContent$LoginCtrl$Password': 'mypassword',
'ctl00$MasterMainContent$LoginCtrl$cbxRememberMe' : 'on',
}
with requests.Session() as s:
login_page = s.get('http://www.bentekenergy.com/')
response = s.post('http://benport.bentekenergy.com/Login.aspx', data=payload)
if 'userid' in response.cookies:
print("connected")
else:
print("not connected")
Edit 1 (following comments):
I am not sure about what to put in the request headers, below is what I tried, but unsuccessfully.
request_headers = {
'Accept':'image/webp,image/*,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch, br',
'Accept-Language':'en-US,en;q=0.8',
'Connection':'keep-alive',
'Cookie':'ACOOKIE=C8ctADJmMTc1YTRhLTBiMTEtNGViOC1iZjE0LTM5NTNkZDVmMDc1YwAAAAABAAAASGYBALlflFnvWZRZAQAAAABLAAC5X5RZ71mUWQAAAAA-',
'Host':'statse.webtrendslive.com',
'Referer':'https://benport.bentekenergy.com/Login.aspx',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
Edit 2 (following stovfl answer):
I use now the following payload, fill each attributes with the value in the form and complete it with username, password and rememberMe.
I also tried with the following headers in the request.
Still not connected
payload = {
'__VIEWSTATE' : '',
'__VIEWSTATEGENERATOR' : '',
'__PREVIOUSPAGE' : '',
'__EVENTVALIDATION' : '',
'isAuthenticated' : 'False',
'ctl00$hfAccessKey' : '',
'ctl00$hfVisibility' : '',
'ctl00$hfDateTime' : '',
'ctl00$hfHash' : '',
'ctl00$hfAnnouncementsUrl' : '',
'ctl00$MasterMainContent$LoginCtrl$Username' : '',
'ctl00$MasterMainContent$LoginCtrl$Password' : '',
'ctl00$MasterMainContent$LoginCtrl$cbxRememberMe' : '',
}
request_headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'Connection':'keep-alive',
'Content-Length':'7522',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'',
'Host':'benport.bentekenergy.com',
'Origin':'https://benport.bentekenergy.com',
'Referer':'https://benport.bentekenergy.com/Login.aspx',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
with requests.Session() as s:
response = s.get('http://benport.bentekenergy.com/Login.aspx')
soup = BeautifulSoup(response.text, "html.parser")
if soup.find("input", {"name" : "ctl00$MasterMainContent$LoginCtrl$Username"}):
print("not connected")
soup = BeautifulSoup(response.text, "lxml")
for element in soup.select("input"):
if element.get("name") in payload:
payload[element.get("name")] = element.get("value")
payload['ctl00$MasterMainContent$LoginCtrl$Username'] = 'myemail#email.com'
payload['ctl00$MasterMainContent$LoginCtrl$Password'] = 'mypassword'
payload['ctl00$MasterMainContent$LoginCtrl$cbxRememberMe'] = 'on'
response = s.post('http://benport.bentekenergy.com/Login.aspx', data=payload, headers=request_headers)
print (s.cookies)
soup = BeautifulSoup(response.text, "html.parser")
if soup.find("input", {"name" : "ctl00$MasterMainContent$LoginCtrl$Username"}):
print("not connected")
else:
print("connected")
s.cookies contains:
<RequestsCookieJar[<Cookie BenportState=q1k2r2eqftltjm55igy5mg55 for .bentekenergy.com/>, <Cookie RememberMe=True for .bentekenergy.com/>]>
Edit 3 (answer!):
I added
'__EVENTTARGET' : ''
in the payload and filled it with the value 'ctl00$MasterMainContent$LoginCtrl$btnSignIn'
Now I am connected!
NB: the headers were not necessary, just the payload
Comment: ... found that there is a parameter '__EVENTTARGET' that was not in the payload. It needed to contain 'ctl00$MasterMainContent$LoginCtrl$btnSignIn'. Now I am connected!
Yes, overlooked the Submit Button, there is a Javascript:
href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$headerLoginCtrl$btnSignIn",
Relevant: SO Answer How To see POST Data
Comment: ... based on your answer (Edit 2). Still not connected
You are using http instead of https
Will be Auto-Redirected to https.
The <RequestsCookieJar has changed, so some progress.
I'm still unsure about your Authenticated Check: if soup.find("input", {"name"....
Have you Check the Page Content?
Any Error Message?
Don't use BeautifulSoup(... your following Requests should be using Session s to reuse the assigned Cookie.
E.g. response = s.get('<url to some resticted page>
Try request_headers with only 'User-Agent'
Analysis <form>:
Login URL: https://benport.bentekenergy.com/Login.aspx
Form: action: /Login.aspx, method: post
If value not empty means: Pre-Set-Values from Login Page.
1:input type:hidden value:/wEPDwUKLT... id:__VIEWSTATE
2:input type:hidden value:0BA31D5D id:__VIEWSTATEGENERATOR
3:input type:hidden value:2gILTn0H1S... id:__PREVIOUSPAGE
4:input type:hidden value:/wEWDAKIr6... id:__EVENTVALIDATION
5:input type:hidden value:False id:isAuthenticated
6:input type:hidden value:nu66O9eqvE id:ctl00_hfAccessKey
7:input type:hidden value:public id:ctl00_hfVisibility
8:input type:hidden value:08%2F16%2F... id:ctl00_hfDateTime
9:input type:hidden value:3AB353573D... id:ctl00_hfHash
10:input type:hidden value://announce... id:ctl00_hfAnnouncementsUrl
11:input type:text value:empty id:ctl00_MasterMainContent_LoginCtrl_Username
12:input type:password value:empty id:ctl00_MasterMainContent_LoginCtrl_Password
13:input type:checkbox value:empty id:ctl00_MasterMainContent_LoginCtrl_cbxRememberMe

Extracting second list from nested list returned by XHR request

I am using the below code to return data from a website by copying an XHR request that is submitted to it:
import requests
url = 'http://www.whoscored.com/stageplayerstatfeed/-1/Overall'
params = {
'field': '0',
'isAscending': 'false',
'isMoreThanAvgApps': 'true',
'isOverall': 'false',
'numberOfPlayersToPick': '20',
'orderBy': 'Rating',
'page': '1',
'stageId': '9155',
'teamId': '-1'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/'}
responser = requests.get(url, params=params, headers=headers)
responser = responser.text
responser = responser.encode('cp1252')
print responser
This returns a set of nested list. The first list is a simple list, whilst the second is a list of dictionaries. I want to return the second list.
I have tried amending the last line of my code to print responser[1], however for some reason this just prints a [.
Can anyone see why this is not returning what I require?
Thanks
responser variable contains a JSON string. That means that when you are getting responser[1], you basically get the second character from the string, which is [.
Load the JSON string it into a python list. The easiest way is to use .json() method that requests module provides:
responser = requests.get(url, params=params, headers=headers)
data = responser.json()
print data[1]
Because you're turning the request response into text. So this line:
responser = responser.text
should be:
responser = responser.json()
And then you can print:
print responser[1]

Categories

Resources