Unable to grab expected result from a site issuing post requests - python

I'm trying to fetch some json response from a webpage using the script below. Here are the steps to populate the result in that site. Click on the AGREE button located at the bottom of this webpage and then on the EDIT SEARCH button and finally on SHOW RESULTS button without changing anything.
I've tried like this:
import requests
from bs4 import BeautifulSoup
url = 'http://finra-markets.morningstar.com/BondCenter/Results.jsp'
post_url = 'http://finra-markets.morningstar.com/bondSearch.jsp'
payload = {
'postData': {'Keywords':[]},
'ticker': '',
'startDate': '',
'endDate': '',
'showResultsAs': 'B',
'debtOrAssetClass': '1,2',
'spdsType': ''
}
payload_second = {
'count': '20',
'searchtype': 'B',
'query': {"Keywords":[{"Name":"debtOrAssetClass","Value":"3,6"},{"Name":"showResultsAs","Value":"B"}]},
'sortfield': 'issuerName',
'sorttype': '1',
'start': '0',
'curPage': '1'
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
s.headers['Referer'] = 'http://finra-markets.morningstar.com/BondCenter/UserAgreement.jsp'
r = s.post(url,json=payload)
s.headers['Access-Control-Allow-Headers'] = r.headers['Access-Control-Allow-Headers']
s.headers['cf-request-id'] = r.headers['cf-request-id']
s.headers['CF-RAY'] = r.headers['CF-RAY']
s.headers['X-Requested-With'] = 'XMLHttpRequest'
s.headers['Origin'] = 'http://finra-markets.morningstar.com'
s.headers['Referer'] = 'http://finra-markets.morningstar.com/BondCenter/Results.jsp'
r = s.post(post_url,json=payload_second)
print(r.content)
This is the result I get when I run the script above:
b'\n\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\n\n\n{}'
How can I make the script populate expected result from that site?
P.S. I do not wish to go for selenium to get this done.

The response for http://finra-markets.morningstar.com/BondCenter/Results.jsp doesn't contain the search results. It must be fetching the data asynchronously.
An easy way to find out which network requests returned the search results, is to search the requests for one of the search results using Firefox's Dev Tools:
To convert the HTTP request to a Python request, I copy the request as a CURL code method from Firefox, import it into Postman and then export it as a Python code (a little long-winded (and lazy) I know!):
All this leads to the following code:
import requests
url = "http://finra-markets.morningstar.com/bondSearch.jsp"
payload = "count=20&searchtype=B&query=%7B%22Keywords%22%3A%5B%7B%22Name%22%3A%22debtOrAssetClass%22%2C%22Value%22%3A%223%2C6%22%7D%2C%7B%22Name%22%3A%22showResultsAs%22%2C%22Value%22%3A%22B%22%7D%5D%7D&sortfield=issuerName&sorttype=1&start=0&curPage=1"
headers = {
'User-Agent': "...",
'Accept': "text/plain, */*; q=0.01",
'Accept-Language': "en-US,en;q=0.5",
'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8",
'X-Requested-With': "XMLHttpRequest",
'Origin': "http://finra-markets.morningstar.com",
'DNT': "1",
'Connection': "keep-alive",
'Referer': "http://finra-markets.morningstar.com/BondCenter/Results.jsp",
'Cookie': "...",
'cache-control': "no-cache"
}
response = requests.request("POST", url, data=payload, headers=headers)
print(response.text)
The response wasn't 100% JSON. So I just stripped away the outer whitespace and {B:..} part:
>>> text = response.text.strip()[3:-1]
>>> import json
>>> data = json.loads(text)
>>> data['Columns'][0]
{'moodyRating': {'ratingText': '', 'ratingNumber': 0},
'fitchRating': {'ratingText': None, 'ratingNumber': None},
'standardAndPoorRating': {'ratingText': '', 'ratingNumber': 0},

Related

How can I add threads to this python code to make multiple request

i am creating a custom tool for login bruteforce on web application for bug bounty hunting so i came to a bug on one web application which i had to create my own tool to bruteforce this is not a complete tool but i need solution for the current code for adding threads
import requests
import re
exploit = open('password.txt', 'r').readlines()
headers = {
'Host': 'TARGET.COM',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'close',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'iframe',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1'
}
for line in exploit:
params = {
'execution': '111u9342',
'client_id': 'client-23429df',
'tab_id': '234324',
}
password = line.strip()
http = requests.post('https://www.target.com/test',
params=params,
headers=headers,
data={'username':myname,'password':password},
verify=False,
proxies=proxies)
content = http.content
print("finished")
I am beginner in python
You can use it ThreadPoolExecuter;
from concurrent.futures import ThreadPoolExecutor
import requests
# ....Other code parts...
def base_post(url, header, data, proxies, timeout=10):
response = requests.post(url, headers=header, data=data, proxies=proxies, timeout=timeout)
return response
total_possibilities = []
exploit = []
for line in exploit:
params = {
'execution': '111u9342',
'client_id': 'client-23429df',
'tab_id': '234324',
}
password = line.strip()
total_possibilities.append({'url': "...",
"params": params,
"headers": headers,
"data": {'username': myname, 'password': password},
"verify": False,
"proxies": proxies
"content": http.content})
results = []
with ThreadPoolExecutor(max_workers=3) as executor:
for row in total_possibilities:
results.append(executor.submit(base_post, **row))
print(results)
Don't forget to update "max_workers" based on your needs.

Can't scrape names from a table using requests

I'm trying to scrape names from a table populated upon selecting some options in a webpage. This is the options to generate the table in that site. however, When I try doing the same using the script below, I always get status 500. I got success using selenium, so I'm not after any solution based upon selenium.
webpage address
I've written so far:
import requests
from bs4 import BeautifulSoup
link = 'https://rederef-saude.appspot.com/proximidade/prestador/buscar?'
params = {
'canal': '1',
'latitude': '-23.5505199',
'longitude': '-46.63330939999999',
'categoria': '1',
'produto': '557',
'plano': '18051',
'nome': '',
'qualificacoes': '',
'prefixoEmpresa': '',
'empresa': '',
'especialidade': '',
'procedimento': '',
'tipoPesquisaProcedimento': '1',
'raio': '200000'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36',
'referer': 'https://rederef-saude.appspot.com/rederef/buscaPrestadores?login=publico&canal=1&data=23/04/2021&hora=00:16:55&tipoProduto=M&produto=557&plano=18051',
'x-requested-with': 'XMLHttpRequest',
'content-type': 'application/json;charset=utf-8',
'accept': '*/*',
'captcha-token': ''
}
with requests.Session() as s:
s.headers = headers
res = s.get(link,params=params)
print(res.status_code)
How can I scrape the names from that table using requests?
When Checking the requests it does have recaptcha protection its just that its in a invisible mode which only comes to display if required and this would even need cookies from first call or else it will give invalid session error so for recaptcha automated solution(Paid but pretty cheap) their various providers for it for ex:2Captcha,Anti-Captcha,CapMonster,etc
I Have used 2Captcha for it you could follow the below code.
Reference to 2Captcha Library
import requests
from twocaptcha import TwoCaptcha
solver = TwoCaptcha('Twocaptcha API KEY')
result = solver.recaptcha(sitekey='6LdOzkUaAAAAANV9z7gQOokV2kNWI8WI_eSH80vC',
url='https://rederef-saude.appspot.com', invisible=1)["code"]
params = {
'login': 'publico',
'canal': 1,
'data': '23/04/2021',
'hora': '04:10:41',
'tipoProduto': 'M',
'produto': 557,
'plano': 18051
}
cookies = requests.get(
'https://rederef-saude.appspot.com/rederef/buscaPrestadores', params=params).cookies
headers = {
'captcha-token': result,
}
params = {
'canal': '1',
'latitude': '-23.5505199',
'longitude': '-46.63330939999999',
'categoria': '1',
'produto': '557',
'plano': '18051',
'nome': '',
'qualificacoes': '',
'prefixoEmpresa': '',
'empresa': '',
'especialidade': '',
'procedimento': '',
'tipoPesquisaProcedimento': '1',
'raio': '200000'
}
response = requests.get('https://rederef-saude.appspot.com/proximidade/prestador/buscar', headers=headers,
params=params, cookies=cookies)
print(response.json())
Don't forget to update the Twocaptcha API KEY in the above Code.
You could use any provider you feel okay with.
Output would be in json so no need to use BeautifulSoup
Output:
Let me know if you have any questions :)

Script gets stuck while sending post requests with parameters

I'm trying to populate json response issuing a post http requests with appropriate parameters from a webpage. When I run the script, I see that the script gets stuck and doesn't bring any result. It doesn't throw any error either. This is the site link. I chose three options from the three dropdowns from this form in that site before hitting Get times & tickets button.
I've tried with:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thetrainline.com/'
link = 'https://www.thetrainline.com/api/journey-search/'
payload = {"passengers":[{"dateOfBirth":"1991-01-31"}],"isEurope":False,"cards":[],"transitDefinitions":[{"direction":"outward","origin":"1f06fc66ccd7ea92ae4b0a550e4ddfd1","destination":"7c25e933fd14386745a7f49423969308","journeyDate":{"type":"departAfter","time":"2021-02-11T22:45:00"}}],"type":"single","maximumJourneys":4,"includeRealtime":True,"applyFareDiscounts":True}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
s.headers['content-type'] = 'application/json'
s.headers['accept'] = 'application/json'
r = s.post(link,json=payload)
print(r.status_code)
print(r.json())
How can I get json response issuing post requests with parameters from that site?
You are missing the required headers: x-version and referer. The referer header is referring to the search form and you can build it. Before journey-search you have to post an availability request.
import requests
from requests.models import PreparedRequest
headers = {
'authority': 'www.thetrainline.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'x-version': '2.0.18186',
'dnt': '1',
'accept-language': 'en-GB',
'sec-ch-ua-mobile': '?0',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/88.0.4324.96 Safari/537.36',
'content-type': 'application/json',
'accept': 'application/json',
'origin': 'https://www.thetrainline.com',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
}
with requests.Session() as s:
origin = "6e2242b3f38bbbd8d8124e1d84d319e1"
destination = "15bcf02bc44ea754837c8cf14569f608"
localDateTime = "2021-02-03T19:30:00"
dateOfBirth = "1991-02-03"
passenger_type = "single"
req = PreparedRequest()
url = "http://www.neo4j.com"
params = {
"origin": origin,
"destination": destination,
"outwardDate": localDateTime,
"outwardDateType": "departAfter",
"journeySearchType": passenger_type,
"passengers[]": dateOfBirth
}
req.prepare_url("https://www.thetrainline.com/book/results", params)
headers.update({"referer": req.url})
s.headers = headers
payload_availability = {
"origin": origin,
"destination": destination,
"outwardDefinition": {
"localDateTime": localDateTime,
"searchMethod": "DEPARTAFTER"
},
"passengerBirthDates": [{
"id": "PASSENGER-0",
"dateOfBirth": dateOfBirth
}],
"maximumNumberOfJourneys": 4,
"discountCards": []
}
r = s.post('https://www.thetrainline.com/api/coaches/availability', json=payload_availability)
r.raise_for_status()
payload_search = {
"passengers": [{"dateOfBirth": "1991-02-03"}],
"isEurope": False,
"cards": [],
"transitDefinitions": [{
"direction": "outward",
"origin": origin,
"destination": destination,
"journeyDate": {
"type": "departAfter",
"time": localDateTime}
}],
"type": passenger_type,
"maximumJourneys": 4,
"includeRealtime": True,
"applyFareDiscounts": True
}
r = s.post('https://www.thetrainline.com/api/journey-search/', json=payload_search)
r.raise_for_status()
print(r.json())
As Sers's reply, headers are missing.
When scrawling websites, you have to keep in mind anti-scrawling mechanism. The website will block your requests by taking into consideration your IP address, request headers, cookies, and various other factors.

Can't fetch the name of different suppliers from a webpage

I've created a script in python using post requests to fetch the name of different suppliers from a webpage but unfortunately I'm getting this error AttributeError: 'NoneType' object has no attribute 'text' whereas it occured to me that I did things in the right way.
websitelink
To populate the content, it is required to click on the search button just the way it is seen in the image.
I've tried so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.gebiz.gov.sg/ptn/supplier/directory/index.xhtml"
r = requests.get(url)
soup = BeautifulSoup(r.text,"lxml")
payload = {
'contentForm': 'contentForm',
'contentForm:j_idt225_listButton2_HIDDEN-INPUT': '',
'contentForm:j_idt161_inputText': '',
'contentForm:j_idt164_SEARCH': '',
'contentForm:j_idt167_selectManyMenu_SEARCH-INPUT': '',
'contentForm:j_idt167_selectManyMenu-HIDDEN-INPUT': '',
'contentForm:j_idt167_selectManyMenu-HIDDEN-ACTION-INPUT': '',
'contentForm:search': 'Search',
'contentForm:j_idt185_select': 'SUPPLIER_NAME',
'javax.faces.ViewState': soup.select_one('[id="javax.faces.ViewState"]')['value']
}
res = requests.post(url,data=payload,headers={
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
})
sauce = BeautifulSoup(res.text,"lxml")
item = sauce.select_one(".form2_ROW").text
print(item)
Only this portion will do as well: 8121 results found.
Full traceback:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\general_demo.py", line 27, in <module>
item = sauce.select_one(".form2_ROW").text
AttributeError: 'NoneType' object has no attribute 'text'
You need to find a way to get the cookie. The following currently works for me across multiple requests.
import requests
from bs4 import BeautifulSoup
url = "https://www.gebiz.gov.sg/ptn/supplier/directory/index.xhtml"
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'User-Agent': 'Mozilla/5.0',
'Referer' : 'https://www.gebiz.gov.sg/ptn/supplier/directory/index.xhtml',
'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'en-US,en;q=0.9',
'Cache-Control' : 'max-age=0',
'Connection' : 'keep-alive',
'Cookie' : '__cfduid=d3fe47b7a0a7f3ef307c266817231b5881555951761; wlsessionid=pFpF87sa9OCxQhUzwQ3lXcKzo04j45DP3lIVYylizkFMuIbGi6Ka!1395223647; BIGipServerPTN2_PRD_Pool=52519072.47873.0000'
}
with requests.Session() as s:
r = s.get(url, headers= headers)
soup = BeautifulSoup(r.text,"lxml")
payload = {
'contentForm': 'contentForm',
'contentForm:search': 'Search',
'contentForm:j_idt185_select': 'SUPPLIER_NAME',
'javax.faces.ViewState': soup.select_one('[id="javax.faces.ViewState"]')['value']
}
res = s.post(url,data=payload,headers= headers)
sauce = BeautifulSoup(res.text,"lxml")
item = sauce.select_one(".formOutputText_HIDDEN-LABEL.outputText_TITLE-BLACK").text
print(item)

Extracting second list from nested list returned by XHR request

I am using the below code to return data from a website by copying an XHR request that is submitted to it:
import requests
url = 'http://www.whoscored.com/stageplayerstatfeed/-1/Overall'
params = {
'field': '0',
'isAscending': 'false',
'isMoreThanAvgApps': 'true',
'isOverall': 'false',
'numberOfPlayersToPick': '20',
'orderBy': 'Rating',
'page': '1',
'stageId': '9155',
'teamId': '-1'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/'}
responser = requests.get(url, params=params, headers=headers)
responser = responser.text
responser = responser.encode('cp1252')
print responser
This returns a set of nested list. The first list is a simple list, whilst the second is a list of dictionaries. I want to return the second list.
I have tried amending the last line of my code to print responser[1], however for some reason this just prints a [.
Can anyone see why this is not returning what I require?
Thanks
responser variable contains a JSON string. That means that when you are getting responser[1], you basically get the second character from the string, which is [.
Load the JSON string it into a python list. The easiest way is to use .json() method that requests module provides:
responser = requests.get(url, params=params, headers=headers)
data = responser.json()
print data[1]
Because you're turning the request response into text. So this line:
responser = responser.text
should be:
responser = responser.json()
And then you can print:
print responser[1]

Categories

Resources