I am trying to scrape this website: https://ssweb.seap.minhap.es/portalEELL/consulta_alcaldes
When you choose Alicante from the first menu and then Ayuntamiento de Abengibre from the second you will see a table with results. This is what I want.
I saw in Chrome Console that choosing the values in drop-downs generates a POST request. So I thought it would be straight-forward to obtain that with requests.post
params = {
"consulta_alcalde[_csrf_token]":"dd1546dd35bf0f1af4a1f3aac165a1b5",
"consulta_alcalde[id_provincia]":"2",
"consulta_alcalde[id_entidad]":"17926"
}
r = requests.post("https://ssweb.seap.minhap.es/portalEELL/consulta_alcaldes", params)
But then when I check what r.text contains I get 200 response but can't see my data from the table. What am I doing wrong?
I am aware it can be done with Selenium but I am trying to avoid it as it's very slow.
EDIT:
As per Brian's suggestion I have modified my code as:
params = {
"consulta_alcalde[_csrf_token]":"dd1546dd35bf0f1af4a1f3aac165a1b5",
"consulta_alcalde[id_provincia]":"2",
"consulta_alcalde[id_entidad]":"17951",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
}
with requests.Session() as s:
s.get("https://ssweb.seap.minhap.es/portalEELL/consulta_alcaldes")
r = s.post("https://ssweb.seap.minhap.es/portalEELL/consulta_alcaldes", data=params)
But still no luck...
The "csrf_token" is not static, you'll have to parse the page with bs4 to get it.
Also the site provides content via xhr request, so you need to have "XMLHttpRequest" in the headers. Code:
url = 'https://ssweb.seap.minhap.es/portalEELL/consulta_alcaldes'
s = requests.Session()
r = s.get(url, verify=False)
soup = BeautifulSoup(r.content, 'html.parser')
csrf_token = soup.find('input', id="consulta_alcalde__csrf_token")['value']
data = {
"consulta_alcalde[_csrf_token]":csrf_token,
"consulta_alcalde[id_provincia]":"2",
"consulta_alcalde[id_entidad]":"17951"
}
headers = {"X-Requested-With":"XMLHttpRequest"}
r = s.post(url, data=data, headers=headers, verify=False)
print(r.content)
With a post request, the payload should be the body of the request. To do this make pass the params using the data keyword argument.
requests.post(url, data=payload)
If the post requires json, then you can either use json.dumps or simply pass the payload to the json keyword argument instead.
requests.post(url, json=payload)
Related
I am trying to get the listing data on this page https://stashh.io/collection/secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3?sort=sold_date+desc using web scraping
Because the data is being loaded with Javascript, I can't use something like requests and BeautifulSoup. I checked the network tab to see how the request are being sent. I found that to get the data, I need to get the sid to make further request I can get the sid with the code below
def get_sid():
url = "https://stashh.io/socket.io/?EIO=4&transport=polling&t=NyPfiJ-"
response = requests.get(url)
response.raise_for_status()
text = response.text[1:]
data = {"data": ast.literal_eval(text)}
return data["data"]["sid"]
Then use the SID to send a request to this endpoint which gets the data using the code below
def get_listings():
sid = get_sid()
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
url = f"https://stashh.io/socket.io/?EIO=4&transport=polling&sid={sid}"
response = requests.get(url, headers=headers)
response.raise_for_status()
print(response.content)
return response.json()
I am getting this b'2' as a response instead of this
434[{"nfts":[{"_id":"61ffffd9aa7f94f21e7262c0","collection":"secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3","id":"354","fullid":"secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3_354","name":"Amalia","thumbnail":[{"authentication":{"key":"","user":""},"file_type":"image","extension":"png","url":"https://arweave.net/7pVsbsC2M6uVDMHaVxds-oZkDNajhsrIkKEDT-vfkM8/public_image.png"}],"created_at":1644080437,"royalties_decimal_rate":3,"royalties":[{"recipient":null,"rate":20},{"recipient":null,"rate":15},{"recipient":null,"rate":15}],"isTemplate":false,"mint_on_demand":{"serial":null,"quantity":null,"version":null,"from_template":""},"template":{},"likes":[{"from":"secret19k85udnt8mzxlt3tx0gk29thgnszyjcxe8vrkt","timestamp":1644543830855}],"listing"...
I resort to using selenium to get the data, it works but it's quite slow.
Is there a way I can get this data without using selenium?
I try to login to the member area of the following website :
https://trader.degiro.nl/
Unfortunately, I tried many way without success.
The post form since to be a json it's the reason why I sent a json instead of the post data
import requests
session = requests.Session()
data = {"username":"test", "password":"test", "isRedirectToMobile": "false", "loginButtonUniversal": ""}
url = "https://trader.degiro.nl/login/#/login"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36'}
r = session.post(url, headers=headers, json={'json_payload': data})
Does any one have a idea why it doesn't work ?
Looking at the request my browser sends, the code should be:
url = "https://trader.degiro.nl/login/secure/login"
...
r = session.post(url, headers=headers, json=data)
That is, there's no need to wrap the data in json_payload and the url is slightly different to the one for viewing the login page.
I have tried logging into GitHub using the following code:
url = 'https://github.com/login'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'login':'username',
'password':'password',
'authenticity_token':'Token that keeps changing',
'commit':'Sign in',
'utf8':'%E2%9C%93'
}
res = requests.post(url)
print(res.text)
Now, res.text prints the code of login page. I understand that it maybe because the token keeps changing continuously. I have also tried setting the URL to https://github.com/session but that does not work either.
Can anyone tell me a way to generate the token. I am looking for a way to login without using the API. I had asked another question where I mentioned that I was unable to login. One comment said that I am not doing it right and it is possible to login just by using the requests module without the help of Github API.
ME:
So, can I log in to Facebook or Github using the POST method? I have tried that and it did not work.
THE USER:
Well, presumably you did something wrong
Can anyone please tell me what I did wrong?
After the suggestion about using sessions, I have updated my code:
s = requests.Session()
headers = {Same as above}
s.put('https://github.com/session', headers=headers)
r = s.get('https://github.com/')
print(r.text)
I still can't get past the login page.
I think you get back to the login page because you are redirected and since your code doesn't send back your cookies, you can't have a session.
You are looking for session persistance, requests provides it :
Session Objects The Session object allows you to persist certain
parameters across requests. It also persists cookies across all
requests made from the Session instance, and will use urllib3's
connection pooling. So if you're making several requests to the same
host, the underlying TCP connection will be reused, which can result
in a significant performance increase (see HTTP persistent
connection).
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/master/user/advanced/
Actually in post method the request parameters should be in request body, not in header.So the login data should be in data parameter.
For github, authenticity token is present in value attribute of an input tag which is extracted using BeautifulSoup library.
This code works fine
import requests
from getpass import getpass
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': input('Username: '),
'password': getpass()
}
url = 'https://github.com/session'
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')
login_data['authenticity_token'] = soup.find(
'input', attrs={'name': 'authenticity_token'})['value']
response = session.post(url, data=login_data, headers=headers)
print(response.status_code)
response = session.get('https://github.com', headers=headers)
print(response.text)
This code works perfectly
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': 'your-username',
'password': 'your-password'
}
with requests.Session() as s:
url = "https://github.com/session"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['authenticity_token'] = soup.find('input', attrs={'name': 'authenticity_token'})['value']
r = s.post(url, data=login_data, headers=headers)
You can also try using the PyGitHub API to perform common git tasks.
Check the link below:
https://github.com/PyGithub/PyGithub
import requests
headers ={
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-US,en;q=0.5",
"Connection":"keep-alive",
"Host":"mcfbd.com",
"Referer":"https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",
"User-Agent":"Mozilla/5.0(Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0"}
a = requests.session()
soup = BeautifulSoup(a.get("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx").content)
payload = {"ctl00$ContentPlaceHolder1$txtSearchHouse":"",
"ctl00$ContentPlaceHolder1$txtSearchSector":"",
"ctl00$ContentPlaceHolder1$txtPropertyID":"",
"ctl00$ContentPlaceHolder1$txtownername":"",
"ctl00$ContentPlaceHolder1$ddlZone":"1",
"ctl00$ContentPlaceHolder1$ddlSector":"2",
"ctl00$ContentPlaceHolder1$ddlBlock":"2",
"ctl00$ContentPlaceHolder1$btnFind":"Search",
"__VIEWSTATE":soup.find('input',{'id':'__VIEWSTATE'})["value"],
"__VIEWSTATEGENERATOR":"14039419",
"__EVENTVALIDATION":soup.find("input",{"name":"__EVENTVALIDATION"})["value"],
"__SCROLLPOSITIONX":"0",
"__SCROLLPOSITIONY":"0"}
b = a.post("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",headers = headers,data = payload).text
print(b)
above is my code for this website.
https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx
I checked firebug out and these are the values of the form data.
however doing this:
b = requests.post("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",headers = headers,data = payload).text
print(b)
throws this error:
[ArgumentException]: Invalid postback or callback argument
is my understanding of submitting forms via request correct?
1.open firebug
2.submit form
3.go to the NET tab
4.on the NET tab choose the post tab
5.copy form data like the code above
I've always wanted to know how to do this. I could use selenium but I thought I'd try something new and use requests
The error you are receiving is correct because the fields like _VIEWSTATE (and others as well) are not static or hardcoded. The proper way to do this is as follows:
Create a Requests Session object. Also, it is advisable to update it with headers containing USER-AGENT string -
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36",}`
s = requests.session()
Navigate to the specified url -
r = s.get(url)
Use BeautifulSoup4 to parse the html returned -
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html5lib')
Populate formdata with the hardcoded values and dynamic values -
formdata = {
'__VIEWSTATE': soup.find('input', attrs={'name': '__VIEWSTATE'})['value'],
'field1': 'value1'
}
Then send the POST request using the session object itself -
s.post(url, data=formdata)
You are seeing this message because this HTTPS site requires a 'Referer
header' to be sent by your Web browser, but none was sent. This header is
required for security reasons, to ensure that your browser is not being
hijacked by third parties.
I was trying to login to a website using requests but received the error above, how do I create a 'Referer
header'?
payload = {'inUserName': 'xxx.com', 'inUserPass': 'xxxxxx'}
url = 'https:xxxxxx'
req=requests.post(url, data=payload)
print(req.text)
You can pass in headers you want to send on your request as a keyword argument to request.post:
payload = {'inUserName': 'xxx.com', 'inUserPass': 'xxxxxx'}
url = 'https:xxxxxx'
req=requests.post(url, data=payload, headers={'Referer': 'yourReferer')
print(req.text)
I guess you are using this library: http://docs.python-requests.org/en/latest/user/quickstart/
If this is the case you have to add a custom header Referer (see section Custom headers). The code would be something like this:
url = '...'
payload = ...
headers = {'Referer': 'https://...'}
r = requests.post(url, data=payload, headers=headers)
For more information on the referer see this wikipedia article: https://en.wikipedia.org/wiki/Referer
I was getting the Same error in Chrome. What I did was just disabled all my chrome extensions including ad blockers. Here after I reloaded the page from where i wanted to scrape the data and logged in once again and then in the code as #Stephan Kulla mentioned you need to add headers inside headers i added user agent, referer, referrer-policy, origin. all these you can get in from inspect sample where you will find a Network part..
add all those in header and try to login again using post it should work.(It worked for me)
ori = 'https:......'
login_route = 'login/....'
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36' , 'origin':'https://www.screener.in', 'referer': '/login/','referrer-policy':'same-origin'}
s=requests.session()
csrf = s.get(ori+login_route).cookies['csrftoken']
payload = {
'username': 'xxxxxx',
'password': 'yyyyyyy',
'csrfmiddlewaretoken': csrf
}
login_req = s.post(ori+login_route,headers=header,data=payload)