I'm trying to use Python requests to get the results of this url Target URL. As you could see, it updates on javascript when you push button "Consultar" (leaving fields empty), so post method is not working.
I'm trying this code right here:
import requests
URL = https://www.cmfchile.cl/institucional/mercados/entidad.php?mercado=V&rut=61808000&tipoentidad=RVEMI&control=svs&pestania=25
page = requests.post(URL, headers=headers)
print(page.text)
Does anyone know any other way or how I could solve this?
This works:
import requests
url = "https://www.cmfchile.cl/institucional/mercados/entidad.php?mercado=V&rut=61808000&tipoentidad=RVEMI&control=svs&pestania=25"
payload='dd=%23%23%23&mm=%23%23%23&aa=%23%23%23&dd2=%23%23%23&mm2=%23%23%23&aa2=%23%23%23&dias=&entidad=AGUAS%2BANDINAS%2BS.A.&rut=61808000%2B&formulario=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:96.0) Gecko/20100101 Firefox/96.0',
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The process I followed was to:
Copy the request as cURL in Firefox:
Import it into Postman:
3. Export as Python code from Postman:
[]
Related
So im trying to make a data science project using information from this site. But sadly when I try to scrape it, it blocks me because it thinks I am a bot. I saw a couple of post here: Python webscraping blocked
but it seems that Immoscout have already found a solution to this workaround. Does somebody know how I can come around this? thanks!
My Code:
import requests
from bs4 import BeautifulSoup
import random
headers = {"User-Agent": "Mozilla/5.0 (Linux; U; Android 4.2.2; he-il; NEO-X5-116A Build/JDQ39) AppleWebKit/534.30 ("
"KHTML, like Gecko) Version/4.0 Safari/534.30 , 'Accept-Language': 'en-US,en;q=0.5'"}
url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?enteredFrom=one_step_search"
response = requests.get(url, cookies={'required_cookie': 'reese84=xxx'} ,headers=headers)
webpage = response.content
print(response.status_code)
soup = BeautifulSoup(webpage, "html.parser")
print(soup.prettify)
thanks :)
Data is generating dynamically from API calls json response as POST method and You can extract data using only requests module.So,You can follow the next example.
import requests
headers= {
'content-type': 'application/json',
'x-requested-with': 'XMLHttpRequest'
}
api_url = "https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-kaufen?pagenumber=1"
jsonData = requests.post(api_url).json()
for item in jsonData['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']:
value=item['attributes'][0]['attribute'][0]['value'].replace('€','').replace('.',',')
print(value)
Output:
4,350,000
285,000
620,000
590,000
535,000
972,500
579,000
1,399,900
325,000
749,000
290,000
189,900
361,825
199,900
299,000
195,000
1,225,000
199,000
825,000
315,000
I am trying to get data from a page. I've tried to read the posts of other people who had the same problem, Making a get request first to get cookies, setting headers, none of it works. When I examine the output of print(soup.title.get_text()) I still end up getting "Log In" as the title returned. The login_data has the same key names as the HTML <input> elements, e.g <input name=ctl00$cphMain$logIn$UserName ...> for username and <input name=ctl00$cphMain$logIn$Password ...> for password. Not sure what to do next. I can't use selenium, as I have to execute this script on an EC2 instance that's running a splunk server.
import requests
from bs4 import BeautifulSoup
link = "****"
login_URL = "https://erecruit.elwoodstaffing.com/Login.aspx"
login_data = {
"ctl00$cphMain$logIn$UserName": "****",
"ctl00$cphMain$logIn$Password": "****"
}
with requests.Session() as session:
z = session.get(login_URL)
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36',
'Content-Type':'application/json;charset=UTF-8',
}
post = session.post(login_URL, data=login_data)
response = session.get(link)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.get_text())
I actually found the answer.
You can basically just go to the network tab using chrome, and then copy requests as a cURL statement. Then, just use a website or tool to convert the cURL statement to its programming language equivalent (Python, node, java, and so forth).
I have a crawl-data task, after inspecting the URL with Firefox F12 (DevTools), I find that the site needs a JSON array input looks like:
phyIDs: Array
0: "FDER047ERDF"
and returns some data also in JSON format:
trueIDs: Array
0: "802.112.1"
What I need is just the 'trueIDs', so I use Python 3.6.1 and Requests to do the job, here is parts of the code:
import json
import requests
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0',
'Cookie': 'JSFDKF.......',
'Content-Type': 'application/json;charset=UTF-8'}
data = {'phyIDs': json.dumps([{0: 'FDER047ERDF'}])}
resp = requests.post(url, headers=headers, verify=False,
data=data)
print(resp.text)
But the printed response text is a html like message saying that some error occurs, and the status_code is 500, however, if I comment the 'Content-Type' part in headers, and use normal dict instead of JSON as the input data, then nothing returns and the status_code changes to 415, now I don't know what to do and hope someone could help me, thanks very much!
...........
Thanks guys, I have solved this. The problem is that I shouldn't add '0' in the JSON array!
I use the following code to retrieve a web page.
import requests
payload = {'name': temp} #I extract temp from another page.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:49.0) Gecko/20100101 Firefox/49.0','Accept': 'text/html, */*; q=0.01','Accept-Language': 'en-US,en;q=0.5', 'X-Requested-With': 'XMLHttpRequest' }
full_url = url.rstrip() + '/test/log?'
r = requests.get(full_url, params=payload, headers=headers, stream=True)
for line in r.iter_lines():
if line:
print line
However for some reason the http response is lacking the text inside tags.
I found out that if I send the request to Burp, intercept it and wait for 3 secs before forwarding it, then I get the complete html page containing the text inside the tags....
I still could not find the cause. Ideas?
From requests documentation:
By default, when you make a request, the body of the response is
downloaded immediately. You can override this behaviour and defer
downloading the response body until you access the Response.content
attribute with the stream parameter:
Body Content Workflow
In other words try removing stream=True in your requests.get()
or
You will have all the content when you access r.content, where r is the response.
I have tried logging into GitHub using the following code:
url = 'https://github.com/login'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'login':'username',
'password':'password',
'authenticity_token':'Token that keeps changing',
'commit':'Sign in',
'utf8':'%E2%9C%93'
}
res = requests.post(url)
print(res.text)
Now, res.text prints the code of login page. I understand that it maybe because the token keeps changing continuously. I have also tried setting the URL to https://github.com/session but that does not work either.
Can anyone tell me a way to generate the token. I am looking for a way to login without using the API. I had asked another question where I mentioned that I was unable to login. One comment said that I am not doing it right and it is possible to login just by using the requests module without the help of Github API.
ME:
So, can I log in to Facebook or Github using the POST method? I have tried that and it did not work.
THE USER:
Well, presumably you did something wrong
Can anyone please tell me what I did wrong?
After the suggestion about using sessions, I have updated my code:
s = requests.Session()
headers = {Same as above}
s.put('https://github.com/session', headers=headers)
r = s.get('https://github.com/')
print(r.text)
I still can't get past the login page.
I think you get back to the login page because you are redirected and since your code doesn't send back your cookies, you can't have a session.
You are looking for session persistance, requests provides it :
Session Objects The Session object allows you to persist certain
parameters across requests. It also persists cookies across all
requests made from the Session instance, and will use urllib3's
connection pooling. So if you're making several requests to the same
host, the underlying TCP connection will be reused, which can result
in a significant performance increase (see HTTP persistent
connection).
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/master/user/advanced/
Actually in post method the request parameters should be in request body, not in header.So the login data should be in data parameter.
For github, authenticity token is present in value attribute of an input tag which is extracted using BeautifulSoup library.
This code works fine
import requests
from getpass import getpass
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': input('Username: '),
'password': getpass()
}
url = 'https://github.com/session'
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')
login_data['authenticity_token'] = soup.find(
'input', attrs={'name': 'authenticity_token'})['value']
response = session.post(url, data=login_data, headers=headers)
print(response.status_code)
response = session.get('https://github.com', headers=headers)
print(response.text)
This code works perfectly
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': 'your-username',
'password': 'your-password'
}
with requests.Session() as s:
url = "https://github.com/session"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['authenticity_token'] = soup.find('input', attrs={'name': 'authenticity_token'})['value']
r = s.post(url, data=login_data, headers=headers)
You can also try using the PyGitHub API to perform common git tasks.
Check the link below:
https://github.com/PyGithub/PyGithub