Posting form data isn't working and since my other post about this wasn't working, I figured I would try to ask the question again so maybe I can get another perspective. I am currently trying to get the requests.get(url, data=q) to work. When I print, I am getting a page not found. I have resorted just to set variables and join them to the entire URL to make it work but I really want to learn this aspect about requests. Where am I making the mistake? I am using the HTML tag attributes name=search_terms and name=geo_location_terms for the form.
search_terms = "Bars"
location = "New Orleans, LA"
url = "https://www.yellowpages.com"
q = {'search_terms': search_terms, 'geo_locations_terms': location}
page = requests.get(url, data=q)
print(page.url)
You have few little mistakes in your code:
Check form's action parameter. Then url = "https://www.yellowpages.com/search"
Second parameter is geo_location_terms not geo_locations_terms.
You should pass query parameters in requests.get as params not as request data (data).
So, the final version of code:
import requests
search_terms = "Bars"
location = "New Orleans, LA"
url = "https://www.yellowpages.com/search"
q = {'search_terms': search_terms, 'geo_location_terms': location}
page = requests.get(url, params=q)
print(page.url)
Result:
https://www.yellowpages.com/search?search_terms=Bars&geo_location_terms=New+Orleans%2C+LA
Besides the issues pointed by #Lev Zakharov, you need to set the cookies in your request, like this:
import requests
search_terms = "Bars"
location = "New Orleans, LA"
url = "https://www.yellowpages.com/search"
with requests.Session() as session:
session.headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Cookie': 'cookies'
})
q = {'search_terms': search_terms, 'geo_locations_terms': location}
response = session.get(url, params=q)
print(response.url)
print(response.status_code)
Output
https://www.yellowpages.com/search?search_terms=Bars&geo_locations_terms=New+Orleans%2C+LA
200
To get the cookies you can see the requests using some Network listener for instance using Chrome Developer Tools Network tab, then replace the value 'cookies'
Related
My goal is to web scrape this url link and iterate through the pages. I keep getting a strange error. My code and error follows:
import requests
import json
import pandas as pd
url = 'https://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page='
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}
#create a url list to scrape data from all pages
url_list = []
for i in range(0, 4375):
url_list.append(url + str(i))
response = requests.get(url, headers=headers)
data = response.json()
d = json.dumps(data)
df = pd.json_normalize(d)
Error:
{'items': [{'applicationName': 'ReverseProxy', 'errorCode': 'UNAUTHORIZED', 'message': 'You are Unauthorized to perform the attempted operation. Application access token required', 'additionalErrorData': [{'name': 'OperationName', 'value': 'http://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page=0&page=1'}]}], 'exceptionDetail': {'type': 'Mozu.Core.Exceptions.VaeUnAuthorizedException'}
This is strange to me because I should be able to access each page on this url
Specifically, since I can follow the link and copy and paste the json data. Is there a way to scrape this site without an api key?
It works in your browser because you have the cookie token saved in you local storage.
Once you delete all cookies, it does not work when you try to navigate to API link directly.
The token cookie is sb-sf-at-prod-s. Add this cookie to your headers and it will work.
I do not know if the value of this cookie is linked to my ip address. But if it is and it does not work for you. Just change the value of this cookie to one from your browser.
This cookies maybe is valid only for some request or for some time.
I recommend you to put some sleep between each request.
This website has security antibot Akamai.
import requests
import json
url = 'https://www.acehardware.com/api/commerce/storefront/locationUsageTypes/SP/locations?page='
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
'cookie': 'sb-sf-at-prod=at=%2FVzynTSsuVJGJMAd8%2BjAO67EUtyn1fIEaqKmCi923rynHnztv6rQZH%2F5LMa7pmMBRiW00x2L%2B%2FLfmJhJKLpNMoK9OFJi069WHbzphl%2BZFM%2FpBV%2BdqmhCL%2FtylU11GQYQ8y7qavW4MWS4xJzWdmKV%2F01iJ0RkwynJLgcXmCzcde2oqgxa%2FAYWa0hN0xuYBMFlCoHJab1z3CU%2F01FJlsBDzXmJwb63zAJGVj4PIH5LvlcbnbOhbouQBKxCrMyrmpvxDf70U3nTl9qxF9qgOyTBZnvMBk1juoK8wL1K3rYp51nBC0O%2Bthd94wzQ9Vkolk%2B4y8qapFaaxRtfZiBqhAAtMg%3D%3D'
}
#create a url list to scrape data from all pages
url_list = []
for i in range(0, 4375):
url_list.append(url + str(i))
response = requests.get(url, headers=headers)
data = response.json()
d = json.dumps(data)
print(d)
I hope I have been able to help you.
Writing a parser for the site https://myip.ms/ And here for this page https://myip.ms/browse/sites/1/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 Everything works fine with this link, but if you go to another page https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714 It does not output any data, although the site structure is the same. I think that this may be due to the fact that the site has a limit on views, or because you need to register, but I can't find what request you need to send to log in to your account. Tell me what to do?
import requests
from bs4 import BeautifulSoup
import time
link_list = []
URL = 'https://myip.ms/browse/sites/2/ipID/23.227.38.0/ipIDii/23.227.38.255/own/376714'
HEADERS = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 YaBrowser/20.12.2.105 Yowser/2.5 Safari/537.36','accept':'*/*'}
#HOST =
def get_html(url,params=None):
r = requests.get(url,headers=HEADERS,params=params)
return r
def get_content(html):
soup = BeautifulSoup(html,'html.parser')
items = soup.find_all('td',class_='row_name')
for item in items:
links = item.find('a').get('href')
link_list.append({
'link': links
})
def parser():
print(URL)
html = get_html(URL)
if html.status_code == 200:
get_content(html.text)
else:
print('Error')
parser()
print(link_list)
Use SessionID with your request. It will allow you at least 50 requests per day.
If you use proxy that support cookies this number might be even higher.
So the process is as follows:
load the page with your browser.
find session id in the request inside your Dev Tools.
use this session id in your request, no headers or additional info is required.
enjoy results for 50 requests per day.
repeat in 24 hours.
I am trying to get data from a page. I've tried to read the posts of other people who had the same problem, Making a get request first to get cookies, setting headers, none of it works. When I examine the output of print(soup.title.get_text()) I still end up getting "Log In" as the title returned. The login_data has the same key names as the HTML <input> elements, e.g <input name=ctl00$cphMain$logIn$UserName ...> for username and <input name=ctl00$cphMain$logIn$Password ...> for password. Not sure what to do next. I can't use selenium, as I have to execute this script on an EC2 instance that's running a splunk server.
import requests
from bs4 import BeautifulSoup
link = "****"
login_URL = "https://erecruit.elwoodstaffing.com/Login.aspx"
login_data = {
"ctl00$cphMain$logIn$UserName": "****",
"ctl00$cphMain$logIn$Password": "****"
}
with requests.Session() as session:
z = session.get(login_URL)
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36',
'Content-Type':'application/json;charset=UTF-8',
}
post = session.post(login_URL, data=login_data)
response = session.get(link)
html = response.text
soup = BeautifulSoup(html, "html.parser")
print(soup.title.get_text())
I actually found the answer.
You can basically just go to the network tab using chrome, and then copy requests as a cURL statement. Then, just use a website or tool to convert the cURL statement to its programming language equivalent (Python, node, java, and so forth).
I try to login to the member area of the following website :
https://trader.degiro.nl/
Unfortunately, I tried many way without success.
The post form since to be a json it's the reason why I sent a json instead of the post data
import requests
session = requests.Session()
data = {"username":"test", "password":"test", "isRedirectToMobile": "false", "loginButtonUniversal": ""}
url = "https://trader.degiro.nl/login/#/login"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36'}
r = session.post(url, headers=headers, json={'json_payload': data})
Does any one have a idea why it doesn't work ?
Looking at the request my browser sends, the code should be:
url = "https://trader.degiro.nl/login/secure/login"
...
r = session.post(url, headers=headers, json=data)
That is, there's no need to wrap the data in json_payload and the url is slightly different to the one for viewing the login page.
I have tried logging into GitHub using the following code:
url = 'https://github.com/login'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'login':'username',
'password':'password',
'authenticity_token':'Token that keeps changing',
'commit':'Sign in',
'utf8':'%E2%9C%93'
}
res = requests.post(url)
print(res.text)
Now, res.text prints the code of login page. I understand that it maybe because the token keeps changing continuously. I have also tried setting the URL to https://github.com/session but that does not work either.
Can anyone tell me a way to generate the token. I am looking for a way to login without using the API. I had asked another question where I mentioned that I was unable to login. One comment said that I am not doing it right and it is possible to login just by using the requests module without the help of Github API.
ME:
So, can I log in to Facebook or Github using the POST method? I have tried that and it did not work.
THE USER:
Well, presumably you did something wrong
Can anyone please tell me what I did wrong?
After the suggestion about using sessions, I have updated my code:
s = requests.Session()
headers = {Same as above}
s.put('https://github.com/session', headers=headers)
r = s.get('https://github.com/')
print(r.text)
I still can't get past the login page.
I think you get back to the login page because you are redirected and since your code doesn't send back your cookies, you can't have a session.
You are looking for session persistance, requests provides it :
Session Objects The Session object allows you to persist certain
parameters across requests. It also persists cookies across all
requests made from the Session instance, and will use urllib3's
connection pooling. So if you're making several requests to the same
host, the underlying TCP connection will be reused, which can result
in a significant performance increase (see HTTP persistent
connection).
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/master/user/advanced/
Actually in post method the request parameters should be in request body, not in header.So the login data should be in data parameter.
For github, authenticity token is present in value attribute of an input tag which is extracted using BeautifulSoup library.
This code works fine
import requests
from getpass import getpass
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': input('Username: '),
'password': getpass()
}
url = 'https://github.com/session'
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')
login_data['authenticity_token'] = soup.find(
'input', attrs={'name': 'authenticity_token'})['value']
response = session.post(url, data=login_data, headers=headers)
print(response.status_code)
response = session.get('https://github.com', headers=headers)
print(response.text)
This code works perfectly
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': 'your-username',
'password': 'your-password'
}
with requests.Session() as s:
url = "https://github.com/session"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['authenticity_token'] = soup.find('input', attrs={'name': 'authenticity_token'})['value']
r = s.post(url, data=login_data, headers=headers)
You can also try using the PyGitHub API to perform common git tasks.
Check the link below:
https://github.com/PyGithub/PyGithub