Python: Foursquare API and Requests requires cookies and javascript - python

Issue
I am trying to contact the Foursquare API, specifically the checkin/resolve endpoint. In the past this has worked, but lately I am getting blocked with an error message saying I am a bot, and that cookies and javascript cannot be read.
Code
response = "Swarmapp URL" # from previous functions, this isn't the problem
checkin_id = response.split("c/")[1] # To get shortID
url = "https://api.foursquare.com/v2/checkins/resolve"
params = dict(
client_id = "client_id",
client_secret = "client_secret",
shortId = checkin_id,
v = "20180323")
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
time.sleep(8.5) # Limit of 500 requests an hour
resp = requests.get(url = url, params=params, headers = headers)
data = json.loads(resp.text)
This code will work for about 30-40 requests, then error and return an HTML file including: "Please verify you are human", "Access to this page has been denied because we believe you are using automation tools to browse the website.", "Your browser does not support cookies" and so on.
I've tried Googling and searching this site for similar errors, but I can't find anything that has helped. Foursquare API does not say anything about this either.
Any suggestions?

Answer
According to the Foursquare API documentation, this code should work:
import json, requests
url = 'https://api.foursquare.com/v2/checkins/resolve'
params = dict(
client_id='CLIENT_ID',
client_secret='CLIENT_SECRET',
v='20180323',
shortId = 'swarmPostID'
)
resp = requests.get(url=url, params=params)
data = json.loads(resp.text)
However, the bot detection Foursquare uses evidently contradicts the functionality of the API. I found that implementing a try except catch with a wait timer fixed the issue.
import json, requests
url = 'https://api.foursquare.com/v2/checkins/resolve'
params = dict(
client_id='CLIENT_ID',
client_secret='CLIENT_SECRET',
v='20180323',
shortId = 'swarmPostID'
)
try:
resp = requests.get(url=url, params=params)
except:
time.sleep(60) # Avoids bot detection
resp = requests.get(url=url, params=params)
try:
resp = requests.get(url=url, params=params)
except:
print("Post is private or deleted.")
continue
data = json.loads(resp.text)
This seems like a very weird fix. Either Foursquare has implemented a DDoS prevention system that contradicts its own functionality, or their checkin/resolve endpoint is broken. Either way, the code works.

Related

How to know what user-agent should I use? (404 Client Error Not Found for url)

I'm trying to download data from a specific URL and I get the "404 Client Error: Not Found for URL" Error. The website I'm trying to access is an FTP server of a university.
From searching the web I understand that a user-agent must be configured but even after configuration I still get the same error...
The URL I'm trying to access is this- https://idcftp.files.com/files/Users%20Folders/yoav.yair/WWLLN%20Data/December2018/
(You need a password to access it, but this is information I can't give).
The code I'm trying to use is this-
from requests.auth import HTTPBasicAuth
from bs4 import BeautifulSoup
def get_url_paths(url, header_list, ext='', params={}):
response = requests.get(url, params=params, headers=header_list, auth=HTTPBasicAuth('<user_name>', '<password>'))
# response = requests.get(url, params=params)
if response.ok:
response_text = response.text
else:
return response.raise_for_status()
soup = BeautifulSoup(response_text, 'html.parser')
parent = [url + node.get('href') for node in soup.find_all('a') if node.get('href').endswith(ext)]
return parent
def main():
# url = 'https://www.ncei.noaa.gov/data/total-solar-irradiance/access/monthly/'
header_list = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://idcftp.files.com/files/Users%20Folders/yoav.yair/WWLLN%20Data/December2018/'
ext = 'mat'
result = get_url_paths(url, header_list, ext)
for file in result:
f_name = file[-19:-13]
print(f_name)
if __name__ == '__main__':
main()
I've tried using all kinds of user agents, but nothing works. How can I find what user agent this website uses?
Thank you,
Karin.

Is there a way I can get data that is being loaded with ajax request on a website using web scraping in python?

I am trying to get the listing data on this page https://stashh.io/collection/secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3?sort=sold_date+desc using web scraping
Because the data is being loaded with Javascript, I can't use something like requests and BeautifulSoup. I checked the network tab to see how the request are being sent. I found that to get the data, I need to get the sid to make further request I can get the sid with the code below
def get_sid():
url = "https://stashh.io/socket.io/?EIO=4&transport=polling&t=NyPfiJ-"
response = requests.get(url)
response.raise_for_status()
text = response.text[1:]
data = {"data": ast.literal_eval(text)}
return data["data"]["sid"]
Then use the SID to send a request to this endpoint which gets the data using the code below
def get_listings():
sid = get_sid()
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
url = f"https://stashh.io/socket.io/?EIO=4&transport=polling&sid={sid}"
response = requests.get(url, headers=headers)
response.raise_for_status()
print(response.content)
return response.json()
I am getting this b'2' as a response instead of this
434[{"nfts":[{"_id":"61ffffd9aa7f94f21e7262c0","collection":"secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3","id":"354","fullid":"secret1f7mahjdux4hldnn6nqc8vnu0h5466ks9u8fwg3_354","name":"Amalia","thumbnail":[{"authentication":{"key":"","user":""},"file_type":"image","extension":"png","url":"https://arweave.net/7pVsbsC2M6uVDMHaVxds-oZkDNajhsrIkKEDT-vfkM8/public_image.png"}],"created_at":1644080437,"royalties_decimal_rate":3,"royalties":[{"recipient":null,"rate":20},{"recipient":null,"rate":15},{"recipient":null,"rate":15}],"isTemplate":false,"mint_on_demand":{"serial":null,"quantity":null,"version":null,"from_template":""},"template":{},"likes":[{"from":"secret19k85udnt8mzxlt3tx0gk29thgnszyjcxe8vrkt","timestamp":1644543830855}],"listing"...
I resort to using selenium to get the data, it works but it's quite slow.
Is there a way I can get this data without using selenium?

How can I use POST from requests module to login to Github?

I have tried logging into GitHub using the following code:
url = 'https://github.com/login'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
'login':'username',
'password':'password',
'authenticity_token':'Token that keeps changing',
'commit':'Sign in',
'utf8':'%E2%9C%93'
}
res = requests.post(url)
print(res.text)
Now, res.text prints the code of login page. I understand that it maybe because the token keeps changing continuously. I have also tried setting the URL to https://github.com/session but that does not work either.
Can anyone tell me a way to generate the token. I am looking for a way to login without using the API. I had asked another question where I mentioned that I was unable to login. One comment said that I am not doing it right and it is possible to login just by using the requests module without the help of Github API.
ME:
So, can I log in to Facebook or Github using the POST method? I have tried that and it did not work.
THE USER:
Well, presumably you did something wrong
Can anyone please tell me what I did wrong?
After the suggestion about using sessions, I have updated my code:
s = requests.Session()
headers = {Same as above}
s.put('https://github.com/session', headers=headers)
r = s.get('https://github.com/')
print(r.text)
I still can't get past the login page.
I think you get back to the login page because you are redirected and since your code doesn't send back your cookies, you can't have a session.
You are looking for session persistance, requests provides it :
Session Objects The Session object allows you to persist certain
parameters across requests. It also persists cookies across all
requests made from the Session instance, and will use urllib3's
connection pooling. So if you're making several requests to the same
host, the underlying TCP connection will be reused, which can result
in a significant performance increase (see HTTP persistent
connection).
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'
http://docs.python-requests.org/en/master/user/advanced/
Actually in post method the request parameters should be in request body, not in header.So the login data should be in data parameter.
For github, authenticity token is present in value attribute of an input tag which is extracted using BeautifulSoup library.
This code works fine
import requests
from getpass import getpass
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': input('Username: '),
'password': getpass()
}
url = 'https://github.com/session'
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html5lib')
login_data['authenticity_token'] = soup.find(
'input', attrs={'name': 'authenticity_token'})['value']
response = session.post(url, data=login_data, headers=headers)
print(response.status_code)
response = session.get('https://github.com', headers=headers)
print(response.text)
This code works perfectly
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
login_data = {
'commit': 'Sign in',
'utf8': '%E2%9C%93',
'login': 'your-username',
'password': 'your-password'
}
with requests.Session() as s:
url = "https://github.com/session"
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'html5lib')
login_data['authenticity_token'] = soup.find('input', attrs={'name': 'authenticity_token'})['value']
r = s.post(url, data=login_data, headers=headers)
You can also try using the PyGitHub API to perform common git tasks.
Check the link below:
https://github.com/PyGithub/PyGithub

python requests handle error 302?

I am trying to make a http request using requests library to the redirect url (in response headers-Location). When using Chrome inspection, I can see the response status is 302.
However, in python, requests always returns a 200 status. I added the allow_redirects=False, but the status is still always 200.
The url is https://api.weibo.com/oauth2/authorize?redirect_uri=http%3A//oauth.weico.cc&response_type=code&client_id=211160679
the first line entered the test account: moyan429#hotmail.com
the second line entered the password: 112358
and then click the first button to login.
My Python code:
import requests
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36'
session = requests.session()
session.headers['User-Agent'] = user_agent
session.headers['Host'] = 'api.weibo.com'
session.headers['Origin']='https://api.weibo.com'
session.headers['Referer'] ='https://api.weibo.com/oauth2/authorize?redirect_uri=http%3A//oauth.weico.cc&response_type=code&client_id=211160679'
session.headers['Connection']='keep-alive'
data = {
'client_id': api_key,
'redirect_uri': callback_url,
'userId':'moyan429#hotmail.com',
'passwd': '112358',
'switchLogin': '0',
'action': 'login',
'response_type': 'code',
'quick_auth': 'null'
}
resp = session.post(
url='https://api.weibo.com/oauth2/authorize',
data=data,
allow_redirects=False
)
code = resp.url[-32:]
print code
You are probably getting an API error message. Use print resp.text to see what the server tells you is wrong here.
Note that you can always inspect resp.history to see if there were any redirects; if there were any you'll find a list of response objects.
Do not set the Host or Connection headers; leave those to requests to handle. I doubt the Origin or Referer headers here needed either. Since this is an API, the User-Agent header is probably also overkill.

python requests - 403 forbidden

I have this
payload = {'from':'me', 'lang':lang, 'url':csv_url}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'
}
api_url = 'http://dev.mypage.com/app/import/'
sent = requests.get(api_url , params=payload, headers=headers)
I just keep getting 403. I am looking up to this requests docs
what am I doing wrong?
UPDATE:
The url only accepts loggedin users. how can I login there with requests?
This is how it's usually done using a Session object:
# start new session to persist data between requests
session = requests.Session()
# log in session
response = session.post(
'http://dev.mypage.com/login/',
data={'user':'username', 'password':'12345'}
)
# make sure log in was successful
if not 200 <= response.status_code < 300:
raise Exception("Error while logging in, code: %d" % response. status_code)
# ... use session object to make logged-in requests, your example:
api_url = 'http://dev.mypage.com/app/import/'
sent = session.get(api_url , params=payload, headers=headers)
You should obviously adapt this to your usage scenario.
The reason a session is needed is that the HTTP protocol does not have the concept of a session, so sessions are implemented over HTTP.

Categories

Resources