Getting a 403 error on a webscraping script

Getting a 403 error on a webscraping script - python

I have a web scraping script that has recently ran into a 403 error.
It worked for a while with just the basic code but now has been running into 403 errors.
I've tried using user agents to circumvent this and it very briefly worked, but those are now getting a 403 error too.
Does anyone have any idea how to get this script running again?
If it helps, here is some context:
The purpose of the script is to find out which artists are on which Tidal playlists, for the purpose of this question - I have only included the snippet of code that gets the site as that is where the error occurs.
Thanks in advance!
The basic code looks like this:
baseurl = 'https://tidal.com/browse'
for i in platformlist:
url = baseurl+str(i[0])
tidal = requests.get(url)
tidal.raise_for_status()
if tidal.status_code != 200:
print ("Website Error: ", url)
pass
else:
soup = bs4.BeautifulSoup(tidal.text,"lxml")
text = str(soup)
text2 = text.lower()
With user-agents:
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
url = 'https://tidal.com/playlist/1b418bb8-90a7-4f87-901d-707993838346'
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
#Make the request
tidal = requests.get(url,headers=headers)
print("Request #%d\nUser-Agent Sent:%s\n\nHeaders Received by HTTPBin:"%(i,user_agent))
print(tidal.status_code)
print("-------------------")
#tidal = requests.get(webpage)
tidal.raise_for_status()
print(tidal.status_code)
#make webpage content legible
soup = bs4.BeautifulSoup(tidal.text,"lxml")
print(soup)
#turn bs4 type content into text
text = str(soup)
text2 = text.lower()

I'd like to suggest an alternative solution - one that doesn't involve BeautifulSoup.
I visited the main page and clicked on an album, while at the same time logging my network traffic. I noticed that my browser made an HTTP POST request to a GraphQL API, which accepts a custom query string as part of the POST payload which dictates the format of the response data. The response is JSON, and it contains all the information we requested with the original query string (in this case, all artists for every track of a playlist). Normally this API is used by the page to populate itself asynchronously using JavaScript, which is what normally happens when the page is viewed in a browser like it's meant to be.
Since we have the API endpoint, request headers and POST payload, we can imitate that request in Python to get a JSON response:
def main():
import requests
url = "https://tidal.com/browse/api"
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"content-type": "application/json",
"user-agent": "Mozilla/5.0"
}
query = """
query ($playlistId: String!) {
playlist(uuid: $playlistId) {
creator {
name
}
title
tracks {
albumID
albumTitle
artists {
id
name
}
id
title
}
}
}
"""
payload = {
"operationName": None,
"query": query,
"variables": {
"playlistId": "1b418bb8-90a7-4f87-901d-707993838346"
}
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
playlist = response.json()["data"]["playlist"]
print("Artists in playlist \"{}\":".format(playlist["title"]))
for track_number, track in enumerate(playlist["tracks"], start=1):
artists = ", ".join(artist["name"] for artist in track["artists"])
print("Track #{} [{}]: {}".format(track_number, track["title"], artists))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Artists in playlist "New Arrivals":
Track #1 [Higher Power]: Coldplay
Track #2 [i n t e r l u d e]: J. Cole
Track #3 [Fast (Motion)]: Saweetie
Track #4 [Miss The Rage]: Trippie Redd, Playboi Carti
Track #5 [In My Feelings (feat. Quavo & Young Dolph)]: Tee Grizzley, Quavo, Young Dolph
Track #6 [Thumbin]: Kash Doll
Track #7 [Tiempo]: Ozuna
...
You can change the playlistId key-value pair in the payload dictionary to get the artist information for any playlist.
Take a look at this other answer I posted, where I go more in-depth on how to log your network traffic, finding API endpoints and imitating requests.

Related

Read JSON metadata for a token from Solscan

I'm using python and trying to read the metadata from a token on solscan.
I am looking for the name, image, etc from metadata.
I am currently using JSON request which seems to work (ie not fail), but it only returns me:
{"holder":0}
Process finished with exit code 0
I am doing several other requests to website, so I think my request is correct.
I tried looking at the documentation on https://public-api.solscan.io/docs and I believe I am requesting the correct info, but I dont get it.
Here is my current code:
import requests
headers = {
'accept': 'application/jsonParsed',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
params = (
('tokenAddress', 'EArf8AxBi44QxFVnSab9gZpXTxVGiAX2YCLokccr1UsW'),
)
response = requests.get('https://public-api.solscan.io/token/meta', headers=headers, params=params)
#response = requests.get('https://arweave.net/viPcoBnO9OjXvnzGMXGvqJ2BEgl25BMtqGaj-I1tkCM', headers=headers)
print(response.content.decode())
Any help appreciated!

This code sample works:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
}
params = {
'address': 'EArf8AxBi44QxFVnSab9gZpXTxVGiAX2YCLokccr1UsW',
}
response = requests.get('https://api.solscan.io/account', headers=headers, params=params)
print(response.content.decode())
I use another URL and parameters in my sample: https://api.solscan.io/account used instead of https://public-api.solscan.io/token/meta and address param instead of tokenAddress.

Python Selenium hard webscraping

website is: https://www.jao.eu/auctions#/
you see 'OUT AREA' dropdown (I see a lot of ReactSelect...)
I need to get the full list of items contained in that list [AT, BDL-GB, BDL-NL, BE...].
Can you please help me?
wait = WebDriverWait(driver, 20)
driver.get('https://www.jao.eu/auctions#/')
first = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.css-1739xgv-control')))
first.click()
second = wait.until(......

Logging ones network traffic reveals that the page makes several requests to REST APIs, one endpoint being getcorridors, whose response is JSON and contains all values from the dropdown(s). All you need to do is imitate that HTTP POST request. No Selenium required:
def get_corridors():
import requests
from operator import itemgetter
url = "https://www.jao.eu/api/v1/auction/calls/getcorridors"
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
"Content-Type": "application/json",
"User-Agent": "Mozilla/5.0"
}
response = requests.post(url, headers=headers, json={})
response.raise_for_status()
return list(map(itemgetter("value"), response.json()))
def main():
for corridor in get_corridors():
print(corridor)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
IT-CH
HU-SK
ES-PT
FR-IT
SK-CZ
NL-DK
IT-FR
HU-HR
FR-ES
IT-GR
CZ-AT
DK-NL
SI-AT
CH-DE
...

Try the following to fetch the required list of items from that site using requests module:
import requests
link = 'https://www.jao.eu/api/v1/auction/calls/getcorridors'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.post(link,json={})
items = [item['value'] for item in res.json()]
print(items)
Output are like (truncated):
'IT-CH', 'HU-SK', 'ES-PT', 'FR-IT', 'SK-CZ', 'NL-DK', 'IT-FR', 'HU-HR'

Can't login to Instagram using requests

I'm trying to login to Instagram using requests library. I succeeded using following script, however it doesn't work anymore. The password field becomes encrypted (checked the dev tools while logging in manually).
I've tried :
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
payload = {
'username': 'someusername',
'password': 'somepassword',
'enc_password': '',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
r = s.post(login_url,data=payload,headers={
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken":csrf
})
print(r.status_code)
print(r.url)
I found using dev tools:
username: someusername
enc_password: #PWD_INSTAGRAM_BROWSER:10:1592421027:ARpQAAm7pp/etjy2dMjVtPRdJFRPu8FAGILBRyupINxLckJ3QO0u0RLmU5NaONYK2G0jQt+78BBDBxR9nrUsufbZgR02YvR8BLcHS4uN8Gu88O2Z2mQU9AH3C0Z2NpDPpS22uqUYhxDKcYS5cA==
queryParams: {"oneTapUsers":"[\"36990119985\"]"}
optIntoOneTap: false
How can I login to Instagram using requests?

You can use authentication version 0 - plain password, no encryption:
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': '<USERNAME HERE>',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:<PLAIN PASSWORD HERE>', # <-- note the '0' - that means we want to use plain passwords
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
r = s.post(login_url,data=payload,headers={
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken":csrf
})
print(r.status_code)
print(r.url)
print(r.text)
Prints:
200
https://www.instagram.com/accounts/login/ajax/
{"authenticated": true, "user": true, "userId": "XXXXXXXX", "oneTapPrompt": true, "reactivated": true, "status": "ok"}

In order to do it, you need to do some investigation job on their javascript.
After a little research, I got that they use AES-GCM with 256 bits key length, they have some prefix of 100 bytes that I still do not know what is it, then they concatenate the password to it and encrypt the whole message 100 + len(password).
You can read about AES-GCM, get the key, iv, and additional data from their code, and complete the job yourself.
I hope that I have helped, Good Luck :)

The above code provided by #Andrej Kesely fails for me. but I have did some changes in the code by setting the header(user-agent and Referer) in the first get "s.get(link)" request.
If you also get the error like same as me.
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
IndexError: list index out of range
Then here is the solution as I have told you above.
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
userAgent= "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
time = int(datetime.now().timestamp())
payload = {
'username': '<Your_Username>',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:<Your_Password>',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
s.headers= {"user-agent":userAgent}
s.headers.update({"Referer":link})
r = s.get(link)
print(r)
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
r = s.post(login_url,data=payload,headers={
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken":csrf
})
print(r.status_code)
print(r.url)
print(r.text)
Please use the python3 version to run this script. otherwise you will get some more error for “formatted string literals,” and "AttributeError: 'datetime.datetime' object has no attribute 'timestamp'".
Run the script like this.
python3 <your_script_name.py>
I hope this will solve your problem.

You can't. Instagram encrypts the password when the request is sent. Unless you can figure out the way they encrypt it and be able to do the same with the password you're sending, you can't login to instagram with requests.

Can't get required response using json parameters within get requests

I'm trying to get json response from this webpage using the following approach but this is what I get {"message": "Must provide valid one of: query_id, query_hash", "status": "fail"}. I tried to print the response url, as in r.url in the second script to see if it matches the one I tried to send but I found it different in structure.
If I use the url directly (taken from dev tools) within requests, I get required content:
import json
import requests
check_url = 'https://www.instagram.com/graphql/query/?query_hash=7dabc71d3e758b1ec19ffb85639e427b&variables=%7B%22tag_name%22%3A%22instagood%22%2C%22first%22%3A2%2C%22after%22%3A%22QVFDa3djMUFwM1BkRWJNTlEzRmxBYkRGdFBDVzViU2JoNVZPbWNQSmNCTE1HNDlhYWdsdi1EcE5ickhvYjhRWUhqUDhIcXE3YTE4M1JMbmdVN0lMSXM3ZA%3D%3D%22%7D'
r = requests.get(check_url)
print(r.json())
But, I can't make it work:
import json
import requests
url = 'https://www.instagram.com/explore/tags/instagood/'
query_url = 'https://www.instagram.com/graphql/query/?'
payload = {
"query_hash": "7dabc71d3e758b1ec19ffb85639e427b",
"variables": {"tag_name":"instagood","first":"2","after":"QVFDa3djMUFwM1BkRWJNTlEzRmxBYkRGdFBDVzViU2JoNVZPbWNQSmNCTE1HNDlhYWdsdi1EcE5ickhvYjhRWUhqUDhIcXE3YTE4M1JMbmdVN0lMSXM3ZA=="}
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
r = s.get(query_url,params=json.dumps(payload))
print(r.content)
How can I make the above script work?

Your problem is connected to how you encode the params.
From the check_url in your first example we can see:
?query_hash=7dabc71d3e758b1ec19ffb85639e427b&variables=%7B%22tag_name%22%3A%22...
This URL has 2 params:
query_hash - string
variables - looks like a URL encoded string, judging by the escape values (%7B%22).
As you have correctly identified, %7B%22 corresponds to {". In other words, the second parameter is a url-escaped JSON string.
From this we can get a clue about the new solution:
query_url = 'https://www.instagram.com/graphql/query/?'
variables = {"tag_name": "instagood", "first": "2",
"after": "QVFDa3djMUFwM1BkRWJNTlEzRmxBYkRGdFBDVzViU2JoNVZPbWNQSmNCTE1HNDlhYWdsdi1EcE5ickhvYjhRWUhqUDhIcXE3YTE4M1JMbmdVN0lMSXM3ZA=="}
payload = {
"query_hash": "7dabc71d3e758b1ec19ffb85639e427b",
"variables": json.dumps(variables)
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) ' + \
'Chrome/81.0.4044.138 Safari/537.36'
r = s.get(query_url, params=payload)
print(r.content)
As you can see, the params passed to the requests.get method is a dict with two keys. This will get translated into ?query_hash=value1&variables=value2.
To get the correct value for variables, we just dump the JSON to string. The requests library will take care of URL-escaping all the characters like { and " in the string.

While running your code, the URL that forms after api call contains unnecessary escape characters. This is what screwing up the API call.
It is not suggested to send data payload while using get. A quick solution to this could be using post request instead. It worked fine!
import json
import requests
url = 'https://www.instagram.com/explore/tags/instagood/'
query_url = 'https://www.instagram.com/graphql/query/?'
payload = {
"query_hash": "7dabc71d3e758b1ec19ffb85639e427b",
"variables": {"tag_name":"instagood","first":"2","after":"QVFDa3djMUFwM1BkRWJNTlEzRmxBYkRGdFBDVzViU2JoNVZPbWNQSmNCTE1HNDlhYWdsdi1EcE5ickhvYjhRWUhqUDhIcXE3YTE4M1JMbmdVN0lMSXM3ZA=="}
}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
r = s.post(query_url,params=json.dumps(payload))
print(r.content)

Python scraping response data

i would like to take the response data about a specific website.
I have this site:
https://enjoy.eni.com/it/milano/map/
and if i open the browser debuger console i can see a posr request that give a json response:
how in python i can take this response by scraping the website?
Thanks

Apparently the webservice has a PHPSESSID validation so we need to get it first using proper user agent:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
}
r = requests.get('https://enjoy.eni.com/it/milano/map/', headers=headers)
session_id = r.cookies['PHPSESSID']
headers['Cookie'] = 'PHPSESSID={};'.format(session_id)
res = requests.post('https://enjoy.eni.com/ajax/retrieve_vehicles', headers=headers, allow_redirects=False)
json_obj = json.loads(res.content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting a 403 error on a webscraping script - python

Related

Read JSON metadata for a token from Solscan

Python Selenium hard webscraping

Can't login to Instagram using requests

Can't get required response using json parameters within get requests

Python scraping response data

Categories

Resources