Python Selenium hard webscraping

Python Selenium hard webscraping - python

website is: https://www.jao.eu/auctions#/
you see 'OUT AREA' dropdown (I see a lot of ReactSelect...)
I need to get the full list of items contained in that list [AT, BDL-GB, BDL-NL, BE...].
Can you please help me?
wait = WebDriverWait(driver, 20)
driver.get('https://www.jao.eu/auctions#/')
first = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.css-1739xgv-control')))
first.click()
second = wait.until(......

Logging ones network traffic reveals that the page makes several requests to REST APIs, one endpoint being getcorridors, whose response is JSON and contains all values from the dropdown(s). All you need to do is imitate that HTTP POST request. No Selenium required:
def get_corridors():
import requests
from operator import itemgetter
url = "https://www.jao.eu/api/v1/auction/calls/getcorridors"
headers = {
"Accept": "application/json",
"Accept-Encoding": "gzip, deflate",
"Content-Type": "application/json",
"User-Agent": "Mozilla/5.0"
}
response = requests.post(url, headers=headers, json={})
response.raise_for_status()
return list(map(itemgetter("value"), response.json()))
def main():
for corridor in get_corridors():
print(corridor)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
IT-CH
HU-SK
ES-PT
FR-IT
SK-CZ
NL-DK
IT-FR
HU-HR
FR-ES
IT-GR
CZ-AT
DK-NL
SI-AT
CH-DE
...

Try the following to fetch the required list of items from that site using requests module:
import requests
link = 'https://www.jao.eu/api/v1/auction/calls/getcorridors'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
res = s.post(link,json={})
items = [item['value'] for item in res.json()]
print(items)
Output are like (truncated):
'IT-CH', 'HU-SK', 'ES-PT', 'FR-IT', 'SK-CZ', 'NL-DK', 'IT-FR', 'HU-HR'

Related

Getting a 403 error on a webscraping script

I have a web scraping script that has recently ran into a 403 error.
It worked for a while with just the basic code but now has been running into 403 errors.
I've tried using user agents to circumvent this and it very briefly worked, but those are now getting a 403 error too.
Does anyone have any idea how to get this script running again?
If it helps, here is some context:
The purpose of the script is to find out which artists are on which Tidal playlists, for the purpose of this question - I have only included the snippet of code that gets the site as that is where the error occurs.
Thanks in advance!
The basic code looks like this:
baseurl = 'https://tidal.com/browse'
for i in platformlist:
url = baseurl+str(i[0])
tidal = requests.get(url)
tidal.raise_for_status()
if tidal.status_code != 200:
print ("Website Error: ", url)
pass
else:
soup = bs4.BeautifulSoup(tidal.text,"lxml")
text = str(soup)
text2 = text.lower()
With user-agents:
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
url = 'https://tidal.com/playlist/1b418bb8-90a7-4f87-901d-707993838346'
for i in range(1,4):
#Pick a random user agent
user_agent = random.choice(user_agent_list)
#Set the headers
headers = {'User-Agent': user_agent}
#Make the request
tidal = requests.get(url,headers=headers)
print("Request #%d\nUser-Agent Sent:%s\n\nHeaders Received by HTTPBin:"%(i,user_agent))
print(tidal.status_code)
print("-------------------")
#tidal = requests.get(webpage)
tidal.raise_for_status()
print(tidal.status_code)
#make webpage content legible
soup = bs4.BeautifulSoup(tidal.text,"lxml")
print(soup)
#turn bs4 type content into text
text = str(soup)
text2 = text.lower()

I'd like to suggest an alternative solution - one that doesn't involve BeautifulSoup.
I visited the main page and clicked on an album, while at the same time logging my network traffic. I noticed that my browser made an HTTP POST request to a GraphQL API, which accepts a custom query string as part of the POST payload which dictates the format of the response data. The response is JSON, and it contains all the information we requested with the original query string (in this case, all artists for every track of a playlist). Normally this API is used by the page to populate itself asynchronously using JavaScript, which is what normally happens when the page is viewed in a browser like it's meant to be.
Since we have the API endpoint, request headers and POST payload, we can imitate that request in Python to get a JSON response:
def main():
import requests
url = "https://tidal.com/browse/api"
headers = {
"accept": "application/json",
"accept-encoding": "gzip, deflate",
"content-type": "application/json",
"user-agent": "Mozilla/5.0"
}
query = """
query ($playlistId: String!) {
playlist(uuid: $playlistId) {
creator {
name
}
title
tracks {
albumID
albumTitle
artists {
id
name
}
id
title
}
}
}
"""
payload = {
"operationName": None,
"query": query,
"variables": {
"playlistId": "1b418bb8-90a7-4f87-901d-707993838346"
}
}
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
playlist = response.json()["data"]["playlist"]
print("Artists in playlist \"{}\":".format(playlist["title"]))
for track_number, track in enumerate(playlist["tracks"], start=1):
artists = ", ".join(artist["name"] for artist in track["artists"])
print("Track #{} [{}]: {}".format(track_number, track["title"], artists))
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Artists in playlist "New Arrivals":
Track #1 [Higher Power]: Coldplay
Track #2 [i n t e r l u d e]: J. Cole
Track #3 [Fast (Motion)]: Saweetie
Track #4 [Miss The Rage]: Trippie Redd, Playboi Carti
Track #5 [In My Feelings (feat. Quavo & Young Dolph)]: Tee Grizzley, Quavo, Young Dolph
Track #6 [Thumbin]: Kash Doll
Track #7 [Tiempo]: Ozuna
...
You can change the playlistId key-value pair in the payload dictionary to get the artist information for any playlist.
Take a look at this other answer I posted, where I go more in-depth on how to log your network traffic, finding API endpoints and imitating requests.

Can't login to Instagram using requests

I'm trying to login to Instagram using requests library. I succeeded using following script, however it doesn't work anymore. The password field becomes encrypted (checked the dev tools while logging in manually).
I've tried :
import re
import requests
from bs4 import BeautifulSoup
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
payload = {
'username': 'someusername',
'password': 'somepassword',
'enc_password': '',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
r = s.post(login_url,data=payload,headers={
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken":csrf
})
print(r.status_code)
print(r.url)
I found using dev tools:
username: someusername
enc_password: #PWD_INSTAGRAM_BROWSER:10:1592421027:ARpQAAm7pp/etjy2dMjVtPRdJFRPu8FAGILBRyupINxLckJ3QO0u0RLmU5NaONYK2G0jQt+78BBDBxR9nrUsufbZgR02YvR8BLcHS4uN8Gu88O2Z2mQU9AH3C0Z2NpDPpS22uqUYhxDKcYS5cA==
queryParams: {"oneTapUsers":"[\"36990119985\"]"}
optIntoOneTap: false
How can I login to Instagram using requests?

You can use authentication version 0 - plain password, no encryption:
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': '<USERNAME HERE>',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:<PLAIN PASSWORD HERE>', # <-- note the '0' - that means we want to use plain passwords
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
r = s.post(login_url,data=payload,headers={
"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken":csrf
})
print(r.status_code)
print(r.url)
print(r.text)
Prints:
200
https://www.instagram.com/accounts/login/ajax/
{"authenticated": true, "user": true, "userId": "XXXXXXXX", "oneTapPrompt": true, "reactivated": true, "status": "ok"}

In order to do it, you need to do some investigation job on their javascript.
After a little research, I got that they use AES-GCM with 256 bits key length, they have some prefix of 100 bytes that I still do not know what is it, then they concatenate the password to it and encrypt the whole message 100 + len(password).
You can read about AES-GCM, get the key, iv, and additional data from their code, and complete the job yourself.
I hope that I have helped, Good Luck :)

The above code provided by #Andrej Kesely fails for me. but I have did some changes in the code by setting the header(user-agent and Referer) in the first get "s.get(link)" request.
If you also get the error like same as me.
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
IndexError: list index out of range
Then here is the solution as I have told you above.
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
userAgent= "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
time = int(datetime.now().timestamp())
payload = {
'username': '<Your_Username>',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:<Your_Password>',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
s.headers= {"user-agent":userAgent}
s.headers.update({"Referer":link})
r = s.get(link)
print(r)
csrf = re.findall(r"csrf_token\":\"(.*?)\"",r.text)[0]
r = s.post(login_url,data=payload,headers={
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
"referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken":csrf
})
print(r.status_code)
print(r.url)
print(r.text)
Please use the python3 version to run this script. otherwise you will get some more error for “formatted string literals,” and "AttributeError: 'datetime.datetime' object has no attribute 'timestamp'".
Run the script like this.
python3 <your_script_name.py>
I hope this will solve your problem.

You can't. Instagram encrypts the password when the request is sent. Unless you can figure out the way they encrypt it and be able to do the same with the password you're sending, you can't login to instagram with requests.

Download bing image search results using python (custom url)

I want to download bing search images using python code.
Example URL: https://www.bing.com/images/search?q=sketch%2520using%20iphone%2520students
My python code generates an url of bing search as shown in example. Next step, is to download all images shown in that link on my local desktop.
In my project i am generating some words in python and my code generates bing image search URL. All i need is to download images shown on that search page using python.

To download an image, you need to make a request to the image URL that ends with .png, .jpg etc.
But Bing provides a "m" attribute inside the <a> element that stores needed data in the JSON format from which you can parse the image URL that is stored in the "murl" key and download it afterward.
To download all images locally to your computer, you can use 2 methods:
# bs4
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
# urllib
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
In the first case, you can use context manager with open() to load the image locally. In the second case, you can use urllib.request.urlretrieve method of the urllib.request library.
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Note: An error might occur with the urllib.request.urlretrieve method where some of the request has got a captcha or something else that returns an unsuccessful status code. The biggest problem is it's hard to test for response code while requests provide a status_code method to test it.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
image = requests.get(img_url, headers=headers, timeout=30)
query = query.lower().replace(" ", "_")
if image.status_code == 200:
with open(f"images/{query}_image_{index}.jpg", 'wb') as file:
file.write(image.content)
Using urllib.request.urlretrieve.
from bs4 import BeautifulSoup
import requests, lxml, json
import urllib.request as req
query = "sketch using iphone students"
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"first": 1
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get("https://www.bing.com/images/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "lxml")
for index, url in enumerate(soup.select(".iusc"), start=1):
img_url = json.loads(url["m"])["murl"]
query = query.lower().replace(" ", "_")
opener = req.build_opener()
opener.addheaders=[("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36")]
req.install_opener(opener)
req.urlretrieve(img_url, f"images/{query}_image_{index}.jpg")
Output:

edit your code to find the designated image url and then use this code
use urllib.request
import urllib.request as req
imgurl ="https://i.ytimg.com/vi/Ks-_Mh1QhMc/hqdefault.jpg"
req.urlretrieve(imgurl, "image_name.jpg")

Python scraping response data

i would like to take the response data about a specific website.
I have this site:
https://enjoy.eni.com/it/milano/map/
and if i open the browser debuger console i can see a posr request that give a json response:
how in python i can take this response by scraping the website?
Thanks

Apparently the webservice has a PHPSESSID validation so we need to get it first using proper user agent:
import requests
import json
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'
}
r = requests.get('https://enjoy.eni.com/it/milano/map/', headers=headers)
session_id = r.cookies['PHPSESSID']
headers['Cookie'] = 'PHPSESSID={};'.format(session_id)
res = requests.post('https://enjoy.eni.com/ajax/retrieve_vehicles', headers=headers, allow_redirects=False)
json_obj = json.loads(res.content)

Post forms using requests on .net website (python)

import requests
headers ={
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding":"gzip, deflate",
"Accept-Language":"en-US,en;q=0.5",
"Connection":"keep-alive",
"Host":"mcfbd.com",
"Referer":"https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",
"User-Agent":"Mozilla/5.0(Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0"}
a = requests.session()
soup = BeautifulSoup(a.get("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx").content)
payload = {"ctl00$ContentPlaceHolder1$txtSearchHouse":"",
"ctl00$ContentPlaceHolder1$txtSearchSector":"",
"ctl00$ContentPlaceHolder1$txtPropertyID":"",
"ctl00$ContentPlaceHolder1$txtownername":"",
"ctl00$ContentPlaceHolder1$ddlZone":"1",
"ctl00$ContentPlaceHolder1$ddlSector":"2",
"ctl00$ContentPlaceHolder1$ddlBlock":"2",
"ctl00$ContentPlaceHolder1$btnFind":"Search",
"__VIEWSTATE":soup.find('input',{'id':'__VIEWSTATE'})["value"],
"__VIEWSTATEGENERATOR":"14039419",
"__EVENTVALIDATION":soup.find("input",{"name":"__EVENTVALIDATION"})["value"],
"__SCROLLPOSITIONX":"0",
"__SCROLLPOSITIONY":"0"}
b = a.post("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",headers = headers,data = payload).text
print(b)
above is my code for this website.
https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx
I checked firebug out and these are the values of the form data.
however doing this:
b = requests.post("https://mcfbd.com/mcf/FrmView_PropertyTaxStatus.aspx",headers = headers,data = payload).text
print(b)
throws this error:
[ArgumentException]: Invalid postback or callback argument
is my understanding of submitting forms via request correct?
1.open firebug
2.submit form
3.go to the NET tab
4.on the NET tab choose the post tab
5.copy form data like the code above
I've always wanted to know how to do this. I could use selenium but I thought I'd try something new and use requests

The error you are receiving is correct because the fields like _VIEWSTATE (and others as well) are not static or hardcoded. The proper way to do this is as follows:
Create a Requests Session object. Also, it is advisable to update it with headers containing USER-AGENT string -
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36",}`
s = requests.session()
Navigate to the specified url -
r = s.get(url)
Use BeautifulSoup4 to parse the html returned -
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.content, 'html5lib')
Populate formdata with the hardcoded values and dynamic values -
formdata = {
'__VIEWSTATE': soup.find('input', attrs={'name': '__VIEWSTATE'})['value'],
'field1': 'value1'
}
Then send the POST request using the session object itself -
s.post(url, data=formdata)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Selenium hard webscraping - python

Related

Getting a 403 error on a webscraping script

Can't login to Instagram using requests

Download bing image search results using python (custom url)

Python scraping response data

Post forms using requests on .net website (python)

Categories

Resources