Web Scraping Identifying executing and troubleshooting a request - python

I am having some trouble scraping data from the following website:
https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin
When we load the page, it loads the first ~30 posts on real state in the city of Sao Paulo.
If we scroll down, it loads more posts.
Usually I would use selenium to get around this - but I want to learn how to do it properly - I imagine that is by fiddling with requests.
By using inspect on chrome, and watching for what happens when we scroll down, I can see a request made which I presume is what retrieves the new posts.
If I copy its content as curl, I get the following command:
curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
-X "OPTIONS" ^
-H "Connection: keep-alive" ^
-H "Accept: */*" ^
-H "Access-Control-Request-Method: GET" ^
-H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
-H "Origin: https://www.loft.com.br" ^
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
-H "Sec-Fetch-Mode: cors" ^
-H "Sec-Fetch-Site: same-site" ^
-H "Sec-Fetch-Dest: empty" ^
-H "Referer: https://www.loft.com.br/" ^
-H "Accept-Language: en-US,en;q=0.9" ^
--compressed
I am unsure which would be the proper way to convert this to a command to be used in python module requests - so I used this website - https://curl.trillworks.com/ - to do it.
The result is:
import requests
headers = {
'Connection': 'keep-alive',
'Accept': '*/*',
'Access-Control-Request-Method': 'GET',
'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
'Origin': 'https://www.loft.com.br',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.loft.com.br/',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('city', 'S\xE3o Paulo'),
('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
('limit', '18'),
('limitedColumns', 'true'),
('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
('offset', '28'),
('orderBy/[/]', 'rankB'),
('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
('originType', 'LISTINGS_LOAD_MORE'),
('q', 'pin'),
('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)
response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)
However, when I try to run it, I get a 204.
So my questions are:
What is the proper/best way to identify requests from this website? Are there any better alternatives to what I did?
Once identified, is copy as curl the best way to replicate the command?
How to best replicate the command in Python?
Why am I getting a 204?

Your way to find requests is correct. But you need to find and analyze correct requests.
About why you get 204 response code with no results; you send OPTION requests instead of GET. In Chrome DevTools you can see two similar requests (check attached picture). One is OPTION and second one is GET with type xhr.
For the website you need the second one, but you used OPTION in your code requests.options(..)
To see response of the request select it and check response or preview tab.
One of the best HTTP libraries in Python is requests.
And here's complete code to get all search results:
import requests
headers = {
'x-user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/88.0.4324.146 Safari/537.36',
'utm_created_at': '',
'Accept': 'application/json, text/plain, */*',
}
with requests.Session() as s:
s.headers = headers
listings = list()
limit = 18
offset = 0
while True:
params = {
"city": "São Paulo",
"facetFilters/[/]": "address.city:São Paulo",
"limit": limit,
"limitedColumns": "true",
# "loftUserId": "a2531ad4-cc3f-49b0-8828-e78fb489def8",
"offset": offset,
"orderBy/[/]": "rankA",
"orderByStatus": "\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\'",
"originType": "LISTINGS_LOAD_MORE",
"q": "pin",
"status/[/]": ["FOR_SALE", "JUST_LISTED", "DEMOLITION", "COMING_SOON", "SOLD"]
}
r = s.get('https://landscape-api.loft.com.br/listing/search', params=params)
r.raise_for_status()
data = r.json()
listings.extend(data["listings"])
offset += limit
total = data["pagination"]["total"]
if len(data["listings"]) == 0 or len(listings) == total:
break
print(len(listings))

1- You did it the proper way! I have been doing it the same way for a long time, and based on my experiences on webscraping, using your browser network tab is by far the best way to get info about the requests made on a website, better than any "extension" and/or "plugin" that I know of!!! There is also "burp suit" on kali linux or on windows, but again the network tab on the browser is always my number one choice!
2- I have been using the same website that you mentioned!!! It makes my life easier and works seamlessly fine. Of curse, you could do it manually, but the website you mentioned makes it easier and faster for me, and I have been using it for a long time!
3- You could do it manually, it's pretty straightforward, but like I said, the website you mentioned makes it easier and faster .
4- It's probably because you're using requests.options, I would try to use requests.get instead!!!

Related

Connect to NordVPN using Python in MacOS without using command line tools

So, I wanted to get a few search results for Google without getting blocked for a Machine Learning app. I want to use a python script to rotate my IP Address while making requests to avoid getting blocked by Google. I can't seem to get the python script working. I don't a API endpoint from which I can connect to NordVPN.
I tried to figure out the endpoint using the chrome extension and inspecting its webpage. But it was of no use.
Currently I'm stuck at this issue.
My code:
import requests
access_token = 'my-secret-token'
# Get a list of available server groups
server_groups_url = "https://api.nordvpn.com/server"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/89.0.4389.82 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9,fr;q=0.8,es;q=0.7',
'Accept-Encoding': 'gzip',
'Accept': 'application/json',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers',
'Authorization': f"Bearer {access_token}"
}
server_groups = requests.get(server_groups_url, headers=headers).json()
# Choose a server (e.g. the first server in the list)
hostname = server_groups[0]['domain']
the hostname in the code returns something like this: 'p119.nordvpn.com'
I don't know how to connect to this VPN using python code. Can someone help me ?

How to bypass Cloudflare with Python on GET requests?

I want to bypass Cloudflare on a GET request I have tried using Cloudscraper which worked for me in the past but now seems decreped.
I tried:
import cloudscraper
import requests
ses = requests.Session()
ses.headers = {
'referer': 'https://magiceden.io/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
'accept': 'application/json'
}
scraper = cloudscraper.create_scraper(sess=ses)
hookLink = f"https://magiceden.io/launchpad/planetarians"
meG = scraper.get("https://api-mainnet.magiceden.io/launchpads/planetarians")
print(meG.status_code)
print(meG.text)
The issue seems to be that I'm getting a captcha on the request
The python library works well (I never knew about it), the issue is your user agent. Cloudflare uses some sort of extra checks to determine whether you're faking it.
For me, any of the following works:
ses.headers = {
'referer': 'https://magiceden.io/',
'accept': 'application/json'
}
ses.headers = {
'accept': 'application/json'
}
And also just:
scraper = cloudscraper.create_scraper()
meG = scraper.get("https://api-mainnet.magiceden.io/launchpads/planetarians")
EDIT:
You can use this dict syntax instead to fake the user agent (as per the manual)
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'windows',
'desktop': True
}
)

python web scraping - len(containers) always returning 0

i am trying to web-scrape pokemon information from their online pokedex, but i'm having trouble with the findAll()function. i've got:
containers = page_soup.findAll("div",{"class":"pokemon-info"})
but I'm not sure if this div is where i need to be looking at all, because (see photo html) this div is inside a li, so perhaps i should search within it instead, like so:
containers = page_soup.findAll("li", {"class":"animating"})
but in both cases when i use len(containers), the length returned is always 0, even though there are several entries.
i also tried find_all(), but the results of len() are the same.
The problem is that BeautifulSoup can't read javascript. As furas said, you should open the webpage and turn off javascript (here's how) and then see if you can still access what you want. If you can't, you need to use something like Selenium to control the browser.
As the other comments and answer suggested, the site is loading the data in the background. The most common response to this is to use Selenium; my approach is to first check for any API calls in Chrome. Luckily for us, the page retrieves 953 pokemon on load.
Below is a script that will retrieve the clean JSON data and here is a little article I wrote explaining the use of chrome developer tools in the first instance over Selenium.
# Gotta catch em all
import requests
import pandas as pd
headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Referer': 'https://www.pokemon.com/us/pokedex/',
'Connection': 'keep-alive',
}
r = requests.get('https://www.pokemon.com/us/api/pokedex/kalos', headers=headers)
j = r.json()
print(j[0])

Web Scraper returns empty html file while Chrome browser works; already tried UserAgent

I am a rookie just learning Python, however, for our Bachelor's thesis, we need the data from the following website (its just municipal financial data from the Latvian government):
https://e2.kase.gov.lv/pub5.5_pasv/code/pub.php?module=pub
So far I have done the following:
Got frustrated that this is not a simple HTML page and that it has this 'interactive' header (sorry, my knowledge is very limited on how to interact with it).
By using Chrome dev tools and network tab I found out that I can run the following URL to 'request' the period, municipality, financial statement, etc. that I need: https://e2.kase.gov.lv/pub5.5_pasv/code/ajax.php?module=pub&job=getDoc&period_id=1626&org_id=2542&blank_id=200079&currency_id=2&editable=0&type=HTML
Created basic python code to get that URL HTML (see below).
Found out that it returns empty data. Thought that this is a bug, however, the response code is 200, which as I understand means that it was successful.
Tested this URL in different browsers, and 'lo and behold. It works in Chrome, however, in Microsoft Edge, it returns an empty blank page.
Read somewhere that I have to 'introduce' myself to the server and tried to use headers and User-Agent both manually, and also using a fake_useragent library with Chrome User Agent. Yet it still doesn't work.
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get("https://e2.kase.gov.lv/pub5.5_pasv/code/ajax.php?module=pub&job=getDoc&period_id=1626&org_id=2542&blank_id=200079&currency_id=2&editable=1&type=HTML", headers=headers)
print(r.text)
So I'm stuck in point 6. The URL works well in Chrome, does not work in Edge. And it seems that my Python code gets the same blank page Edge browser gets - with no data whatsoever.
I would appreciate it a lot if If anyone could at least lead me in the right direction or give some reading material because right now I have no idea how to configure my Python code to reproduce the HTML output from Chrome.. Or if this is even a legitimate (or good) way on how to approach this problem to obtain this data.
EDIT: Sorry guys, I found out that it is not possible to access this website from outside Latvia, however, I have found a solution (see below).
Solved the problem.
Previously when imitating a browser I only used the following headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'
}
Turns out I had to include all of the response headers sent to the server for the request (found through Chrome dev tools), as so:
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Cookie': 'Cookie; Cookie',
'DNT': '1',
'Host': 'e2.kase.gov.lv',
'Referer': 'https://e2.kase.gov.lv/pub5.5_pasv/code/pub.php?module=pub',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36'
}

How could I query the result without selenium on Python or Ruby

I'm trying to track the tickets trend
currently, I'm using selenium to simulate submitting forms.
as you know, the selenium is slow and consume much more memory.
However, when you submit the form, it will redirect you to a new url http://makeabooking.flyscoot.com/Flight/Select
Therefore, I don't have the idea how could I do this without the selenium.
Because I couldn't change the form of query like this http://makeabooking.flyscoot.com/Flight/from={TPE}&to={NYK}&date={2015-10-12} to fetch the result.
Any idea to do this with Ruby or Python with SSL proxy and HTTP proxy support ?
sample website: http://www.flyscoot.com/index.php/en/
You can get the curl requests from chrome easily and use it by:
F12 > Network > request > Right Click > Copy As cURL
curl 'http://makeabooking.flyscoot.com/Flight/Select' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8,tr;q=0.6' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8' -H 'Referer: http://www.flyscoot.com/index.php/en/' -H 'Cookie: optimizelyEndUserId=oeu1444666692081r0.12463579000905156; __utmt=1; granify.lasts#1345=1444666699786; ASP.NET_SessionId=lql5yzv1l3yatkh1lcumg2e5; dotrez=1209262602.20480.0000; optimizelySegments=%7B%222335550040%22%3A%22gc%22%2C%222344180004%22%3A%22referral%22%2C%222354350067%22%3A%22false%22%2C%222355380121%22%3A%22none%22%7D; optimizelyBuckets=%7B%223025070068%22%3A%223020800213%22%7D; __utma=185425846.733949751.1444666694.1444666694.1444666694.1; __utmb=185425846.2.10.1444666694; __utmc=185425846; __utmz=185425846.1444666694.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/33084039/how-could-i-query-the-result-without-selenium-on-python-or-ruby; granify.uuid=68b0d8e8-d068-40d8-9068-3098e870b858; granify.session#1345=1444666699786; granify.flags#1345=8; _gr_ep_sent=1; _gr_er_sent=1; granify.session_init#1345=2; optimizelyPendingLogEvents=%5B%5D' -H 'Connection: keep-alive' -H 'X-FirePHP-Version: 0.0.6' -H 'Cache-Control: max-age=0' --compressed
If you can set the headers and cookies info correctly you can use Python requests. If you want to convert it to the Python requests, you can use the this link. By this way you can simulate the browser. See the pyton requests:
cookies = {
'optimizelyEndUserId': 'oeu1444666692081r0.12463579000905156',
'__utmt': '1',
'granify.lasts#1345': '1444666699786',
'ASP.NET_SessionId': 'lql5yzv1l3yatkh1lcumg2e5',
'dotrez': '1209262602.20480.0000',
'optimizelySegments': '%7B%222335550040%22%3A%22gc%22%2C%222344180004%22%3A%22referral%22%2C%222354350067%22%3A%22false%22%2C%222355380121%22%3A%22none%22%7D',
'optimizelyBuckets': '%7B%223025070068%22%3A%223020800213%22%7D',
'__utma': '185425846.733949751.1444666694.1444666694.1444666694.1',
'__utmb': '185425846.2.10.1444666694',
'__utmc': '185425846',
'__utmz': '185425846.1444666694.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/33084039/how-could-i-query-the-result-without-selenium-on-python-or-ruby',
'granify.uuid': '68b0d8e8-d068-40d8-9068-3098e870b858',
'granify.session#1345': '1444666699786',
'granify.flags#1345': '8',
'_gr_ep_sent': '1',
'_gr_er_sent': '1',
'granify.session_init#1345': '2',
'optimizelyPendingLogEvents': '%5B%5D',
}
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,tr;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.flyscoot.com/index.php/en/',
'Connection': 'keep-alive',
'X-FirePHP-Version': '0.0.6',
'Cache-Control': 'max-age=0',
}
requests.get('http://makeabooking.flyscoot.com/Flight/Select', headers=headers, cookies=cookies)
If you save the result, you can see that result is as done via browser (open stack.html):
r = requests.get('http://makeabooking.flyscoot.com/Flight/Select', headers=headers, cookies=cookies
f = open("stack1.html", "w")
f.write(r.content)
I think this answer https://stackoverflow.com/a/1196151/1033953 is what you're looking for.
You'll need to inspect the parameters on that form to make sure you're posting the right values, but then you just need to use the Ruby net/http to send the HTTP Post.
I'm sure Python has something similar. Or you could use curl to post as shows in this answer https://superuser.com/a/149335

Categories

Resources