Website blocking out curl even with real browser's headers - python

I noticed that http://www.momondo.com.cn/ is using some magic technology:
curl doesn't work on it. The URL displays fine in a web browser, but curl always returns a timeout, even when I add all of the headers like a web browser would.
I also tried Python requests and urllib2, but they didn't work either.
C:\Users\Administrator>curl -v -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36" -H "Connection: Keep-Alive" -H "Accept-Encoding:gzip, deflate, sdch" -H "Cache-Control:no-cache" -H "Upgrade-Insecure-Requests:1" -H "Accept-Language:zh-CN,zh;q=0.8" -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
http://www.momondo.com.cn/
* About to connect() to www.momondo.com.cn port 80 (#0)
* Trying 184.50.91.106...
* connected
* Connected to www.momondo.com.cn (184.50.91.106) port 80 (#0)
> GET / HTTP/1.1
> Host: www.momondo.com.cn
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36
> Connection: Keep-Alive
> Accept-Encoding:gzip, deflate, sdch
> Cache-Control:no-cache
> Upgrade-Insecure-Requests:1
> Accept-Language:zh-CN,zh;q=0.8
> Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
>
Why and how does this happen? How does Momondo escape curl?

How are you setting up the request? If you are using requests you should use the Session object type and change the headers there so they can be easily reused. It doesn't look like they are doing anything special because using telnet directly on that site (i.e. telnet www.momondo.com.cn 80) with the headers generated by the browser (captured via tcpdump, just to be sure) resulted in content returned rather than hanging till timeout. Also, it pays attention to look at what CDN (content delivery network) the site is behind, and in this case the address resolves to some subdomain at akamaiedge.net which means it might be useful to check out why they might have blocked you.
Anyway, using the headers you have supplied with a requests.Session object, a response was generated.
>>> from requests import Session
>>> session = Session()
>>> session.headers # check the default headers
{'User-Agent': 'python-requests/2.12.5', 'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*'}
>>> session.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
>>> session.headers['Accept-Language'] = 'en-GB,en-US;q=0.8,en;q=0.6,zh-TW;q=0.4'
>>> session.headers['Cache-Control'] = 'max-age=0'
>>> session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'
>>> response = session.get('http://www.momondo.com.cn/')
>>> response
<Response [200]>
Doesn't seem to be anything magic at all.

I figure out the reason:
momondo is using following methods to block unreal web clients.
Detect the user-agent. Can not be curl's default UA.
Detect the "Connection" header. Must use "keep-alive" rather than "Keep-Alive" in my initial test.
Detect the "Accept-Encoding" header. Can not be empty, can use anything.
Finally i can use curl to get the content now:
curl -v -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X
10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89
Safari/537.36" -H "Connection: keep-alive" -H "Accept-Encoding:
nothing" http://www.momondo.com.cn/
BTW, I have doing webscraping for about seven years. This is the first time i met a website used this anti-scraping method. Mark it.

Related

Web Scraping Identifying executing and troubleshooting a request

I am having some trouble scraping data from the following website:
https://www.loft.com.br/apartamentos/sao-paulo-sp?q=pin
When we load the page, it loads the first ~30 posts on real state in the city of Sao Paulo.
If we scroll down, it loads more posts.
Usually I would use selenium to get around this - but I want to learn how to do it properly - I imagine that is by fiddling with requests.
By using inspect on chrome, and watching for what happens when we scroll down, I can see a request made which I presume is what retrieves the new posts.
If I copy its content as curl, I get the following command:
curl "https://landscape-api.loft.com.br/listing/search?city=S^%^C3^%^A3o^%^20Paulo^&facetFilters^\[^\]=address.city^%^3AS^%^C3^%^A3o^%^20Paulo^&limit=18^&limitedColumns=true^&loftUserId=417b37df-19ab-4014-a800-688c5acc039d^&offset=28^&orderBy^\[^\]=rankB^&orderByStatus=^%^27FOR_SALE^%^27^%^2C^%^20^%^27JUST_LISTED^%^27^%^2C^%^20^%^27DEMOLITION^%^27^%^2C^%^20^%^27COMING_SOON^%^27^%^20^%^2C^%^20^%^27SOLD^%^27^&originType=LISTINGS_LOAD_MORE^&q=pin^&status^\[^\]=FOR_SALE^&status^\[^\]=JUST_LISTED^&status^\[^\]=DEMOLITION^&status^\[^\]=COMING_SOON^&status^\[^\]=SOLD" ^
-X "OPTIONS" ^
-H "Connection: keep-alive" ^
-H "Accept: */*" ^
-H "Access-Control-Request-Method: GET" ^
-H "Access-Control-Request-Headers: loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id" ^
-H "Origin: https://www.loft.com.br" ^
-H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" ^
-H "Sec-Fetch-Mode: cors" ^
-H "Sec-Fetch-Site: same-site" ^
-H "Sec-Fetch-Dest: empty" ^
-H "Referer: https://www.loft.com.br/" ^
-H "Accept-Language: en-US,en;q=0.9" ^
--compressed
I am unsure which would be the proper way to convert this to a command to be used in python module requests - so I used this website - https://curl.trillworks.com/ - to do it.
The result is:
import requests
headers = {
'Connection': 'keep-alive',
'Accept': '*/*',
'Access-Control-Request-Method': 'GET',
'Access-Control-Request-Headers': 'loft_user_id,loftuserid,utm_campaign,utm_content,utm_created_at,utm_id,utm_medium,utm_source,utm_term,utm_user_agent,x-user-agent,x-utm-source,x-utm-user-id',
'Origin': 'https://www.loft.com.br',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-site',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.loft.com.br/',
'Accept-Language': 'en-US,en;q=0.9',
}
params = (
('city', 'S\xE3o Paulo'),
('facetFilters/[/]', 'address.city:S\xE3o Paulo'),
('limit', '18'),
('limitedColumns', 'true'),
('loftUserId', '417b37df-19ab-4014-a800-688c5acc039d'),
('offset', '28'),
('orderBy/[/]', 'rankB'),
('orderByStatus', '\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\''),
('originType', 'LISTINGS_LOAD_MORE'),
('q', 'pin'),
('status/[/]', ['FOR_SALE', 'JUST_LISTED', 'DEMOLITION', 'COMING_SOON', 'SOLD']),
)
response = requests.options('https://landscape-api.loft.com.br/listing/search', headers=headers, params=params)
However, when I try to run it, I get a 204.
So my questions are:
What is the proper/best way to identify requests from this website? Are there any better alternatives to what I did?
Once identified, is copy as curl the best way to replicate the command?
How to best replicate the command in Python?
Why am I getting a 204?
Your way to find requests is correct. But you need to find and analyze correct requests.
About why you get 204 response code with no results; you send OPTION requests instead of GET. In Chrome DevTools you can see two similar requests (check attached picture). One is OPTION and second one is GET with type xhr.
For the website you need the second one, but you used OPTION in your code requests.options(..)
To see response of the request select it and check response or preview tab.
One of the best HTTP libraries in Python is requests.
And here's complete code to get all search results:
import requests
headers = {
'x-user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_0) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/88.0.4324.146 Safari/537.36',
'utm_created_at': '',
'Accept': 'application/json, text/plain, */*',
}
with requests.Session() as s:
s.headers = headers
listings = list()
limit = 18
offset = 0
while True:
params = {
"city": "São Paulo",
"facetFilters/[/]": "address.city:São Paulo",
"limit": limit,
"limitedColumns": "true",
# "loftUserId": "a2531ad4-cc3f-49b0-8828-e78fb489def8",
"offset": offset,
"orderBy/[/]": "rankA",
"orderByStatus": "\'FOR_SALE\', \'JUST_LISTED\', \'DEMOLITION\', \'COMING_SOON\' , \'SOLD\'",
"originType": "LISTINGS_LOAD_MORE",
"q": "pin",
"status/[/]": ["FOR_SALE", "JUST_LISTED", "DEMOLITION", "COMING_SOON", "SOLD"]
}
r = s.get('https://landscape-api.loft.com.br/listing/search', params=params)
r.raise_for_status()
data = r.json()
listings.extend(data["listings"])
offset += limit
total = data["pagination"]["total"]
if len(data["listings"]) == 0 or len(listings) == total:
break
print(len(listings))
1- You did it the proper way! I have been doing it the same way for a long time, and based on my experiences on webscraping, using your browser network tab is by far the best way to get info about the requests made on a website, better than any "extension" and/or "plugin" that I know of!!! There is also "burp suit" on kali linux or on windows, but again the network tab on the browser is always my number one choice!
2- I have been using the same website that you mentioned!!! It makes my life easier and works seamlessly fine. Of curse, you could do it manually, but the website you mentioned makes it easier and faster for me, and I have been using it for a long time!
3- You could do it manually, it's pretty straightforward, but like I said, the website you mentioned makes it easier and faster .
4- It's probably because you're using requests.options, I would try to use requests.get instead!!!

Urllib request takes too long to respond [duplicate]

I try to open url with python3:
import urllib.request
fp = urllib.request.urlopen("http://lebed.com/")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
But it hangs on second line.
What's the reason of this problem and how to fix it?
I suppose the reason is that the url does not support robot visiting a site visit. You need to fake a browser visit by sending browser headers along with your request
import urllib.request
url = "http://lebed.com/"
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
Tried this one on my system and it works.
Agree with Arpit Solanki. Shown output for a failed request vs successful.
Failed
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Python-urllib/3.5
Success
GET / HTTP/1.1
Accept-Encoding: identity
Host: www.lebed.com
Connection: close
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36

Returning 403 Forbidden from simple get but loads okay in browser

I'm trying to get some data from a page, but it's returning the error [403 Forbidden].
I thought it was the user agent, but I tried several user agents and it still returns the error.
I also tried to use the library fake user-agent but I did not succeed.
with requests.Session() as c:
url = '...'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'}
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
page = c.get(url, headers=header)
print page.content
When I access the page manually, everything works.
I'm using python 2.7.14 and requests library, Any idea?
The site could be using anything in the request to trigger the rejection.
So, copy all headers from the request that your browser makes. Then delete them one by one1 to find out which are essential.
As per Python requests. 403 Forbidden, to add custom headers to the request, do:
result = requests.get(url, headers={'header':'value', <etc>})
1A faster way would be to delete half of them each time instead but that's more complicated since there are probably multiple essential headers
These all headers I can see for a generic GET request that are included by the browser:
Host: <URL>
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Try to include all those incrementally in your request (1 by 1) in order to identify which one(s) is/are required for a successful request.
On the other hand, take look of the tabs: Cookies and/or Security available in your browser console / developer tools under Network option.

Python; Submitting Web Form, How to use a search bar?

I have been trying to search a website and collect urls to the articles it yields, but I have run into problems I don't understand. I have read the following on web forms, and I believe that my problem arises because I am trying to submit prebuilt link to the search results rather than generating them with the web form.
There are several similar questions on here already, but none I've found which deal with how to deconstruct the html and what steps are needed to find and submit a web form.
When researching how to submit web forms, I found out about selenium, which I can't find working python 3 examples of, nor good documentation for. I found on another SO question a codeproject link which has gotten me the most progress so far. code featured below- https://www.codeproject.com/Articles/873060/Python-Search-Youtube-for-Video
That said, I don't yet understand why it works or what variables I will need to change in order to harvest results from another website. Namely, the website I'm looking to search is: https://globenewswire.com/Search
So my question is this; Whether by web form submission or by proper url formation, how can I obtain the search results html?
Here is the code I had been using to formulate the post search url:
name=input()
name=name.replace(' ','%20')
url='https://globenewswire.com/Search/NewsSearch?keyword='+name+'#'
Here is the code featured on the code project link:
import urllib.request
import urllib.parse
import re
query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("http://www.youtube.com/results?" + query_string)
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
print("http://www.youtube.com/watch?v=" + search_results[0])
Edit:
After having captured my request using chrome's dev tools, I now have the response headers and the following curl:
curl "https://globenewswire.com/Search" -H "cookie:somecookie" -H "Origin: https://globenewswire.com" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Cache-Control: max-age=0" -H "Referer: https://globenewswire.com/Search" -H "Connection: keep-alive" -H "DNT: 1" --data "__RequestVerificationToken=xY^%^2BkRoeEL8DswUTDlUVEWUCSxnRzX5Ax2Z^%^2FNCTa0lBNfqOFaU2eb^%^2FTD8XqENnf8d2Ghtm1taW8Cu0BvWrC1dh^%^2BdKZVgHyC6HM0EEm7mupQe1UZ7pHrF9GhnpwwcXR0dyJ^%^2B91Ng^%^3D^%^3D^&quicksearch-textbox=Abeona+Therapeutics" --compressed
As well as the request headers:
POST /Search HTTP/1.1
Host: globenewswire.com
Connection: keep-alive
Content-Length: 217
Cache-Control: max-age=0
Origin: https://globenewswire.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
DNT: 1
Referer: https://globenewswire.com/Search
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

How could I query the result without selenium on Python or Ruby

I'm trying to track the tickets trend
currently, I'm using selenium to simulate submitting forms.
as you know, the selenium is slow and consume much more memory.
However, when you submit the form, it will redirect you to a new url http://makeabooking.flyscoot.com/Flight/Select
Therefore, I don't have the idea how could I do this without the selenium.
Because I couldn't change the form of query like this http://makeabooking.flyscoot.com/Flight/from={TPE}&to={NYK}&date={2015-10-12} to fetch the result.
Any idea to do this with Ruby or Python with SSL proxy and HTTP proxy support ?
sample website: http://www.flyscoot.com/index.php/en/
You can get the curl requests from chrome easily and use it by:
F12 > Network > request > Right Click > Copy As cURL
curl 'http://makeabooking.flyscoot.com/Flight/Select' -H 'Accept-Encoding: gzip, deflate, sdch' -H 'Accept-Language: en-US,en;q=0.8,tr;q=0.6' -H 'Upgrade-Insecure-Requests: 1' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8' -H 'Referer: http://www.flyscoot.com/index.php/en/' -H 'Cookie: optimizelyEndUserId=oeu1444666692081r0.12463579000905156; __utmt=1; granify.lasts#1345=1444666699786; ASP.NET_SessionId=lql5yzv1l3yatkh1lcumg2e5; dotrez=1209262602.20480.0000; optimizelySegments=%7B%222335550040%22%3A%22gc%22%2C%222344180004%22%3A%22referral%22%2C%222354350067%22%3A%22false%22%2C%222355380121%22%3A%22none%22%7D; optimizelyBuckets=%7B%223025070068%22%3A%223020800213%22%7D; __utma=185425846.733949751.1444666694.1444666694.1444666694.1; __utmb=185425846.2.10.1444666694; __utmc=185425846; __utmz=185425846.1444666694.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/33084039/how-could-i-query-the-result-without-selenium-on-python-or-ruby; granify.uuid=68b0d8e8-d068-40d8-9068-3098e870b858; granify.session#1345=1444666699786; granify.flags#1345=8; _gr_ep_sent=1; _gr_er_sent=1; granify.session_init#1345=2; optimizelyPendingLogEvents=%5B%5D' -H 'Connection: keep-alive' -H 'X-FirePHP-Version: 0.0.6' -H 'Cache-Control: max-age=0' --compressed
If you can set the headers and cookies info correctly you can use Python requests. If you want to convert it to the Python requests, you can use the this link. By this way you can simulate the browser. See the pyton requests:
cookies = {
'optimizelyEndUserId': 'oeu1444666692081r0.12463579000905156',
'__utmt': '1',
'granify.lasts#1345': '1444666699786',
'ASP.NET_SessionId': 'lql5yzv1l3yatkh1lcumg2e5',
'dotrez': '1209262602.20480.0000',
'optimizelySegments': '%7B%222335550040%22%3A%22gc%22%2C%222344180004%22%3A%22referral%22%2C%222354350067%22%3A%22false%22%2C%222355380121%22%3A%22none%22%7D',
'optimizelyBuckets': '%7B%223025070068%22%3A%223020800213%22%7D',
'__utma': '185425846.733949751.1444666694.1444666694.1444666694.1',
'__utmb': '185425846.2.10.1444666694',
'__utmc': '185425846',
'__utmz': '185425846.1444666694.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/33084039/how-could-i-query-the-result-without-selenium-on-python-or-ruby',
'granify.uuid': '68b0d8e8-d068-40d8-9068-3098e870b858',
'granify.session#1345': '1444666699786',
'granify.flags#1345': '8',
'_gr_ep_sent': '1',
'_gr_er_sent': '1',
'granify.session_init#1345': '2',
'optimizelyPendingLogEvents': '%5B%5D',
}
headers = {
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8,tr;q=0.6',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Referer': 'http://www.flyscoot.com/index.php/en/',
'Connection': 'keep-alive',
'X-FirePHP-Version': '0.0.6',
'Cache-Control': 'max-age=0',
}
requests.get('http://makeabooking.flyscoot.com/Flight/Select', headers=headers, cookies=cookies)
If you save the result, you can see that result is as done via browser (open stack.html):
r = requests.get('http://makeabooking.flyscoot.com/Flight/Select', headers=headers, cookies=cookies
f = open("stack1.html", "w")
f.write(r.content)
I think this answer https://stackoverflow.com/a/1196151/1033953 is what you're looking for.
You'll need to inspect the parameters on that form to make sure you're posting the right values, but then you just need to use the Ruby net/http to send the HTTP Post.
I'm sure Python has something similar. Or you could use curl to post as shows in this answer https://superuser.com/a/149335

Categories

Resources