I have been trying to search a website and collect urls to the articles it yields, but I have run into problems I don't understand. I have read the following on web forms, and I believe that my problem arises because I am trying to submit prebuilt link to the search results rather than generating them with the web form.
There are several similar questions on here already, but none I've found which deal with how to deconstruct the html and what steps are needed to find and submit a web form.
When researching how to submit web forms, I found out about selenium, which I can't find working python 3 examples of, nor good documentation for. I found on another SO question a codeproject link which has gotten me the most progress so far. code featured below- https://www.codeproject.com/Articles/873060/Python-Search-Youtube-for-Video
That said, I don't yet understand why it works or what variables I will need to change in order to harvest results from another website. Namely, the website I'm looking to search is: https://globenewswire.com/Search
So my question is this; Whether by web form submission or by proper url formation, how can I obtain the search results html?
Here is the code I had been using to formulate the post search url:
name=input()
name=name.replace(' ','%20')
url='https://globenewswire.com/Search/NewsSearch?keyword='+name+'#'
Here is the code featured on the code project link:
import urllib.request
import urllib.parse
import re
query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("http://www.youtube.com/results?" + query_string)
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
print("http://www.youtube.com/watch?v=" + search_results[0])
Edit:
After having captured my request using chrome's dev tools, I now have the response headers and the following curl:
curl "https://globenewswire.com/Search" -H "cookie:somecookie" -H "Origin: https://globenewswire.com" -H "Accept-Encoding: gzip, deflate, br" -H "Accept-Language: en-US,en;q=0.9" -H "Upgrade-Insecure-Requests: 1" -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36" -H "Content-Type: application/x-www-form-urlencoded" -H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8" -H "Cache-Control: max-age=0" -H "Referer: https://globenewswire.com/Search" -H "Connection: keep-alive" -H "DNT: 1" --data "__RequestVerificationToken=xY^%^2BkRoeEL8DswUTDlUVEWUCSxnRzX5Ax2Z^%^2FNCTa0lBNfqOFaU2eb^%^2FTD8XqENnf8d2Ghtm1taW8Cu0BvWrC1dh^%^2BdKZVgHyC6HM0EEm7mupQe1UZ7pHrF9GhnpwwcXR0dyJ^%^2B91Ng^%^3D^%^3D^&quicksearch-textbox=Abeona+Therapeutics" --compressed
As well as the request headers:
POST /Search HTTP/1.1
Host: globenewswire.com
Connection: keep-alive
Content-Length: 217
Cache-Control: max-age=0
Origin: https://globenewswire.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
DNT: 1
Referer: https://globenewswire.com/Search
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Related
I am using the socket library for handling http requests waiting on port 80 for connections (does not really matter right now), which works fine as all responses follow the following format
b"""GET / HTTP/1.1
Host: localhost:8000
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36 OPR/70.0.3728.189
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: el-GR,el;q=0.9"""
if you open port 443 or just use https in any browser, when a request is made the data is encrypted. But how can you actually decrypt the data and interact with the client? I've seen many posts about this but no one explains how the data can actually be decrypted. The data that is received always looks something like this and starts the same way with 0x16 and 0x03 bytes
b'\x16\x03\x01\x02\x00\x01\x00\x01\xfc\x03\x03\xfb\'\xa3\xa5\xa4\x1cf\xd1w~(L\xb5%0,\xfb\xa57\xf4\x92\x03}\x84xCIA\xd9}]2 \x15ID\xafU\xb6\xe3\x9d\xbdr\x93 L\x98\rD\xca\xa7\x11\x89\x00`Q\xf5\th\xde\x85S\xf8Q\x98\x00"jj\x13\x03\x13\x01\x13\x02\xcc\xa9\xcc\xa8\xc0+\xc0/\xc0,\xc00\xc0\x13\xc0\x14\x00\x9c\x00\x9d\x00/\x005\x00\n\x01\x00\x01\x91ZZ\x00\x00\x00\x00\x00\x0e\x00\x0c\x00\x00\tlocalhost\x00\x17\x00\x00\xff\x01\x00\x01\x00\x00\n\x00\n\x00\x08\x9a\x9a\x00\x1d\x00\x17\x00\x18\x00\x0b\x00\x02\x01\x00\x00#\x00\x00\x00\x10\x00\x0e\x00\x0c\x02h2\x08http/1.1\x00\x05\x00\x05\x01\x00\x00\x00\x00\x00\r\x00\x14\x00\x12\x04\x03\x08\x04\x04\x01\x05\x03\x08\x05\x05\x01\x08\x06\x06\x01\x02\x01\x00\x12\x00\x00\x003\x00+\x00)\x9a\x9a\x00\x01\x00\x00\x1d\x00 \xa5\x81S\xec\xf4I_\x08\xd2\n\xa6\xb5\xf6E\x9dE\xe6ha\xe7\xfdy\xdab=\xf4\xd3\x1b`V\x94F\x00-\x00\x02\x01\x01\x00+\x00\x0b\nZZ\x03\x04\x03\x03\x03\x02\x03\x01\x00\x1b\x00\x03\x02\x00\x02\xea\xea\x00\x01\x00\x00\x15\x00\xcf\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
My question is how can I bring the HTTPS data into a form like the above. I've read about some specific handshake procedures but I could not find something that just answsers telling exactly what to do. Of course I am only asking for development purposes.
I have a list of named locations and a set of lat/lng bounds that I want to plug in to the Google Maps API and have it find the locations for me. The names can be very vague, like simply "the boarding school". Using the lat/lng bounds, is there a way I can get GMaps to find these vaguely named locations within the coordinates provided?
My application is web-based and powered by Python Flask in the backend. I've tried looking into Maps' Place Search, but it seems like it can only 'prefer' a certain area to search in, and with my vague place names, it doesn't do well:
https://maps.googleapis.com/maps/api/place/findplacefromtext/xml?input=beach&inputtype=textquery&fields=formatted_address,geometry&locationbias=rectangle:43.3145,-79.8236|43.3490,-79.7741&key=XXXXXXX
This query has a bias covering part of Burlington, ON, but the result is in the neighbouring town of Oakville, significantly out of bounds. If you perform the search with the term "Burlington beach" instead however, it finds the beach that is within bounds.
I need the query to find the beach in Burlington, simply given the term "beach", and bounds that said beach falls within.
Edit: here are my HTTP requests+headers in both Chrome and Edge when testing the original query and Evan's smaller query (URL in comments):
==== Google Chrome: Original Request ====
:authority: maps.googleapis.com
:method: GET
:path: /maps/api/place/findplacefromtext/json?input=beach&inputtype=textquery&fields=formatted_address,geometry&locationbias=rectangle:43.3145,-79.8236|43.3490,-79.7741&key=
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
accept-encoding: gzip, deflate, br
accept-language: en-CA,en;q=0.9,it;q=0.8,el-GR;q=0.7,el;q=0.6
cache-control: max-age=0
dnt: 1
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
x-client-data: CIi2yQEIprbJAQjBtskBCKmdygEIup/KAQioo8oBCOKoygEIl63KAQjNrcoBCMqvygEIh7TKARjwsMoB
==== Google Chrome: Evan's Request ====
:authority: maps.googleapis.com
:method: GET
:path: /maps/api/place/findplacefromtext/json?input=beach&inputtype=textquery&fields=formatted_address,geometry&locationbias=rectangle:43.3145,-79.8236|43.3490,-79.800879&key=
:scheme: https
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3
accept-encoding: gzip, deflate, br
accept-language: en-CA,en;q=0.9,it;q=0.8,el-GR;q=0.7,el;q=0.6
cache-control: max-age=0
dnt: 1
sec-fetch-mode: navigate
sec-fetch-site: none
sec-fetch-user: ?1
upgrade-insecure-requests: 1
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36
x-client-data: CIi2yQEIprbJAQjBtskBCKmdygEIup/KAQioo8oBCOKoygEIl63KAQjNrcoBCMqvygEIh7TKARjwsMoB
==== Edge: Original Request ====
Request URL: https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=beach&inputtype=textquery&fields=formatted_address,geometry&locationbias=rectangle:43.3145,-79.8236|43.3490,-79.7741&key=
Request Method: GET
Status Code: 200 /
Accept: text/html, application/xhtml+xml, application/xml; q=0.9, */*; q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-CA
Cache-Control: max-age=0
Host: maps.googleapis.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362
==== Edge: Evan's Request ====
Request URL: https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=beach&inputtype=textquery&fields=formatted_address,geometry&locationbias=rectangle:43.3145,-79.8236|43.3490,-79.800879&key=
Request Method: GET
Status Code: 200 /
Accept: text/html, application/xhtml+xml, application/xml; q=0.9, */*; q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-CA
Host: maps.googleapis.com
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362
This inconsistency appears to be related to the following issue, reported in Google's Issue Tracker:
https://issuetracker.google.com/issues/35822155
The language parameter influences the results you get from Place Search requests, and even more so with such generic/broad queries. So it's intended behavior for the API.
Potential alternatives to Find Place include using the Nearby Search or the Text Search services. These are more appropriate for ambiguous queries and you can filter out results that are outside a given radius+location.
Hope this helps!
I'm trying to get some data from a page, but it's returning the error [403 Forbidden].
I thought it was the user agent, but I tried several user agents and it still returns the error.
I also tried to use the library fake user-agent but I did not succeed.
with requests.Session() as c:
url = '...'
#headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2224.3 Safari/537.36'}
ua = UserAgent()
header = {'User-Agent':str(ua.chrome)}
page = c.get(url, headers=header)
print page.content
When I access the page manually, everything works.
I'm using python 2.7.14 and requests library, Any idea?
The site could be using anything in the request to trigger the rejection.
So, copy all headers from the request that your browser makes. Then delete them one by one1 to find out which are essential.
As per Python requests. 403 Forbidden, to add custom headers to the request, do:
result = requests.get(url, headers={'header':'value', <etc>})
1A faster way would be to delete half of them each time instead but that's more complicated since there are probably multiple essential headers
These all headers I can see for a generic GET request that are included by the browser:
Host: <URL>
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Try to include all those incrementally in your request (1 by 1) in order to identify which one(s) is/are required for a successful request.
On the other hand, take look of the tabs: Cookies and/or Security available in your browser console / developer tools under Network option.
I noticed that http://www.momondo.com.cn/ is using some magic technology:
curl doesn't work on it. The URL displays fine in a web browser, but curl always returns a timeout, even when I add all of the headers like a web browser would.
I also tried Python requests and urllib2, but they didn't work either.
C:\Users\Administrator>curl -v -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36" -H "Connection: Keep-Alive" -H "Accept-Encoding:gzip, deflate, sdch" -H "Cache-Control:no-cache" -H "Upgrade-Insecure-Requests:1" -H "Accept-Language:zh-CN,zh;q=0.8" -H "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
http://www.momondo.com.cn/
* About to connect() to www.momondo.com.cn port 80 (#0)
* Trying 184.50.91.106...
* connected
* Connected to www.momondo.com.cn (184.50.91.106) port 80 (#0)
> GET / HTTP/1.1
> Host: www.momondo.com.cn
> User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36
> Connection: Keep-Alive
> Accept-Encoding:gzip, deflate, sdch
> Cache-Control:no-cache
> Upgrade-Insecure-Requests:1
> Accept-Language:zh-CN,zh;q=0.8
> Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
>
Why and how does this happen? How does Momondo escape curl?
How are you setting up the request? If you are using requests you should use the Session object type and change the headers there so they can be easily reused. It doesn't look like they are doing anything special because using telnet directly on that site (i.e. telnet www.momondo.com.cn 80) with the headers generated by the browser (captured via tcpdump, just to be sure) resulted in content returned rather than hanging till timeout. Also, it pays attention to look at what CDN (content delivery network) the site is behind, and in this case the address resolves to some subdomain at akamaiedge.net which means it might be useful to check out why they might have blocked you.
Anyway, using the headers you have supplied with a requests.Session object, a response was generated.
>>> from requests import Session
>>> session = Session()
>>> session.headers # check the default headers
{'User-Agent': 'python-requests/2.12.5', 'Connection': 'keep-alive', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*'}
>>> session.headers['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
>>> session.headers['Accept-Language'] = 'en-GB,en-US;q=0.8,en;q=0.6,zh-TW;q=0.4'
>>> session.headers['Cache-Control'] = 'max-age=0'
>>> session.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36'
>>> response = session.get('http://www.momondo.com.cn/')
>>> response
<Response [200]>
Doesn't seem to be anything magic at all.
I figure out the reason:
momondo is using following methods to block unreal web clients.
Detect the user-agent. Can not be curl's default UA.
Detect the "Connection" header. Must use "keep-alive" rather than "Keep-Alive" in my initial test.
Detect the "Accept-Encoding" header. Can not be empty, can use anything.
Finally i can use curl to get the content now:
curl -v -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X
10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89
Safari/537.36" -H "Connection: keep-alive" -H "Accept-Encoding:
nothing" http://www.momondo.com.cn/
BTW, I have doing webscraping for about seven years. This is the first time i met a website used this anti-scraping method. Mark it.
I am trying to log into my router's panel using python, but the problem is that I have no idea what the protocol for doing that is. I tried using Wireshark to find out, but it just shows just a GET request and a response. I tried logging in to the router and then searching the username and password in the packets, but it didn't find it. (My guess is that it's encrypted)
If anyone could help me with the protocol of logging in to the panel, it would be greatly appreciated.
Found it. Fllowing the TCP stream gave me the following:
GET / HTTP/1.1
Host: 10.0.0.138
Connection: keep-alive
Cache-Control: max-age=0
Authorization: Basic UG90YXRvOg==
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8,he;q=0.6
HTTP/1.0 401 Unauthorized
WWW-Authenticate: Basic realm="NETGEAR DGN2200v2BEZEQ"
Content-type: text/html
<html>
<head><title>401 Unauthorized</title></head>
<body><h1>401 Unauthorized</h1>
<p>Access to this resource is denied, your client has not supplied the correct authentication.</p></body>
</html>
The username and password are encoded in base64 in the format of username:password.