Python/urllib suddenly stops working properly - python

I'm writing a little tool to monitor class openings at my school.
I wrote a python script that will fetch the current availablity of classes from each department every few minutes.
The script was functioning properly until the uni's site started returning this:
SIS Server is not available at this time
Uni must have blocked my server right? Well, not really because that is the output I get when I goto the URL directly from other PCs. But if I go through the intermediary form on uni's site that does a POST, I don't get that message.
The URL I'm requesting is https://s4.its.unc.edu/SISMisc/SISTalkerServlet
This is what my python code looks like:
data = urllib.urlencode({"progname" : "SIR033WA", "SUBJ" : "busi", "CRS" : "", "TERM" : "20099"})
f = urllib.urlopen("https://s4.its.unc.edu/SISMisc/SISTalkerServlet", data)
s = f.read()
print (s)
I am really stumped! It seems like python isn't sending a proper request. At first I thought it wasn't sending a proper post data but I changed the URL to my localbox and the post data apache recieved seemed just fine.
If you'd like to see the system actually functioning, goto https://s4.its.unc.edu/SISMisc/browser/student_pass_z.jsp and click on the "Enter as Guest" button and then look for "Course Availability". (Now you know why I'm building this!)
Weirdest thing is this was working until 11am! I've had the same error before but it only lasted for few minutes. This tells me it is more of a problem somewhere than any blocking of my server by the uni.
update
Upon suggestion, I tried to play with a more legit referer/user-agent. Same result. This is what I tried:
import httplib
import urllib
headers = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4',"Content-type": "application/x-www-form-urlencoded","Accept": "text/plain","Referrer": "https://s4.its.unc.edu/SISMisc/SISTalkerServlet"}
data = urllib.urlencode({"progname" : "SIR033WA", "SUBJ" : "busi", "CRS" : "", "TERM" : "20099"})
c = httplib.HTTPSConnection("s4.its.unc.edu",443)
c.request("POST", "/SISMisc/SISTalkerServlet",data,headers)
r = c.getresponse()
print r.read()

This post doesn't attempt to fix your code, but suggest a debugging tool.
Once upon a time I was coding a program to fill out online forms for me. To learn exactly how my browser was handling the POSTs, and cookies, and whatnot, I installed WireShark ( http://www.wireshark.org/ ), a network sniffer. This application allowed me to view, chunk by chunk, the data that was being sent and received on the IP and hardware level.
You might consider trying out a similar program and comparing the network flow. This might highlight differences between what your browser is doing and your script is doing.

After seeing multiple requests from an odd non-browser User-Agent string, it's possible that they are blocking users not being referred to from the site. For example, PHP has a feature called $_SERVER['HTTP_REFERRER'] IIRC, which will check the page which reffered the user to the current one. Since your program is not including one in the User-Agent string (you are trying to directly access it) it is very possible they are preventing you access based upon that. Try adding a referrer into the headers of your http request and see how it goes. (preferably a page which links to the one you're trying to access)
http://whatsmyuseragent.com/ can assist you in building your spoofed user agent.
you then build headers like so...
headers = {"Content-type": "application/x-www-form-urlencoded",
"Accept": "text/plain"}
and then send them as an additional parameter with your HTTPConnection request...
conn.request("POST", "/page/on/site", params, headers)
see the python doc on httplib for further reference and examples.

Related

Python. Status code missing in HTTP response

I want to get audio streaming data from server using Python.
I try simple request to audio stream url using urllib:
req = urllib.request.Request(<url>)
but i get exception:
http.client.BadStatusLine: Uª¨Ì5¦`
It looks like server responses and send data without http header including Status code.
Is there any way to get and process response in this case?
Also it is worth to mention results i got to request this URL with clients:
Curl:
curl "http://<server>:81/audiostream.cgi?user=<user>&pwd=<password>&streamid=0&filename=" curl: (1) Received HTTP/0.9 when not allowed
The workaround is use --http0.9 switch.
Chrome/Chromium based browsers shows:
ERR_INVALID_HTTP_RESPONSE
Mozilla Firefox can correctly fetch this data as binary
Screenshot
Can you upload the code fragment? Or maybe you just need to search it on google and SO. I have found several links that have mentioned this problem.
Like:
Why am I getting httplib.BadStatusLine in python?
BadStatusLine exception raised when returning reply from server in Python 3
Why does this url raise BadStatusLine with httplib2 and urllib2?
Issue 42432: Http client, Bad Status Line triggered for no reason
Check again and Think Twice! Search on SO first before start a new thread.
HTTP 0.9 is about the simplest possible http protocol:
The client sends a document request consisting of a line of ASCII characters terminated by a CR LF (carriage return, line feed) pair [...]
This request consists of the word "GET", a space, the document address , omitting the "http:, host and port parts when they are the coordinates just used to make the connection.
The response to a simple GET request is a message in hypertext mark-up language ( HTML ). This is a byte stream of ASCII characters.
source
Thus your server is not sending a valid HTTP 0.9 response, as it's not html. Chrome (etc) is quite within its rights to reject it, although in practice it may not even support http 0.9.
In this case the camera is apparently (ab)using http to start a stream (since presumably it will carry on sending data over the connection, which is also not http 0.9, although not explicitly forbidden). The simplest way to get the data you want is to do it manually:
Create and open a socket with the server's base address
send a GET request for audiostream.cgi?user=<user>&pwd=<password>&streamid=0&filename= (do you really need that last param?)
run socket.recv(max_bytes) in a loop in a thread and transfer to a (thread-safe) buffer, do whatever you want to do with that buffer in another thread.
Alternatively if you're familiar with async programming, use asyncio rather than threads.
You will obviously need to handle decoding the file stream yourself. Hopefully you can identify the format and pass it to a decoder; alternatively something like ffmpeg might be able to guess it.
Have you tried including User-Agent header when doing this request? Sometimes this can be caused by a web-scraping detection.
import urllib2
opener = urllib2.build_opener()
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
}
opener.addheaders = headers.items()
response = opener.open(<url>)

Why don't I get a response from my request?

I'm trying to make one simple request:
ua=UserAgent()
req = requests.get('https://www.casasbahia.com.br/' , headers={'User-Agent':ua.random})
I would understand if I received <Response [403] or something like that, but instead, a recive nothing, the code keep runing with no response.
using logging I see:
I know I could use a timeout to avoid keeping the code running, but I just want to understand why I don't get an response
thanks in advance
I never used this API before, but from what I researched on here just now, there are sites that can block requests from fake users.
So, for reproducing this example on my PC, I installed fake_useragent and requests modules on my Python 3.10, and tried to execute your script. It turns out that with my Authentic UserAgent string, the request can be done. When printed on the console, req.text shows the entire HTML file received from the request.
But if I try again with a fake user agent, using ua.random, it fails. The site was probably developed to detect and reject requests from fake agents (or bots).
Though again, this is just theory. I have no ways to access this site's server files to comprove it.

Python scraping HTTPError: 403 Client Error: Forbidden for url:

My python code used to work, but when I tried it today it did not work anymore.
I assume the website owner forbade non browsers requests recently.
code
import requests, bs4
res = requests.get('https://manga1001.com/日常-raw-free/')
res.raise_for_status()
print(res.text)
I read that adding header in the requests.get method may work, but I don't know which header info exactly I need to make it work.
error
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-15-ed1948d83d51> in <module>
3 # res = requests.get('https://manga1001.com/日常-raw-free/', headers=headers_dic)
4 res = requests.get('https://manga1001.com/日常-raw-free/')
----> 5 res.raise_for_status()
6 print(res.text)
7
~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://manga1001.com/%E6%97%A5%E5%B8%B8-raw-free/
Requests get a header argument
res = requests.get('https://manga1001.com/日常-raw-free/', headers="")
I think adding a proper value here could make it work, but I don't know what the value is.
I would really appreciate if you could tell you.
And if you know any other ways to make it work, that is also quite helpful.
Btw I have also tried the code below but it also didn't work.
code 2
from requests_html import HTMLSession
url = "https://search.yahoo.co.jp/realtime"
session = HTMLSession()
r = session.get(url)
r = r.html.render()
print(r)
FYI HTMLSession may not work on IDLE like Jupyter notebook so I tired it after saving it as a python file but it still did not work.
When I run first code without res.raise_for_status() then I can see in HTML with Why do I have to complete a CAPTCHA? and Cloudflare Ray ID which shows what is the problem. It uses Cloudflare to detect scripts/bots/hackers/spamers and it uses Captcha to check it. But if I use header 'User-Agent' with value from real browser or even with short 'Mozilla/5.0' then it get expected page.
It works for me with both pages.
import requests
headers = {
'User-Agent': 'Mozilla/5.0'
}
url = 'https://manga1001.com/日常-raw-free/'
#url = 'https://search.yahoo.co.jp/realtime'
res = requests.get(url, headers=headers)
print('status_code:', res.status_code)
print(res.text)
BTW:
If you will run it often for many links in short time then it may display again CAPTCHA and then you may need other methods to behave more like real human - ie. sleep() with random time, Session() to use cookies, first get main page (to get fresh cookies) and later get this page, add other headers.
I wanted to expand on the answer given by #Furas because I understand his fix will not be the solution in all cases. Yes, In this instance you're getting the 403 and Cloudflare/security captcha page when you make a request because of not "scoring" high enough on the security system (Your HTTP browser isn't similar enough to a real browser)
This creates a big question. What is a real browser and what score do I need to beat it? How do I increase my browser score and make my HTTP-request based browser look more real to the bot protection?
Firstly, it's important to understand that these 403/Security blocks are based on different levels on security. Something you do on one site may not work on the other due to different security configurations/version. Two sites may use the same security system and still the request you make may only work on one.
Why would they have different configurations and everyone not use the highest security available? Because with each additional security measure, there's more false-positives and challenges to pass, on a large scale or for an e-commerce store this can mean lost sales due to a poor user-experience or additional bugs/downtime which are introduced via the security program.
What is a real browser?
A real browser can perform SSL/TLS handshakes, parse and run Javascript and make TCP/requests. Along with this, the security programs will analyze the patterns and timings of everything from Layer 2 to see if you're a "real" human. When you use something like Python to make a request that is only performing a HTTP(s) request it's really easy for these security programs to recognise you as a bot without some heavy configuration.
One way that security systems combat bots is by putting a Javascript challenge as a proxy between the bot and a site, this requires running client-side Javascript which bots cannot do by default, not only do you need to run the client-side Javascript, it also needs to be similar to one that your own browser would generate, the challenge can typically consist of a few hundred individual "browser" challenges or/along with a manual captcha to fingerprint and track your browser to see if you're a human (This is the page you're seeing).
The typical and more lower-standard security systems/configurations can be beaten by using the correct headers (with capitalization, header order and HTTP versions. Like #Furas mentioned, using consistent sessions can also help create longer-lasting sessions before getting another 403. More advanced and higher-level security configurations can do tracking on lower-levels by looking at some flags (Such as WindowSize) of the TCP connection and JA3 fingerprinting analyzing the TLS handshake which will look at your cipher suites and ALPN amongst other things. Security systems can see characteristics which differentiate between browsers, browser-versions and operating systems and compare these all together to generate your realness score. Your IP can also be an important factor, requests can be cross-checked with other sites, intervals, older requests you tried before and much more, you can use proxies to divide your requests between and look less suspicious, but this can come with additional problems and affect your request also causing it to be fingerprinted and blocked.
To understand this better, here's a great site you can go to in your browser and also make a GET request to, check your browser "Rank" and look at the different values which can be seen just from the TLS request alone.
I hope this provides some insight into why a block might appear, although it's impossible to tell from a single URL since blocks can appear for such a variety of different reasons.

Python requests isn't giving me the same HTML as my browser is

I am grabbing a Wikia page using Python requests. There's a problem, though: the requests request isn't giving me the same HTML as my browser is with the very same page.
For comparison, here's the page Firefox gets me, and here's the page requests fetches (download them to view - sorry, no easy way to just visually host a bit of HTML from another site).
You'll note a few differences (super unfriendly diff). There are some small things, like attributes beinig ordered differently and such, but there are also a few very, very large things. Most important is the lack of the last six <img>s, and the entirety of the navigation and footer sections. Even in the raw HTML it looks like the page cut off abruptly.
Why is this happening, and is there a way to fix it? I've thought of a bunch of things already, none of which have been fruitful:
Request headers interfering? Nope, I tried copying the headers my browser sends, User-Agent and all, 1:1 into the requests request, but nothing changed.
JavaScript loading content after the HTML is loaded? Nah. Even with JS disabled, Firefox gives me the "good" page.
Uh... well... what else could there be?
It'd be amazing if you know a way this could happen and a way to fix it. Thank you!
I had a similar issue:
Identical headers with Python and through the browser
JavaScript definitely ruled out as a cause
To resolve the issue, I ended up swapping out the requests library for urllib.request.
Basically, I replaced:
import requests
session = requests.Session()
r = session.get(URL)
with:
import urllib.request
r = urllib.request.urlopen(URL)
and then it worked.
Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.
I suggest that you're not sending the proper header (or sending it wrong) with your request. That's why you are getting different content. Here is an example of a HTTP request with header:
url = 'https://www.google.co.il/search?q=eminem+twitter'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'
# header variable
headers = { 'User-Agent' : user_agent }
# creating request
req = urllib2.Request(url, None, headers)
# getting html
html = urllib2.urlopen(req).read()
If you are sure that you are sending right header, but are still getting different html. You can try to use selenium. It will allows you to work with browser directly (or with phantomjs if your machine doesn't have GUI). With selenium you will be able just to grab html directly from browser.
A lot of the differences I see are showing me that the content is still there, it's just rendered in a different order, sometimes with different spacing.
You could be receiving different content based on multiple different things:
Your headers
Your user agent
The time!
The order which the web application decides to render elements on the page, subject to random attribute order as the element may be pulled from an unsorted data source.
If you could include all of your headers at the top of that Diff, then we may be able to make more sense of it.
I suspect that the application chose not to render certain images as they aren't optimized for what it thinks is some kind of robot/mobile device (Python Requests)
On a closer look at the diff, it appears that everything was loaded in both requests, just with a different formatting.
I was facing similar issue while requesting a page. Then I noticed that the URL which I was using required 'http' to be prepended to the URL but I was prepending 'https'. My request URL looked like https://example.com. So make the URL look like http://example.com. Hope it solves the problem.
Maybe Requests and Browsers use different ways to render the raw data from WEB server, and the diff in the above example are only with the rendered html.
I found that when html is broken, different browsers, e.g. Chrome and Safari, use different ways to fix when parsing. So maybe it is the same idea with Requests and Firefox.
From both Requests and Firefox I suggest to diff the raw data, i.e. the byte stream in socket. Requests can use .raw property of response object to get the raw data in socket. (http://docs.python-requests.org/en/master/user/quickstart/) If the raw data from both sides are same and there are some broken codes in HTML, maybe it is due to the different auto-fixing policies of Request and browser when parsing broken html.
(Maybe my recent experience will help)
I faced the same issue scraping on Amazon: my local machine was able to process all the pages but, when I moved the project on a Google Cloud instance, the behavior changed for some of the items I was scraping.
Previous implementation
On my local machine I was using requests library as follow
page = requests.get(url_page, headers=self.headers)
page=page.content
with headers specified in my class, based on my local browser
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 OPR/67.0.3575.137"}
but I get incomplete pages using this setup on Google Cloud instance
New implementation
The following implementation involves urllib without the headers
req = urllib.request.Request(
url_page,
data=None
)
f = urllib.request.urlopen(req)
page = f.read().decode('utf-8')
self.page = page
this solution works on both the machines; before this attempt, I tried also using the same headers and the prolem was not solved, and so I removed the headers supposing that the problem was there (maybe because I was indentifying incorrectly as another client).
So, my code works perfectly and I'm still able to process the content of the pages with beautifulsoup, as in the following method which I implemented in my class in order to extract the text from specific portion of the page.
def find_data(self, div_id):
soup = BeautifulSoup(self.page, features = "lxml")
text = soup.select("#"+div_id)[0].get_text()
text = text.strip()
text = str(text)
text = text.replace('"', "")
return text

How do you open a URL with Python without using a browser?

I want to open a URL with Python code but I don't want to use the "webbrowser" module. I tried that already and it worked (It opened the URL in my actual default browser, which is what I DON'T want). So then I tried using urllib (urlopen) and mechanize. Both of them ran fine with my program but neither of them actually sent my request to the website!
Here is part of my code:
finalURL="http://www.locationary.com/access/proxy.jsp?ACTION_TOKEN=proxy_jsp$JspView$SaveAction&inPlaceID=" + str(newPID) + "&xxx_c_1_f_987=" + str(ZA[z])
print finalURL
print ""
br.open(finalURL)
page = urllib2.urlopen(finalURL).read()
When I go into the site, locationary.com, it doesn't show that any changes have been made! When I used "webbrowser" though, it did show changes on the website after I submitted my URL. How can I do the same thing that webbrowser does without actually opening a browser?
I think the website wants a "GET"
I'm not sure what OS you're working on, but if you use something like httpscoop (mac) or fiddler (pc) or wireshark, you should be able to watch the traffic and see what's happening. It may be that the website does a redirect (which your browser is following) or there's some other subsequent activity.
Start an HTTP sniffer, make the request using the web browser and watch the traffic. Once you've done that, try it with the python script and see if the request is being made, and what the difference is in the HTTP traffic. This should help identify where the disconnect is.
A HTTP GET doesn't need any specific code or action on the client side: It's just the base URL (http://server/) + path + optional query.
If the URL is correct, then the code above should work. Some pointers what you can try next:
Is the URL really correct? Use Firebug or a similar tool to watch the network traffic which gives you the full URL plus any header fields from the HTTP request.
Maybe the site requires you to log in, first. If so, make sure you set up cookies correctly.
Some sites require a correct "referrer" field (to protect themselves against deep linking). Add the referrer header which your browser used to the request.
The log file of the server is a great source of information to trouble shoot such problems - when you have access to it.

Categories

Resources