I have asked this question here about a Python command that fetches a URL of a web page and stores it in a variable. The first thing that I wanted to know then was whether or not the variable in this code contains the HTML code of a web-page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
doSomethingWithResult(result.content)
The answer that I received was "yes", i.e. the variable "result" in the code did contain the HTML code of a web page, and the programmer who was answering said that I needed to "check the Content-Type header and verify that it's either text/html or application/xhtml+xml". I've looked through several Python tutorials, but couldn't find anything about headers. So my question is where is this Content-Type header located and how can I check it? Could I send the content of that variable directly to my mailbox?
Here is where I got this code. It's on Google App Engines.
If you look at the Google App Engine documentation for the response object, the result of urlfetch.fetch() contains the member headers which contains the HTTP response headers, as a mapping of names to values. So, all you probably need to do is:
if result['Content-Type'] in ('text/html', 'application/xhtml+xml'):
# assuming you want to do something with the content
doSomethingWithXHTML(result.content)
else:
# use content for something else
doTheOtherThing(result.content)
As far as emailing the variable's contents, I suggest the Python email module.
for info on sending Content-Type header, see here: http://code.google.com/appengine/docs/python/urlfetch/overview.html#Request_Headers
Related
I have this url which returns the json data of my models but I don't know how to create a unit test for a url like this
path("list/", views.json_list, name="json_list"),
I'm not really sure what is being asked. A test like this
url = reverse('myapp:json_list')
response = client.get( url)
body = response.content.decode()
is going to fail if anything is wrong with the url definition. (Specifically, reverse will fail if the name doesn't exist, and for an url with arguments, if what you supply as kwargs isn't accepted by the url definition).
As for validating the response, we can't help without knowing a lot more about what is expected. Presumably, you will locate the start of some JSON text in body, feed it to json.loads, and make sure the data is as expected. But I don't think that's what is being asked.
EDIT:
In a similar vein, when I now try to log into their account with a post request, what is returned is none of the errors they suggest on their site, but is in fact a "JSON exception". Is there any way to debug this, or is an error code 500 completely impossible to deal with?
I'm well aware this question has been asked before. Sadly, when trying the proposed answers, none worked. I have an extremely simple Python project with urllib, and I've never done web programming in Python before, nor am I even a regular Python user. My friend needs to get access to content from this site, but their user-friendly front-end is down and I learned that they have a public API to access their content. Not knowing what I'm doing, but glad to try to help and interested in the challenge, I have very slowly set out.
Note that it is necessary for me to only use standard Python libraries, so that any finished project could easily be emailed to their computer and just work.
The following works completely fine minus the "originalLanguage" query, but when using it, which the API has documented as an array value, no matter whether I comma-separate things, or write "originalLanguage[0]" or "originalLanguage0" or anything that I've seen online, this creates the error message from the server: "Array value expected but string detected" or something along those lines.
Is there any way for me to get this working? Because it clearly can work, otherwise the API wouldn't document it. Many thanks.
In case it helps, when using "[]" or "<>" or "{}" or any delimeter I could think of, my IDE didn't recognise it as part of the URL.
import urllib.request as request
import urllib.parse as parse
def make_query(url, params):
url += "?"
for i in range(len(params)):
url += list(params)[i]
url += '='
url += list(params.values())[i]
if i < len(params) - 1:
url += '&'
return url
base = "https://api.mangadex.org/manga"
params = {
"limit": "50",
"originalLanguage": "en"
}
url = make_query(base, params)
req = request.Request(url)
response = request.urlopen(req)
I have this long list of URL that I need to check response code of, where the links are repeated 2-3 times. I have written this script to check the response code of each URL.
connection =urllib.request.urlopen(url)
return connection.getcode()
The URL comes in XML in this format
< entry key="something" > url</entry>
< entry key="somethingelse" > url</entry>
and I have to associate the response code with the attribute Key so I don't want to use a SET.
Now I definitely don't want to make more than 1 request for the same URL so I was searching whether urlopen uses cache or not but didn't find a conclusive answer. If not what other technique can be used for this purpose.
You can store the urls in a dictionary (urls = {}) as you make a request and check if you have already made a req to that url later:
if key not in urls:
connection = urllib.request.urlopen(url)
urls[key] = url
return connection.getcode()
BTW if you make requests to the same urls repeatedly (multiple runs of the script), and need a persistent cache, i recommend using requests with requests-cache
Why don't you create a python set() of the URLs? That way each url is included only once.
How are you associating the URL with the key? A dictionary?
You can use a dictionary to map the URL to it's response and any other information you need to keep track of. If the URL is already in the dictionary then you know the response. So you have one dictionary:
url_cache = {
"url1" : ("response", [key1,key2])
}
If you need to organize things differently it shouldn't be too hard with another dictionary.
To give an overview of the problem, I have a list of Twitter users "screen_names" and I want to verify wether they are suspended users or not. I don't want to use the twitter search API to avoid the rate limits problem (the list is quite big). Therefore, I am trying to use a cluster of computers to label my dataset (wether an account in my database is suspended or not).
If an account is suspended by Twitter and you try to access them through the link http://www.twitter/screen_name you get redirected to https://twitter.com/account/suspended
I tried to capture this behaviour using python 2.7 with urlib using the geturl() method. It works but is not reliable (I don't get the same results on the same link). I tested it on the same account and yet sometimes it returns the https://twitter.com/account/suspended and some other times it returns http://www.twitter/screen_name
The same problem occurs with requests.
My code:
import requests
from lxml import html
screen_name = 'IaMaGuyGetIt'
account_url = "https://twitter.com/"+screen_name
url = requests.get(account_url)
print url.url
req = urllib.urlopen(url.url).read()
page = html.fromstring(req)
for heading in page.xpath("//h1"):
if heading.text == 'Account suspended':
print True
The twitter server only serves you the 302 redirect once; after that it'll assume your browser has cached the redirect.
The body of the page does contain a pointer though, so even if you were not redirected you can see that there is still the link there:
r = requests.get(account_url)
>>> r.url
u'https://twitter.com/IaMaGuyGetIt'
>>> r.text
u'<html><body>You are being redirected.</body></html>'
Look for that exact text.
I'm trying to extract the response header of a URL request. When I use firebug to analyze the response output of a URL request, it returns:
Content-Type text/html
However when I use the python code:
urllib2.urlopen(URL).info()
the resulting output returns:
Content-Type: video/x-flv
I am new to python, and to web programming in general; any helpful insight is much appreciated. Also, if more info is needed please let me know.
Thanks in advance for reading this post
Try to request as Firefox does. You can see the request headers in Firebug, so add them to your request object:
import urllib2
request = urllib2.Request('http://your.tld/...')
request.add_header('User-Agent', 'some fake agent string')
request.add_header('Referer', 'fake referrer')
...
response = urllib2.urlopen(request)
# check content type:
print response.info().getheader('Content-Type')
There's also HTTPCookieProcessor which can make it better, but I don't think you'll need it in most cases. Have a look at python's documentation:
http://docs.python.org/library/urllib2.html
Content-Type text/html
Really, like that, without the colon?
If so, that might explain it: it's an invalid header, so it gets ignored, so urllib guesses the content-type instead, by looking at the filename. If the URL happens to have ‘.flv’ at the end, it'll guess the type should be video/x-flv.
This peculiar discrepancy might be explained by different headers (maybe ones of the accept kind) being sent by the two requests -- can you check that...? Or, if Javascript is running in Firefox (which I assume you're using when you're running firebug?) -- since it's definitely NOT running in the Python case -- "all bets are off", as they say;-).
Keep in mind that a web server can return different results for the same URL based on differences in the request. For example, content-type negotiation: the requestor can specify a list of content-types it will accept, and the server can return different results to try to accomodate different needs.
Also, you may be getting an error page for one of your requests, for example, because it is malformed, or you don't have cookies set that authenticate you properly, etc. Look at the response itself to see what you are getting.
according to http://docs.python.org/library/urllib2.html there is only get_header() method and nothing about getheader .
Asking because Your code works fine for
response.info().getheader('Set cookie')
but once i execute
response.info().get_header('Set cookie')
i get:
Traceback (most recent call last):
File "baza.py", line 11, in <module>
cookie = response.info().get_header('Set-Cookie')
AttributeError: HTTPMessage instance has no attribute 'get_header'
edit:
Moreover
response.headers.get('Set-Cookie') works fine as well, not mentioned in urlib2 doc....
for getting raw data for the headers in python2, a little bit of a hack but it works.
"".join(urllib2.urlopen("http://google.com/").info().__dict__["headers"])
basically "".join(list) will the list of headers, which all include "\n" at the end.
__dict__ is a built in python variable for all dicts, basically you can select a list out of a 2d array with it.
and ofcourse ["headers"] is selecting the list value from the .info() response value dict
hope this helped you learn a few ez python tricks :)