Some help understanding my own Python code - python

I'm starting to learn Python and I've written the following Python code (some of it omitted) and it works fine, but I'd like to understand it better. So I do the following:
html_doc = requests.get('[url here]')
Followed by:
if html_doc.status_code == 200:
soup = BeautifulSoup(html_doc.text, 'html.parser')
line = soup.find('a', class_="some_class")
value = re.search('[regex]', str(line))
print (value.group(0))
My questions are:
What does html_doc.text really do? I understand that it makes "text" (a string?) out of html_doc, but why isn't it text already? What is it? Bytes? Maybe a stupid question but why doesn't requests.get create a really long string containing the HTML code?
The only way that I could get the result of re.search was by value.group(0) but I have literally no idea what this does. Why can't I just look at value directly? I'm passing it a string, there's only one match, why is the resulting value not a string?

requests.get() return value, as stated in docs, is Response object.
re.search() return value, as stated in docs, is MatchObject object.
Both objects are introduced, because they contain much more information than simply response bytes (e.g. HTTP status code, response headers etc.) or simple found string value (e.g. it includes positions of first and last matched characters).
For more information you'll have to study docs.
FYI, to check type of returned value you may use built-in type function:
response = requests.get('[url here]')
print type(response) # <class 'requests.models.Response'>

Seems to me you are lacking some basic knowledge about Classes, Object and methods...etc, you need to read more about it here (for Python 2.7) and about requests module here.
Concerning what you asked, when you type html_doc = requests.get('url'), you are creating an instance of class requests.models.Response, you can check it by:
>>> type(html_doc)
<class 'requests.models.Response'>
Now, html_doc has methods, thus html_doc.text will return to you the server's response
Same goes for re module, each of its methods generates response object that are not simply int or string

Related

Cant replace spaces in a python variable

i tried to replace spaces in a variable in python but it returns me this error
AttributeError: 'HTTPHeaders' object has no attribute 'replace'
this is my code
for req in driver.requests:
print(req.headers)
d = req.headers
x = d.replace("""
""", "")
So, if you check out the class HTTPHeaders you'll see it has a __repr__ function and that it's an HTTPMessage object.
Depending on what you exactly want to achieve (which is still not clear to me!, i.e, for which header do you want to replace spaces?) you can go about this two ways. Use the methods on the HTTPMessage object (documented here) or use the string version of it by calling repr on the response. I recommend you use the first approach as it is much cleaner.
I'll give an example in which I remove spaces for all canary values in all of the requests:
for req in driver.requests:
canary = req.headers.get("canary")
canary = canary.replace(" ", "")
P.S., your question is nowhere near clear enough as it stands. Only after asking multiple times and linking your other question it becomes clear that you are using seleniumwire, for example. Ideally, the code you provide can be run by anyone with the installed packages and reproduces the issue you have. BUT, allright, the comments made it more clear.

Input variable name as raw string into request in python

I am kind of very new to python.
I tried to loop through an URL request via python and I want to change one variable each time it loops.
My code looks something like this:
codes = ["MCDNDF3","MCDNDF4"]
#count = 0
for x in codes:
response = requests.get(url_part1 + str(codes) + url_part3, headers=headers)
print(response.content)
print(response.status_code)
print(response.url)
I want to have the url change at every loop to like url_part1+code+url_part3 and then url_part1+NEXTcode+url_part3.
Sadly my request badly formats the string from the variable to "%5B'MCDNDF3'%5D".
It should get inserted as a raw string each loop. I don't know if I need url encoding as I don't have any special chars in the request. Just change code to MCDNDF3 and in the next request to MCDNDF4.
Any thoughts?
Thanks!
In your for loop, the first line should be:
response = requests.get(url_part1 + x + url_part3, headers=headers)
This will work assuming url_part1 and url_part3 are regular strings. x is already a string, as your codes list (at least in your example) contains only strings. %5B and %5D are [ and ] URL-encoded, respectively. You got that error because you called str() on a single-membered list:
>>> str(["This is a string"])
"['This is a string']"
If url_part1 and url_part3 are raw strings, as you seem to indicate, please update your question to show how they are defined. Feel free to use example.com if you don't want to reveal your actual target URL. You should probably be calling str() on them before constructing the full URL.
You’re putting the whole list in (codes) when you probably want x.

Python documentation on possibly inherited method

I am writing a program (python Python 3.5.2) that uses a HTTPSConnection to get a JSON object as a response. I have it working using some example code, but am not sure where a method comes from.
My question is this: In the code below, the decode('utf-9') method doesn't exist in the documentation at https://docs.python.org/3.4/library/http.client.html#http.client.HTTPResponse under "21.12.2. HTTPResponse Objects". How would I know that the return value from the method "response.read()" has the method "decode('utf-8')" available?
Do Python objects inherit from a base class like C# objects do or am I missing something?
http = HTTPSConnection(get_hostname(token))
http.request('GET', uri_path, headers=get_authorization_header(token))
response = http.getresponse()
print(response.status, response.reason)
feed = json.loads(response.read().decode('utf-8'))
Thank you for your help.
The read method of the response object always returns a byte string (in Python 3, which I presume you are using as you use the print function). The byte string does indeed have a decode method, so there should be no problem with this code. Of course it makes the assumption that the response is encoded in UTF-8, which may or may not be correct.
[Technical note: email is a very difficult medium to handle: messages can be made up of different parts, each of which is differently encoded. At least with web traffic you stand a chance of reading the Content-Type header's charset attribute to find the correct encoding].

trouble scraping from JSONP feed

I asked a similar question earlier
python JSON feed returns string not object
but I am having a little more trouble and don't understand it.
For about half of the dates this works and returns a JSON object
for example November 9 2013 works
url = 'http://data.ncaa.com/jsonp/scoreboard/basketball-men/d1/2013/11/09/scoreboard.html?callback=c'
r = requests.get(url)
jsonObj = json.loads(r.content[2:-2])
but if I try November 11 2013:
url = 'http://data.ncaa.com/jsonp/scoreboard/basketball-men/d1/2013/11/11/scoreboard.html?callback=c'
r = requests.get(url)
jsonObj = json.loads(r.content[2:-2])
I get this error
ValueError: No JSON object could be decoded
I dont understand why. When I put both urls into a browser they look exactly the same.
The JSON in the second feed is, in fact, invalid JSON. Found this by removing the callback function and running it through: http://jsonlint.com/
To see for yourself, search for the following ID: 336252
The lines just above that ID contain two commas in a row, which is disallowed by the JSON spec.
My guess is that the server at data.ncaa.com is trying to generate JSON itself rather than using a JSON library. You should contact the site administrator and make them aware of this error.
Using demjson
demjson.decode(r.content[2:-2])
seems to work

Use of string in if() statement django error: string as left operand, not QuerySet

So, I am getting an error of:
TypeError: 'in <string>' requires string as left operand, not QuerySet
I have a method which has:
error_val = self.error_object
for p in self.output:
request = requests.get(p, timeout=settings.REQUESTS_TIMEOUT, verify=False)
for req in request:
if error_val in req:
print 'error Found in'+req
This error is happening due to error_val in the if()
In laymen's terms, this is basically saying (if I'm not mistaken), "Whoah, I'm getting a object value, with strings, but I can't compare to another object value"
req - is basically the html output of a page e.g. <html><body><!--html content here--></body></html>
error_val - is an variable holding the values of an object (results from a django query)
My question: how can I rework this method so, I can use the error_val var against each req (request)?
Any help, comments, suggestions are really helpful. Thank you.
self.error_object holds the instance of QuerySet class. And no you can't check if object of this type is inside the string.
QuerySet is a class which is a wrapper for Django ORM query/ies. It implements iterable protocol so you can iterate over it to get matching Model instances one by one.
Then you can access the fields of these instances as normal object attributes. If one of them is a string then you can check if it's a substring of req.
It's hard to say, what exactly you are trying to do but just a guess:
for model_instance in self.error_object:
for req in request:
if model_instance.some_string_field in req:
print 'error Found in' + req
If you're trying to check whether any of multiple objects' string representations appear in your response (I am unable to figure out what other behavior "queryset in string" might be intended to create) you want something like:
error_strings = [str(val) for val in self.error_object]
...
# then in your loop
if any(val in req for val in error_strings):
You might also profile creating an ord together regexp of error strings.

Categories

Resources