Input variable name as raw string into request in python - python

I am kind of very new to python.
I tried to loop through an URL request via python and I want to change one variable each time it loops.
My code looks something like this:
codes = ["MCDNDF3","MCDNDF4"]
#count = 0
for x in codes:
response = requests.get(url_part1 + str(codes) + url_part3, headers=headers)
print(response.content)
print(response.status_code)
print(response.url)
I want to have the url change at every loop to like url_part1+code+url_part3 and then url_part1+NEXTcode+url_part3.
Sadly my request badly formats the string from the variable to "%5B'MCDNDF3'%5D".
It should get inserted as a raw string each loop. I don't know if I need url encoding as I don't have any special chars in the request. Just change code to MCDNDF3 and in the next request to MCDNDF4.
Any thoughts?
Thanks!

In your for loop, the first line should be:
response = requests.get(url_part1 + x + url_part3, headers=headers)
This will work assuming url_part1 and url_part3 are regular strings. x is already a string, as your codes list (at least in your example) contains only strings. %5B and %5D are [ and ] URL-encoded, respectively. You got that error because you called str() on a single-membered list:
>>> str(["This is a string"])
"['This is a string']"
If url_part1 and url_part3 are raw strings, as you seem to indicate, please update your question to show how they are defined. Feel free to use example.com if you don't want to reveal your actual target URL. You should probably be calling str() on them before constructing the full URL.

You’re putting the whole list in (codes) when you probably want x.

Related

How can I prepend each value of a list with either a loop or a function?

I'm currently making API calls to an application and then parsing the JSON data for certain values. The values I'm parsing are suffixes of a full URL. I need to take this list of suffixes and prepend them with my variable which completes each URL. This list can vary in size and so I will not always know the exact list length nor the exact values of the suffixes.
The list (worker_uri_list) will consist of the following format: ['/api/admin/configuration/v1/vm/***/', '/api/admin/configuration/v1/vm/***/'], '/api/admin/configuration/v1/vm/***/']. The asterisk are characters that can vary at any given time when the API call is made. I need to then prepend each one of these suffixes with my variable (url_prefix).
I'm having a hard time trying to understand the best method and how to do this correctly. Very new to Python here so I hope I've explained this well enough. It seems a simple for or while loop could accomplish this, but a lot of all the loop examples I find are for simple integer values.
Any help is appreciated.
url_prefix = "https://website.com"
url1 = url_prefix + "/api/admin/configuration/v1/config"
payload={}
header = {
'Authorization': 'Basic ********'
}
#Parse response
response = requests.get(url1, headers=header).json()
worker_uri_list = nested_lookup("resource_uri", response)
loop?
function?
You could iterate the worker_uri_list in a loop:
for uri in worker_uri_list:
url = url_prefix + uri
# do something with the url
print(url)
Output:
https://website.com/api/admin/configuration/v1/vm/***/
https://website.com/api/admin/configuration/v1/vm/***/
https://website.com/api/admin/configuration/v1/vm/***/
Or if you want a list of URLs, use a list comprehension:
urls = [url_prefix + uri for uri in worker_uri_list]
print(urls)
Output:
[
'https://website.com/api/admin/configuration/v1/vm/***/',
'https://website.com/api/admin/configuration/v1/vm/***/',
'https://website.com/api/admin/configuration/v1/vm/***/'
]

Error formatting the body of a scrapy request

When I make an scrapy request without formatting the body I get the right results, however, when I format it in order to make a loop I get a 400 error.
This is the body that's not formatted:
'{"fields":"id,angellist_url,job_roles","limit":25,"offset":0,"form_data":{"must":{"filters":{"founding_or_hq_slug_locations":{"values":["spain"],"execution":"or"}},"execution":"and"},"should":{"filters":{}},"must_not":{"growth_stages":["mature"],"company_type":["service provider","government nonprofit"],"tags":["outside tech"],"company_status":["closed"]}},"keyword":null,"sort":"-last_funding_date"}'
This is the formatted body:
'{"fields":"id,angellist_url,job_roles","limit":25,"offset":{offset_items},"form_data":{"must":{"filters":{"founding_or_hq_slug_locations":{"values":["spain"],"execution":"or"}},"execution":"and"},"should":{"filters":{}},"must_not":{"growth_stages":["mature"],"company_type":["service provider","government nonprofit"],"tags":["outside tech"],"company_status":["closed"]}},"keyword":null,"sort":"-last_funding_date"}'
Then when making the request I use:
yield scrapy.Request(url = url, headers = headers, body = body.format(offset_items = '0'))
The two are identical if not for **offset_items** in your 'formatted' string which is causing the issue. Scrapy requires a int for the offset.
You're trying to pass the offset kwarg value to the body using .format() but using wrong syntax. Best solution is to use an f-string to pass in the value.
You can use something like difflib in python to quickly check the difference between the two strings, which will allow you to easily troubleshoot issues like these

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Plus instead of a space in the url

I'm getting a string url request, I want to handle it and send the parameters, because it is very long.
I get plus in req.url instead of a space like query.
query = "https://www.superjob.ru/resume/search_resume.html?sbmit=1&detail_search=1&keywords%5B0%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%93%D1%80%D1%83%D0%B7%D1%87%D0%B8%D0%BA+%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%BD%D0%B8%D0%BA+%D1%81%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0&keywords%5B0%5D%5Bskwc%5D=or&keywords%5B0%5D%5Bsrws%5D=60&keywords%5B1%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%93%D1%80%D1%83%D0%B7%D1%87%D0%B8%D0%BA+%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%BD%D0%B8%D0%BA+%D1%81%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0&keywords%5B1%5D%5Bskwc%5D=or&keywords%5B1%5D%5Bsrws%5D=7&keywords%5B2%5D%5Bkeys%5D=%D0%AE%D0%BB%D0%BC%D0%B0%D1%80%D1%82&keywords%5B2%5D%5Bskwc%5D=nein&keywords%5B2%5D%5Bsrws%5D=50&keywords%5B3%5D%5Bkeys%5D=%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C+%D0%B4%D0%B8%D1%80%D0%B5%D0%BA%D1%82%D0%BE%D1%80+%D0%BD%D0%B0%D1%87%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA+%D1%8D%D0%BA%D0%BE%D0%BD%D0%BE%D0%BC%D0%B8%D1%81%D1%82+%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%BB%D0%BE%D0%B3%D0%B8%D1%81%D1%82+%D0%9E%D1%84%D0%B8%D1%81-%D0%BC%D0%B5%D0%BD%D0%B5%D0%B4%D0%B6%D0%B5%D1%80+%D0%93%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D0%B2*&keywords%5B3%5D%5Bskwc%5D=nein&keywords%5B3%5D%5Bsrws%5D=60&exclude_words=%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C+%D0%B4%D0%B8%D1%80%D0%B5%D0%BA%D1%82%D0%BE%D1%80+%D0%BD%D0%B0%D1%87%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA+%D1%8D%D0%BA%D0%BE%D0%BD%D0%BE%D0%BC%D0%B8%D1%81%D1%82+%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%BB%D0%BE%D0%B3%D0%B8%D1%81%D1%82+%D0%9E%D1%84%D0%B8%D1%81-%D0%BC%D0%B5%D0%BD%D0%B5%D0%B4%D0%B6%D0%B5%D1%80+%D0%93%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D0%B2*&period=7&place_of_work=0&t%5B%5D=4&m%5B%5D=194&m%5B%5D=195&m%5B%5D=193&m%5B%5D=192&m%5B%5D=191&m%5B%5D=164&m%5B%5D=162&m%5B%5D=163&m%5B%5D=150&m%5B%5D=534&m%5B%5D=540&m%5B%5D=149&m%5B%5D=148&m%5B%5D=147&m%5B%5D=146&m%5B%5D=535&m%5B%5D=536&m%5B%5D=159&m%5B%5D=157&m%5B%5D=158&m%5B%5D=156&m%5B%5D=155&m%5B%5D=154&m%5B%5D=161&m%5B%5D=20&m%5B%5D=21&m%5B%5D=22&m%5B%5D=24&m%5B%5D=23&m%5B%5D=25&m%5B%5D=28&m%5B%5D=573&m%5B%5D=29&m%5B%5D=27&m%5B%5D=26&m%5B%5D=145&m%5B%5D=144&m%5B%5D=143&m%5B%5D=142&m%5B%5D=141&m%5B%5D=30&m%5B%5D=31&m%5B%5D=115&m%5B%5D=153&m%5B%5D=72&m%5B%5D=84&m%5B%5D=152&m%5B%5D=83&m%5B%5D=82&m%5B%5D=81&m%5B%5D=79&m%5B%5D=80&m%5B%5D=542&m%5B%5D=151&m%5B%5D=46&m%5B%5D=85&paymentfrom=20000&paymentto=35000&type_of_work=0&citizenship%5B0%5D=1&old1=18&old2=40&maritalstatus=0&pol=0&children=0&education=0&eduform=0&id_institute=0&institution=&languages%5B0%5D%5Blanguage_id%5D=0&languages%5B0%5D%5Blanguage_level%5D=0&business_trip=0"
def getparams(url):
params = {}
urlarr = url.split("&")
i = 0
while i < len(urlarr):
params[urllib2.unquote(urlarr[i].split("=")[0].encode("utf-8"))] = urllib2.unquote(urlarr[i].split("=")[1].encode('utf-8'))
i += 1
return params
s = Session()
req = s.get(query.split("?")[0], params=getparams(query.split("?")[1]))
print(req.url)
I tried decode() instead encode() and without encode() and without urllib2.unquote()
Have you looked at the url you are posting?
you set
query = "https://www.superjob.ru/resume/search_resume.html?sbmit=1&detail_search=1&keywords%5B0%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%93%D1%80%D1%83%D0%B7%D1%87%D0%B8%D0%BA+%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%BD%D0%B8%D0%BA+%D1%81%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0&keywords%5B0%5D%5Bskwc%5D=or&keywords%5B0%5D%5Bsrws%5D=60&keywords%5B1%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%93%D1%80%D1%83%D0%B7%D1%87%D0%B8%D0%BA+%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%BD%D0%B8%D0%BA+%D1%81%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0&keywords%5B1%5D%5Bskwc%5D=or&keywords%5B1%5D%5Bsrws%5D=7&keywords%5B2%5D%5Bkeys%5D=%D0%AE%D0%BB%D0%BC%D0%B0%D1%80%D1%82&keywords%5B2%5D%5Bskwc%5D=nein&keywords%5B2%5D%5Bsrws%5D=50&keywords%5B3%5D%5Bkeys%5D=%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C+%D0%B4%D0%B8%D1%80%D0%B5%D0%BA%D1%82%D0%BE%D1%80+%D0%BD%D0%B0%D1%87%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA+%D1%8D%D0%BA%D0%BE%D0%BD%D0%BE%D0%BC%D0%B8%D1%81%D1%82+%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%BB%D0%BE%D0%B3%D0%B8%D1%81%D1%82+%D0%9E%D1%84%D0%B8%D1%81-%D0%BC%D0%B5%D0%BD%D0%B5%D0%B4%D0%B6%D0%B5%D1%80+%D0%93%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D0%B2*&keywords%5B3%5D%5Bskwc%5D=nein&keywords%5B3%5D%5Bsrws%5D=60&exclude_words=%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C+%D0%B4%D0%B8%D1%80%D0%B5%D0%BA%D1%82%D0%BE%D1%80+%D0%BD%D0%B0%D1%87%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA+%D1%8D%D0%BA%D0%BE%D0%BD%D0%BE%D0%BC%D0%B8%D1%81%D1%82+%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%BB%D0%BE%D0%B3%D0%B8%D1%81%D1%82+%D0%9E%D1%84%D0%B8%D1%81-%D0%BC%D0%B5%D0%BD%D0%B5%D0%B4%D0%B6%D0%B5%D1%80+%D0%93%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D0%B2*&period=7&place_of_work=0&t%5B%5D=4&m%5B%5D=194&m%5B%5D=195&m%5B%5D=193&m%5B%5D=192&m%5B%5D=191&m%5B%5D=164&m%5B%5D=162&m%5B%5D=163&m%5B%5D=150&m%5B%5D=534&m%5B%5D=540&m%5B%5D=149&m%5B%5D=148&m%5B%5D=147&m%5B%5D=146&m%5B%5D=535&m%5B%5D=536&m%5B%5D=159&m%5B%5D=157&m%5B%5D=158&m%5B%5D=156&m%5B%5D=155&m%5B%5D=154&m%5B%5D=161&m%5B%5D=20&m%5B%5D=21&m%5B%5D=22&m%5B%5D=24&m%5B%5D=23&m%5B%5D=25&m%5B%5D=28&m%5B%5D=573&m%5B%5D=29&m%5B%5D=27&m%5B%5D=26&m%5B%5D=145&m%5B%5D=144&m%5B%5D=143&m%5B%5D=142&m%5B%5D=141&m%5B%5D=30&m%5B%5D=31&m%5B%5D=115&m%5B%5D=153&m%5B%5D=72&m%5B%5D=84&m%5B%5D=152&m%5B%5D=83&m%5B%5D=82&m%5B%5D=81&m%5B%5D=79&m%5B%5D=80&m%5B%5D=542&m%5B%5D=151&m%5B%5D=46&m%5B%5D=85&paymentfrom=20000&paymentto=35000&type_of_work=0&citizenship%5B0%5D=1&old1=18&old2=40&maritalstatus=0&pol=0&children=0&education=0&eduform=0&id_institute=0&institution=&languages%5B0%5D%5Blanguage_id%5D=0&languages%5B0%5D%5Blanguage_level%5D=0&business_trip=0"
which contains a lot of escape quotes. None of them are a space (%20)
The reason you get a plus instead of a space, is because your url contains a plus in place of every space. This is expected behaviour, as proper URLs should not contain any spaces, and thus, are often converted into "+" beforehand.
If I put your escaped URL into my browser hotbar and look at it, all I see is "+", no spaces there.
This is from somewhere at the start of your URL:
detail_search=1&keywords%5B0%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA
which is essentially detail_search=1&keywords[0][keys]=Кладовщик+Комплектовщик
if unescaped. As you can clearly see, both versions, quoted and unquoted, contain the plus. This is done by the website, and can't (shouldn't) be changed, especially if you want to reconstruct the URL somewhere else.
If you need the parameters without the plus, just do "myUnquotedString".replace("+"," ") (after unquoting, because i don't know what urllib2 does if it gets a malformed string as parameter for unquote().
I solved this problem like this:
query = "https://www.superjob.ru/resume/search_resume.html?sbmit=1&detail_search=1&keywords%5B0%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%93%D1%80%D1%83%D0%B7%D1%87%D0%B8%D0%BA+%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%BD%D0%B8%D0%BA+%D1%81%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0&keywords%5B0%5D%5Bskwc%5D=or&keywords%5B0%5D%5Bsrws%5D=60&keywords%5B1%5D%5Bkeys%5D=%D0%9A%D0%BB%D0%B0%D0%B4%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%9A%D0%BE%D0%BC%D0%BF%D0%BB%D0%B5%D0%BA%D1%82%D0%BE%D0%B2%D1%89%D0%B8%D0%BA+%D0%93%D1%80%D1%83%D0%B7%D1%87%D0%B8%D0%BA+%D0%A0%D0%B0%D0%B1%D0%BE%D1%82%D0%BD%D0%B8%D0%BA+%D1%81%D0%BA%D0%BB%D0%B0%D0%B4%D0%B0&keywords%5B1%5D%5Bskwc%5D=or&keywords%5B1%5D%5Bsrws%5D=7&keywords%5B2%5D%5Bkeys%5D=%D0%AE%D0%BB%D0%BC%D0%B0%D1%80%D1%82&keywords%5B2%5D%5Bskwc%5D=nein&keywords%5B2%5D%5Bsrws%5D=50&keywords%5B3%5D%5Bkeys%5D=%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C+%D0%B4%D0%B8%D1%80%D0%B5%D0%BA%D1%82%D0%BE%D1%80+%D0%BD%D0%B0%D1%87%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA+%D1%8D%D0%BA%D0%BE%D0%BD%D0%BE%D0%BC%D0%B8%D1%81%D1%82+%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%BB%D0%BE%D0%B3%D0%B8%D1%81%D1%82+%D0%9E%D1%84%D0%B8%D1%81-%D0%BC%D0%B5%D0%BD%D0%B5%D0%B4%D0%B6%D0%B5%D1%80+%D0%93%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D0%B2*&keywords%5B3%5D%5Bskwc%5D=nein&keywords%5B3%5D%5Bsrws%5D=60&exclude_words=%D0%A0%D1%83%D0%BA%D0%BE%D0%B2%D0%BE%D0%B4%D0%B8%D1%82%D0%B5%D0%BB%D1%8C+%D0%B4%D0%B8%D1%80%D0%B5%D0%BA%D1%82%D0%BE%D1%80+%D0%BD%D0%B0%D1%87%D0%B0%D0%BB%D1%8C%D0%BD%D0%B8%D0%BA+%D1%8D%D0%BA%D0%BE%D0%BD%D0%BE%D0%BC%D0%B8%D1%81%D1%82+%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%BB%D0%BE%D0%B3%D0%B8%D1%81%D1%82+%D0%9E%D1%84%D0%B8%D1%81-%D0%BC%D0%B5%D0%BD%D0%B5%D0%B4%D0%B6%D0%B5%D1%80+%D0%93%D0%BE%D1%81%D1%83%D0%B4%D0%B0%D1%80%D1%81%D0%B2*&period=7&place_of_work=0&t%5B%5D=4&m%5B%5D=194&m%5B%5D=195&m%5B%5D=193&m%5B%5D=192&m%5B%5D=191&m%5B%5D=164&m%5B%5D=162&m%5B%5D=163&m%5B%5D=150&m%5B%5D=534&m%5B%5D=540&m%5B%5D=149&m%5B%5D=148&m%5B%5D=147&m%5B%5D=146&m%5B%5D=535&m%5B%5D=536&m%5B%5D=159&m%5B%5D=157&m%5B%5D=158&m%5B%5D=156&m%5B%5D=155&m%5B%5D=154&m%5B%5D=161&m%5B%5D=20&m%5B%5D=21&m%5B%5D=22&m%5B%5D=24&m%5B%5D=23&m%5B%5D=25&m%5B%5D=28&m%5B%5D=573&m%5B%5D=29&m%5B%5D=27&m%5B%5D=26&m%5B%5D=145&m%5B%5D=144&m%5B%5D=143&m%5B%5D=142&m%5B%5D=141&m%5B%5D=30&m%5B%5D=31&m%5B%5D=115&m%5B%5D=153&m%5B%5D=72&m%5B%5D=84&m%5B%5D=152&m%5B%5D=83&m%5B%5D=82&m%5B%5D=81&m%5B%5D=79&m%5B%5D=80&m%5B%5D=542&m%5B%5D=151&m%5B%5D=46&m%5B%5D=85&paymentfrom=20000&paymentto=35000&type_of_work=0&citizenship%5B0%5D=1&old1=18&old2=40&maritalstatus=0&pol=0&children=0&education=0&eduform=0&id_institute=0&institution=&languages%5B0%5D%5Blanguage_id%5D=0&languages%5B0%5D%5Blanguage_level%5D=0&business_trip=0"
query = query.encode('utf-8')
s = Session()
req = s.get(query.split("?")[0], params=parse_qs(urlparse(query).query))

Some help understanding my own Python code

I'm starting to learn Python and I've written the following Python code (some of it omitted) and it works fine, but I'd like to understand it better. So I do the following:
html_doc = requests.get('[url here]')
Followed by:
if html_doc.status_code == 200:
soup = BeautifulSoup(html_doc.text, 'html.parser')
line = soup.find('a', class_="some_class")
value = re.search('[regex]', str(line))
print (value.group(0))
My questions are:
What does html_doc.text really do? I understand that it makes "text" (a string?) out of html_doc, but why isn't it text already? What is it? Bytes? Maybe a stupid question but why doesn't requests.get create a really long string containing the HTML code?
The only way that I could get the result of re.search was by value.group(0) but I have literally no idea what this does. Why can't I just look at value directly? I'm passing it a string, there's only one match, why is the resulting value not a string?
requests.get() return value, as stated in docs, is Response object.
re.search() return value, as stated in docs, is MatchObject object.
Both objects are introduced, because they contain much more information than simply response bytes (e.g. HTTP status code, response headers etc.) or simple found string value (e.g. it includes positions of first and last matched characters).
For more information you'll have to study docs.
FYI, to check type of returned value you may use built-in type function:
response = requests.get('[url here]')
print type(response) # <class 'requests.models.Response'>
Seems to me you are lacking some basic knowledge about Classes, Object and methods...etc, you need to read more about it here (for Python 2.7) and about requests module here.
Concerning what you asked, when you type html_doc = requests.get('url'), you are creating an instance of class requests.models.Response, you can check it by:
>>> type(html_doc)
<class 'requests.models.Response'>
Now, html_doc has methods, thus html_doc.text will return to you the server's response
Same goes for re module, each of its methods generates response object that are not simply int or string

Categories

Resources