A'm just a beginner in Python. Please, help me with such problem:
I have API documentation (Server allowed methods):
GET, http://view.example.com/candidates/, shows a
candidate with id=. Returns 200
I write such code:
import requests
url = 'http://view.example.com/candidates/4'
r = requests.get(url)
print r
But I want to now how can I put id of candidate through "input()" builtin-function instead of including it to URL.
There is my efforts to do this:
import requests
cand_id = input('Please, type id of askable candidate: ')
url = ('http://view.example.com/candidates' + 'cand_id')
r = requests.get(url)
print r
dir(r)
r.content
But it's not working...
You can do this to construct the url:
url = 'http://view.example.com/candidates'
params = { 'cand_id': 4 }
requests.get(url, params=params)
Result: http://view.example.com/candidates?cand_id=4
--
Or if you want to build the same url as you mentioned in your post:
url = 'http://view.example.com/candidates'
cand_id = input("Enter a candidate id: ")
new_url = "{}/{}".format(url, cand_id)
Result: http://view.example.com/candidates/4
you're using the string 'cand_id' instead of the variable cand_id. The string creates a url of 'http://view.example.com/candidatescand_id'
Related
I don't really know what to call this issue, sorry for the undescriptive title.
My program checks if a element exists on multiple paths of a website. The program has a base url that gets different paths of the domain to check, which are located in a json file (name.json).
In this current state of my program, it prints 1 if the element is found and 2 if not. I want it to print the url instead of 1 or 2. But my problem is that the id's gets saved before the final for loop. When trying to print fullurl I'm only getting the last id in my json file printed multiple times(because it isnt being saved), instead of the unique url.
import json
import grequests
from bs4 import BeautifulSoup
idlist = json.loads(open('name.json').read())
baseurl = 'https://steamcommunity.com/id/'
complete_urls = []
for uid in idlist:
fullurl = baseurl + uid
complete_urls.append(fullurl)
rs = (grequests.get(fullurl) for fullurl in complete_urls)
resp = grequests.map(rs)
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
if soup.find('span', class_='actual_persona_name'):
print('1')
else:
print('2')
Since the grequests.map return the responses in order of requests (see this), you can match the fullurl of each request to a response using enumerate.
import json
import grequests
from bs4 import BeautifulSoup
idlist = json.loads(open('name.json').read())
baseurl = 'https://steamcommunity.com/id/'
for uid in idlist:
fullurl = baseurl + uid
complete_urls = []
for uid in idlist:
fullurl = baseurl + uid
complete_urls.append(fullurl)
rs = (grequests.get(fullurl) for fullurl in complete_urls)
resp = grequests.map(rs)
for index,r in enumerate(resp): # use enumerate to get the index of response
soup = BeautifulSoup(r.text, 'lxml')
print(complete_urls[index]) # using the index of responses to access the already existing list of complete_urls
if soup.find('span', class_='actual_persona_name'):
print('1')
else:
print('2')
If I undertstood correctly you could just print(r.url) instead of the numbers since the fullurl is stored inside each response object.
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
if soup.find('span', class_='actual_persona_name'):
print(r.url)
else:
print(r.url)
I'm trying to snip a embedded json from a webpage and then passing the json object to json.loads(). First url is okay but when loading the second url it's return error
ValueError: Unterminated string starting at: line 1 column 2078 (char 2077)
here is the code
import requests,json
from bs4 import BeautifulSoup
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')[0]
data = scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]
jdata = json.loads(data)
print(jdata)
If you print out scripts.text.split("window['AT_APOLLO_STATE'] = ")[1], you will see the follows that includes a ; right after and enthusiastic. So you get an invalid json string from scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]. And the data ends with and enthusiastic that is not a valid json string.
"strapline":"In our state-of-the-art dealerships across the U.K, Sytner Group
represents the world’s most prestigious car manufacturers.
All of our staff are knowledgeable and enthusiastic; making every interaction
special by going the extra mile.",
Reason has been given. You could also regex out appropriate string
import requests,json
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
p = re.compile(r"window\['AT_APOLLO_STATE'\] =(.*?});", re.DOTALL)
for url in urls:
r = requests.get(url)
jdata = json.loads(p.findall(r.text)[0])
print(jdata)
Missed a } in the original post.
In my code, a user inputs a search term and the get_all_links parses the html response and extract the links that start with ‘http’. When req is replaced with a hard coded url such as:
content = urllib.request.urlopen("http://www.ox.ac.uk")
The program returns a list of properly formatted links correctly. However passing in req, no links are returned. I suspect this may be a formatting blip.
Here is my code:
import urllib.request
def get_all_links(s): # function to get all the links
d=0
links=[] # getting all links into a list
while d!=-1: # untill d is -1. i.e no links in that page
d=s.find('<a href=',d) # if <a href is found
start=s.find('"',d) # stsrt will be the next character
end=s.find('"',start+1) # end will be upto "
if d!=-1: # d is not -1
d+=1
if(s[start+1]=='h'): # add the link which starts with http only.
links.append(s[start+1:end]) # to link list
return links # return list
def main():
term = input('Enter a search term: ')
url = 'http://www.google.com/search'
value = {'q' : term}
user_agent = 'Mozilla/5.0'
headers = {'User-Agent' : user_agent}
data = urllib.parse.urlencode(value)
print(data)
url = url + '?' + data
print(url)
req = urllib.request.Request(url, None, headers)
content = urllib.request.urlopen(req)
s = content.read()
print(s)
links = get_all_links(s.decode('utf-8'))
for i in links: # print the returned list.
print(i)
main()
You should use a HTML parser, as suggested in the comments. A library like BeautifulSoup is perfect for this.
I have adapted your code to use BeautifulSoup
import urllib.request
from bs4 import BeautifulSoup
def get_all_links(s):
soup = BeautifulSoup(s, "html.parser")
return soup.select("a[href^=\"http\"]") # Select all anchor tags whose href attribute starts with 'http'
def main():
term = input('Enter a search term: ')
url = 'http://www.google.com/search'
value = {'q' : term}
user_agent = 'Mozilla/5.0'
headers = {'User-Agent' : user_agent}
data = urllib.parse.urlencode(value)
print(data)
url = url + '?' + data
print(url)
req = urllib.request.Request(url, None, headers)
content = urllib.request.urlopen(req)
s = content.read()
print(s)
links = get_all_links(s.decode('utf-8'))
for i in links: # print the returned list.
print(i)
main()
It uses the select method of the BeautifulSoup library and returns a list of selected elements (in your case anchor-tags).
Using a library like BeautifulSoup not only makes it easier, but you can also use much more complex selections. Imagine how you would have to change your code when you wanted to select all links whose href attribute contains the word "google" or "code"?
You can read the BeautifulSoup documentation here.
My code below gives a list of urls.If I want any specific url how to resolve this:
My code is given below:
import bs4, requests
index_pages = ('https://www.tripadvisor.in/Hotels-g60763-oa{}-New_York_City_New_York-Hotels.html#ACCOM_OVERVIEW'.format(i) for i in range(0, 180, 30))
urls = []
with requests.session() as s:
for index in index_pages:
r = s.get(index)
soup = bs4.BeautifulSoup(r.text, 'lxml')
url_list = [i.get('href') for i in soup.select('.property_title')]
urls.append(url_list)
print(url_list)
Now the output I am getting as a list of Urls.Output is given below:
New_York_City_New_York.html', '/Hotel_Review-g60763-d93543-Reviews-Shelburne_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1485603-Reviews-Comfort_Inn_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d93340-Reviews-Hotel_Elysee_by_Library_Hotel_Collection-New_York_City_New_York.html', '/Hotel_Review-g60763-d1641016-Reviews-The_Chatwal_A_Luxury_Collection_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d93585-Reviews-Lowell_Hotel-New_York_City_New_York.html']
D:\anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
['/Hotel_Review-g60763-d277882-Reviews-Hampton_Inn_Manhattan_Seaport_Financial_District-New_York_City_New_York.html', '/Hotel_Review-g60763-d3529145-Reviews-Holiday_Inn_Express_Manhattan_Times_Square_South-New_York_City_New_York.html', '/Hotel_Review-g60763-d208453-Reviews-Hilton_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d249711-Reviews-The_Hotel_at_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d1158753-Reviews-Kimpton_Ink48_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1186070-Reviews-Marriott_Vacation_Club_Pulse_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d1938661-Reviews-Row_NYC_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93345-Reviews-Skyline_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d217616-Reviews-Kimpton_Muse_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1888977-Reviews-The_Pearl_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d223021-Reviews-Club_Quarters_Hotel_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d611947-Reviews-New_York_Hilton_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d4274398-Reviews-Courtyard_New_York_Manhattan_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456416-Reviews-The_Dominick_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d122014-Reviews-Gild_Hall_a_Thompson_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d2622936-Reviews-Wyndham_Garden_Chinatown-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456560-Reviews-Kimpton_Hotel_Eventi-New_York_City_New_York.html', '/Hotel_Review-g60763-d249710-Reviews-Morningside_Inn-New_York_City_New_York.html', '/Hotel_Review-g60763-d2079052-Reviews-YOTEL_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d224214-Reviews-The_Bryant_Park_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1785018-Reviews-The_James_New_York_SoHo-New_York_City_New_York.html', '/Hotel_Review-g60763-d247814-Reviews-The_Gatsby_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d112039-Reviews-Hotel_Newton-New_York_City_New_York.html', '/Hotel_Review-g60763-d612263-Reviews-Hotel_Mela-New_York_City_New_York.html', '/Hotel_Review-g60763-d99392-Reviews-Hotel_Metro-New_York_City_New_York.html', '/Hotel_Review-g60763-d4446427-Reviews-Hotel_Boutique_At_Grand_Central-New_York_City_New_York.html', '/Hotel_Review-g60763-d1503474-Reviews-Distrikt_Hotel_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d93467-Reviews-Gardens_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93603-Reviews-The_Pierre_A_Taj_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d113311-Reviews-The_Peninsula_New_York-New_York_City_New_York.html']
Now if I am looking for any particular url from the above list how to do that?
example: For Hilton_Times_Square how to find out the url from the above list?
For looking for exact url:
def findExactUrl(urlList, searched):
for url in urllist:
if url == searched:
reurn url
in Idle you can call
>>findExactUrl(url_list, "http://maritonhotel.com/123")
## if such url is in your list
>>"mariton Hotel"
## or if such url is not there, nothing should show, just:
>>
alteranively, calling from your .py file:
myUrl = findExactUrl(url_list, "http://maritonhotel.com/123")
print(myUrl)
>>"http://maritonhotel.com/123"
You can edit function to return True or i to find it's index instead.
For more vague search
def findOccurence(urlList, searched):
foundUrls = []
for url in urllist:
if url.contains(searched):
foundUrls.append(url)
return foundUrls
if you want to remove some substring from string, simply call .replace() method.
for i in range(len(url_list)):
if "[" in url_list[i]:
url_list[i] = url_list[i].replace("[", "")
Let me know if there is something you do not understand. Also, please, seriously consider buying some python book for beginners, or doing some on-line course.
This is what i would do:
keyword = "Hilton_Times_Square"
target_urls = [ e for e in url_list if keyword in e ]
I am requesting an Ajax Web site with a Python script and fetching cities and branch offices of http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
I completed the first step with posting
{cityID: 34} to this url and fetc the JSON output.
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/GetTownByCity
But I can not retrive the JSON output with Python although i get succesfully with Chrome Advanced Rest Client Extension, posting {cityID:54,townID:5416,unitOnDutyFlag:null,closestFlag:2}
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-unitservices.aspx/GetUnit
All of the source code is here
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
class Yurtici(object):
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/'
getTown = 'GetTownByCity'
getUnit = 'GetUnit'
urlGetTown = baseUrl + ajaxRoot + getTown
urlGetUnit = baseUrl + ajaxRoot + getUnit
headers = {'content-type': 'application/json','encoding':'utf-8'}
def __init__(self):
pass
def ilceler(self, plaka=34): # Default testing value
payload = {'cityId':plaka}
url = self.urlGetTown
r = requests.post(url, data=json.dumps(payload), headers=self.headers)
return r.json() # OK
def subeler(self, ilceNo=5902): # Default testing value
# 5902 Çerkezköy
payload= {'cityID':59,'townID':5902,'unitOnDutyFlag':'null','closestFlag':0}
url = self.urlGetUnit
headers = {'content-type': 'application/json','encoding':'utf-8'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
print r.status_code, r.raw.read()
if __name__ == '__main__':
a = Yurtici()
print a.ilceler(37) # OK
print a.subeler() # NOT OK !!!
Your code isn't posting to the same url you're using in your text example.
Let's walk through this backwards. First, let's look at the failing POST.
url = self.urlGetUnit
headers = {'content-type': 'application/json','encoding':'utf-8'}
r = requests.post(url, data=json.dumps(payload), headers=headers)
So we're posting to a URL that is equal to self.urlGetUnit. Ok, let's look at how that's defined:
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/'
getUnit = 'GetUnit'
urlGetUnit = baseUrl + ajaxRoot + getUnit
If you do the work in urlGetUnit, you get that the URL will be http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/GetUnit. Let's put this alongside the URL you used in Chrome to compare the differences:
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-sswservices.aspx/GetUnit
http://www.yurticikargo.com/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-unitservices.aspx/GetUnit
See the difference? ajaxRoot is not the same for both URLs. Sort that out and you'll get back a JSON response.