Unable to findout an Url from an output url list - python

My code below gives a list of urls.If I want any specific url how to resolve this:
My code is given below:
import bs4, requests
index_pages = ('https://www.tripadvisor.in/Hotels-g60763-oa{}-New_York_City_New_York-Hotels.html#ACCOM_OVERVIEW'.format(i) for i in range(0, 180, 30))
urls = []
with requests.session() as s:
for index in index_pages:
r = s.get(index)
soup = bs4.BeautifulSoup(r.text, 'lxml')
url_list = [i.get('href') for i in soup.select('.property_title')]
urls.append(url_list)
print(url_list)
Now the output I am getting as a list of Urls.Output is given below:
New_York_City_New_York.html', '/Hotel_Review-g60763-d93543-Reviews-Shelburne_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1485603-Reviews-Comfort_Inn_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d93340-Reviews-Hotel_Elysee_by_Library_Hotel_Collection-New_York_City_New_York.html', '/Hotel_Review-g60763-d1641016-Reviews-The_Chatwal_A_Luxury_Collection_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d93585-Reviews-Lowell_Hotel-New_York_City_New_York.html']
D:\anaconda3\lib\site-packages\requests\packages\urllib3\connectionpool.py:852: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
['/Hotel_Review-g60763-d277882-Reviews-Hampton_Inn_Manhattan_Seaport_Financial_District-New_York_City_New_York.html', '/Hotel_Review-g60763-d3529145-Reviews-Holiday_Inn_Express_Manhattan_Times_Square_South-New_York_City_New_York.html', '/Hotel_Review-g60763-d208453-Reviews-Hilton_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d249711-Reviews-The_Hotel_at_Times_Square-New_York_City_New_York.html', '/Hotel_Review-g60763-d1158753-Reviews-Kimpton_Ink48_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1186070-Reviews-Marriott_Vacation_Club_Pulse_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d1938661-Reviews-Row_NYC_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93345-Reviews-Skyline_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d217616-Reviews-Kimpton_Muse_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1888977-Reviews-The_Pearl_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d223021-Reviews-Club_Quarters_Hotel_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d611947-Reviews-New_York_Hilton_Midtown-New_York_City_New_York.html', '/Hotel_Review-g60763-d4274398-Reviews-Courtyard_New_York_Manhattan_Times_Square_West-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456416-Reviews-The_Dominick_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d122014-Reviews-Gild_Hall_a_Thompson_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d2622936-Reviews-Wyndham_Garden_Chinatown-New_York_City_New_York.html', '/Hotel_Review-g60763-d1456560-Reviews-Kimpton_Hotel_Eventi-New_York_City_New_York.html', '/Hotel_Review-g60763-d249710-Reviews-Morningside_Inn-New_York_City_New_York.html', '/Hotel_Review-g60763-d2079052-Reviews-YOTEL_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d224214-Reviews-The_Bryant_Park_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d1785018-Reviews-The_James_New_York_SoHo-New_York_City_New_York.html', '/Hotel_Review-g60763-d247814-Reviews-The_Gatsby_Hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d112039-Reviews-Hotel_Newton-New_York_City_New_York.html', '/Hotel_Review-g60763-d612263-Reviews-Hotel_Mela-New_York_City_New_York.html', '/Hotel_Review-g60763-d99392-Reviews-Hotel_Metro-New_York_City_New_York.html', '/Hotel_Review-g60763-d4446427-Reviews-Hotel_Boutique_At_Grand_Central-New_York_City_New_York.html', '/Hotel_Review-g60763-d1503474-Reviews-Distrikt_Hotel_New_York_City-New_York_City_New_York.html', '/Hotel_Review-g60763-d93467-Reviews-Gardens_NYC_an_Affinia_hotel-New_York_City_New_York.html', '/Hotel_Review-g60763-d93603-Reviews-The_Pierre_A_Taj_Hotel_New_York-New_York_City_New_York.html', '/Hotel_Review-g60763-d113311-Reviews-The_Peninsula_New_York-New_York_City_New_York.html']
Now if I am looking for any particular url from the above list how to do that?
example: For Hilton_Times_Square how to find out the url from the above list?

For looking for exact url:
def findExactUrl(urlList, searched):
for url in urllist:
if url == searched:
reurn url
in Idle you can call
>>findExactUrl(url_list, "http://maritonhotel.com/123")
## if such url is in your list
>>"mariton Hotel"
## or if such url is not there, nothing should show, just:
>>
alteranively, calling from your .py file:
myUrl = findExactUrl(url_list, "http://maritonhotel.com/123")
print(myUrl)
>>"http://maritonhotel.com/123"
You can edit function to return True or i to find it's index instead.
For more vague search
def findOccurence(urlList, searched):
foundUrls = []
for url in urllist:
if url.contains(searched):
foundUrls.append(url)
return foundUrls
if you want to remove some substring from string, simply call .replace() method.
for i in range(len(url_list)):
if "[" in url_list[i]:
url_list[i] = url_list[i].replace("[", "")
Let me know if there is something you do not understand. Also, please, seriously consider buying some python book for beginners, or doing some on-line course.

This is what i would do:
keyword = "Hilton_Times_Square"
target_urls = [ e for e in url_list if keyword in e ]

Related

How to filter certain URLs from a list based on keywords

I am having problem with removing certain URLs from a list in Python. I want to know the easiest and quickest way.
I got a list of URLs returned from google search API. I want to remove all the websites which have the domain "trip advisor", "facebook", "instagram", "twitter" and "ebag".
Here is what I have tried so far:
page_urls = ['https://the1955club.com/', 'https://www.tripadvisor.com/Restaurant_Review-g1842228-d10140470-Reviews-The_1955_Club-Walton_On_Thames_Surrey_England.html']
# print('page_urls', page_urls)
all_urls = []
for address in page_urls:
url = urlparse(address)
new_url = url.netloc
all_urls.append(new_url)
all_urls.remove('www.tripadvisor.com')
all_urls.remove('tripadvisor.com')
I am getting this error:
ValueError: list.remove(x): x not in list
You can do:
from urllib.parse import urlparse
page_urls = [
"https://the1955club.com/",
"https://www.tripadvisor.com/Restaurant_Review-g1842228-d10140470-Reviews-The_1955_Club-Walton_On_Thames_Surrey_England.html",
]
forbiddens = ["facebook", "instagram", "twitter", "tripadvisor"]
def check_url(url):
parsed_url = urlparse(url).netloc
return not any(item in parsed_url for item in forbiddens)
valid_urls = [url for url in page_urls if check_url(url)]
print(valid_urls)
Basically you first parse the URL with urllib.parse.urlparse and get the netloc part of it. Next you iterate through your forbidden names and check to see if any of them is in the netloc. check_url does the filtering.
This of course doesn't remove URLs from page_urls, it creates new one. But you can do this if you want to mutate that list:
page_urls[:] = [url for url in page_urls if check_url(url)]
print(page_urls)

Values are getting replaced in python list

I am trying to get actual URLs of shortened URLs by using following code (I have replaced shortened URLs with others as stackoverflow doesn't allow them)
url_list = ['https://stackoverflow.com/questions/62242867/php-lumen-request-body-always-empty', 'https://twitter.com/i/web/status/1269102116364324865']
import requests
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list = actual_url
print(actual_list)
At the end there is only last URL left in actual_url but I need each URL. Can someone tell me what am I doing wrong here?
You need to append the url's to the list.
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
please try this, you need use append to add the item to list:
url_list = ['https://stackoverflow.com/questions/62242867/php-lumen-request-body-always-empty', 'https://twitter.com/i/web/status/1269102116364324865']
import requests
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
print(actual_list)
you need to append the url in list:
actual_list =[]
for link in url_list:
response = requests.get(link)
actual_url = response.url
actual_list.append(actual_url)
but you are assigning the url to actual_list variable

json not loads data properly

I'm trying to snip a embedded json from a webpage and then passing the json object to json.loads(). First url is okay but when loading the second url it's return error
ValueError: Unterminated string starting at: line 1 column 2078 (char 2077)
here is the code
import requests,json
from bs4 import BeautifulSoup
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
scripts = soup.find_all('script')[0]
data = scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]
jdata = json.loads(data)
print(jdata)
If you print out scripts.text.split("window['AT_APOLLO_STATE'] = ")[1], you will see the follows that includes a ; right after and enthusiastic. So you get an invalid json string from scripts.text.split("window['AT_APOLLO_STATE'] = ")[1].split(';')[0]. And the data ends with and enthusiastic that is not a valid json string.
"strapline":"In our state-of-the-art dealerships across the U.K, Sytner Group
represents the world’s most prestigious car manufacturers.
All of our staff are knowledgeable and enthusiastic; making every interaction
special by going the extra mile.",
Reason has been given. You could also regex out appropriate string
import requests,json
urls = ['https://www.autotrader.co.uk/dealers/greater-manchester/manchester/williams-landrover-9994',
'https://www.autotrader.co.uk/dealers/warwickshire/stratford-upon-avon/guy-salmon-land-rover-stratford-upon-avon-9965'
]
p = re.compile(r"window\['AT_APOLLO_STATE'\] =(.*?});", re.DOTALL)
for url in urls:
r = requests.get(url)
jdata = json.loads(p.findall(r.text)[0])
print(jdata)
Missed a } in the original post.

Parsing HTML using LXML Python

I'm trying to parse Oxford Dictionary in order to obtain the etymology of a given word.
class SkipException (Exception):
def __init__(self, value):
self.value = value
try:
doc = lxml.html.parse(urlopen('https://en.oxforddictionaries.com/definition/%s' % "good"))
except SkipException:
doc = ''
if doc:
table = []
trs = doc.xpath("//div[1]/div[2]/div/div/div/div[1]/section[5]/div/p")
I cannot seem to work out how to obtain the string of text I need. I know I lack some lines of code in the ones I have copied but I don't know how HTML nor LXML fully works. I would much appreciate if someone could provide me with the correct way to solve this.
You don't want to do web scraping, and especially when probably every dictionary has an API interface. In the case of Oxford create an account at https://developer.oxforddictionaries.com/. Get the API credentials from your account and do something like this:
import requests
import json
api_base = 'https://od-api.oxforddictionaries.com:443/api/v1/entries/{}/{}'
language = 'en'
word = 'parachute'
headers = {
'app_id': '',
'app_key': ''
}
url = api_base.format(language, word)
reply = requests.get(url, headers=headers)
if reply.ok:
reply_dict = json.loads(reply.text)
results = reply_dict.get('results')
if results:
headword = results[0]
entries = headword.get('lexicalEntries')[0].get('entries')
if entries:
entry = entries[0]
senses = entry.get('senses')
if senses:
sense = senses[0]
print(sense.get('short_definitions'))
Here's a sample to get you started scraping Oxford dictionary pages:
import lxml.html as lh
from urllib.request import urlopen
url = 'https://en.oxforddictionaries.com/definition/parachute'
html = urlopen(url)
root = lh.parse(html)
body = root.find("body")
elements = body.xpath("//span[#class='ind']")
for element in elements:
print(element.text)
To find the correct search string you need to format the html so you can see the structure. I used the html formatter at https://www.freeformatter.com/html-formatter.html. Looking at the formatted HTML, I could see the definitions were in the span elements with the 'ind' class attribute.

How to embed "input(id)" into URL in GET request using Python?

A'm just a beginner in Python. Please, help me with such problem:
I have API documentation (Server allowed methods):
GET, http://view.example.com/candidates/, shows a
candidate with id=. Returns 200
I write such code:
import requests
url = 'http://view.example.com/candidates/4'
r = requests.get(url)
print r
But I want to now how can I put id of candidate through "input()" builtin-function instead of including it to URL.
There is my efforts to do this:
import requests
cand_id = input('Please, type id of askable candidate: ')
url = ('http://view.example.com/candidates' + 'cand_id')
r = requests.get(url)
print r
dir(r)
r.content
But it's not working...
You can do this to construct the url:
url = 'http://view.example.com/candidates'
params = { 'cand_id': 4 }
requests.get(url, params=params)
Result: http://view.example.com/candidates?cand_id=4
--
Or if you want to build the same url as you mentioned in your post:
url = 'http://view.example.com/candidates'
cand_id = input("Enter a candidate id: ")
new_url = "{}/{}".format(url, cand_id)
Result: http://view.example.com/candidates/4
you're using the string 'cand_id' instead of the variable cand_id. The string creates a url of 'http://view.example.com/candidatescand_id'

Categories

Resources