I've created a script in python to get the first 400 links of search results from bing. It's not sure that there will always be at least 400 results. In this case the number of results is around 300. There are 10 results in it's landing page. However, the rest of the results can be found traversing next pages. The problem is when there is no more next page link in there, the webpage displays the last results over and over again.
Search keyword is michael jackson and ths is a full-fledged link
How can I get rid of the loop when there are no more new results or the results are less than 400?`
I've tried with:
import time
import requests
from bs4 import BeautifulSoup
link = "https://www.bing.com/search?"
params = {'q': 'michael jackson','first': ''}
def get_bing_results(url):
q = 1
while q<=400:
params['first'] = q
res = requests.get(url,params=params,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
})
soup = BeautifulSoup(res.text,"lxml")
for link in soup.select("#b_results h2 > a"):
print(link.get("href"))
time.sleep(2)
q+=10
if __name__ == '__main__':
get_bing_results(link)
As I mentioned in the comments, couldn't you do something like this:
import time
import requests
from bs4 import BeautifulSoup
link = "https://www.bing.com/search?"
params = {'q': 'michael jackson','first': ''}
def get_bing_results(url):
q = 1
prev_soup = str()
while q <= 400:
params['first'] = q
res = requests.get(url,params=params,headers={
"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"
})
soup = BeautifulSoup(res.text,"lxml")
if str(soup) != prev_soup:
for link in soup.select("#b_results h2 > a"):
print(link.get("href"))
prev_soup = str(soup)
else:
break
time.sleep(2)
q+=10
if __name__ == '__main__':
get_bing_results(link)
Related
I'm scraping the activities to do in Paris from TripAdvisor (https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html).
The code that I've written works well, but I haven't still found a way to obtain the rating of each activity. The rating in Tripadvisor is represented from 5 rounds, I need to know how many of these rounds are colored.
I obtain nothing in the "rating" field.
Following the code:
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.get("https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html")
import pprint
detail_tours = []
for tour in list_tours:
url = tour.find_elements_by_css_selector("a")[0].get_attribute("href")
title = ""
reviews = ""
rating = ""
if(len(tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")) > 0):
title = tour.find_elements_by_css_selector("._1gpq3zsA._1zP41Z7X")[0].text
if(len(tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")) > 0):
reviews = tour.find_elements_by_css_selector("._7c6GgQ6n._22upaSQN._37QDe3gr.WullykOU._3WoyIIcL")[0].text
if(len(tour.find_elements_by_css_selector(".zWXXYhVR")) > 0):
rating = tour.find_elements_by_css_selector(".zWXXYhVR")[0].text
detail_tours.append({'url': url,
'title': title,
'reviews': reviews,
'rating': rating})
I would use BeautifulSoup in a way similar to the suggested code. (I would also recommend you study the structure of the html, but seeing the original code I don't think that's necessary.)
import requests
from bs4 import BeautifulSoup
import re
header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"}
resp = requests.get('https://www.tripadvisor.it/Attractions-g187147-Activities-c42-Paris_Ile_de_France.html', headers=header)
if resp.status_code == 200:
soup = BeautifulSoup(resp.text, 'lxml')
cards = soup.find_all('div', {'data-automation': 'cardWrapper'})
for card in cards:
rating = card.find('svg', {'class': 'zWXXYhVR'})
match = re.match('Punteggio ([0-9,]+)', rating.attrs['aria-label'])[1]
print(float(match.replace(',', '.')))
And a small bonus-info, the part in the link preceeded by oa (In the example below: oa60), indicates the starting offset, which runs in 30 result increments - So in case you want to change pages, you can change your link to include oa30, oa60, oa90, etc.: https://www.tripadvisor.it/Attractions-g187147-Activities-c42-oa60-Paris_Ile_de_France.html
I've written a script to parse all the links recursively until the dead-end from the left sided window under Any Department. As the selectors throughout all the depth are identical, I thought the script will be able to get all the links but when I run it I could notice that it parses few links superficially.
webpage address
I've tried so far with:
import requests
from bs4 import BeautifulSoup
link = 'https://www.amazon.de/-/en/gp/bestsellers/digital-text/ref=zg_bs_nav_0'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
while True:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
if not soup.select("li:has(> span.zg_selected) + ul > li > a[href]"):break
for item in soup.select("li:has(> span.zg_selected) + ul > li > a[href]"):
link = item.get("href")
print(link)
Current output:
https://www.amazon.de/-/en/gp/bestsellers/digital-text/530886031/ref=zg_bs_nav_kinc_1_kinc/261-4895013-9879242
https://www.amazon.de/-/en/gp/bestsellers/digital-text/530887031/ref=zg_bs_nav_kinc_1_kinc/261-4895013-9879242
https://www.amazon.de/-/en/gp/bestsellers/digital-text/4824719031/ref=zg_bs_nav_kinc_1_kinc/261-4895013-9879242
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065180031/ref=zg_bs_nav_kinc_1_kinc/261-4895013-9879242
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065217031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065201031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065326031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065236031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065329031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065181031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065237031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065312031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065242031/ref=zg_bs_nav_kinc_2_13065180031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065244031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065259031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065243031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065287031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065282031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065290031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065295031/ref=zg_bs_nav_kinc_3_13065242031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/13065294031/ref=zg_bs_nav_kinc_3_13065242031
I can see lot more links in there in different depth. To be clearer, I'm interested in all the links in different depth from the block in the image attached.
How can I get all the links from left sided window recursively?
Actually your code is not recursive and an actual recursive piece of coding would be one way of obtaining what you want. In the following code I have taken the added precaution of adding a set to which all seen links are added. I was not sure whether there was a possibility of getting into an infinite loop if a link could appear more than once so I test to see if a link has already been processed before I process it again:
import requests
from bs4 import BeautifulSoup
def find_links(session, link, indent, seen_set):
print(' ' * indent * 4, link, sep='')
r = session.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("li:has(> span.zg_selected) + ul > li > a[href]"):
link = item.get("href")
if link not in seen_set:
seen_set.add(link)
find_links(session, link, indent + 1, seen_set)
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
link = 'https://www.amazon.de/-/en/gp/bestsellers/digital-text/ref=zg_bs_nav_0'
find_links(s, link, 0, set([link]))
Update
The code has been modified to use your generator (with my indentation support, which can be ignored) and threading to speed up the processing a bit. There is a slight change to the logic: When the links on a page are read in, the pages for those links are then retrieved concurrently before recursing for processing. It's just a change as to when a page is retrieved. There also seems to be far greater number of links today: 4392
from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup
import functools
def get_page(s, link):
resp = s.get(link)
return resp.text
def get_links(executor, s, link, page, indent, seen_set):
yield link, indent
soup = BeautifulSoup(page, "lxml")
links = [item.get("href") for item in soup.select("li:has(> span.zg_selected) + ul > li > a[href]") if item.get("href") not in seen_set]
pages = executor.map(functools.partial(get_page, s), links)
for i, page in enumerate(pages):
link = links[i]
seen_set.add(link)
yield from get_links(executor, s, link, page, indent + 1, seen_set)
def main():
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
with ThreadPoolExecutor(max_workers=30) as executor:
link = 'https://www.amazon.de/-/en/gp/bestsellers/digital-text/ref=zg_bs_nav_0'
page = get_page(s, link)
for elem, indent in get_links(executor, s, link, page, 0, set([link])):
print(' ' * 4 * indent, elem, sep='')
if __name__ == '__main__':
main()
Prints (I can only show a portion since I have exceeded maximum post length -- there are 316 links):
https://www.amazon.de/-/en/gp/bestsellers/digital-text/ref=zg_bs_nav_0
https://www.amazon.de/-/en/gp/bestsellers/digital-text/530886031/ref=zg_bs_nav_kinc_1_kinc/257-8192800-2632148
https://www.amazon.de/-/en/gp/bestsellers/digital-text/567111031/ref=zg_bs_nav_kinc_2_530886031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692008031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692009031/ref=zg_bs_nav_kinc_4_6692008031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692012031/ref=zg_bs_nav_kinc_4_6692008031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692020031/ref=zg_bs_nav_kinc_4_6692008031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610657031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610653031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610651031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692040031/ref=zg_bs_nav_kinc_4_610651031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692032031/ref=zg_bs_nav_kinc_4_610651031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692046031/ref=zg_bs_nav_kinc_4_610651031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692048031/ref=zg_bs_nav_kinc_4_610651031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692049031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610670031/ref=zg_bs_nav_kinc_4_6692049031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692050031/ref=zg_bs_nav_kinc_4_6692049031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610668031/ref=zg_bs_nav_kinc_4_6692049031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692063031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692064031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610662031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692065031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692066031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692067031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692068031/ref=zg_bs_nav_kinc_5_6692067031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692069031/ref=zg_bs_nav_kinc_5_6692067031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692070031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692071031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692073031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692074031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692075031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692076031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692077031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/16008439031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692078031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692079031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692080031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692081031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692082031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692083031/ref=zg_bs_nav_kinc_4_6692063031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/610655031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692095031/ref=zg_bs_nav_kinc_4_610655031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692099031/ref=zg_bs_nav_kinc_4_610655031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692101031/ref=zg_bs_nav_kinc_4_610655031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692103031/ref=zg_bs_nav_kinc_4_610655031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692108031/ref=zg_bs_nav_kinc_4_610655031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692113031/ref=zg_bs_nav_kinc_4_610655031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/567123031/ref=zg_bs_nav_kinc_3_567111031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692341031/ref=zg_bs_nav_kinc_4_567123031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692345031/ref=zg_bs_nav_kinc_4_567123031
https://www.amazon.de/-/en/gp/bestsellers/digital-text/6692348031/ref=zg_bs_nav_kinc_4_567123031
https://www.amazon.de/-/en/gp/bestsellers/digital-
I am trying to gather some information about some books available on Amazon and I am having a weird glitch error that I can't understand. At first I thought it was Amazon blocking my connection but then I noticed the request has a "200 OK" and it had the real HTML content of the corresponding page.
Let's take for example this book: https://www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110/ref=sr_1_1?crid=2PPCQEJD706VY&dchild=1&keywords=books+bestsellers+2020+paperback&qid=1598132071&sprefix=book%2Caps%2C234&sr=8-1'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, features="lxml")
price = {}
if soup.select("#buyBoxInner > ul > li > span > .a-text-strike") != []:
price["regular_price"] = float(
soup.select("#buyBoxInner > ul > li > span > .a-text-strike")[0].string[1:].replace(",", "."))
price["promo_price"] = float(soup.select(".offer-price")[0].string[1:].replace(",", "."))
else:
price["regular_price"] = float(soup.select(".offer-price")[0].string[1:].replace(",", "."))
price["currency"] = soup.select(".offer-price")[0].string[0]
This part works fine and I can have the regular price and a promo price (if exists), and even the currency. But when I do this:
isbn = soup.select("td.bucket > .content > ul > li")[4].contents[1].string.strip().replace("-", "")
I get "IndexError: list index out of range". But if I debug the code, the content is actually there!
Is this a bug of BeautifulSoup? Is the request response too long?
It seems that Amazon returns two version of the page. One where's <td class="bucket"> and one where are several <span> tags. This script tries to extract ISBN from both of them:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, features="lxml")
isbn_10 = soup.select_one('span.a-text-bold:contains("ISBN-10"), b:contains("ISBN-10")').find_parent().text
isbn_13 = soup.select_one('span.a-text-bold:contains("ISBN-13"), b:contains("ISBN-13")').find_parent().text
print(isbn_10.split(':')[-1].strip())
print(isbn_13.split(':')[-1].strip())
Prints:
0241985110
978-0241985113
I wish I had an explanation of the problem but you a solution would be to wrap your code in a function like so:
def scrape():
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://www.amazon.co.uk/All-Rage-Cara-Hunter/dp/0241985110/ref=sr_1_1?crid=2PPCQEJD706VY&dchild=1&keywords=books+bestsellers+2020+paperback&qid=1598132071&sprefix=book%2Caps%2C234&sr=8-1'
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, features="lxml")
price = {}
if soup.select("#buyBoxInner > ul > li > span > .a-text-strike") != []:
price["regular_price"] = float(
soup.select("#buyBoxInner > ul > li > span > .a-text-strike")[0].string[1:].replace(",", "."))
price["promo_price"] = float(soup.select(".offer-price")[0].string[1:].replace(",", "."))
else:
price["regular_price"] = float(soup.select(".offer-price")[0].string[1:].replace(",", "."))
price["currency"] = soup.select(".offer-price")[0].string[0]
#ADD THIS FEATURE TO YOUR CODE
isbn = soup.select("td.bucket > .content > ul > li")
if not isbn:
scrape()
isbn = isbn[4].contents[1].string.strip().replace("-", "")
Then if it fails it will just call itself again. You might want to refactor it so it only makes the request once.
I am trying to scrape the bookmyshow website for finding out movie details like at what time tickets are available and how many seats are available. I have got to find how to get the show timings in which seats are available but now i want to get total seats avaialble in that show. My code is :
import requests
from bs4 import BeautifulSoup
import json
base_url = "https://in.bookmyshow.com"
s =requests.session()
headers = {"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}
r = s.get("https://in.bookmyshow.com/vizag/movies", headers = headers)
print(r.status_code)
soup = BeautifulSoup(r.text,"html.parser")
movies_list = soup.find("div",{"class":"__col-now-showing"})
movies = movies_list.findAll("a",{"class":"__movie-name"})
for movie in movies:
print(movie.text)
show = []
containers = movies_list.findAll("div",{"class":"card-container"})
for container in containers:
try:
detail = container.find("div",{"class":"__name overflowEllipses"})
button = container.find("div",{"class":"book-button"})
print(detail.text)
print(button.a["href"])
url_ticket = base_url + button.a["href"]
show.append(url_ticket)
except:
pass
for i in show:
print(i)
for t in show:
res = s.get(t,headers=headers)
bs = BeautifulSoup(res.text,"html.parser")
movie_name = bs.find("div",{"class":"cinema-name-wrapper"})
print(movie_name.text.replace(" ","").replace("\t","").replace("\n",""))
venue_list = bs.find("ul",{"id":"venuelist"})
venue_names = venue_list.findAll("li",{"class":"list"})
try:
for i in venue_names:
vn = i.find("div",{"class":"__name"})
print(vn.text.replace(" ","").replace("\t","").replace("\n",""))
show_times = i.findAll("div",{"data-online":"Y"})
for st in show_times:
print(st.text.replace(" ","").replace("\t","").replace("\n",""))
except:
pass
print("\n")
heads = {
"accept":"*/*",
"accept-encoding":"gzip, deflate, br",
"accept-language":"en-US,en;q=0.9",
"origin":"https://in.bookmyshow.com",
"referer":"https://in.bookmyshow.com/buytickets/chalo-vizag/movie-viza-ET00064364-MT/20180204",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
rr = s.post("https://b-eu.simility.com/b?c=bookmyshow&v=1.905&ec=BLOFaZ2HdToCxwcr&cl=0&si=5a76bfce6ae4a00027767ae9&sc=3B0CB9F4-4A27-4588-9FB4-A2A2760569BC&uc=D834EDA4-57E4-4889-A34F-473AC6BBDDBB&e=Seatlayout&cd=.simility.com&r=0&st=1517731803171&s=792a6c66313a2032223133302633343a2c393a322e3c202422636e312a382037633f3c606669673e61653e6338323230353f3c35616f3b2a2c2269663a203820606765696d7371606f77282e2a61663320327e70756f2e2a63643e20326c776e6e242861643f20326e75666e24206166342a306c75666e2422636e352a386c776e64262073692032223348324b403b4436253e43323d2f3c3538322f314440362f493843323d3438353633404b202e20776b2838224e3a3b34454e433c2f3735473c273638323b2541333e4425363531434b3c40424e464a422226206a66303120326c636c79672422626e303a203864636479672c28716c32342838253131322e2a7966323f203231353b353f31333a323b3b353326207b643428382a32202e207b6e302230767a756526207b663420382a6f6c2d5f512a2c2279663f203859206d642f5559202422656420552e2071663028383026207b6431392032204f6d7861666e6125372630202255616c666d757b2a4c542a33382e3031225f6b6c3436332a7a363e2b2841707a6e6d55676049617e2d3539352633362a2a434a564f4e242a6e6961672847656969672b22416a7a656f6525343b2e3024313a313b2c333b3822536b6469726925373b352c31342a2620736e3338223a2855616c313020242871643b362a3a224d6d67656e67224164612e282e2a73643b342a383a3036242871643b352a3a313f313e2e2071663932203a32343c2c227966393b2038333d39342c28716c323028383a362e20716c38332230303c2c22686639362038767a7f672c28606c313628383b2e206066393d203a282f3a30303f363c353a3a332a2620626e3330223a282024207565332a3076727f672422776d302a385920756d68656c282e2a65787a677a6b6f676c7c6b6e2d7d676a676c285f24207565342a3020576f60436974282e2a756535203228556568496174205d676a454e202e2a7d65323d203274727f6724207565312a30202d3b333c3833323a31333a202e2a7a66312838535b226b72786e6b61637c636d6e257a25676f656564672f616a7a656f6527726c66222620616c766770666b6e2d7a666e2d7663677f6770202e2a496a72656f6d20504e4428526e77656164202c6477646c5d26592a6372726e61696374636d662f706e642a2e206f6a626c606d6e656b666a68607863676d68676c6d6865676e67696f6a62636b202e2a496a72656f6d20504e4428546b67756d78202c6477646c5d26592a6372726e61696374636d662f78276c69616e2e63787a6e6969637c696f642d702f726c636b66202c286b667465786c696e2f6c636b662f7066776f696e282e2a4c63766b7e6f2243666b6d6e74282e66776e6e5f245120617a726469636b76616d6c2d7a257a72617a6b2577696e677e6b6c672f6b6e6f2226207f69646f74616c676166656b66617a766d722e6e6e64202e2055616e6776636c6d2043656c7c676c76224c6f617273727c696f6422456d66776e6d282e223b2c3c2e38243338303b205f5577",headers =heads) # i got the link while i was inspecting the booking tickets page
f = s.get("https://in.bookmyshow.com/buytickets/chalo-vizag/movie-viza-ET00064364-MT/20180204#!seatlayout") # this is the page gets displayed when we click the show time
ff = f.text
j = json.loads(ff)
print(j)
After i get the source code of this page i can get seats availability easily. But i am unable to get that page. How to do this? Thanks in Advance!
Steps:
1) use selenium to click on the time showing block
driver.find_element_by_xpath('<enter xpath>').click()
find xpath using inspect element and then click on element then copy you will get the option for copy xpath
time.sleep(4) # wait for 4 seconds for the page to appear
2) Get the html source code using
html = driver.page_source
then use beautiful soup to scrap the page
soup = BeautifulSoup(html,'html.parser')
Find all a href tag having class ='_available' and count them and then
find all a href tag having class = '_blocked' and count them
using these data you can find total no of seats and available seats
I have the code ready for one keyword and its working fine. Next problem is I want to do the scrape for 10 different keywords and save them in one csv file with the keyword name on column/row. I think we can give csv file as input and it picks keyword one by one and does scrape. Here is the code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = "http://www.amazon.in/s/ref=sr_pg_2?
rh=n%3A4772060031%2Ck%3Ahelmets+for+men&keywords=helmets+for+men&ie=UTF8"
#excluding page from base_url for further adding
res = []
for page in range(1,3):
request = requests.get(base_url + '&page=' + str(page), headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}) # here adding page
if request.status_code == 404: #added just in case of error
break
soup = BeautifulSoup(request.content, "lxml")
for url in soup.find_all('li', class_ = 's-result-item'):
res.append([url.get('data-asin'), url.get('id')])
df = pd.DataFrame(data=res, columns=['Asin', 'Result'])
df.to_csv('hel.csv')
I made some sample keywords, replace on needed ones.
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = "http://www.amazon.in/s/ref=sr_pg_2?rh=n%3A4772060031%2Ck%3Ahelmets+for+men&ie=UTF8"
keywords_list = ['helmets for men', 'helmets for women']
keyword = 'helmets for men'
#excluding page from base_url for further adding
res = []
for page in range(1,3):
for keyword in keywords_list:
request = requests.get(base_url + '&keywords=' + requests.utils.quote(keyword) + '&page=' + str(page), headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}) # here adding page
if request.status_code == 404: #added just in case of error
break
soup = BeautifulSoup(request.content, "lxml")
for url in soup.find_all('li', class_ = 's-result-item'):
res.append([url.get('data-asin'), url.get('id'), keyword])
df = pd.DataFrame(data=res, columns=['Asin', 'Result', 'keyword'])
df.to_csv('hel.csv')