I cant get his for loop to be read and take the listing of items it just prints nothing at all and skips the whole loop
import requests
import re
from bs4 import BeautifulSoup
maxPages = 10
keyword = "ps4"
costMax = 0
costMin = 0
def tradeSpiderGS(maxPages):
page = 1
while page <= maxPages:
print(page)
#creating url for soup
if page <= 1:
url = 'https://www.gamestop.com/browse?nav=16k-3-'+ keyword
+',28zu0'
else:
url = 'https://www.gamestop.com/browse?nav=16k-3-' + keyword +
',2b'+
str(page *12) + ',28zu0'
#creating soup object
srcCode = requests.get(url)
plainTxt = srcCode.text
soup = BeautifulSoup(plainTxt,"html.parser")
#this for loop is not being read supposed to grab links on gs website
for links in soup.find_all('a', {'class': 'ats-product-title-lnk'}):
href = links.get('href')
trueHref = 'https://www.gamestop.com/' + href
print(trueHref)
page += 1
tradeSpiderGS(maxPages)
Why Doesn't the Loop Run?
The loop doesn't run because soup.find_all('a', {'class': 'ats-product-title-lnk'}) is [] (there aren't any a with that class).
The reason there aren't any a with that class is that GameStop doesn't let you access the /browse pages unless you've been to a normal page first. You can confirm this by opening one of the urls in a web browser in incognito mode:
Workarounds:
You can use a different scraping mechanism like Selenium in python to work around this. You might also be able to copy headers from a web browser request into the request.get call, although I wasn't able to get this to work.
Related
I've written a function to try and get the names of authors and their respective links from a sandbox website (https://quotes.toscrape.com/), which should move onto the next page when all have been covered.
It works for the first two pages but fails when moving onto the third with the error 'NoneType' object has no attribute 'find_all'.
Why would it break at the start of the new page when it has already successfully moved pages already?
Here's the function:
def AuthorLink(url):
a = 0
url = url
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
divContainer = soup.find("div", class_="container")
divRow = divContainer.find_all("div", class_= "row")
for result in divRow:
divQuotes = result.find_all("div", class_="quote")
for quotes in divQuotes:
for el in quotes.find_all("small", class_="author"):
print(el.get_text())
for link in quotes.find_all("a"):
if link['href'][1:7] == "author":
print(url + link['href'])
a += 1
print("Page:", a)
nav = soup.find("li", class_="next")
nextPage = nav.find("a")
AuthorLink(url + nextPage['href'])
Here's the code that it broke on:
5 soup = BeautifulSoup(page.content, "html.parser")
6 divContainer = soup.find("div", class_="container")
----> 7 divRow = divContainer.find_all("div", class_= "row")
I don't see why this is happening if it ran for the first two pages successfully.
I've checked the structure of the website and it seems little has changed from each page.
I've also tried to change the code so that instead of using the link from "Next" at the bottom of the page, it just adds the number of the next page to the URL but this doesn't work either.
You are facing this error because your new requsting url is adding in that previous url which means.
url value is iterations:
"https://quotes.toscrape.com/", where works;
"https://quotes.toscrape.com/page/2/", where also works;
"https://quotes.toscrape.com/page/2//page/3/", but here website can't serve the page. So, doesn't work.
Exact solution could be different, but here's a little bit changed in my answer.
import requests
from bs4 import BeautifulSoup
base_url="https://quotes.toscrape.com"
def AuthorLink(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
divContainer = soup.find("div", class_="container")
divRow = divContainer.find_all("div", class_= "row")[1]
divQuotes = divRow.find_all("div", class_="quote")
for quotes in divQuotes:
for el in quotes.find_all("small", class_="author"):
print(el.get_text())
for link in quotes.find_all("a"):
if link['href'][1:7] == "author":
print(base_url + link['href'])
for i in range(1,5):
AuthorLink(f"{base_url}/page/{i}")
I have defined new base_url to store actual website link. And next page is "/page/[i]" which means we can use for loop to generate i=1,2,3... . And other change is print(base_url + link['href']) where you had used url instead of "base url" that again leads to same URL changing problem from above.
I have around 900 pages and each page contains 10 buttons (each button has pdf). I want to download all the pdf's - the program should browse to all the pages and download the pdfs one by one.
Code only searching for .pdf but my href does not have .pdf page_no (1 to 900).
https://bidplus.gem.gov.in/bidlists?bidlists&page_no=3
This is the website and below is the link:
BID NO: GEM/2021/B/1804626
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://bidplus.gem.gov.in/bidlists"
#If there is no such folder, the script will create one automatically
folder_location = r'C:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
You only need the href as associated with the links you call buttons. Then prefix with the appropriate protocol + domain.
The links can be matched with the following selector:
.bid_no > a
That is anchor (a) tags with direct parent element having class bid_no.
This should pick up 10 links per page. As you will need a file name for each download I suggest having a global dict, which you store the links as values and link text as keys in. I would replace the "\" in the link descriptions with "_". You simply add to this during your loop over the desired number of pages.
An example of some of the dictionary entries:
As there are over 800 pages I have chosen to add in an additional termination page count variable called end_number. I don't want to loop to all pages so this allows me an early exit. You can remove this param if so desired.
Next, you need to determine the actual number of pages. For this you can use the following css selector to get the Last pagination link and then extract its data-ci-pagination-page value and convert to integer. This can then be the num_pages (number of pages) to terminate your loop at:
.pagination li:last-of-type > a
That looks for an a tag which is a direct child of the last li element, where those li elements have a shared parent with class pagination i.e. the anchor tag in the last li, which is the last page link in the pagination element.
Once you have all your desired links and file suffixes (the description text for the links) in your dictionary, loop the key, value pairs and issue requests for the content. Write that content out to disk.
TODO:
I would suggest you look at ways of optimizing the final issuing of requests and writing out to disk. For example, you could first issue all requests asynchronously and store in a dictionary to optimize what would be an I/0-bound process. Then loop that writing to disk perhaps with a multi-processing approach to optimize for a more CPU-bound process.
I would additionally consider if some sort of wait should be introduced between requests. Or if requests should be batches. You could theoretically currently have something like (836 * 10) + 836 requests.
import requests
from bs4 import BeautifulSoup as bs
end_number = 3
current_page = 1
pdf_links = {}
path = '<your path>'
with requests.Session() as s:
while True:
r = s.get(f'https://bidplus.gem.gov.in/bidlists?bidlists&page_no={current_page}')
soup = bs(r.content, 'lxml')
for i in soup.select('.bid_no > a'):
pdf_links[i.text.strip().replace('/', '_')] = 'https://bidplus.gem.gov.in' + i['href']
#print(pdf_links)
if current_page == 1:
num_pages = int(soup.select_one('.pagination li:last-of-type > a')['data-ci-pagination-page'])
print(num_pages)
if current_page == num_pages or current_page > end_number:
break
current_page+=1
for k,v in pdf_links.items():
with open(f'{path}/{k}.pdf', 'wb') as f:
r = s.get(v)
f.write(r.content)
Your site doesnt work for 90% people. But you provide examples of html. So i hope this ll help you:
url = 'https://bidplus.gem.gov.in/bidlists'
response = requests.get(url)
soup = BeautifulSoup(response.text, features='lxml')
for bid_no in soup.find_all('p', class_='bid_no pull-left'):
for pdf in bid_no.find_all('a'):
with open('pdf_name_here.pdf', 'wb') as f:
#if you have full link
href = pdf.get('href')
#if you have link exept full path, like /showbidDocument/2993132
#href = url + pdf.get('href')
response = requests.get(href)
f.write(response.content)
Hello I am new into python . practicing web scraping with some demo sites .
I am trying to scrape this website http://books.toscrape.com/ and want to extract
href
name/title
start rating/star-rating
price/price_color
in-stock availbility/instock availability
i written a basic code which goes to each book level.
but after that i am clueless as how i can extract those information.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
htmlContent = r.content
soup = BeautifulSoup(htmlContent,'html.parser')
for article in soup.find_all('article'):
This will find you the href and name for every book. You could also extract some other other information if you want.
import requests
from csv import reader,writer
from bs4 import BeautifulSoup
base_url= "http://books.toscrape.com/"
r = requests.get(base_url)
soup = BeautifulSoup(r.content,'html.parser')
def extract_info(soup):
href = []
for a in soup.find_all('a', href=True):
if a.text:
if "catalogue" in a["href"]:
href.append(a['href'])
name = []
for a in soup.find_all('a', title=True):
name.append(a.text)
return href, name
href, name = extract_info(soup)
print(href[0], name[0])
the output will be the href and name for the first book
Try below approach using python - requests and BeautifulSoup. I have fetched the page URL from website itself after inspecting the network section > Doc tab of google chrome browser.
What exactly below script is doing:
First it will take the Page URL which is created using, page no parameter and then doing a GET request.
URL is dynamic which will get created after finishing of an iteration. You will notice that PAGE_NO param will get incremented after each iteration.
After getting the data script will parse the HTML code using html5.parser library.
Finally it will iterate all over the list of books fetched in each iteration or page for ex:- Title, Hyperlink, Price, Stock Availability and rating.
There are 50 pages and 1k results below script will extract all the books details one page per iteration
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_books_data():
PAGE_NO = 1 # Page no parameter which will get incremented after every iteration
while True:
print('Creating URL to scrape books data for ', str(PAGE_NO))
URL = 'http://books.toscrape.com/catalogue/page-' + str(PAGE_NO) + '.html' #dynamic URL which will get created after every iteration
response = requests.get(URL,verify=False) # GET request to fetch data from site
soup = bs(response.text,'html.parser') #Parse HTML data using 'html5.parser'
extracted_books_data = soup.find_all('article', class_ = 'product_pod') # find all articles tag where book details are nested
if len(extracted_books_data) == 0: #break the loop and exit from the script if there in no more data available to process
break
else:
for item in range(len(extracted_books_data)): #iterate over the list of extracted books
print('-' * 100)
print('Title : ', extracted_books_data[item].contents[5].contents[0].attrs['title'])
print('Link : ', extracted_books_data[item].contents[5].contents[0].attrs['href'])
print('Rating : ', extracted_books_data[item].contents[3].attrs['class'][1])
print('Price : ', extracted_books_data[item].contents[7].contents[1].text.replace('Â',''))
print('Availability : ', extracted_books_data[item].contents[7].contents[3].text.replace('\n','').strip())
print('-' * 100)
PAGE_NO += 1 #increment page no by 1 to scrape next page data
scrap_books_data()
I'm doing python scraping and i'm trying to get all the links between href tags and then accessing it one by one to scrape data from these links. I'm a newbie and can't figure it out how to continue from this.The code is as follows:
import requests
import urllib.request
import re
from bs4 import BeautifulSoup
import csv
url = 'https://menupages.com/restaurants/ny-new-york'
url1 = 'https://menupages.com'
response = requests.get(url)
f = csv.writer(open('Restuarants_details.csv', 'w'))
soup = BeautifulSoup(response.text, "html.parser")
menu_sections=[]
for url2 in soup.find_all('h3',class_='restaurant__title'):
completeurl = url1+url2.a.get('href')
print(completeurl)
#print(url)
If you want to scrape all the links obtained from the first page, and then scrape all the links obtained from these links, etc, you need a recursive function.
Here is some initial code to get you started:
if __name__ == "__main__":
initial_url = "https://menupages.com/restaurants/ny-new-york"
scrape(initial_url)
def scrape(url):
print("now looking at " + url)
# scrape URL
# do something with the data
if (STOP_CONDITION): # update this!
return
# scrape new URLs:
for new_url in soup.find_all(...):
scrape(new_url, file)
The problem with this recursive function is that it will not stop until there are no links on the pages, which probably won't happen anytime soon. You will need to add a stop condition.
I want to parse some info from website that has data spread among several pages.
The problem is I don't know how many pages there are. There might be 2, but there might be also 4, or even just one page.
How can I loop over pages when I don't know how many pages there will be?
I know however the url pattern which looks something like in the code below.
Also, the pages names are not plain numbers but they are in 'pe2' for page 2 and 'pe4' for page 3 etc. so can't just loop over range(number).
This dummy code for the loop I am trying to fix.
pages=['','pe2', 'pe4', 'pe6', 'pe8',]
import requests
from bs4 import BeautifulSoup
for i in pages:
url = "http://www.website.com/somecode/dummy?page={}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
#rest of the scraping code
You can use a while loop that will stop to run when encounters an exception.
Code:
from bs4 import BeautifulSoup
from time import sleep
import requests
i = 0
while(True):
try:
if i == 0:
url = "http://www.website.com/somecode/dummy?page=pe"
else:
url = "http://www.website.com/somecode/dummy?page=pe{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
#print page url
print(url)
#rest of the scraping code
#don't overflow website
sleep(2)
#increase page number
i += 2
except:
break
Output:
http://www.website.com/somecode/dummy?page
http://www.website.com/somecode/dummy?page=pe2
http://www.website.com/somecode/dummy?page=pe4
http://www.website.com/somecode/dummy?page=pe6
http://www.website.com/somecode/dummy?page=pe8
...
... and so on, until it faces an Exception.