Isolating a link with beautifulsoup - python

I have to scrape through the text of a website: link. I created a set using beautifulsoup of all the links on the page and then eventually I want to iterate through the set.
import requests
from bs4 import BeautifulSoup
url = 'https://crmhelpcenter.gitbook.io/wahi-digital/getting-started/readme'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
check = []
for link in links:
link = 'https://crmhelpcenter.gitbook.io' + link.get('href')
check.append(link)
print(check)
With this method it is not adding the sub-links of some of the links in the sidebar. I could loop through each page and add the links accordingly but then I have to go through each link again and check if it is included in a set which makes the time expensive. Is there any way I can instead just isolate the "next" link that is on each page and go through that recursively till I reach the end?

Is there any way I can instead just isolate the "next" link that is on each page and go through that recursively till I reach the end?
If you mean buttons like
OR
then you can look for a tags with data-rnwi-handle="BaseCard" and [because "Previous" button has the same attribute] containing "Next" as as the first [stripped] string (see aNxt below). You don't need to utilize recursion necessarily - since each page has just one "Next" [at most], a while loop should suffice:
# from urllib.parse import urljoin # [ if you use it ]
rootUrl = 'https://crmhelpcenter.gitbook.io'
nxtUrl = f'{rootUrl}/wahi-digital/getting-started/readme'
nextUrls = [nxtUrl]
# allUrls = [nxtUrl] # [ if you want to collect ]
while nxtUrl:
resp = requests.get(nxtUrl)
print([len(nextUrls)], resp.status_code, resp.reason, 'from', resp.url)
soup = BeautifulSoup(resp.content, 'html.parser')
### EXTRACT ANY PAGE DATA YOU WANT TO COLLECT ###
# pgUrl = {urljoin(nxtUrl, a["href"]) for a in soup.select('a[href]')}
# allUrls += [l for l in pgUrl if l not in allUrls]
aNxt = [a for a in soup.find_all(
'a', {'href': True, 'data-rnwi-handle': 'BaseCard'}
) if list(a.stripped_strings)[:1]==['Next']]
# nxtUrl = urljoin(nxtUrl, aNxt[0]["href"]) if aNxt else None
nxtUrl = f'{rootUrl}{aNxt[0]["href"]}' if aNxt else None
nextUrls.append(nxtUrl) # the last item will [most likely] be None
# if nxtUrl is None: nextUrls = nextUrls[:-1] # remove last item if None
On colab this took about 3min to run and collect 344[+1 for None] items in nextUrls and 2879 in allUrls; omitting or keeping allUrls does not seem to make any significant difference in this duration, since the most of the delay is due to the request (and some due to parsing).
You can also try to scrape all ~3k links with this queue-based crawler. [It took about 15min in my colab notebook.] The results of that, as well as nextUrls and allUrls have been uploaded to this spreadsheet.

Related

How to get final links from find_all('a') as a list?

import requests
import re
from bs4 import BeautifulSoup
respond = requests.get("http://www.kulugyminiszterium.hu/dtwebe/Irodak.aspx")
print(respond)
soup = BeautifulSoup(respond.text, 'html.parser')
for link in soup.find_all('a'):
links = link.get('href')
linki_bloc = ('http://www.kulugyminiszterium.hu/dtwebe/'+links).replace(' ', '%20' )
print(linki_bloc)
value = linki_bloc
print(value.split())
I am trying to use the results of find_all('a') as a list. The only thing that succeeds for me is the last link.
It seems to me that the problem is the results as a list of links deselected \n. I tried many ways to get rid of the new line character but failed. Saving to a file (e.g. .txt) also fails, saving only the last link.
Close to your goal, but you overwrite the result wit each iteration - Simply append your manipulated links to a list with list comprehension directly:
['http://www.kulugyminiszterium.hu/dtwebe/'+link.get('href').replace(' ', '%20' ) for link in soup.find_all('a')]
or as in your example:
links = []
for link in soup.find_all('a'):
links.append('http://www.kulugyminiszterium.hu/dtwebe/'+link.get('href').replace(' ', '%20' ))
Example
import requests
from bs4 import BeautifulSoup
respond = requests.get("http://www.kulugyminiszterium.hu/dtwebe/Irodak.aspx")
soup = BeautifulSoup(respond.text, 'html.parser')
links = []
for link in soup.find_all('a'):
links.append('http://www.kulugyminiszterium.hu/dtwebe/'+link.get('href').replace(' ', '%20' ))
links

Comparing results with Beautiful Soup in Python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you
With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.
Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

How do you move to a new page when web scraping with BeautifulSoup?

Below I have code that pulls the records off craigslist. Everything works great but I need to be able to go to the next set of records and repeat the same process but being new to programming I am stuck. From looking at the page code it looks like I should be clicking the arrow button contained in the span here until it contains no href:
next >
I was thinking that maybe this was a loop within a loop but I suppose this could be a try/except situation too. Does that sound right? How would you implement that?
import requests
from urllib.request import urlopen
import pandas as pd
response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")
soup = BeautifulSoup(response.text,"lxml")
listings = soup.find_all('li', class_= "result-row")
base_url = 'https://nh.craigslist.org/d/computer-parts/search/'
next_url = soup.find_all('a', class_= "button next")
dates = []
titles = []
prices = []
hoods = []
while base_url !=
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
For each page you crawl you can find the next url to crawl and add it to a list.
This is how I would do it, without changing your code too much. I added some comments so you understand what's happening, but leave me a comment if you need any extra explanation:
import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []
while len(urls) > 0: # while we have urls to crawl
print(urls)
url = urls.pop(0) # removes the first element from the list of urls
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
if next_url: # if it's not an empty string
urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl
listings = soup.find_all('li', class_= "result-row") # get all current url listings
# this is your code unchanged
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
Edit: You are also forgetting to import BeautifulSoup in your code, which I added in my response
Edit2: You only need to find the first instance of the next button, as the page can (and in this case it does) have more that one next button.
Edit3: For this to crawl computer parts, base_url should be changed to the one present in this code
This is not a direct answer to how to access the "next" button, but this may be a solution to your problem. When I've webscraped in the past I use the URLs of each page to loop through search results.
On craiglist, when you click "next page" the URL changes. There's usually a pattern to this change you can take advantage of. I didn't have to long a look but it looks like the second page of craigslist is: https://nh.craigslist.org/search/syp?s=120, and the third is https://nh.craigslist.org/search/syp?s=240. It looks like that final part of the URL changes by 120 each time.
You could create a list of multiples of 120, and then build a for loop to add this value on to the end of each URL.
Then you have your current for loop nested in this for loop.

Using For Loop with BeautifulSoup to Select Text on a Different URLs

I have been scratching my head for nearly 4 days trying to find the best way to loop through a table of URLs on one website, request the URL and scrape text from 2 different areas of the second site.
I have tried to rewrite this script multiple times, using several different solutions to achieve my desired results, however, I have not been able to fully accomplish this.
Currently, I am able to select the first link of the table on page one, to go to the new page and select the data I need but I cant get the code to continue to loop through every link on the first page.
import requests
from bs4 import BeautifulSoup
journal_site = "https://journals.sagepub.com"
site_link 'http://journals.sagepub.com/action/showPublications?
pageSize=100&startPage='
# each page contains 100 results I need to scrape from
page_1 = '0'
page_2 = '1'
page_3 = '3'
page_4 = '4'
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
for table_row in soup.select('div.results'):
journal_name = table_row.findAll('tr', class_='False')
journal_link = table_row.find('a')['href']
journal_page = journal_site + journal_link
r = requests.get(journal_page)
soup = BeautifulSoup(r.text, 'html.parser')
for journal_header, journal_description in zip(soup.select('main'),
soup.select('div.journalCarouselTextText')):
try:
title = journal_header.h1.text.strip()
description = journal_description.p.text.strip()
print(title,':', description)
except AttributeError:
continue
What is the best way to find the title and the description for every journal_name? Thanks in advance for the help!
Most of your code works for me, just needed to modify the middle section of the code, leaving the parts before and after the same:
# all code same up to here
journal_list = site_link + page_1
r = requests.get(journal_list)
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find("div", { "class" : "results" })
table = results.find('table')
for row in table.find_all('a', href=True):
journal_link = row['href']
journal_page = journal_site + journal_link
# from here same as your code
I stopped after it got the fourth response(title/description) of 100 results from the first page. I'm pretty sure it will get all the expected results, only needs to loop through the 4 subsequent pages.
Hope this helps.

Python scraper advice

I have been working on a scraper for a little while now, and have come very close to getting it to run as intended. My code as follows:
import urllib.request
from bs4 import BeautifulSoup
# Crawls main site to get a list of city URLs
def getCityLinks():
city_sauce = urllib.request.urlopen('https://www.prodigy-living.co.uk/') # Enter url here
city_soup = BeautifulSoup(city_sauce, 'html.parser')
the_city_links = []
for city in city_soup.findAll('div', class_="city-location-menu"):
for a in city.findAll('a', href=True, text=True):
the_city_links.append('https://www.prodigy-living.co.uk/' + a['href'])
return the_city_links
# Crawls each of the city web pages to get a list of unit URLs
def getUnitLinks():
getCityLinks()
for the_city_links in getCityLinks():
unit_sauce = urllib.request.urlopen(the_city_links)
unit_soup = BeautifulSoup(unit_sauce, 'html.parser')
for unit_href in unit_soup.findAll('a', class_="btn white-green icon-right-open-big", href=True):
yield('the_url' + unit_href['href'])
the_unit_links = []
for link in getUnitLinks():
the_unit_links.append(link)
# Soups returns all of the html for the items in the_unit_links
def soups():
for the_links in the_unit_links:
try:
sauce = urllib.request.urlopen(the_links)
for things in sauce:
soup_maker = BeautifulSoup(things, 'html.parser')
yield(soup_maker)
except:
print('Invalid url')
# Below scrapes property name, room type and room price
def getPropNames(soup):
try:
for propName in soup.findAll('div', class_="property-cta"):
for h1 in propName.findAll('h1'):
print(h1.text)
except:
print('Name not found')
def getPrice(soup):
try:
for price in soup.findAll('p', class_="room-price"):
print(price.text)
except:
print('Price not found')
def getRoom(soup):
try:
for theRoom in soup.findAll('div', class_="featured-item-inner"):
for h5 in theRoom.findAll('h5'):
print(h5.text)
except:
print('Room not found')
for soup in soups():
getPropNames(soup)
getPrice(soup)
getRoom(soup)
When I run this, it returns all the prices for all the urls picked up. However, I does not return the names or the rooms and I am not really sure why. I would really appreciate any pointers on this, or ways to improve my code - been learning Python for a few months now!
I think that the links you are scraping will in the end redirect you to another website, in which case your scraping functions will not be useful!
For instance, the link for a room in Birmingham is redirecting you to another website.
Also, be careful in your usage of the find and find_all methods in BS. The first returns only one tag (as when you want one property name) while find_all() will return a list allowing you to get, for instance, multiple room prices and types.
Anyway, I have simplified a bit your code and this is how I have come across your issue. Maybe you would like to get some inspiration from that:
import requests
from bs4 import BeautifulSoup
main_url = "https://www.prodigy-living.co.uk/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find("div", class_ = "footer-city-nav") # Bottom page not loaded dynamycally
cities_links = [main_url+tag["href"] for tag in city_tags.find_all("a")] # Links to cities
# Getting the individual links to the apts
indiv_apts = []
for link in cities_links[0:4]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
links_tags = soup.find_all("a", class_ = "btn white-green icon-right-open-big")
for url in links_tags:
indiv_apts.append(main_url+url.get("href"))
# Now defining your functions
def GetName(tag):
print tag.find("h1").get_text()
def GetType_Price(tags_list):
for tag in tags_list:
print tag.find("h5").get_text()
print tag.find("p", class_ = "room-price").get_text()
# Now scraping teach of the apts - name, price, room.
for link in indiv_apts[0:2]:
print "At link: ", link
re = requests.get(link)
soup = BeautifulSoup(re.text, "html.parser")
property_tag = soup.find("div", class_ = "property-cta")
rooms_tags = soup.find_all("div", class_ = "featured-item")
GetName(property_tag)
GetType_Price(rooms_tags)
You will see that right at the second element of the lis, you will get an AttributeError as you are not on your website page anymore. Indeed:
>>> print indiv_apts[1]
https://www.prodigy-living.co.uk/http://www.iqstudentaccommodation.com/student-accommodation/birmingham/penworks-house?utm_source=prodigylivingwebsite&utm_campaign=birminghampagepenworksbutton&utm_medium=referral # You will not scrape the expected link right at the beginning
Next time come with a precise problem to solve, or in another case just take a look at the code review section.
On find and find_all: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all
Finally, I think it also answers your question here: https://stackoverflow.com/questions/42506033/urllib-error-urlerror-urlopen-error-errno-11001-getaddrinfo-failed
Cheers :)

Categories

Resources