Comparing results with Beautiful Soup in Python - python

I've got the following code that filters a particular search on an auction site.
I can display the titles of each value & also the len of all returned values:
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.findAll("div", attrs={"class":"tm-marketplace-search-card__title"})
print(len(listings))
for listing in listings:
print(listing.text)
This prints out the following:
#print(len(listings))
3
#for listing in listings:
# print(listing.text)
PRS. Ten Top Custom 24, faded Denim, Piezo.
PRS SE CUSTOM 22
PRS Tremonti SE *With Seymour Duncan Pickups*
I know what I want to do next, but don't know how to code it. Basically I want to only display new results. I was thinking storing the len of the listings (3 at the moment) as a variable & then comparing that with another GET request (2nd variable) that maybe runs first thing in the morning. Alternatively compare both text values instead of the len. If it doesn't match, then it shows the new listings. Is there a better or different way to do this? Any help appreciated thank you

With length-comparison, there is the issue of some results being removed between checks, so it might look like there are no new results even if there are; and text-comparison does not account for results with similar titles.
I can suggest 3 other methods. (The 3rd uses my preferred approach.)
Closing time
A comment suggested using the closing time, which can be found in the tag before the title; you can define a function to get the days until closing
from datetime import date
import dateutil.parser
def get_days_til_closing(lSoup):
cTxt = lSoup.previous_sibling.find('div', {'tmid':'closingtime'}).text
cTime = dateutil.parser.parse(cTxt.replace('Closes:', '').strip())
return (cTime.date() - date.today()).days
and then filter by the returned value
min_dtc = 3 # or as preferred
# your current code upto listings = soup.findAll....
new_listings = [l for l in listings if get_days_til_closing(l) > min_dtc]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing.text)
However, I don't know if sellers are allowed to set their own closing times or if they're set at a fixed offset; also, I don't see the closing time text when inspecting with the browser dev tools [even though I could extract it with the code above], and that makes me a bit unsure of whether it's always available.
JSON list of Listing IDs
Each result is in a "card" with a link to the relevant listing, and that link contains a number that I'm calling the "listing ID". You can save that in a list as a JSON file and keep checking against it every new scrape
from bs4 import BeautifulSoup
import requests
import json
lFilename = 'listing_ids.json' # or as preferred
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = json.load(open(lFilename, 'r'))
except Exception as e:
prev_listings = []
print(len(prev_listings), 'saved listings found')
soup = BeautifulSoup(url.text, "html.parser")
listings = soup.select("div.o-card > a[href*='/listing/']")
new_listings = [
l for l in listings if
l.get('href').split('/listing/')[1].split('?')[0]
not in prev_listings
]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings:
print(listing.select_one('div.tm-marketplace-search-card__title').text)
with open(lFilename, 'w') as f:
json.dump(prev_listings + [
l.get('href').split('/listing/')[1].split('?')[0]
for l in new_listings
], f)
This should be fairly reliable as long as they don't tend to recycle the listing ids, this should be fairly reliable. (Even then, every once in a while, after checking the new listings for that day, you can just delete the JSON file and re-run the program once; it will also keep the file from getting too big...)
CSV Logging [including Listing IDs]
Instead of just saving the IDs, you can save pretty much all the details from each result
from bs4 import BeautifulSoup
import requests
from datetime import date
import pandas
lFilename = 'listings.csv' # or as preferred
max_days = 60 # or as preferred
date_today = date.today()
url = requests.get("https://www.trademe.co.nz/a/marketplace/music-instruments/instruments/guitar-bass/electric-guitars/search?search_string=prs&condition=used")
try:
prev_listings = pandas.read_csv(lFilename).to_dict(orient='records')
prevIds = [str(l['listing_id']) for l in prev_listings]
except Exception as e:
prev_listings, prevIds = [], []
print(len(prev_listings), 'saved listings found')
def get_listing_details(lSoup, prevList, lDate=date_today):
selectorsRef = {
'title': 'div.tm-marketplace-search-card__title',
'location_time': 'div.tm-marketplace-search-card__location-and-time',
'footer': 'div.tm-marketplace-search-card__footer',
}
lId = lSoup.get('href').split('/listing/')[1].split('?')[0]
lDets = {'listing_id': lId}
for k, sel in selectorsRef.items():
s = lSoup.select_one(sel)
lDets[k] = None if s is None else s.text
lDets['listing_link'] = 'https://www.trademe.co.nz/a/' + lSoup.get('href')
lDets['new_listing'] = lId not in prevList
lDets['last_scraped'] = lDate.isoformat()
return lDets
soup = BeautifulSoup(url.text, "html.parser")
listings = [
get_listing_details(s, prevIds) for s in
soup.select("div.o-card > a[href*='/listing/']")
]
todaysIds = [l['listing_id'] for l in listings]
new_listings = [l for l in listings if l['new_listing']]
print(len(new_listings), f'new listings [of {len(listings)}]')
for listing in new_listings: print(listing['title'])
prev_listings = [
p for p in prev_listings if str(p['listing_id']) not in todaysIds
and (date_today - date.fromisoformat(p['last_scraped'])).days < max_days
]
pandas.DataFrame(prev_listings + listings).to_csv(lFilename, index=False)
You'll end up with a spreadsheet of scraping history/log that you can check anytime, and depending on what you set max_days to, the oldest data will be automatically cleared.

Fixed it with the following:
allGuitars = ["",]
latestGuitar = soup.select("#-title")[0].text.strip()
if latestGuitar in allGuitars[0]:
print("No change. The latest listing is still: " + allGuitars[0])
elif not latestGuitar in allGuitars[0]:
print("New listing detected! - " + latestGuitar)
allGuitars.clear()
allGuitars.insert(0, latestGuitar)

Related

Output non JSON data from regex web scraping to a JSON file

I'm using requests and regex to scrape data from an entire website and then save it to a JSON file, hosted on github so I and anyone else can access the data from other devices.
The first thing I tried was just to open every single page on the website and get all the data I want but I found that to be unnecessary so I decided to make two scripts, the first one finds the URL of every page on the site and the second one will be the one called which will then scrape the called URL. What I'm having trouble with right now is getting my data formatted correctly for the JSON file. Currently this is a sample of what the output looks like:
{
"Console":"/neo-geo-aes",
"Call ID":"62815",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle"
}{
"Console":"/neo-geo-cd",
"Call ID":"62817",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/bare-knuckle-2"
}{
"Console":"/neo-geo-pocket-color",
"Call ID":"62578",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman"
}{
"Console":"/playstation",
"Call ID":"62580",
"URL":"https://www.pricecharting.com/game/jp-sega-mega-drive/batman-forever"
}
I've looked into this a lot and can't find a solution, here's the code in question:
import re
import requests
import json
##The base URL
URL = "https://www.pricecharting.com/"
r = requests.get(URL)
htmltext = r.text
##Find all system URLs
dataUrl = re.findall('(?<=<li><a href="\/console).*(?=">)', htmltext)
print(dataUrl)
##For each Item(number of consoles) find games
for i in range(len(dataUrl)):
##make console URL
newUrl = ("https://www.pricecharting.com/console" + dataUrl[i])
req = requests.get(newUrl)
newHtml = req.text
##Get item URLs
urlOne = re.findall('(?<=<a href="\/game).*(?=">)', newHtml)
itemId = re.findall('(?<=tr id="product-).*(?=" data)', newHtml)
##For every item in list(items per console)
out_list = []
for i in range(len(urlOne)):
##Make item URL
itemUrl = ("https://www.pricecharting.com/game" + urlOne[i])
callId = (itemId[i])
##Format for JSON
json_file_content = {}
json_file_content['Console'] = dataUrl[i]
json_file_content['Call ID'] = callId
json_file_content['URL'] = itemUrl
out_list.append(json_file_content)
data_json_filename = 'docs/result.json'
with open(data_json_filename, 'a') as data_json_file:
json.dump(out_list, data_json_file, indent=4)

Attempting to retrieve text from <td></td> tags using BeautifulSoup

So I'm using BeautifulSoup to scrape the link in the code. The artist names and the links come out fine, but I'm not sure how to access the nationality in that second tag.
Here's the code:
import requests
import csv
from bs4 import BeautifulSoup
def findName():
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
last_links = soup.find(class_='AlphaNav')
last_links.decompose()
f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')
artist_nationality_list_items = artist_name_list.find_all('td')
print(artist_nationality_list_items)
for artist_name in artist_name_list_items:
names = artist_name.contents[0]
#nationalities = artist_nationality_list_items.contents[0]
links = 'https://web.archive.org' + artist_name.get('href')
#print(nationalities)
f.writerow([names, links])
findName()
If I uncomment the line in the for loop, I get a runtime error which I expect. The print statement gives me this value for artist_nationality_list_items:
<td>Babbitt, Platt D.</td>, <td>American, died 1879</td>, ..... <- follows this pattern for every artist
Basically, I want the part with 'American, died 1879'.
You can use select which accepts CSS selectors with :nth-child() to select second <td> in each <tr> instead of find_all, so this:
artist_nationality_list_items = artist_name_list.find_all('td')
becomes:
artist_nationality_list_items = artist_name_list.select('td:nth-child(2)')
You can still work with contents, but don't get bogged down with all the lists - Select your target more specific and get all information with more flow.
What happens?
You're treating artist_nationality_list_items (a list) like a single element, that wont work.
How to fix?
To get the right result from your artist_nationality_list_items you have to iterate it too.
(Works, but bad idea):
for i,artist_name in enumerate(artist_name_list_items):
names = artist_name.contents[0]
nationalities = artist_nationality_list_items[i+1].contents[0]
links = 'https://web.archive.org' + artist_name.get('href')
Alternativ and much leaner approach
import requests, csv
from bs4 import BeautifulSoup
def findName():
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
f = csv.writer(open('h-artist_lastname.csv', 'w')) # Create a file to write
f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
for row in soup.select('div.BodyText h3+table tr'):
names = row.contents[0].text
nationalities = row.contents[1].text
links = 'https://web.archive.org' + row.a.get('href')
#print([names,nationalities,links])
f.writerow([names,nationalities,links])
findName()
A little bit of a botched answer with some sloppy workarounds, but this resulted in what I needed:
import requests
import csv
from bs4 import BeautifulSoup
def findName():
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anB1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
last_links = soup.find(class_='AlphaNav')
last_links.decompose()
f = csv.writer(open('b-artist_lastname.csv', 'w')) # Create a file to write
f.writerow(['Last Name, First Name', 'Nationality', 'Link'])
artist_name_list = soup.find(class_='BodyText')
artist_name_list_items = artist_name_list.find_all('a')
i = 2
for artist_name in artist_name_list_items:
str_list = list('td:nth-of-type(i)')
str_list[15] = str(i)
selection = "".join(str_list)
names = artist_name.contents[0]
nationality = artist_name_list.select(selection)
links = 'https://web.archive.org' + artist_name.get('href')
nat_to_str = str(nationality)
nat_str_final = nat_to_str[5:len(nat_to_str) - 6]
#print(nat_str_final)
f.writerow([names, nat_str_final, links])
i += 2
findName()
Thank you to everyone who answered. Using 'td:nth-of-type()' seemed to work but for me to get every artist on the page, I would need to increase the value inside of nth-of-type every time so I used a list of chars and converted them into a string after incrementing I at each traversal.

Asking the user to input something and use Beautiful Soup to parse a website

I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise. I have been at this for the past few days and my code still does not work.
The first thing I ask the user is to import the course catalog abbreviation. For example, ICS is abbreviated as Information for Computer Science. Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.
While I was able to get the input portion to work, I still have errors or the program just stops.
Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?
Here is the code and my attempt at it:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
This is the original code that was provided in Juypter Notebook
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed. But, for this task, the user can not save files and it must be an outright input to get the output.
The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list []. So thats the first thing you need to change. Looking at the html, you'll want to be looking for <table>, <tr>, <td>, tags
working off what was provided from the original code, you just need to alter a few things to flow logically:
I'll point out what I changed:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
This will give you:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)

How to scrape embedded integers on a website

I'm trying to scrape the number of likes for the datasets available on this website.
I've been unable to workout a way of reliably identifying and scraping the relationship between the dataset title and the like integer:
as it is embedded in the HTML as below:
I have used a scraper previously to get information about the resource urls. In that case I was able to capture the last child a of parent h3 with a parent having class .dataset-item.
I would like to adapt my existing code to scrape the number of likes for each resource in the catalogue, rather than the URLs. Below is the code for the url scraper I used:
from bs4 import BeautifulSoup as bs
import requests
import csv
from urllib.parse import urlparse
json_api_links = []
data_sets = []
def get_links(s, url, css_selector):
r = s.get(url)
soup = bs(r.content, 'lxml')
base = '{uri.scheme}://{uri.netloc}'.format(uri=urlparse(url))
links = [base + item['href'] if item['href'][0] == '/' else item['href'] for item in soup.select(css_selector)]
return links
results = []
#debug = []
with requests.Session() as s:
for page in range(1,2): #set number of pages
links = get_links(s, 'https://data.nsw.gov.au/data/dataset?page={}'.format(page), '.dataset-item h3 a:last-child')
for link in links:
data = get_links(s, link, '[href*="/api/3/action/package_show?id="]')
json_api_links.append(data)
#debug.append((link, data))
resources = list(set([item.replace('opendata','') for sublist in json_api_links for item in sublist])) #can just leave as set
for link in resources:
try:
r = s.get(link).json() #entire package info
data_sets.append(r)
title = r['result']['title'] #certain items
if 'resources' in r['result']:
urls = ' , '.join([item['url'] for item in r['result']['resources']])
else:
urls = 'N/A'
except:
title = 'N/A'
urls = 'N/A'
results.append((title, urls))
with open('data.csv','w', newline='') as f:
w = csv.writer(f)
w.writerow(['Title','Resource Url'])
for row in results:
w.writerow(row)
My desired output would appear like this:
The approach is pretty straight forward. Your given website contains required elements in a list Tag. And what you need to do, is to get source code of that <li> tag, and just fetch Heading, which has a certain class and Same goes for like count.
The catch in like count is, the text comprises of some noise. To fix that, you can use regular expression to extract digits ('\d+') from given input of likes count. Following code gives desired result:
from bs4 import BeautifulSoup as soup
import requests
import re
import pandas as pd
source = requests.get('https://data.nsw.gov.au/data/dataset')
sp = soup(source.text,'lxml')
element = sp.find_all('li',{'class':"dataset-item"})
heading = []
likeList = []
for i in element:
try:
header = i.find('a',{'class':"searchpartnership-url-analytics"})
heading.append(header.text)
except:
header = i.find('a')
heading.append(header.text)
like = i.find('span',{'id':'likes-count'})
likeList.append(re.findall('\d+',like.text)[0])
dict = {'Title': heading, 'Likes': likeList}
df = pd.DataFrame(dict,index=False)
print(df)
Hope it helped!
You could use the following.
I am using a css selector with Or syntax to retrieve title and likes as one list (as every publication has both). I then use slicing to separate titles from likes.
from bs4 import BeautifulSoup as bs
import requests
import csv
def get_titles_and_likes(s, url, css_selector):
r = s.get(url)
soup = bs(r.content, 'lxml')
info = [item.text.strip() for item in soup.select(css_selector)]
titles = info[::2]
likes = info[1::2]
return list(zip(titles,likes))
results = []
with requests.Session() as s:
for page in range(1,10): #set number of pages
data = get_titles_and_likes(s, 'https://data.nsw.gov.au/data/dataset?page={}'.format(page), '.dataset-heading .searchpartnership-url-analytics, .dataset-heading [href*="/data/dataset"], .dataset-item #likes-count')
results.append(data)
results = [i for item in results for i in item]
with open(r'data.csv','w', newline='') as f:
w = csv.writer(f)
w.writerow(['Title','Likes'])
for row in results:
w.writerow(row)

How do you move to a new page when web scraping with BeautifulSoup?

Below I have code that pulls the records off craigslist. Everything works great but I need to be able to go to the next set of records and repeat the same process but being new to programming I am stuck. From looking at the page code it looks like I should be clicking the arrow button contained in the span here until it contains no href:
next >
I was thinking that maybe this was a loop within a loop but I suppose this could be a try/except situation too. Does that sound right? How would you implement that?
import requests
from urllib.request import urlopen
import pandas as pd
response = requests.get("https://nh.craigslist.org/d/computer-parts/search/syp")
soup = BeautifulSoup(response.text,"lxml")
listings = soup.find_all('li', class_= "result-row")
base_url = 'https://nh.craigslist.org/d/computer-parts/search/'
next_url = soup.find_all('a', class_= "button next")
dates = []
titles = []
prices = []
hoods = []
while base_url !=
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
For each page you crawl you can find the next url to crawl and add it to a list.
This is how I would do it, without changing your code too much. I added some comments so you understand what's happening, but leave me a comment if you need any extra explanation:
import requests
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
base_url = 'https://nh.craigslist.org/d/computer-parts/search/syp'
base_search_url = 'https://nh.craigslist.org'
urls = []
urls.append(base_url)
dates = []
titles = []
prices = []
hoods = []
while len(urls) > 0: # while we have urls to crawl
print(urls)
url = urls.pop(0) # removes the first element from the list of urls
response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
next_url = soup.find('a', class_= "button next") # finds the next urls to crawl
if next_url: # if it's not an empty string
urls.append(base_search_url + next_url['href']) # adds next url to crawl to the list of urls to crawl
listings = soup.find_all('li', class_= "result-row") # get all current url listings
# this is your code unchanged
for listing in listings:
datar = listing.find('time', {'class': ["result-date"]}).text
dates.append(datar)
title = listing.find('a', {'class': ["result-title"]}).text
titles.append(title)
try:
price = listing.find('span', {'class': "result-price"}).text
prices.append(price)
except:
prices.append('missing')
try:
hood = listing.find('span', {'class': "result-hood"}).text
hoods.append(hood)
except:
hoods.append('missing')
#write the lists to a dataframe
listings_df = pd.DataFrame({'Date': dates, 'Titles' : titles, 'Price' : prices, 'Location' : hoods})
#write to a file
listings_df.to_csv("craigslist_listings.csv")
Edit: You are also forgetting to import BeautifulSoup in your code, which I added in my response
Edit2: You only need to find the first instance of the next button, as the page can (and in this case it does) have more that one next button.
Edit3: For this to crawl computer parts, base_url should be changed to the one present in this code
This is not a direct answer to how to access the "next" button, but this may be a solution to your problem. When I've webscraped in the past I use the URLs of each page to loop through search results.
On craiglist, when you click "next page" the URL changes. There's usually a pattern to this change you can take advantage of. I didn't have to long a look but it looks like the second page of craigslist is: https://nh.craigslist.org/search/syp?s=120, and the third is https://nh.craigslist.org/search/syp?s=240. It looks like that final part of the URL changes by 120 each time.
You could create a list of multiples of 120, and then build a for loop to add this value on to the end of each URL.
Then you have your current for loop nested in this for loop.

Categories

Resources