How to deal with standardized html having abnormal entry

How to deal with standardized html having abnormal entry - python

Someone was kind enough to help me put together a web scraper for a government website.
The code:
import urllib.request
from pywebcopy import save_webpage
import requests
from bs4 import BeautifulSoup as Soup
url = "https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Laboratory&CycleBeginYear="
year =2018# This variable can be changed to whatever year you want to parse
url = url + str(year) #combined the government url with the chosen year
response = requests.get(url)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
# This class contains all 4 fields in the NHANES table
class Chemical:
def __init__(self,chemical_name,doc_file,data_file,last_updated):
self.chemical_name = chemical_name
self.doc_file = doc_file
self.data_file = data_file
self.last_updated = last_updated
chemicalArray = [] #initating array
for row in soup.find("tbody").find_all("tr"):
name, *files, date = row.find_all("td")
hrefs = [file.a["href"] for file in files] # this is where I run into an error
chemical = Chemical(name.get_text(strip=True),hrefs[0],hrefs[1],date.get_text(strip=True))
chemicalArray.append(chemical)
However for some years there is entries that look like this:
Sometimes there is no href in certain years because the data file has been withdrawn, I am not sure how to handle this case. Basically I need to figure out how to deal with the case when there is no href in the "a" tag.

Test if it has the href attribute before trying to access it.
hrefs = [file.a["href"] for file in files if file.a and "href" in file.a.attrs]

Related

If duplicate URL entry exists don't append the data (BeautifulSoup to Google Sheets)

I was wondering if you can help.
I'm using beautifulsoup to write to Google Sheets.
I've created a crawler that runs through a series of URLs, scrapes the content and then updates a Google sheet.
What I now want to do is if a duplicate URL exists (in column c) to prevent it from being written to my sheet again.
e.g If I had the url https://www.bbc.co.uk/1 in my table I wouldn't want it appearing in my table again.
Here is my code:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, 'html.parser')
for item in soup.find_all('h3', class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find('a', class_ = 'primary js_target').text.strip()
url = item.find('a', class_ = 'primary js_target').get('href')
abs = "https://www.ig.com"
rel = url
info = {'date':date, 'title':title, 'url':urllib.parse.urljoin(abs, rel)}
sh.append_row([str(info['date']), str(info['title']), str(info['url'])])
Thanks in advance.
Mark
I'd like to know what i can add to the end of my code to prevent duplicate URLs being entered into my Google Sheet.

I believe your goal is as follows.
You want to put the values of [str(info['date']), str(info['title']), str(info['url'])], when the value of str(info['url']) is not existing in the column "C".
Modification points:
In this case, it is required to check the column "C" of the existing sheet of sh = gc.open('scrapetosheets').sheet1. This has already been mentioned in the TheMaster's comment.
When I saw your script, append_row is used in a loop. In this case, the process cost will become high.
When these points are reflected in your script, how about the following modification?
Modified script:
from cgitb import text
import requests
from bs4 import BeautifulSoup
import gspread
import datetime
import urllib.parse
gc = gspread.service_account(filename='creds.json')
sh = gc.open('scrapetosheets').sheet1
urls = ["https://www.ig.com/uk/trading-strategies", "https://www.ig.com/us/trading-strategies"]
# I modified the below script.
obj = {r[2]: True for r in sh.get_all_values()}
ar = []
for url in urls:
my_url = requests.get(url)
html = my_url.content
soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("h3", class_="article-category-section-title"):
date = datetime.datetime.now()
title = item.find("a", class_="primary js_target").text.strip()
url = item.find("a", class_="primary js_target").get("href")
abs = "https://www.ig.com"
rel = url
info = {"date": date, "title": title, "url": urllib.parse.urljoin(abs, rel)}
url = str(info["url"])
if url not in obj:
ar.append([str(info["date"]), str(info["title"]), url])
if ar != []:
sh.append_rows(ar, value_input_option="USER_ENTERED")
When this script is run, first, the values are retrieved from the sheet, and create an object for searching the value of str(info["url"]). When the value of str(info["url"]) is not existing in column "C" of the sheet, the values are put into an array. And then, the array is appended to the sheet.
Reference:
append_rows

web scraper returns wrong data

import requests
from bs4 import BeautifulSoup
Year = input("What year would you like to travel to? YYY-MM-DD ")
URL = "https://www.billboard.com/charts/hot-100/"
URL += URL + Year
response = requests.get(URL)
data = response.text
soup = BeautifulSoup(data,"html.parser")
songs = soup.find_all(name='h3', id="title-of-a-story")
all_songs = [song.getText() for song in songs]
print(all_songs)
I'm new to web scraping ,
Its supposed to give me the list of songs in the top 100 on the year that I specify but why is it giving me news,Its giving me the wrong data

Try printing URL before making a request:
https://www.billboard.com/charts/hot-100/https://www.billboard.com/charts/hot-100/2022-01-01
That's clearly wrong, you got the base part twice. The line URL += URL + Year is the culprit, it should have been URL = URL + Year.

adding to what Sasszem# mentioned above
import requests
from bs4 import BeautifulSoup
Year = input("What year would you like to travel to? YYYY-MM-DD ")
URL = "https://www.billboard.com/charts/hot-100/"
URL = URL + Year
response = requests.get(URL)
data = response.text
songs = []
soup = BeautifulSoup(data,"html.parser")
# instead of directly jumping to the element, I found the container element first to restrict the code to a specific section of the website
container = soup.find_all(class_='lrv-a-unstyle-list lrv-u-flex lrv-u-height-100p lrv-u-flex-direction-column#mobile-max')
for x in container:
song = x.find(id="title-of-a-story") #locating the element that contains text in that specific 'container'
songs.append(song)
all_songs = [song.getText() for song in songs] #getting all the songs title in a list
print(all_songs) # ['\n\n\t\n\t\n\t\t\n\t\t\t\t\tAll I Want For Christmas Is You\t\t\n\t\n'] there is a weird prefix and suffix of stings with every title
#removing the suffix and prefix strings
final_output=[]
for i in all_songs:
final_output.append(i[14:-5])
print(final_output)

Asking the user to input something and use Beautiful Soup to parse a website

I am supposed to use Beautiful Soup 4 to obtain course information off of my school's website as an exercise. I have been at this for the past few days and my code still does not work.
The first thing I ask the user is to import the course catalog abbreviation. For example, ICS is abbreviated as Information for Computer Science. Beautiful Soup 4 is supposed to list all of the courses and how many students are enrolled.
While I was able to get the input portion to work, I still have errors or the program just stops.
Question: Is there a way for Beautiful Soup to accept user input so that when the user inputs ICS, the output would be a list of all courses that are related to ICS?
Here is the code and my attempt at it:
from bs4 import BeautifulSoup
import requests
import re
#get input for course
course = input('Enter the course:')
#Here is the page link
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get request and response
page_response = requests.get(BASE_AVAILABILITY_URL)
#getting Beautiful Soup to gather the html content
page_content = BeautifulSoup(page_response.content, 'html.parser')
#getting course information
main = page_content.find_all(class_='parent clearfix')
main_p = "".join(str (x) for x in main)
#get the course anchor tags
main_q = BeautifulSoup(main_p, "html.parser")
courses = main.find('a', href = True)
#get each course name
#empty dictionary for course list
courses_list = []
for a in courses:
courses_list.append(a.text)
search = input('Enter the course title:')
for course in courses_list:
if re.search(search, course, re.IGNORECASE):
print(course)
This is the original code that was provided in Juypter Notebook
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text)
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
What's odd is that if the user saves the html file, uploads it into Juypter Notebook, and then opens the file to be read, the courses are displayed. But, for this task, the user can not save files and it must be an outright input to get the output.

The problem with your code is page_content.find_all(class_='parent clearfix') retuns and empty list []. So thats the first thing you need to change. Looking at the html, you'll want to be looking for <table>, <tr>, <td>, tags
working off what was provided from the original code, you just need to alter a few things to flow logically:
I'll point out what I changed:
import requests, bs4
BASE_AVAILABILITY_URL = f"https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s={course}"
#get input for course
course = input('Enter the course:')
def scrape_availability(text):
soup = bs4.BeautifulSoup(text) #<-- need to get the html text before creating a bs4 object. So I move the request (line below) before this, and also adjusted the parameter for this function.
# the rest of the code is fine
r = requests.get(str(BASE_AVAILABILITY_URL) + str(course))
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
This will give you:
import requests, bs4
BASE_AVAILABILITY_URL = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=MAN&t=202010&s="
#get input for course
course = input('Enter the course:')
url = BASE_AVAILABILITY_URL + course
def scrape_availability(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'html.parser')
rows = soup.select('.listOfClasses tr')
for row in rows[1:]:
columns = row.select('td')
class_name = columns[2].contents[0]
if len(class_name) > 1 and class_name != b'\xa0':
print(class_name)
print(columns[4].contents[0])
print(columns[7].contents[0])
print(columns[8].contents[0])
scrape_availability(url)

How can I loop through multiple unknown number of pages and get their texts after the year is substituted in the url?

I am trying to extract some information based on the year entered in the url. The information extracted is from an unknown number of pages.
How can I get the new url after the year is substituted so that this url can be passed for processing the content extracted from multiple pages? Also, I want to be able to get all the information from all the unknown number of pages.
As I understood, I would need a while loop. How do I check if there exists a next page?
Is there an efficient way to do this? Thanks!
import requests
from datetime import datetime
from bs4 import BeautifulSoup
from urllib import parse
from time import sleep
input_year = int(input("Enter year here >>: "))
def print_info(response_text):
soup = BeautifulSoup(response_text, 'lxml')
for info in soup.find_all('div', class_='grid'):
for a in info.find_all('a'):
if a.parent.name == 'div':
print (''.join(text for text in a.find_all(text=True)))
url = 'https://mywebsite.org/archive.pl?op=bytime&keyword=&year={}&page={}'.format(input_year,1)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
#current page number
page_number_tag = soup.find('span', class_='active tcenter')
page_number = page_number_tag.text
#next page number
for x in soup.find_all('div', class_='t'):
for a in x.find_all('a'):
if a.parent.name == 'div':
next_page_number = ''.join(text for text in a.find_all(text=True))

Assuming you have the variables year and page already, you can use string formatting to build a new url with those values:
base_url = url = 'https://mywebsite.com/archive.pl?op=bytime&keyword=&year=%s&page=%s'
new_url = base_url % (year, page)

Use format and pass multiple arguments like below.This is an example you can specify year and page the way you want.
year=2019
for page in range(1,10):
url = 'https://mywebsite.com/archive.pl?op=bytime&keyword=&year={}&page={}'.format(year,page)
print(url)

How can I retrieve the price on airbnb using beautifulsoup to scrape?

I'd like to scrape airbnb's listings by city (for the 5 cities listed in the code) and would like to gather information such as: price, a link to the listing, room type, # of guests, etc.
I was able to get the link, but I'm having trouble getting the price.
from bs4 import BeautifulSoup
import requests
import csv
from urllib.parse import urljoin # For joining next page url with base url
from datetime import datetime # For inserting the current date and time
start_url_nyc = "https://www.airbnb.com/s/New-York--NY--United-States"
start_url_mia = "https://www.airbnb.com/s/Miami--FL--United-States"
start_url_la = "https://www.airbnb.com/s/Los_Angeles--CA--United-States"
start_url_sf = "https://www.airbnb.com/s/San_Francisco--CA--United-States"
start_url_orl = "https://www.airbnb.com/s/Orlando--FL--United-States"
def scrape_airbnb(url):
# Set up the URL Request
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
# Iterate over search results
for search_result in soup.find_all('div', 'infoContainer_tfq3vd'):
# Parse the name and price and record the time
link_end = search_result.find('a').get('href')
link = "https://www.airbnb.com" + link_end
price = search_result.find('span', 'data-pricerate').find('data-reactid').get(int)
return (price)
print(scrape_airbnb(start_url_orl))

This is the html code:
<span data-pricerate="true" data-reactid=".91165im9kw.0.2.0.3.2.1.0.$0.$grid_0.$0/=1$=01$16085565.$=1$16085565.0.2.0.1.0.0.0.1:1">552</span>
This is your code
price = search_result.find('span', 'data-pricerate').find('data-reactid').get(int)
first:
Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a
dictionary and passing the dictionary into find_all() as the attrs
argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
than:
price = search_result.find('span', attrs={"data-pricerate":"true"})
this will return a span tag which contains price as string, just use .text
price = search_result.find('span', attrs={"data-pricerate":"true"}).text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to deal with standardized html having abnormal entry - python

Test if it has the href attribute before trying to access it. hrefs = [file.a["href"] for file in files if file.a and "href" in file.a.attrs]

Related

If duplicate URL entry exists don't append the data (BeautifulSoup to Google Sheets)

web scraper returns wrong data

Asking the user to input something and use Beautiful Soup to parse a website

How can I loop through multiple unknown number of pages and get their texts after the year is substituted in the url?

How can I retrieve the price on airbnb using beautifulsoup to scrape?

Categories

Resources