Python BSoup extracting href text

Python BSoup extracting href text - python

I am writing some Python to scrape lottery numbers and a other columns in a table.
The issue I have is trying to extract January 2001 in the following January 2001 using Python and BeautifulSoup.
The code I have created so far
import requests
from bs4 import BeautifulSoup
URL = "https://www.lotterysearch.org/results/2001"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15"
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.find("table", {"style": "width:100%"})
# Get each table row 'tr'
for row in table.find_all("tr"):
cells = row.findAll("td")
# print(row.find("td").find("a"))
draw_year = cells[0].find("a")
draw_date = cells[0].find(text=True)
# draw_date = cells[0].find(text=True)
winning_numbers = cells[1].find(text=True)
jackpot = cells[3].find(text=True)
draw_number = cells[4].find(text=True)
print(draw_year)
The results that get printed are
January 2001
I could do some sub stringing to pull out the January 2001 but want to find the correct method for doing so.

I made this quick change. Please let me know if it is helpful. I think it prints a relative URL but you can combine it with the base URL.
draw_year = cells[0].find("a", href=True)
if draw_year is not None:
print(draw_year['href'])

I got it now. Add this to the end. I added the if statement because you get a None type in the output. Is this how you want it?
if draw_year is not None:
print(draw_year.get_text()

Late answer, but you can also use:
import requests
from bs4 import BeautifulSoup
h = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15"}
u = "https://www.lotterysearch.org/results/2001"
html = requests.get(u, headers=h).text
soup = BeautifulSoup(html, "html.parser")
table = soup.find("table", {"style": "width:100%"})
for row in table.find_all("tr"):
cells = row.findAll("td")
draw_year = cells[0].findAll("a")
if not len(draw_year) == 1: continue # skip 1st tr that only contains Date
draw_year = draw_year[0].text
draw_date = cells[0].find(text=True)
winning_numbers = cells[1].find(text=True)
jackpot = cells[3].find(text=True)
draw_number = cells[4].find(text=True)
print(draw_year)
January 2001
January 2001
January 2001
January 2001
January 2001
January 2001
January 2001
...
Demo

So the issue was that I was not catering for None when trying to extract the .text so I tested for None if type(draw_year) != type(None)
import requests
from bs4 import BeautifulSoup
URL = "https://www.lotterysearch.org/results/2001"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15"
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.find("table", {"style": "width:100%"})
# Get each table row 'tr'
for row in table.find_all("tr"):
cells = row.findAll("td")
draw_year = cells[0].find("a")
draw_date = cells[0].find(text=True)
# draw_date = cells[0].find(text=True)
winning_numbers = cells[1].find(text=True)
jackpot = cells[3].find(text=True)
draw_number = cells[4].find(text=True)
if type(draw_year) != type(None):
print(draw_year.text)

Related

BeautifulSoup: how to get the value of the price from the webpage's source code if there is no id within the source code for the price

I am doing web scraping in Python with BeautifulSoup and wondering if I there is a way of getting the value of a cell when it has no id. The code is as below:
from bs4 import BeautifulSoup
import requests
import time
import datetime
URL = "https://www.amazon.co.uk/Got-Data-MIS-Business-Analyst/dp/B09F319PK2/ref=sr_1_1?keywords=funny+got+data+mis+data+systems+business+analyst+tshirt&qid=1636481904&qsid=257-9827493-6142040&sr=8-1&sres=B09F319PK2%2CB09F33452D%2CB08MCBFLHC%2CB07Y8Z4SF8%2CB07GJGXY7P%2CB07Z2DV1C2%2CB085MZDMZ8%2CB08XYL6GRM%2CB095CXJ226%2CB08JDMYMPV%2CB08525RB37%2CB07ZDNR6MP%2CB07WL5JGPH%2CB08Y67YF63%2CB07GD73XD8%2CB09JN7Z3G2%2CB078W9GXJY%2CB09HVDRJZ1%2CB07JD7R6CB%2CB08JDKYR6Q&srpt=SHIRT"
headers = {"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14092.77.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.107 Safari/537.36"}
page = requests.get(URL, headers = headers)
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
title = soup2.find(id="productTitle").get_text()
price = soup2.find(id="priceblock_ourprice").get_text()
print(title)
print(price)

For this page, you have to select the garment size before the price is displayed. We can get the price from the dropdown list of sizes which is a SELECT with id = "dropdown_selected_size_name"
First let's get a list of the options in the SELECT dropdown:
options = soup2.find(id='variation_size_name').select('select option')
Then we can get the price say for size 'Large'
for opt in options:
if opt.get('data-a-html-content', '') == 'Large':
print(opt['value'])
or a little more succinctly:
print([opt['value'] for opt in options if opt.get('data-a-html-content', '') == 'Large'][0])

python/ beautifulsoup KeyError: 'href'

I am using bs4 to write a webscraper to obtain funding news data.
The first part of my code extracts the title, link, summary and date
of each article for n number of pages.
The second part of my code loops through the link column and inputs
the resulting url in a new function, which extracts the url of the
company in question.
For the most part, the code works fine (40 pages scraped without errors). I am trying to stress test it by raising it to 80 pages, but i'm running into KeyError: 'href' and I don't know how to fix this.
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from tqdm import tqdm
def clean_data(column):
df[column]= df[column].str.encode('ascii', 'ignore').str.decode('ascii')
#extract
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
url = f'https://www.uktechnews.info/category/investment-round/series-a/page/{page}/'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
#transform
def transform(soup):
for item in soup.find_all('div', class_ = 'post-block-style'):
title = item.find('h3', {'class': 'post-title'}).text.replace('\n','')
link = item.find('a')['href']
summary = item.find('p').text
date = item.find('span', {'class': 'post-meta-date'}).text.replace('\n','')
news = {
'title': title,
'link': link,
'summary': summary,
'date': date
}
newslist.append(news)
return
newslist = []
#subpage
def extract_subpage(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.2 Safari/605.1.15'}
r = requests.get(url, headers)
soup_subpage = BeautifulSoup(r.text, 'html.parser')
return soup_subpage
def transform_subpage(soup_subpage):
main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
if len(main_data):
subpage_link = {
'subpage_link': main_data[0]['href']
}
subpage.append(subpage_link)
else:
subpage_link = {
'subpage_link': '--'
}
subpage.append(subpage_link)
return
subpage = []
#load
page = np.arange(0, 80, 1).tolist()
for page in tqdm(page):
try:
c = extract(page)
transform(c)
except:
None
df1 = pd.DataFrame(newslist)
for url in tqdm(df1['link']):
t = extract_subpage(url)
transform_subpage(t)
df2 = pd.DataFrame(subpage)
Here is a screenshot of the error:
Screenshot
I think the issue is that my if statement for the transform_subpage function does not account for instances where main_data is not an empty list but does not contain href links. I am relatively new to Python so any help would be much appreciated!

You are correct, it's caused by main_data[0] not having an 'href' attribute at some point. You can try changing the logic to something like:
def transform_subpage(soup_subpage):
main_data = soup_subpage.select("div.entry-content.clearfix > p > a")
if len(main_data):
if 'href' in main_data[0].attrs:
subpage_link = {
'subpage_link': main_data[0]['href']
}
subpage.append(subpage_link)
else:
subpage_link = {
'subpage_link': '--'
}
subpage.append(subpage_link)
Also just a note, it's probably not a great idea to iterate through a variable list, and use the same variable name for each item in the list. So change to something like:
page_list = np.arange(0, 80, 1).tolist()
for page in tqdm(page_list):

Creating a dataframe from a dictionary is giving me a could not broadcast error

I am trying to create a data frame from a dictionary I have and it gives me an error that says:
> ValueError: could not broadcast input array from shape (3) into shape
> (1)
Here is the code:
import requests
from bs4 import BeautifulSoup
from requests.api import request
from selenium import webdriver
from bs4 import Tag, NavigableString
baseurl = "https://www.olx.com.eg/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
}
product_links = []
for x in range(1,13):
r = requests.get(f"https://www.olx.com.eg/jobs/?page={x}", headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_list = soup.findAll("div", class_="ads__item")
for item in product_list:
for link in item.findAll("a",href=True):
product_links.append(link['href'])
for thing in product_links:
if '#' in product_links: product_links.remove('#')
# test_link = 'https://www.olx.com.eg/ad/-IDcjqyP.html'
for link in product_links:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
job_title = soup.find('h1',class_="brkword")
job_location = soup.find('strong',class_="c2b")
job_date = soup.find('span',class_="pdingleft10 brlefte5")
try:
seniority = soup.find_all('td',class_='value')[0].text.strip()
except:
print("")
try:
full_or_part = soup.find_all('td',class_='value')[1].text.strip()
except:
print("")
try:
education_level = soup.find_all('td',class_='value')[2].text.strip()
except:
print("")
try:
sector = soup.find_all('td',class_='value')[3].text.strip()
except:
print("")
description = soup.find_all('p',class_='pding10')
df = {
"Job Title" : job_title,
"Job Location" : job_location,
"Post Date" : job_date,
"Seniority Level" : seniority,
"Full or Part time" : full_or_part,
"Educational Level" : education_level,
"Sector" : sector,
"Job Description" : description
}
job_data = pd.DataFrame(df)
Please tell me how I can transform the data I have into a data frame so I can export it into a csv
first of all I was trying to to scrape this jobs website and it scraped it successfully returning 500 jobs in the dictionary but I was unfortunately not able to transform the code into a dataframe, so later on i can export that out to a csv file, so i can do some analysis on it

To create dataframe from the job ads, you can try next example (some column names needs to be renamed from arabic to english though):
import requests
import pandas as pd
from bs4 import BeautifulSoup
baseurl = "https://www.olx.com.eg/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36"
}
product_links = []
for x in range(1, 2): # <-- increase the range here
r = requests.get(f"https://www.olx.com.eg/jobs/?page={x}", headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_list = soup.findAll("div", class_="ads__item")
for item in product_list:
for link in item.findAll("a", href=True):
if link["href"] != "#":
product_links.append(link["href"])
all_data = []
for link in product_links:
print(f"Getting {link} ...")
soup = BeautifulSoup(requests.get(link, headers=headers).content, "lxml")
d = {}
job_title = soup.find("h1").get_text(strip=True)
job_location = soup.find("strong", class_="c2b")
job_date = soup.find("span", class_="pdingleft10 brlefte5")
d["title"] = job_title
d["location"] = job_location.get_text(strip=True) if job_location else "N/A"
d["date"] = job_date.get_text(strip=True) if job_date else "N/A"
for table in soup.select("table.item"):
d[table.th.get_text(strip=True)] = table.td.get_text(strip=True)
all_data.append(d)
job_data = pd.DataFrame(all_data)
print(job_data)
job_data.to_csv("data.csv", index=False)
Creates data.csv (screenshot from LibreOffice):

How do I web scrape Worldometer charts?

Our organization is using Worldometers for COVID-19 data. I'm able to scrape the page state data, but our leaders want the 7-day moving average for new cases and deaths. To do this manually, you have to click on the 7-day moving average button and hover over today's date. Is there an automated method or module that is available to the public?
Link I can web scrape: https://www.worldometers.info/coronavirus/country/us/
Data I need in the images below.

You can use regex to pull that out:
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.worldometers.info/coronavirus/country/us/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if "Highcharts.chart('graph-cases-daily'" in str(script):
jsonStr = str(script)
data = re.search(r"(name: '7-day moving average')[\s\S\W\w]*(data:[\s\S\W\w]*\d\])", jsonStr, re.IGNORECASE)
data = data.group(2).split('data:')[-1].strip().replace('[','').replace(']','').split(',')
Output:
print(data[-1])
148755
Better yet, we can pull out the dates too and make a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import ast
url = 'https://www.worldometers.info/coronavirus/country/us/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if "Highcharts.chart('graph-cases-daily'" in str(script):
jsonStr = str(script)
dates = re.search(r'(xAxis: {[\s\S\W\w]*)(categories: )(\[[\w\W\s\W]*\"\])', jsonStr)
dates = dates.group(3).replace('[','').replace(']','')
dates = ast.literal_eval(dates)
dates = [ x for x in dates]
data = re.search(r"(name: '7-day moving average')[\s\S\W\w]*(data:[\s\S\W\w]*\d\])", jsonStr, re.IGNORECASE)
data = data.group(2).split('data:')[-1].strip().replace('[','').replace(']','').split(',')
df = pd.DataFrame({'Date':dates, '7 Day Moving Average':data})
And to plot:
import matplotlib.pyplot as plt
df.iloc[1:]['7 Day Moving Average'].astype(int).plot(x ='Date', y='7 Day Moving Average', kind = 'line')
plt.show()
UPDATE:
To get each state, we grabbed the href for each of them then pulled out the data. I went ahead and combined all the tables and you can just query the 'State' column for a specific state:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
import ast
url = 'https://www.worldometers.info/coronavirus/country/us/'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Mobile Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
states_list = []
rows = soup.find('table', {'id':'usa_table_countries_today'}).find_all('tr')
for row in rows:
if row.find_all('td'):
tds = row.find_all('td')
for data in tds:
if data.find('a', {'class':'mt_a'}):
href = data.find('a', {'class':'mt_a'})['href']
states_list.append(href)
states_list = [x for x in states_list]
df_dict = {}
for state in states_list:
print(state)
df_dict[state] = []
state_url = 'https://www.worldometers.info/' + state
response = requests.get(state_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
for graph_type in ['cases','deaths']:
if "Highcharts.chart('graph-%s-daily'" %graph_type in str(script):
jsonStr = str(script)
dates = re.search(r'(xAxis: {[\s\S\W\w]*)(categories: )(\[[\w\W\s\W]*\"\])', jsonStr)
dates = dates.group(3).replace('[','').replace(']','')
dates = ast.literal_eval(dates)
dates = [ x for x in dates]
data = re.search(r"(name: '7-day moving average')[\s\S\W\w]*(data:[\s\S\W\w]*\d\])", jsonStr, re.IGNORECASE)
data = data.group(2).split('data:')[-1].strip().replace('[','').replace(']','').split(',')
df = pd.DataFrame({'Date':dates, '7 Day Moving Average - %s' %graph_type.title():data})
df_dict[state].append(df)
# Combine the tables
df_list = []
for state, tables in df_dict.items():
dfs = [df.set_index('Date') for df in tables]
temp_df = pd.concat(dfs, axis=1).reset_index(drop=False)
temp_df['State'] = state.split('/')[-2]
df_list.append(temp_df)
results = pd.concat(df_list, axis=0)

I was able to just scrape the page using BeautifulSoup. I got the area I want - finding the 7-day average - but I'm having difficulties trying to organize the data into a data frame. Ultimately, I just want the latest date, but I'm unsure about how to get there.
import requests
from bs4 import BeautifulSoup
url = "https://www.worldometers.info/coronavirus/usa/california/#graph-cases-daily"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_scripts = soup.find_all('script')

How to scrape a website table where the cell values have the same class name?

I am trying to scrape a (football squad) table from Transfermarkt.com for a project but some columns have the same class name and cannot be differentiated.
Column [2,10] have unique classes and work fine. I am struggling to find a way to get the rest.
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = "https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Players = pageSoup.find_all("a", {"class": "spielprofil_tooltip"})
Values = pageSoup.find_all("td", {"class": "zentriert"})
PlayersList = []
ValuesList = []
for i in range(0, 25):
PlayersList.append(Players[i].text)
ValuesList.append(Values[i].text)
df = pd.DataFrame({"Players": PlayersList, "Values": ValuesList})
I would like to scrape all columns on rows of that table.

I would get all <tr> and then use for loop to get all <td> in row. And then I can use index to get column and I can use different methods to get value from column.
import requests
from bs4 import BeautifulSoup
import pandas as pd
data = {
'name': [],
'data of birth': [],
'height': [],
'foot': [],
'joined': [],
'contract until': [],
}
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
}
url = "https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
all_tr = soup.find_all('tr', {'class': ['odd', 'even']})
print('rows:', len(all_tr))
for row in all_tr:
all_td = row.find_all('td', recursive=False)
print('columns:', len(all_td))
for column in all_td:
print(' >', column.text)
data['name'].append( all_td[1].text.split('.')[0][:-1] )
data['data of birth'].append( all_td[2].text[:-5])
data['height'].append( all_td[4].text )
data['foot'].append( all_td[5].text )
data['joined'].append( all_td[6].text )
data['contract until'].append( all_td[8].text )
df = pd.DataFrame(data)
print(df.head())
Result:
name data of birth height foot joined contract until
0 Kilian Schubert Sep 9, 2002 1,80 m right Jul 1, 2018 -
1 Raphael Bartell Jan 26, 2002 1,82 m - Jul 1, 2018 -
2 Till Aufderheide Jun 15, 2002 1,79 m - Jul 1, 2018 -
3 Milan Kremenovic Mar 8, 2002 1,91 m - Jul 1, 2018 30.06.2020
4 Adnan Alagic Jul 4, 2002 1,86 m right Jul 1, 2018 30.06.2021

Using bs4, pandas and css selectors. This separates out position e.g. goalkeeper from name. It doesn't include market value as is no values are given. For any given player - it shows all values for a player's nationality "/" separated; gives all values for transfer from "/" separated. columns with same class can be differentiated by nth-of-type.
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
headers = {'User-Agent' : 'Mozilla/5.0'}
df_headers = ['position_number' , 'position_description' , 'name' , 'dob' , 'nationality' , 'height' , 'foot' , 'joined' , 'signed_from' , 'contract_until']
r = requests.get('https://www.transfermarkt.com/hertha-bsc-u17/kader/verein/21066/saison_id/2018/plus/1', headers = headers)
soup = bs(r.content, 'lxml')
position_number = [item.text for item in soup.select('.items .rn_nummer')]
position_description = [item.text for item in soup.select('.items td:not([class])')]
name = [item.text for item in soup.select('.hide-for-small .spielprofil_tooltip')]
dob = [item.text for item in soup.select('.zentriert:nth-of-type(3):not([id])')]
nationality = ['/'.join([i['title'] for i in item.select('[title]')]) for item in soup.select('.zentriert:nth-of-type(4):not([id])')]
height = [item.text for item in soup.select('.zentriert:nth-of-type(5):not([id])')]
foot = [item.text for item in soup.select('.zentriert:nth-of-type(6):not([id])')]
joined = [item.text for item in soup.select('.zentriert:nth-of-type(7):not([id])')]
signed_from = ['/'.join([item['title'].lstrip(': '), item['alt']]) for item in soup.select('.zentriert:nth-of-type(8):not([id]) [title]')]
contract_until = [item.text for item in soup.select('.zentriert:nth-of-type(9):not([id])')]
df = pd.DataFrame(list(zip(position_number, position_description, name, dob, nationality, height, foot, joined, signed_from, contract_until)), columns = df_headers)
print(df.head())
Example df.head

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BSoup extracting href text - python

I made this quick change. Please let me know if it is helpful. I think it prints a relative URL but you can combine it with the base URL. draw_year = cells[0].find("a", href=True) if draw_year is not None: print(draw_year['href'])

I got it now. Add this to the end. I added the if statement because you get a None type in the output. Is this how you want it? if draw_year is not None: print(draw_year.get_text()

Related

BeautifulSoup: how to get the value of the price from the webpage's source code if there is no id within the source code for the price

python/ beautifulsoup KeyError: 'href'

Creating a dataframe from a dictionary is giving me a could not broadcast error

How do I web scrape Worldometer charts?

How to scrape a website table where the cell values have the same class name?

Categories

Resources