The issue I am having is driving me crazy. I am trying to pull text from the Pro Football Reference website.
The information I need is in a td element displaying qb hurries In the second section of the web page. The information is in a td element called qb_hurry. Here is what I have so far:
res = requests.get('https://www.pro-football-reference.com/players/D/DonaAa00.htm')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
I tried
totalQbHurrys = soup.find('div', {'id':'all_detailed_defense'})
and I can see the information I need to pull when I parse through the beautiful soup object and print it. But when I try to retrieve the td element I need
totalQbHurrys = soup.find('div', {'id':'all_detailed_defense'}).find('td', {'data-stat':'qb_hurry'})
it returns None, I think the text I am looking for exists as a comment first, but I am having trouble getting to the actual HTML element I need. Would anyone know of a way to target the qb_hurry element successfully?
The issue is that this field is inside HTML comment tag.
Here is a resolution :
import bs4
import requests
res = requests.get('https://www.pro-football-reference.com/players/D/DonaAa00.htm')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
extract = soup.find('div', {'id':'all_detailed_defense'})
for comments in extract.find_all(text=lambda text:isinstance(text, bs4.Comment)):
comments.extract()
soup2 = bs4.BeautifulSoup(comments, 'html.parser')
totalQbHurrys = soup2.find('td', {'data-stat':'qb_hurry'})
print(totalQbHurrys)
PS: I have used this trick : https://stackoverflow.com/a/52874885/2186074
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import pandas as pd
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get("https://www.pro-football-reference.com/players/D/DonaAa00.htm")
df = pd.read_html(driver.page_source, attrs={
'class': 'row_summable sortable stats_table now_sortable'}, header=1)[0]
print(df.loc[1, 'Hrry'])
driver.quit()
Output:
32
The HTML you need is inside a comment so will not be directly visible in the soup. You need to first grab the comment and then parse this as a new soup object. From this you can then locate the tr and th elements. For example:
from bs4 import BeautifulSoup, Comment
import requests
res = requests.get('https://www.pro-football-reference.com/players/D/DonaAa00.htm')
soup = BeautifulSoup(res.text, 'html.parser')
div = soup.find('div', {'id':'all_detailed_defense'})
comment_html = div.find(string=lambda text: isinstance(text, Comment))
comment_soup = BeautifulSoup(comment_html, 'html.parser')
for tr in comment_soup.find_all('tr'):
row = [td.text for td in tr.find_all(['td', 'th'])]
print(row)
Giving you:
['', 'Games', 'Pass Coverage', 'Pass Rush', 'Tackles']
['Year', 'Age', 'Tm', 'Pos', 'No.', 'G', 'GS', 'Int', 'Tgt', 'Cmp', 'Cmp%', 'Yds', 'Yds/Cmp', 'Yds/Tgt', 'TD', 'Rat', 'DADOT', 'Air', 'YAC', 'Bltz', 'Hrry', 'QBKD', 'Sk', 'Prss', 'Comb', 'MTkl', 'MTkl%']
['2018*+', '27', 'LAR', 'DT', '99', '16', '16', '0', '1', '0', '0.0%', '0', '', '0.0', '0', '39.6', '-2.0', '0', '0', '0', '30', '19', '20.5', '70', '59', '6', '9.2%']
['2019*+', '28', 'LAR', 'DT', '99', '16', '16', '0', '0', '0', '', '0', '', '', '0', '', '', '0', '0', '0', '32', '9', '12.5', '55', '48', '6', '11.1%']
Related
I'm trying to create a pdf reader in python, I already got the pdf read and
I got a list with the content of the pdf and I want now to give me back the numbers with eleven characters, like 123.456.789-33 or 124.323.432.33
from PyPDF2 import PdfReader
import re
reader = PdfReader(r"\\abcdacd.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
num = re.findall(r'\d+', text)
print(num)
here's the output:
['01', '01', '2000', '26', '12', '2022', '04483203983', '044', '832', '039', '83', '20210002691450', '5034692', '79', '2020', '8', '24', '0038', '1', '670', '03', '2', '14', '2', '14', '1', '670', '03', '2', '14', '2', '14', '1', '1', '8', '21', '1']
If someone could help me, I'll be really thankful.
Change regex pattern to the following (to match groups of digits):
s = 'text text 123.456.789-33 or 124.323.432.33 text or 12323112333 or even 123,231,123,33 '
num = re.findall(r'\d{3}[.,]?\d{3}[.,]?\d{3}[.,-]?\d{2}', s)
print(num)
['123.456.789-33', '124.323.432.33', '12323112333', '123,231,123,33']
You can try:
\b(?:\d[.-]*){11}\b
Regex demo.
import re
s = '''\
123.456.789-33
124.323.432.33
111-2-3-4-5-6-7-8-9'''
pat = re.compile(r'\b(?:\d[.-]*){11}\b')
for m in pat.findall(s):
print(m)
Prints:
123.456.789-33
124.323.432.33
111-2-3-4-5-6-7-8-9
I am trying to use beautiful soup to pull the table corresponding to the HTML code below
<table class="sortable stats_table now_sortable" id="team_pitching" data-cols-to-freeze=",2">
<caption>Team Pitching</caption>
from https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2. Here is a screenshot of the site layout and HTML code I am trying to extract from.
I was using the code
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
res = requests.get(url)
soup1 = BS(res.content, 'html.parser')
table1 = soup1.find('table',{'id':'team_pitching'})
table1
I can't seem to figure out how to get this working. The table above can be extracted with the line
table1 = soup1.find('table',{'id':'team_batting'})
and I figured similar code should work for the one below. Additionally, is there a way to extract this using the table class "sortable stats_table now_sortable" rather than id?
The problem is that if you open the page normally it shows all the tables, however if you load the page with Developer Tools just the first table is shown. So, when you do your request the left tables are not included into the HTML you're getting. The table you're looking for is not shown until "Show team pitchin" button is pressed, to do this you could use Selenium and get the full HTML response.
That is because the table you are looking for - i.e. <table> with id="team_pitching" is present as a comment inside the soup. You can check it for yourself by printing soup.
You need to
Extract that comment from the soup
Convert it into a soup object
Extract the table data from the soup object.
Here is the complete code that does the above mentioned steps.
from bs4 import BeautifulSoup, Comment
import requests
url = 'https://www.baseball-reference.com/register/team.cgi?id=17cdc2d2'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
main_div = soup.find('div', {'id': 'all_team_pitching'})
# Extracting the comment from the above selected <div>
for comments in main_div.find_all(text=lambda x: isinstance(x, Comment)):
temp = comments.extract()
# Converting the above extracted comment to a soup object
s = BeautifulSoup(temp, 'lxml')
trs = s.find('table', {'id': 'team_pitching'}).find_all('tr')
# Printing the first five entries of the table
for tr in trs[1:5]:
print(list(tr.stripped_strings))
The first 5 entries from the table
['1', 'Tyler Ahearn', '21', '1', '0', '1.000', '1.93', '6', '0', '0', '1', '9.1', '8', '5', '2', '0', '4', '14', '0', '0', '0', '42', '1.286', '7.7', '0.0', '3.9', '13.5', '3.50']
['2', 'Jack Anderson', '20', '2', '0', '1.000', '0.79', '4', '1', '0', '0', '11.1', '6', '4', '1', '0', '3', '11', '1', '0', '0', '45', '0.794', '4.8', '0.0', '2.4', '8.7', '3.67']
['3', 'Shane Drohan', '*', '21', '0', '1', '.000', '4.08', '4', '4', '0', '0', '17.2', '15', '12', '8', '0', '11', '27', '1', '0', '2', '82', '1.472', '7.6', '0.0', '5.6', '13.8', '2.45']
['4', 'Conor Grady', '21', '2', '0', '1.000', '3.00', '4', '4', '0', '0', '15.0', '10', '5', '5', '3', '8', '15', '1', '0', '2', '68', '1.200', '6.0', '1.8', '4.8', '9.0', '1.88']
I'm trying to extract a table from a webpage that I'm working on to store the headers as keys and the body as values but separately to denote which page they're from. Here's what I have tried:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.maximize_window()
driver.get('https://www.google.com')
all_data = []
for i in range(1,6):
url = "https://www.transfermarkt.co.uk/silvio-adzic/profil/spieler/{}".format(i)
driver.get(url)
time.sleep(3)
data = {}
soup = BeautifulSoup(driver.page_source, 'html5lib')
print(f"In Page {i}")
for th in soup.select("#yw2 tr"):
data[th.get_text(strip = True)] = th.find_next('td').get_text(strip=True)
all_data.append(data)
However this produces a jumbled dictionary:
[{'competitionwettbewerb': 'Total :',
'Total :25169524616.948': 'Total :',
'Regionalliga Süd': 'Regionalliga Süd',
'Regionalliga Süd7922-2876.318': '',
'2. Bundesliga': '2. Bundesliga',
'2. Bundesliga60933873.487': '',
'RL West-Südwest': 'RL West-Südwest',
'RL West-Südwest5818-1943.493': '',
'Oberliga Südwest': 'Oberliga Südwest',
'Oberliga Südwest2015-1101.649': '',
'Bundesliga': 'Bundesliga',
'Bundesliga1212355355': '',
..
..
..
(Expected outcome) is there a way to separate these for each page that's extracted so something like this?
[{'p1':{'competition': ["Regionalliga Süd", "2. Bundesliga", ...],
'Appearances': [79, 60,...],
'Goals':[22, 9,...],
'Assists':[-, 3, ...]
...},
'p2':{'competition': ["Bundesliga", "2. Bundesliga", ...],
'Appearances': [262, 98,...],
'Goals':[62, 18,...],
'Assists':[79, -, ...]
...}}]
This needs more complex code to work with every row and every cell separatelly.
First I create place for all data
data = {'competition': [], 'Appearances': [], 'Goals':[], 'Assists':[]}
Next I use for-loop to get rows in table.
But there are two problems:
some tr are empty but they don't have class so it is easy to skip them. I use also tbody to skip row in header.
on some pages it uses ID yw2 and on other yw1 but both are in div with data-viewport=Leistungsdaten_Saison so it is easy to get correct tables.
for tr in soup.select("div[data-viewport=Leistungsdaten_Saison] tbody tr[class]"):
Next I get all cells in row and put values in correct lists.
cells = tr.find_all('td')
#print(cells)
data['competition'].append(cells[1].get_text(strip=True))
data['Appearances'].append(cells[2].get_text(strip=True))
data['Goals'] .append(cells[3].get_text(strip=True))
data['Assists'] .append(cells[4].get_text(strip=True))
And finally I put data in all_data with key p1, p2, etc.
all_data.append({'p{}'.format(i): data})
Full working code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
driver.maximize_window()
all_data = []
for i in range(1, 6):
print('--- page', i, '---')
url = "https://www.transfermarkt.co.uk/silvio-adzic/profil/spieler/{}".format(i)
driver.get(url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html5lib')
data = {'competition': [], 'Appearances': [], 'Goals':[], 'Assists':[]}
for tr in soup.select("div[data-viewport=Leistungsdaten_Saison] tbody tr[class]"):
cells = tr.find_all('td')
#print(cells)
data['competition'].append(cells[1].get_text(strip=True))
data['Appearances'].append(cells[2].get_text(strip=True))
data['Goals'] .append(cells[3].get_text(strip=True))
data['Assists'] .append(cells[4].get_text(strip=True))
all_data.append({'p{}'.format(i): data})
# --- display ---
for player in all_data:
name, data = list(player.items())[0]
print('---', name, '---')
for key, value in data.items():
print(key, value)
Result:
--- p1 ---
competition ['Regionalliga Süd', '2. Bundesliga', 'RL West-Südwest', 'Oberliga Südwest', 'Bundesliga', 'NOFV-Oberliga Süd', 'Oberliga Bayern', 'DFB-Pokal', 'UEFA Cup']
Appearances ['79', '60', '58', '20', '12', '9', '6', '6', '1']
Goals ['22', '9', '18', '15', '1', '-', '2', '2', '-']
Assists ['-', '3', '-', '-', '2', '-', '-', '-', '-']
--- p2 ---
competition ['Bundesliga', '2. Bundesliga', 'DFB-Pokal', 'Champions League', '2. BL North', 'UEFA Cup', 'VL Südwest', '2. BL Nord Aufstiegsr.', 'Ligapokal', "Cup Winners' Cup", 'DFB-SuperCup', 'UI Cup', 'Champions League Qu.', 'Südwestpokal', 'Intertoto-Cup (until 94/95)']
Appearances ['262', '98', '38', '23', '21', '15', '11', '9', '6', '4', '2', '2', '1', '1', '0']
Goals ['62', '18', '9', '7', '3', '4', '1', '2', '2', '2', '-', '2', '-', '1', '-']
Assists ['79', '-', '5', '3', '-', '5', '-', '-', '-', '1', '2', '-', '-', '-', '-']
# ...
I am trying to grab table data and found out that its dynamic and from an iframe. My snippet does not work. Any idea of help will be very useful.
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
browser = webdriver.Chrome('C://Python38/chromedriver')
browser.get("https://poocoin.app/rugcheck/0xe56842ed550ff2794f010738554db45e60730371/top-holders")
url = "https://poocoin.app/rugcheck/0xe56842ed550ff2794f010738554db45e60730371/top-holders"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
t = soup.find('table', class_='table table-bordered table-condensed text-small')
trs = t.find('tbody').find_all('tr')
for tr in trs[:10]:
print(list(tr.stripped_strings))
browser.quit()
Current Output/Error:
Traceback (most recent call last):
File "C:/Users/Acer/poocoin.py", line 8, in <module>
trs = t.find('tbody').find_all('tr')
AttributeError: 'NoneType' object has no attribute 'find'
The webpage is dynamic but the table is not a part of any <iframe>. The table is a part of the current webpage.
Here I have tried to extract the data from the table you need.
import requests
from bs4 import BeautifulSoup
import time
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
url = 'https://poocoin.app/rugcheck/0xe56842ed550ff2794f010738554db45e60730371/top-holders'
driver.get(url)
time.sleep(8)
soup = BeautifulSoup(driver.page_source, 'lxml')
t = soup.find('table', class_='table table-bordered table-condensed text-small')
# Get all the rows from the table
trs = t.find_all('tr')
for tr in trs:
print(list(tr.stripped_strings))
['Address', 'Track Wallet', 'Type', 'Amount', 'Transfer Count', 'Current Balance']
['0xe432afb7283a08be24e9038c30ca6336a7cc8218', 'Contract', '2,047,063,909.1119', '14488', '74,050,430.9257']
['0xa36b9dc17e421d86ddf8e490dafa87344e76125b', 'Track', 'Wallet', '1,000,000,000.0000', '1', '49,463,154.0462']
['0x0eb207b525dc856c3bad5bfd7a7a4aae781e1757', 'Contract', '800,000,000.0000', '1', '3,620,000.0000']
['0xeaed594b5926a7d5fbbc61985390baaf936a6b8d', 'Contract', '150,526,843.9538', '1', '150,000,000.0000']
['0xe56842ed550ff2794f010738554db45e60730371', 'Contract', '148,413,174.7757', '14495', '9,165,152.5432']
['0xbbda05ea467ad348212dade5c38c11910c14e83e', 'Track', 'Wallet', '65,442,888.2752', '2093', '61,246,203.1985']
['0x537d90d1d2743f44b65612c9fff3b6f011f65471', 'Track', 'Wallet', '42,871,267.9652', '1', '3,919,432.0622']
['0x2def4d262bc8d7456c8d59138760c992283abf80', 'Track', 'Wallet', '42,411,193.5197', '1', '0.0000']
['0xab2feac90728c278b30c6597760d74eb57b3726f', 'Track', 'Wallet', '42,411,193.5197', '1', '0.0000']
['0xc1e16013a158d57a60d6aa5bb3108722b0ac6df5', 'Contract', '27,101,254.6006', '18', '0.0000']
['0xcfdb8569fb546a010bb22b5057679c4053d4a231', 'Track', 'Wallet', '26,328,593.9564', '7', '11,493,129.6564']
['0x000159831a681a63b01911b9c162fbb8949976ba', 'Contract', '23,385,665.4517', '1', '0.4517']
['0x8f3e8ab6cc8b3d565564256cce95ba9f213c2a0d', 'Track', 'Wallet', '21,880,000.0000', '21', '0.0000']
['0xc590175e458b83680867afd273527ff58f74c02b', 'Contract', '20,386,615.2880', '173', '0.0000']
['0xdb6f1920a889355780af7570773609bd8cb1f498', 'Contract', '19,065,071.3715', '2', '0.0000']
['0x112ac5463b46ba4f32b95ae733f73c6e23bd3e53', 'Track', 'Wallet', '17,982,140.3274', '7', '0.0074']
['0xa8b398896d67cea6d26fc140e056f745261c4b00', 'Track', 'Wallet', '17,933,245.4310', '21', '9,024,167.7594']
['0x2368b6acc957339cf34a08a064830fcdfcac02c6', 'Track', 'Wallet', '17,675,714.5003', '1', '8.5003']
['0x7cacd11be7d7c95c48a0477875d31040ddaff2da', 'Track', 'Wallet', '17,467,065.9791', '1', '0.0000']
['0xc62184ac04a0610147bd890ba32d1918b67e017c', 'Track', 'Wallet', '16,841,462.6819', '1', '2.6819']
My understanding of this is that the page renders iframe in a separate call, and so the standard Soup call isn't finding it.
Given you're importing selenium, have you tried switch_to()?
Once you've switched to the iframe, you can call .page_source(), and use this as your bs4 input.
browser.switch_to.frame(your_frame_name)
r = browser.page_source
soup = BeautifulSoup(r.text, 'lxml')
t = soup.find('table', class_='table table-bordered table-condensed text-small')
trs = t.find('tbody').find_all('tr')
for tr in trs[:10]:
print(list(tr.stripped_strings))
browser.quit()
I have a number of pages containing statistics in lists that I am scraping. Everything is working except this one minor issue I cannot seem to resolve. In using the text of the data fields to find them, one heading that is very similar to another picks up the wrong value. Anyone know how to correct for this?
HTML looks like this:
<li><span class="bp3-tag p p-50">50</span> <span class="some explaining words.">Positioning</span>
<li><span class="bp3-tag p p-14">14</span> <span class="some other explaining words.">BB Positioning</span>
Code looks like this, and the output returns 14 for both values when it should return 50 for Positioning and 14 for BB Positioning...
stats = ['Positioning', 'BB Positioning']
url = urlopen(req)
soups = bs(url, 'lxml')
def statistics(soups):
data = {}
divs_without_skill = soups[1].find_all('div', {'class': 'col-3'})
more_lis = [div.find_all('li') for div in divs_without_skill]
lis = soups[0].find_all('li') + more_lis[0]
for li in lis:
for stats in fifa_stats:
if stats in li.text:
data[stats.replace(' ', '_').lower()] = str(
(li.text.split(' ')[0]).replace('\n', ''))
return(data)
Any help greatly appreciated.
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
goal = {x.h5.text: [i.text for i in x.select(
'.bp3-tag')] for x in soup.select('div.column.col-3')[7:-1]}
pp(goal)
main('https://sofifa.com/player/244042/moussa-djitte/210049')
Output:
{'Attacking': ['56', '71', '64', '62', '53'],
'Skill': ['72', '46', '29', '36', '70'],
'Movement': ['78', '79', '83', '65', '74'],
'Power': ['67', '77', '74', '70', '59'],
'Mentality': ['51', '29', '69', '57', '65', '55'],
'Defending': ['33', '14', '16'],
'Goalkeeping': ['8', '8', '6', '15', '13']}