I'm using the following code but still keep getting an empty set. Any ideas?
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests, time, os,html5lib
base_site = "https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/14"
response = requests.get(base_site)
soup = bs(response.text,"html.parser")
soup
# Find all links on the page
game = soup.find_all("section", class_="sb-score.final")
game
Here is what I'm seeing on the site:
The most likely issue is that there are multiple classes, you could try:
find_all('section', class_=['sb-score', 'final'])
Related
So I am trying to scrape the price of a product on Aliexpress. I tried inspecting the element which looks like
<span class="product-price-value" itemprop="price" data-spm-anchor-id="a2g0o.detail.1000016.i3.fe3c2b54yAsLRn">US $14.43</span>
I'm trying to run the following code
'''
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
url = 'https://www.aliexpress.com/item/32981494236.html?spm=a2g0o.productlist.0.0.44ba26f6M32wxY&algo_pvid=520e41c9-ba26-4aa6-b382-4aa63d014b4b&algo_expid=520e41c9-ba26-4aa6-b382-4aa63d014b4b-22&btsid=0bb0623b16170222520893504e9ae8&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_'
source = urlopen(url).read()
soup = BeautifulSoup(source, 'lxml')
soup.find('span', class_='product-price-value')
'''
but I keep getting a blank output. I must be doing something wrong but these methods seem to work in the tutorials I've seen.
So, what i got. As i understood right, the page what you gave, was recived by scripts, but in origin, it doesn't contain it, just script tags, so i just used split to get it. Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'https://aliexpress.ru/item/1005002281350811.html?spm=a2g0o.productlist.0.0.42d53b59T5ddTM&algo_pvid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5&algo_expid=f3c72fef-c5ab-44b6-902c-d7d362bcf5a5-1&btsid=0b8b035c16170960366785062e33c0&ws_ab_test=searchweb0_0,searchweb201602_,searchweb201603_&sku_id=12000019900010138'
data = requests.get(url)
soup = BeautifulSoup(data.content, features="lxml")
res = soup.findAll("script")
total_value = str(res[-3]).split("totalValue:")[1].split("}")[0].replace("\"", "").replace(".", "").strip()
print(total_value)
It works fine, i tried on few pages from Ali.
When requesting a website using class='even', I end up just receiving '[]' as my result.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.findAll('tr', class_='even'))
This is my result
[]
I tried looking in a lot of places but I couldn't find out why. The HTML code is really long as there is a lot of data.
I am not sure that this is the universal solution, but this is what worked for my project. I used #ahmedamerican's solution with a few changes in order to fix it.
import requests
import pandas as pd
import lxml
r = requests.get("https://www.worldometers.info/coronavirus/")
df = pd.read_html(r.content)[0]
print(df)
Just instead of doing print(type(df)) as he said, I did print(df).
I am trying to scrape box-score data from ProFootball reference. After running into issues with javascript, I turned to selenium to get the initial soup object. I'm trying to find a specific table on a website and subsequently iterate through its rows.
The code words if I simply find_all('table')[#] however the # changes depending on which box score I am looking at so it isn't reliable. I therefore want to use the id='player_offense' tag to identify the same table across games but when I use it it returns nothing. What am I missing here?
from selenium import webdriver
import os
from bs4 import BeautifulSoup
#path to chromedriver
chrome_path=os.path.expanduser('~/Documents/chromedriver.exe')
driver = webdriver.Chrome(path)
driver.get('https://www.pro-football-
reference.com/boxscores/201709070nwe.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
#doesn't work
soup.find('table',id='player_offense')
#works
table = soup.find_all('table')[3]
Data is in comments. Find the appropriate comment and then extract table
import requests
from bs4 import BeautifulSoup as bs
from bs4 import Comment
import pandas as pd
r= requests.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm#')
soup = bs(r.content, "lxml")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="player_offense"' in comment:
print(pd.read_html(comment)[0])
break
This also works.
from requests_html import HTMLSession, HTML
import pandas as pd
with HTMLSession() as s:
r = s.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm')
r = HTML(html=r.text)
r.render()
table = r.find('table#player_offense', first=True)
df = pd.read_html(table.html)
print(df)
https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()
I am running into some trouble scraping a table using BeautifulSoup. Here is my code
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats = soup.find('table', id = 'totals')
In [78]: print(stats)
None
When I right click on the table to inspect the element the HTML looks as I'd expect, however when I view the source the only element with id = 'totals' is commented out. Is there a way to scrape a table from the commented source code?
I have referenced this post but can't seem to replicate their solution.
Here is a link to the webpage I am interested in. I'd like to scrape the table labeled "Totals" and store it as a data frame.
I am relatively new to Python, HTML, and web scraping. Any help would be greatly appreciated.
Thanks in advance.
Michael
Comments are string instances in BeautifulSoup. You can use BeautifulSoup's find method with a regular expression to find the particular string that you're after. Once you have the string, have BeautifulSoup parse that and there you go.
In other words,
import re
from urllib.request import urlopen
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"html.parser")
stats_html = soup.find(string=re.compile('id="totals"'))
stats_soup = BeautifulSoup(stats_html, "html.parser")
print(stats_soup.table.caption.text)
You can do this:
from urllib2 import *
from bs4 import BeautifulSoup
site = "http://www.sports-reference.com/cbb/schools/clemson/2014.html"
page = urlopen(site)
soup = BeautifulSoup(page,"lxml")
stats = soup.findAll('div', id = 'all_totals')
print stats
Please inform me if I helped!