When requesting a website using class='even', I end up just receiving '[]' as my result.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://www.worldometers.info/coronavirus/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.findAll('tr', class_='even'))
This is my result
[]
I tried looking in a lot of places but I couldn't find out why. The HTML code is really long as there is a lot of data.
I am not sure that this is the universal solution, but this is what worked for my project. I used #ahmedamerican's solution with a few changes in order to fix it.
import requests
import pandas as pd
import lxml
r = requests.get("https://www.worldometers.info/coronavirus/")
df = pd.read_html(r.content)[0]
print(df)
Just instead of doing print(type(df)) as he said, I did print(df).
Related
I'm using the following code but still keep getting an empty set. Any ideas?
from bs4 import BeautifulSoup as bs
import pandas as pd
import requests, time, os,html5lib
base_site = "https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/14"
response = requests.get(base_site)
soup = bs(response.text,"html.parser")
soup
# Find all links on the page
game = soup.find_all("section", class_="sb-score.final")
game
Here is what I'm seeing on the site:
The most likely issue is that there are multiple classes, you could try:
find_all('section', class_=['sb-score', 'final'])
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
url = "https://www.covid19india.org/"
headers = {"Accept-Language":"en-US, en;q=0.5"}
results = requests.get(url,headers = headers)
soup = BeautifulSoup(results.text,"html.parser")
cases_div = soup.find_all('div', class_="Level1")
print(cases_div)
My expect output is the HTML
However, I am getting an empty list while printing cases_div.
Why is that and how can I fix it?
It seems that the specified website uses React, and on the first HTTP request you will not get the whole content.
Try to use selenium or try to find the API requests to the server as was suggested in the comment.
I am trying to scrape box-score data from ProFootball reference. After running into issues with javascript, I turned to selenium to get the initial soup object. I'm trying to find a specific table on a website and subsequently iterate through its rows.
The code words if I simply find_all('table')[#] however the # changes depending on which box score I am looking at so it isn't reliable. I therefore want to use the id='player_offense' tag to identify the same table across games but when I use it it returns nothing. What am I missing here?
from selenium import webdriver
import os
from bs4 import BeautifulSoup
#path to chromedriver
chrome_path=os.path.expanduser('~/Documents/chromedriver.exe')
driver = webdriver.Chrome(path)
driver.get('https://www.pro-football-
reference.com/boxscores/201709070nwe.htm')
soup = BeautifulSoup(driver.page_source,'lxml')
driver.quit()
#doesn't work
soup.find('table',id='player_offense')
#works
table = soup.find_all('table')[3]
Data is in comments. Find the appropriate comment and then extract table
import requests
from bs4 import BeautifulSoup as bs
from bs4 import Comment
import pandas as pd
r= requests.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm#')
soup = bs(r.content, "lxml")
comments = soup.find_all(string=lambda text:isinstance(text,Comment))
for comment in comments:
if 'id="player_offense"' in comment:
print(pd.read_html(comment)[0])
break
This also works.
from requests_html import HTMLSession, HTML
import pandas as pd
with HTMLSession() as s:
r = s.get('https://www.pro-football-reference.com/boxscores/201709070nwe.htm')
r = HTML(html=r.text)
r.render()
table = r.find('table#player_offense', first=True)
df = pd.read_html(table.html)
print(df)
I'm looking to extract data from Instagram and record the time of the post without using auth.
The below code gives me the HTML of the pages from the IG post, but I'm not able to extract the time element from the HTML.
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import json
url_path = 'https://www.instagram.com/<username>'
session = HTMLSession()
r = session.get(url_path)
soup = BeautifulSoup(r.content,features='lxml')
print(soup)
I would like to extract data from the time element near the bottom of this screenshot
to extract time you can use html tag and its class :
time = soup.findAll("time", {"class": "_1o9PC Nzb55"}).text
I'm guessing that the picture you've shared is a browser inspector screenshot. Although inspecting the code is a good basic guideline on web scraping you should check what BeautifullSoup is getting. If you check the print of soup you will see that the data you are looking for its a json inside of a script tag. So your code and any other solution that targets the time tag aren't working on BS4. You might try with selenium maybe.
Anyway here goes the BeautifullSoup pseudo-solution using the instagram from your screenshot:
from bs4 import BeautifulSoup
import json
import re
import requests
import time
url_path = "https://www.instagram.com/srirachi9/"
response = requests.get(url_path)
soup = BeautifulSoup(response.content)
pattern = re.compile(r"window\._sharedData\ = (.*);", re.MULTILINE)
script = soup.find("script", text=lambda x: x and "window._sharedData" in x).text
data = json.loads(re.search(pattern, script).group(1))
times = len(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'])
for x in range(times):
time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'][x]['node']['taken_at_timestamp']))
The times variable its the amount of timestamps the json contains. It may look like hell but its just a matter of patiently following the json structure and indexing accordingly.
New to scraping and I'm trying to use Beautiful soup to get the Wheelbase value ( eventually other things) from a wikipedia page ( I'll deal with robots.txt later) This is the guide I've been using
Two questions
1.) How do I resolve the error below?
2.) How do I scrape the value in the cell that contains wheelbase is it just "td#Wheelbase td" ?
The error I get is
File "evscraper.py", line 25, in <module>
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3') [0].get_text()
IndexError: list index out of range
Thanks for any help!
__author__ = 'KirkLazarus'
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3')[0].get_text()
print wheelbase_data
Well your first problem is with your selector. There's no div with the ID of "Wheelbase" on that page, so it's returning an empty list.
What follows is by no means perfect, but will get you what you want, only because you know the structure of the page already:
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
wheelbase_data = {}
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
for link in soup.find_all('a'):
if link.get('href') == "/wiki/Wheelbase":
wheelbase = link
break
wheelbase_data['Wheelbase'] = wheelbase.parent.parent.td.text
It looks like you're looking for the incorrect path. I've had to do something similar in the past.. I'm not sure if this is the best approach but certainly worked for me.
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
car_data = pd.DataFrame()
models = ['Tesla_Model_S','Tesla_Model_X']
for model in models:
wiki = "https://en.wikipedia.org/wiki/{0}".format(model)
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "infobox hproduct" })
for row in table.findAll("tr")[2:]:
try:
field = row.findAll("th")[0].text.strip()
val = row.findAll("td")[0].text.strip()
car_data.set_value(model,field,val)
except:
pass
car_data