BeautifulSoup's "find" acting inconsistently (bs4)

BeautifulSoup's "find" acting inconsistently (bs4) - python

I'm scraping the NFL's website for player statistics. I'm having an issue when parsing the web page and trying to get to the HTML table which contains the actual information I'm looking for. I successfully downloaded the page and saved it into the directory I'm working in. For reference, the page I've saved can be found here.
# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result
I found that at one point, I ran the code and result printed the correct table I was looking for. Every other time, it doesn't contain anything! I'm assuming this is user error, but I can't figure out what I'm missing. Using "lxml" returned nothing and I can't get html5lib to work (parsing library??).
Any help is appreciated!

First, you should read the contents of your file before passing it to BeautifulSoup.
soup = BeautifulSoup(open("1998.html").read())
Second, verify manually that the table in question exists in the HTML by printing the contents to screen. The .prettify() method makes the data easier to read.
print soup.prettify()
Lastly, if the element does in fact exist, the following will be able to find it:
table = soup.find('table',{'id':'result'})
A simple test script I wrote cannot reproduce your results.
import urllib
from bs4 import BeautifulSoup
def test():
# The URL of the page you're scraping.
url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
# Make a request to the URL.
conn = urllib.urlopen(url)
# Read the contents of the response
html = conn.read()
# Close the connection.
conn.close()
# Create a BeautifulSoup object and find the table.
soup = BeautifulSoup(html)
table = soup.find('table',{'id':'result'})
# Find all rows in the table.
trs = table.findAll('tr')
# Print to screen the number of rows found in the table.
print len(trs)
This outputs 51 every time.

Related

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!

The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.

I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Web Scraping & Csv making

I am doing a project that I am a little stuck on. I am trying to web scrape a website, to grab all of the cast and its characters.
I successfully grab the cast and characters from the wiki page and I get it to print out their personal wikipedia pages.
My question is, is it possible to make a program that produces the wiki by using a loop?
Next, I would want to use the loop to scrape each actors/actresses from their personal wikipedia page to scrape their age and they are from?
After the full code works and outputs what I have asked I need help by creating a csv file to make it print into the created csv file.
Thanks in advance.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Law_%26_Order:_Special_Victims_Unit'
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')
The above code imports the libraries I am going to use and has the url I am trying to scrape:
right_table=bs.find('table', class_='wikitable plainrowheaders')
table_rows = right_table.find_all('td')
for row in table_rows:
try:
if row.a['href'].startswith('/wiki'):
print(row.get_text())
print(row.a['href'])
print('-------------------------')
except TypeError:
pass
This is before I added the below print statements, I think there is a way to create a list that then creates a loop to grab where it prints "/wiki/...."
right_table=bs.find('table', class_='wikitable plainrowheaders')
table_rows = right_table.find_all('td')
for row in table_rows:
try:
if row.a['href'].startswith('/wiki'):
print(row.get_text())
link = ('https://en.wikipedia.org/'+ row.a['href'])
print(link)
print('-------------------------')
except TypeError:
pass
The above code currently prints the Cast and their allocated wikipedia page but I am unsure if thats the write way to do it. I am also printing my outcomes first before putting it all in the CSV to check and make sure I am printing out the right code
def age_finder(url):
...
return age
The code above I am unsure what to put in where the "..." is to help return the age
for url in url_list:
age_finder(url)

How to use requests library to webscrape a list of links already scraped

I have scraped a set of links off a website (https://www.gmcameetings.co.uk) - all the links including the words meetings, i.e. the meeting papers, which are now contained in 'meeting_links'. I now need to follow each of them links to scrape some more links within them.
I've gone back to using the request library and tried
r2 = requests.get("meeting_links")
But it returns the following error:
MissingSchema: Invalid URL 'list_meeting_links': No schema supplied.
Perhaps you meant http://list_meeting_links?
Which I've changed it to but still no difference.
This is my code so far and how I got the links from the first url that I wanted.
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
#second set of links
r2 = requests.get("meeting_links")
Do I need to do something with the 'meeting_links' before I can start using the requests library again? I'm completely lost.

As I understand your new requests could be here:
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
<Do something with the request>
Because it looks like you are trying to pass a string to the requests method.
Request method should look like this:
requests.get('https://example.com')

parsing data from website json table with beautiful soup

I'm sure this will be a quick fix for someone with reasonable knowledge of web-scraping with beautifulsoup. I'm trying to grab the data from a table but for some reason its not giving me the expected output. Below is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import json
def main():
# BASE AND EXTENTIONS FOR EACH CURRENCY COLUMNWISE
base_cols_url='https://uk.reuters.com/assets/'
forex_cols={}
forex_cols['GBP']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=GBP'
forex_cols['EUR']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=EUR'
forex_cols['USD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=USD'
forex_cols['JPY']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=JPY'
forex_cols['CHF']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=CHF'
forex_cols['AUD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=AUD'
forex_cols['CAD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=CAD'
forex_cols['CNY']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=CNY'
forex_cols['HKD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=HKD'
# loop through the pages
for sym in forex_cols:
print(sym)
print(base_cols_url+forex_cols[sym])
get_data_from_page(sym,base_cols_url+forex_cols[sym])
def get_data_from_page(SYMBOL,PAGE):
browser = webdriver.PhantomJS()
# PARSE THE HTML
browser.get(PAGE)
soup = BeautifulSoup(browser.page_source, "lxml")
rows = soup.findAll('td')
# PARSE ALL THE COLUMN DATA
for r in rows:
print(r) # this prints nothing
print(soup) # this prints the page but the markups are missing and replaced with '<tr><td>&lt'
return
if __name__ == '__main__':
main()
If I manually load the page in chrome I can see the 'td' and 'tr' markups that should be able to be parsed but for some reason nothing prints? However, if I just print the entire soup object it seems that the markups are missing which explains why print(r) returns nothing. However, I don't know how to parse out the parts I need? (the data displayed on the table in the base webpage: https://uk.reuters.com/business/currencies).
Would really like an explanation about whats going on here? It looks like format called json but I've never never really used it and when I tried json.loads(soup) it says it can't load a soup object, so I try json.loads(soup.text()) but I get a ValueError: Expecting value: line 1 column 1 (char 0).
Would be really grateful if anyone could help me parse the data? Many thanks for reading!

Ok, after some failed attempts with json, I tried an incredibly crude basic string parsing method but it does the job, just in case anyone else wants to do anything similar.
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import json
def main():
# BASE AND EXTENTIONS FOR EACH CURRENCY COLUMNWISE
base_cols_url='https://uk.reuters.com/assets/'
forex_cols={}
forex_cols['GBP']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=GBP'
forex_cols['EUR']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=EUR'
forex_cols['USD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=USD'
forex_cols['JPY']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=JPY'
forex_cols['CHF']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=CHF'
forex_cols['AUD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=AUD'
forex_cols['CAD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=CAD'
forex_cols['CNY']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=CNY'
forex_cols['HKD']='jsonCurrencyPairs?callback=drawCurrencyPairs&srcCurr=HKD'
for sym in forex_cols:
print(sym)
print(base_cols_url+forex_cols[sym])
get_data_from_page(sym,base_cols_url+forex_cols[sym])
def get_data_from_page(SYMBOL,PAGE):
browser = webdriver.PhantomJS()
# PARSE THE HTML
browser.get(PAGE)
soup = BeautifulSoup(browser.page_source, "lxml")
rows = str(soup).split('"row"')
# PARSE ALL THE COLUMN DATA
for r in rows:
# PARSE OUT VALUE COL
try:
print(r.split('</a></td><td>')[1].split('</td><td class=')[0])
except: IndexError
pass
# PARSE OUT CURRENCY PAIR
try:
print(r.split('sparkchart?symbols=')[1].split('=X&')[0])
except: IndexError
pass
return
if __name__ == '__main__':
main()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup's "find" acting inconsistently (bs4) - python

Related

Finding tables returns [] with bs4

How to get CData from html using beautiful soup

Web Scraping & Csv making

How to use requests library to webscrape a list of links already scraped

parsing data from website json table with beautiful soup

Categories

Resources