I'm still leaning how to utilize beautifulsoup. I've managed to use tags and what not to pull the data from Depth Chart table at https://fantasydata.com/nfl-stats/team-details/CHI
But now I'm try to pull the Full Roster table. I can't quite seem to figure out the tags for that. I do notice in the source though that the info is in a list with dictionaries, as seen:
vm.Roster = [{"PlayerId":16236,"Name":"Cody Parkey","Team":"CHI","Position":"K","FantasyPosition":"K","Height":"6\u00270\"","Weight":189,"Number":1,"CurrentStatus":"Healthy","CurrentStatusCol
...
What's an elegant way to pull that Full Roster table? My thought was if I could just grab that list/dictionary, I could just convert to a dataframe. But not sure how to grab that, or if there is a better way to do that to put that table in a dataframe in python.
One possible solution is to use a regular expression to extract the raw JSON object which then can be loaded using the json library.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import json
html_page = urlopen("https://fantasydata.com/nfl-stats/team-details/CHI")
soup = BeautifulSoup(html_page, "html.parser")
raw_data = re.search(r"vm.Roster = (\[.*\])", soup.text).group(1)
data = json.loads(raw_data)
print(data[0]["Name"]) # Cody Parkey
It should be noted that scraping data from that particular site in this fashion most likely violates their terms of service and might even be illegal in some jurisdictions.
Related
I'm very new to Python and am attempting my first web scraping project. I'm attempting to extract the data following a tag within a XML data source. I've attached an image of the data I'm working with. My issue is that, it seems like no matter what tag I try to extract I constantly return no results. I am able to return the entire data source so I know the connection is not the issue.
My ultimate goal is to loop through all of the data and return the data following a particular tag. I think if I can understand why I'm unable to print a singular particular tag I should be able to figure out how to loop through all of the data. I've looked through similar posts but I think the tree in my set of data is particularly troublesome (that and my inexperience).
My Code:
from bs4 import BeautifulSoup
import requests
#Assign URL to scrape
URL = "http://api.powertochoose.org/api/PowerToChoose/plans?zip_code=78364"
#Fetch the raw HTML Data
Data = requests.get(URL)
Soup = BeautifulSoup(Data.text, "html.parser")
tags = Soup.find_all('fact_sheet')
print (tags)
Try to check the response of your example first, it is JSON not XML so no BeautifulSoup needed here, simply iterate the data list to pick your fact_sheets:
for plan in Data.json()['data']:
print(plan['fact_sheet'])
Out:
https://rates.cleanskyenergy.com:8443/rates/DownloadDoc?path=a70e9298-5537-481a-985c-c7a005b2e4f3.html&id_plan=223344
https://texpo-prod-api.eroms.works/api/v1/document/ViewProductDocument?type=efl&rateCode=SRCPLF24PTC&lang=en
https://www.txu.com/Handlers/PDFGenerator.ashx?comProdId=TCXSIMVL1212AR&lang=en&formType=EnergyFactsLabel&custClass=3&tdsp=AEPTCC
https://signup.myvaluepower.com/Home/EFL?productId=32653&Promo=16410
https://docs.cloud.flagshippower.com/EFL?term=36&duns=007924772&product=galleon&lang=en&code=FPSPTC2
...
As you've already realized by now, you're getting the data as json, so doing something like:
fact_sheet_links = [d['fact_sheet'] for d in Data.json()['data']]
would get you the data you want.
But also, if you'd prefer to work with the xml, you can add headers to the request:
Data = requests.get(URL, headers={ 'Accept': 'application/xml' })
and get an xml response. When I did this, Soup.find_all('fact_sheet') still did not work (although I've seen this method used in some tutorials, so it might be a version problem - and it might still work for you), but it did work when I used find_all with lambda:
tags = Soup.find_all(lambda t: 'fact_sheet' in t.name)
and the results after altering your code looked like this. That just gives you the tags though, so if you want a list of the contents instead, one way would be to use list comprehension:
fact_sheet_links = [t.text for t in tags]
so that you get them like this.
I have written the following code to pull the quotes from the webpage:
#importing python libraries
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import random
from lxml import html
#collect first page of quotes
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
#create a BeautifulSoup object
soup=BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
#find all quotes on the page
soup.find_all('ol')
#pull just the quotes and not the superfluous data
Quote=soup.find(id='post-')
Quote_list=Quote.find_all('ol')
quote_list
At this point, I now want to just show the text in a list and not see < li > or < ol > tags
I've tried using the .get_text() attribute but I get an error saying
ResultSet object has no attribute 'get_text'
How can I get only the text to return?
This is only for the first page of quotes - there is a second page which I am going to need to pull the quotes from. I will also need to present the data in a table with a column for the quotes and a column for the author from both pages.
Help is greatly appreciated... I'm still new to learning python and I've been working to this point for 8 hours on this code and feel so stuck/discouraged.
The 'html.parser' seems to be having bit of a problem even with what I think is now the correct code. But after switching to using 'lxml', which this was not using, it now seems to be working:
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
quotes.extend([li.get_text()
for li in ordered_list.find_all('li')
for ordered_list in ordered_lists
])
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
Prints:
22
“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.”
--------------------------------------------------------------------------------
“Once you have a certain amount of math/stats and hacking skills, it is much better to acquire a grounding in one or more subjects than in adding yet another programming language to your hacking skills, or yet another machine learning algorithm to your math/stats portfolio…. Clients will rather work with some data scientist A who understands their specific field than with another data scientist B who first needs to learn the basics―even if B is better in math/stats/hacking.”
Alternate Coding
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
for li in ordered_list.find_all('li'):
quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
The find_all() method can take a list of elements to search for. It basically takes a function which determines what elements should be returned. For printing the result from the tags, you need get_text(). But it only works on single entity so you have to loop over the entire find_all() to get each element and then apply get_text() to extract text from each element.
Use this code to get all your quotes :- (This is updated and working)
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.text, 'html.parser')
# Q return all of your quotes with HTML tagging but you can loop over it with `get_text()`
Q = soup.find('div',{"id":"post-"}).find_all('ol')
quotes=[]
for every_quote in Q:
quotes.append(every_quote.get_text())
print(quotes[0]) # Get the first quote
Use quotes[0], quotes[1], ... to get 1st, 2nd, and so on quotes!
Python3 - Beautiful Soup 4
I'm trying to parse the weather graph out of the website:
https://www.wunderground.com/forecast/us/ny/new-york-city
But when I grab the weather graph html but beautiful soup seems to grab all around it.
I am new to Beautiful Soup. I think it is not able to grab this because either it is not able to parse the tag thing they have going on or because the javascript that populates the graph hasn't loaded or is not parsable by BS (at least the way I'm using it).
As far as my code goes, it's extremely basic
import requests, bs4
url = 'https://www.wunderground.com/forecast/us/ny/new-york-city'
requrl = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
requrl.raise_for_status()
bs = bs4.BeautifulSoup(requrl.text, features="html.parser")
a = str(bs)
x = 'weather-graph'
print(a[a.find('x'):])
#Also tried a.find('weather-graph') which returns -1
I have verified that each piece of the code works in other scenarios. The last line should find that string and print out everything after that.
I tried making x many different pieces of the html in and around the graph but got nothing of substance.
There is an API you can use. Same as the page does. Don't know if key expires. You may need to do some ordering on output but you can do that by datetime field
import requests
r = requests.get('https://api.weather.com/v1/geocode/40.765/-73.981/forecast/hourly/240hour.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e').json()
for i in r['forecasts']:
print(i)
If unsure I will happily update to show you how to build dataframe and order.
I want to make a Python list of all of Vincent van Gogh's paintings out of the JSON file from a Wikipedia API call. Here is my URL that I use to make the request:
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=list%20of%20works%20by%20Vincent%20van%20Gogh&Page&prop=revisions&rvprop=content
As you can see if you open the URL in your browser, it's a huge blob of text. How can I begin to extract the titles of paintings from this massive JSON return? I have done a great deal of research before asking this question, and tried numerous methods to solve it. It would be helpful if this JSON file was a useful dictionary to work with, but I can't make sense of it. How would you extract names of paintings from this JSON file?
Instead of directly parsing the results of JSON API calls, use a python wrapper:
import wikipedia
page = wikipedia.page("List_of_works_by_Vincent_van_Gogh")
print page.links
There are also other clients and wrappers.
Alternatively, here's an option using BeautifulSoup HTML parser:
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh"
>>> soup = BeautifulSoup(urlopen(url))
>>> table = soup.find('table', class_="wikitable")
>>> for row in table.find_all('tr')[1:]:
... print(row.find_all('td')[1].text)
...
Still Life with Cabbage and Clogs
Crouching Boy with Sickle, Black chalk and watercolor
Woman Sewing, Watercolor
Woman with White Shawl
...
Here is a quick way to have your list in a panda dataframe
import pandas as pd
url = 'http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh'
df = pd.read_html(url, attrs={"class": "wikitable"})[0] # 0 is for the 1st table in this particular page
df.head()
I just started learning web scraping using Python. However, I've already ran into some problems.
My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
The problem: I'm unable to extract all of the species names.
This is what I have so far:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...
Any input will be highly appreciated!
You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:
scientific_names = [it.text for it in soup.table.find_all('i')]
Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.
You should read up on what BS actually does, it seems like you're underestimating its utility.
What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names
Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:
>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']
Thanks everyone...I was able to solve the problem I was having with this code:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
scientific_names = [it.text for it in soup.table.find_all('i')]
for item in scientific_names:
print item
If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.