I want to make a Python list of all of Vincent van Gogh's paintings out of the JSON file from a Wikipedia API call. Here is my URL that I use to make the request:
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=list%20of%20works%20by%20Vincent%20van%20Gogh&Page&prop=revisions&rvprop=content
As you can see if you open the URL in your browser, it's a huge blob of text. How can I begin to extract the titles of paintings from this massive JSON return? I have done a great deal of research before asking this question, and tried numerous methods to solve it. It would be helpful if this JSON file was a useful dictionary to work with, but I can't make sense of it. How would you extract names of paintings from this JSON file?
Instead of directly parsing the results of JSON API calls, use a python wrapper:
import wikipedia
page = wikipedia.page("List_of_works_by_Vincent_van_Gogh")
print page.links
There are also other clients and wrappers.
Alternatively, here's an option using BeautifulSoup HTML parser:
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh"
>>> soup = BeautifulSoup(urlopen(url))
>>> table = soup.find('table', class_="wikitable")
>>> for row in table.find_all('tr')[1:]:
... print(row.find_all('td')[1].text)
...
Still Life with Cabbage and Clogs
Crouching Boy with Sickle, Black chalk and watercolor
Woman Sewing, Watercolor
Woman with White Shawl
...
Here is a quick way to have your list in a panda dataframe
import pandas as pd
url = 'http://en.wikipedia.org/wiki/List_of_works_by_Vincent_van_Gogh'
df = pd.read_html(url, attrs={"class": "wikitable"})[0] # 0 is for the 1st table in this particular page
df.head()
Related
I have written the following code to pull the quotes from the webpage:
#importing python libraries
from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import random
from lxml import html
#collect first page of quotes
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
#create a BeautifulSoup object
soup=BeautifulSoup(page.content, 'html.parser')
soup
print(soup.prettify())
#find all quotes on the page
soup.find_all('ol')
#pull just the quotes and not the superfluous data
Quote=soup.find(id='post-')
Quote_list=Quote.find_all('ol')
quote_list
At this point, I now want to just show the text in a list and not see < li > or < ol > tags
I've tried using the .get_text() attribute but I get an error saying
ResultSet object has no attribute 'get_text'
How can I get only the text to return?
This is only for the first page of quotes - there is a second page which I am going to need to pull the quotes from. I will also need to present the data in a table with a column for the quotes and a column for the author from both pages.
Help is greatly appreciated... I'm still new to learning python and I've been working to this point for 8 hours on this code and feel so stuck/discouraged.
The 'html.parser' seems to be having bit of a problem even with what I think is now the correct code. But after switching to using 'lxml', which this was not using, it now seems to be working:
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
quotes.extend([li.get_text()
for li in ordered_list.find_all('li')
for ordered_list in ordered_lists
])
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
Prints:
22
“By definition all scientists are data scientists. In my opinion, they are half hacker, half analyst, they use data to build products and find insights. It’s Columbus meet Columbo―starry-eyed explorers and skeptical detectives.”
--------------------------------------------------------------------------------
“Once you have a certain amount of math/stats and hacking skills, it is much better to acquire a grounding in one or more subjects than in adding yet another programming language to your hacking skills, or yet another machine learning algorithm to your math/stats portfolio…. Clients will rather work with some data scientist A who understands their specific field than with another data scientist B who first needs to learn the basics―even if B is better in math/stats/hacking.”
Alternate Coding
from bs4 import BeautifulSoup as bs
import requests
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.content, 'lxml')
quotes = []
post_id = soup.find(id='post-')
ordered_lists = post_id.find_all('ol')
for ordered_list in ordered_lists:
for li in ordered_list.find_all('li'):
quotes.append(li.get_text())
print(len(quotes))
print(quotes[0]) # Get the first quote
print('-' * 80)
print(quotes[-1]) #print last quote
The find_all() method can take a list of elements to search for. It basically takes a function which determines what elements should be returned. For printing the result from the tags, you need get_text(). But it only works on single entity so you have to loop over the entire find_all() to get each element and then apply get_text() to extract text from each element.
Use this code to get all your quotes :- (This is updated and working)
from bs4 import BeautifulSoup as bs
import requests
from lxml import html
page = requests.get("https://www.kdnuggets.com/2017/05/42-essential-quotes-data-science-thought-leaders.html")
soup=bs(page.text, 'html.parser')
# Q return all of your quotes with HTML tagging but you can loop over it with `get_text()`
Q = soup.find('div',{"id":"post-"}).find_all('ol')
quotes=[]
for every_quote in Q:
quotes.append(every_quote.get_text())
print(quotes[0]) # Get the first quote
Use quotes[0], quotes[1], ... to get 1st, 2nd, and so on quotes!
Python3 - Beautiful Soup 4
I'm trying to parse the weather graph out of the website:
https://www.wunderground.com/forecast/us/ny/new-york-city
But when I grab the weather graph html but beautiful soup seems to grab all around it.
I am new to Beautiful Soup. I think it is not able to grab this because either it is not able to parse the tag thing they have going on or because the javascript that populates the graph hasn't loaded or is not parsable by BS (at least the way I'm using it).
As far as my code goes, it's extremely basic
import requests, bs4
url = 'https://www.wunderground.com/forecast/us/ny/new-york-city'
requrl = requests.get(url, headers={'user-agent': 'Mozilla/5.0'})
requrl.raise_for_status()
bs = bs4.BeautifulSoup(requrl.text, features="html.parser")
a = str(bs)
x = 'weather-graph'
print(a[a.find('x'):])
#Also tried a.find('weather-graph') which returns -1
I have verified that each piece of the code works in other scenarios. The last line should find that string and print out everything after that.
I tried making x many different pieces of the html in and around the graph but got nothing of substance.
There is an API you can use. Same as the page does. Don't know if key expires. You may need to do some ordering on output but you can do that by datetime field
import requests
r = requests.get('https://api.weather.com/v1/geocode/40.765/-73.981/forecast/hourly/240hour.json?apiKey=6532d6454b8aa370768e63d6ba5a832e&units=e').json()
for i in r['forecasts']:
print(i)
If unsure I will happily update to show you how to build dataframe and order.
I'm still leaning how to utilize beautifulsoup. I've managed to use tags and what not to pull the data from Depth Chart table at https://fantasydata.com/nfl-stats/team-details/CHI
But now I'm try to pull the Full Roster table. I can't quite seem to figure out the tags for that. I do notice in the source though that the info is in a list with dictionaries, as seen:
vm.Roster = [{"PlayerId":16236,"Name":"Cody Parkey","Team":"CHI","Position":"K","FantasyPosition":"K","Height":"6\u00270\"","Weight":189,"Number":1,"CurrentStatus":"Healthy","CurrentStatusCol
...
What's an elegant way to pull that Full Roster table? My thought was if I could just grab that list/dictionary, I could just convert to a dataframe. But not sure how to grab that, or if there is a better way to do that to put that table in a dataframe in python.
One possible solution is to use a regular expression to extract the raw JSON object which then can be loaded using the json library.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import json
html_page = urlopen("https://fantasydata.com/nfl-stats/team-details/CHI")
soup = BeautifulSoup(html_page, "html.parser")
raw_data = re.search(r"vm.Roster = (\[.*\])", soup.text).group(1)
data = json.loads(raw_data)
print(data[0]["Name"]) # Cody Parkey
It should be noted that scraping data from that particular site in this fashion most likely violates their terms of service and might even be illegal in some jurisdictions.
I just started learning web scraping using Python. However, I've already ran into some problems.
My goal is to web scrape the names of the different tuna species from fishbase.org (http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=salmon)
The problem: I'm unable to extract all of the species names.
This is what I have so far:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(html_doc)
spans = soup.find_all(
From here, I don't know how I would go about extracting the species names. I've thought of using regex (i.e. soup.find_all("a", text=re.compile("\d+\s+\d+")) to capture the texts inside the tag...
Any input will be highly appreciated!
You might as well take advantage of the fact that all the scientific names (and only scientific names) are in <i/> tags:
scientific_names = [it.text for it in soup.table.find_all('i')]
Using BS and RegEx are two different approaches to parsing a webpage. The former exists so you don't have to bother so much with the latter.
You should read up on what BS actually does, it seems like you're underestimating its utility.
What jozek suggests is the correct approach, but I couldn't get his snippet to work (but that's maybe because I am not running the BeautifulSoup 4 beta). What worked for me was:
import urllib2
from BeautifulSoup import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Tuna'
page = urllib2.urlopen(fish_url)
soup = BeautifulSoup(page)
scientific_names = [it.text for it in soup.table.findAll('i')]
print scientific_names
Looking at the web page, I'm not sure exactly about what information you want to extract. However, note that you can easily get the text in a tag using the text attribute:
>>> from bs4 import BeautifulSoup
>>> html = '<a>some text</a>'
>>> soup = BeautifulSoup(html)
>>> [tag.text for tag in soup.find_all('a')]
[u'some text']
Thanks everyone...I was able to solve the problem I was having with this code:
import urllib2
from bs4 import BeautifulSoup
fish_url = 'http://www.fishbase.org/ComNames/CommonNameSearchList.php?CommonName=Salmon'
page = urllib2.urlopen(fish_url)
html_doc = page.read()
soup = BeautifulSoup(html_doc)
scientific_names = [it.text for it in soup.table.find_all('i')]
for item in scientific_names:
print item
If you want a long term solution, try scrapy. It is quite simple and does a lot of work for you. It is very customizable and extensible. You will extract all the URLs you need using xpath, which is more pleasant and reliable. Still scrapy allows you to use re, if you need.
I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.