I'm learning Python and I'm following this online class lesson.
https://openclassrooms.com/fr/courses/7168871-apprenez-les-bases-du-langage-python/exercises/4173
At the end of the lesson, we're learning the ETL procedure.
Question 3:
I have to load an HTML script and use BeautifulSoup in a Python script.
The problem is there: the only thing I've done when it comes to data mining is with a website, I create a variable that contains the URL link of the website and after that I create a variable soup.
import requests
from bs4 import BeautifulSoup
url = 'https://www.gov.uk/search/news-and-communications'
reponse = requests.get(url)
page = reponse.content
soup = BeautifulSoup(page, 'html.parser')
This is easy because the HTML code is in a URL but how can I do that with a file inside my machine?
I create a new HTML file with the script inside (the file is named TestOC.html)
I create a new Python file.
from bs4 import BeautifulSoup
soup = BeautifulSoup('TestOC.html', 'html.parser')
But the file is not taken. How can I do that?
BeautifulSoup takes the content, not the file name. You could open it yourself and read() it though:
with open('TestOC.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
Related
Hello I am trying to use beautiful soup and requests to log the data coming from an anemometer which updates live every second. The link to this website here:
http://88.97.23.70:81/
The piece of data I want to scrape is highlighted in purple in the image :
from inspection of the html in my browser.
I have written the code bellow in to try to print out the data however when I run the code it prints: None. I think this means that the soup object doesnt infact contain the whole html page? Upon printing soup.prettify() I cannot find the same id=js-2-text I find when inspecting the html in my browser. If anyone has any ideas why this might be or how to fix it I would be most grateful.
from bs4 import BeautifulSoup
import requests
wind_url='http://88.97.23.70:81/'
r = requests.get(wind_url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
print(soup.find(id='js-2-text'))
All the best,
Brendan
The data is loaded from external URL, so beautifulsoup doesn't need it. You can try to use API URL the page is connecting to:
import requests
from bs4 import BeautifulSoup
api_url = "http://88.97.23.70:81/cgi-bin/CGI_GetMeasurement.cgi"
data = {"input_id": "1"}
soup = BeautifulSoup(requests.post(api_url, data=data).content, "html.parser")
_, direction, metres_per_second, *_ = soup.csv.text.split(",")
knots = float(metres_per_second) * 1.9438445
print(direction, metres_per_second, knots)
Prints:
210 006.58 12.79049681
Is it possible to pass BeautifulSoup a HTML I have copied in my clipboard using pyperclip. I have difficulties using requests as the page requires login, and the usual methods for passing cookies to requests doesn't work.
The first approach using Pyperclip would be:
soup = BeautifulSoup(pyperclip.paste(), "html.parser")
You could also save the html file locally with Ctrl+S and import the html file in Python:
with open(r'yourfile.html', "r", encoding='utf-8') as html_file:
soup = BeautifulSoup(html_file.read(), "html.parser")
https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")
following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html
You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()
There's something I still don't understand about using BeautifulSoup. I can use this to parse the raw HTML of a webpage, here "example_website.com":
from bs4 import BeautifulSoup # load BeautifulSoup class
import requests
r = requests.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
# soup.find_all('a') grabs all elements with <a> tag for hyperlinks
Then, to retrieve and print all elements with the 'href' attribute, we can use a for loop:
for link in soup.find_all('a'):
print(link.get('href'))
What I don't understand: I have a website with several webpages, and each webpage lists several hyperlinks to a single webpage with tabular data.
I can use BeautifulSoup to parse the homepage, but how do I use the same Python script to scrape page 2, page 3, and so on? How do you "access" the contents found via the 'href' links?
Is there a way to write a python script to do this? Should I be using a spider?
You can do that with requests+BeautifulSoup for sure. It would be of a blocking nature, since you would process the extracted links one by one and you would not proceed to the next link until you are done with the current. Sample implementation:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
r = session.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
base_url = "http://example_website.com"
for link in soup.find_all('a'):
url = urljoin(base_url, link.get('href'))
r = session.get(url)
# parse the subpage
Though, it may quickly get complex and slow.
You may need to switch to Scrapy web-scraping framework which makes web-scraping, crawling, following the links easy (check out CrawlSpider with link extractors), fast and in a non-blocking nature (it is based on Twisted).
I'm trying to scrape a webpage using BeautifulSoup using the code below:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen("http://en.wikipedia.org//wiki//Markov_chain.htm") as url:
s = url.read()
soup = BeautifulSoup(s)
with open("scraped.txt", "w", encoding="utf-8") as f:
f.write(soup.get_text())
f.close()
The problem is that it saves the Wikipedia's main page instead of that specific article. Why the address doesn't work and how should I change it?
The correct url for the page is http://en.wikipedia.org/wiki/Markov_chain:
>>> import urllib.request
>>> from bs4 import BeautifulSoup
>>> url = "http://en.wikipedia.org/wiki/Markov_chain"
>>> soup = BeautifulSoup(urllib.request.urlopen(url))
>>> soup.title
<title>Markov chain - Wikipedia, the free encyclopedia</title>
#alecxe's answer will generate:
**GuessedAtParserWarning**:
No parser was explicitly specified, so I'm using the best
available HTML parser for this system ("html.parser"). This usually isn't a problem,
but if you run this code on another system, or in a different virtual environment, it
may use a different parser and behave differently. The code that caused this warning
is on line 25 of the file crawl.py.
To get rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.
Here is a solution without GuessedAtParserWarning using requests:
# crawl.py
import requests
url = 'https://www.sap.com/belgique/index.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
file = path.join(path.dirname(__file__), 'downl.txt')
# Either print the title/text or save it to a file
print(soup.title)
# download the text
with open(file, 'w') as f:
f.write(soup.text)