Extract tables in webpages from Python/R or other software - python

I would like to extract Name, Address of School, Tel, Fax ,Head of School from the website:
https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html
Is it possible to do so?

yes it is possible and there are many tool that help you do that. If you do not want to use a programming language, you can use plenty of tools (but probably have to pay for them, here is an article that might be useful: https://popupsmart.com/blog/web-scraping-tools).
However, If you want to use python, what you should do is to load the page and then parse HTML. Then you should look you desirable element and fetch its data. This article explains the whole process with code: https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
Here is a simple code that shows the tables in the page that you posted, based on the code from above paper:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html")
soup = BeautifulSoup(page.content, 'html.parser')
# Create top_items as empty list
top_items = []
# Extract and store in top_items according to instructions on the left
products = soup.select('table')
for elem in products:
print(elem)
You can try it out here:
https://colab.research.google.com/drive/13EzFWNBqpkGf4CZvGt5pYySCuW7Ij6I4?usp=sharing

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

How to take hyperlinks from a wikipedia page

My current project requires to obtain the summaries of some wikipedia pages. This is really easy to do, but I want to make a general script for that. More specifically I also want to obtain the summaries of hyperlinks. For example I want to get the summary of this page: https://en.wikipedia.org/wiki/Creative_industries (this is easy). Moreover, I would also like to get the summaries of the hyperlinks in Section: Definitions -> 'Advertising', 'Marketing', 'Architecture',...,'visual arts'. My problem is that some of these hyperlinks have different page names. For example, the previous mentioned page has the hyperlink 'Software' (number 6), but I want the summary of the page, which is 'Software Engineering'.
Can someone help me with that? I can find the summaries of the pages with the same hyperlink name, but that is not always the case. So basically I am looking for a way to use (page.links) to only one area of the page.
Thank you in advance
Try using Beautiful soup, this will print all the links with the given prefix
from bs4 import BeautifulSoup
import requests, re
''' Dont forget to install/setup package = 'lxml' '''
url = "your link"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'lxml')
tags = soup.find_all('a')
''' This will print every available link'''
for tag in tags:
print(tag.get('href'))
''' this will print links with only prefix as given'''
for link in soup.find_all('a',attrs={'href': re.compile("^{{you prefix here}}")}):
print(link.get('href')

Check every hour if specific html classes exists on external websites

I want to check a few external links every few hours for some specific classes.
For example, I have this 2 links:
https://nike.com/product-1/
https://adidas.com/product1/
On each one of these links, I want to check every few hours if a specific class exists. More exactly, I want to check the stock availability for each one of those sizes(S, M, L, XL...).
If any size from those two links is "out of stock" I want to receive an email with a message.
From my research, I found that I can use Beautiful Soup which is a Python library for pulling data out of HTML.
This is what I have started:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://nike.com/product-1/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
stock = []
for h2_tag in soup.find_all('h2'):
a_tag = h2_tag.find('a')
print(urls)
This seems pretty complicated and it's just a start... I have the impression that there might be simpler ways of doing this.
What is the easiest way to do this?

Web Scraping Box Office Mojo with Python

I am trying to scrape data from BoxOfficeMojo for a Data Science project. I made some changes to this code I found from an already existing GitHub repository to suit my needs.
https://github.com/OscarPrediction1/boxOfficeCrawler/blob/master/crawlMovies.py
I need some help regarding scraping a particular feature.
While I can scrape a movie gross normally, Box Office Mojo has a feature where they show you the Inflation-adjusted gross (The gross of the movie if it released in any particular year). The inflation-adjusted gross comes with an additional "&adjust_yr=2018" at the end of the normal movie link.
For example -
Titanic Normal link (http://www.boxofficemojo.com/movies/?id=titanic.htm)
Titanic 2018 Inflation adjusted link (http://www.boxofficemojo.com/movies/?id=titanic.htm&adjust_yr=2018)
In this particular code I linked earlier a table of URLs is created by going through the alphabetical list (http://www.boxofficemojo.com/movies/alphabetical.htm ) and then each of the URLs is visited. The problem is, the alphabetical list has the Normal links of the movies and not the inflation-adjusted links. What do I change to get the inflation-adjusted values from here?
(The only way I could crawl all the movies at once is via the alphabetical list. I have checked that earlier)
One possible way would be simply to generate all the necessary url's by appending the list of normal urls with "&adjust_yr=2018" and scraping each site.
I personally like to use xpath (a language to navigate html structures, very useful for scraping!) and recommend not using string matches to filter out data from HTML as it was once recommended to me. A simple way to use xpath is via the lxml library.
from lxml import html
<your setup>
....
for site in major_sites:
page = 1
while True:
# fetch table
url = "http://www.boxofficemojo.com/movies/alphabetical.htm?letter=" + site + "&p=.htm&page=" + str(page)
print(url)
element_tree = html.parse(url)
rows = element_tree.xpath('//td/*/a[contains(#href, "movies/?id")]/#href')
rows_adjusted = ['http://www.boxofficemojo.com' + row + '&adjust_yr=2018' for row in rows]
# then loop over the rows_adjusted and grab the necessary info from the page
If you're comfortable using the pandas dataframe library I would also like to point out the pd.read_html() function, which, in my opinion, is predestined for this task. It would allow you to scrape a whole alphabetical page in almost a single line. Plus you can perform any necessary substitutions / annotations afterwards columnwise.
One possible way could be this.
import pandas as pd
<your setup>
....
for site in major_sites:
page = 1
while True:
# fetch table
url = "http://www.boxofficemojo.com/movies/alphabetical.htm?letter=" + site + "&p=.htm&page=" + str(page)
print(url)
req = requests.get(url=url)
# pandas uses beautifulsoup to parse the html content
content = pd.read_html(req.content)
# chose the correct table from the result
tabular_data = content[3]
# drop the row with the title
tabular_data = tabular_data.drop(0)
# housekeeping renamer
tabular_data.columns = ['title', 'studio', 'total_gross', 'total_theaters',
'opening_gross', 'opening_theaters', 'opening_date']
# now you can use the pandas library to perform all necessary replacement and string operations
Further resources:
Wikipedia has a nice overview of xpath syntax

Python: Parsing a class prints nothing?

Trying to parse a weather page and select the weekly forecasted highs.
Normally I would search with tags = soup.find_all("span", id="hi") but this tag doesn't use an id it uses a class.
Full code:
import mechanize
from bs4 import BeautifulSoup
my_browser = mechanize.Browser()
html_page = my_browser.open("http://www.wunderground.com/weather-forecast/45056")
html_text = html_page.get_data()
my_soup = BeautifulSoup(html_text)
tags = my_soup.find_all("span", class_="hi")
temp = tags[0].string
print temp
When I run this, nothing prints
The piece of HTML is buried inside a bunch of other tags, however the specific tag for today's high is as follows:
<span class="hi">63</span>
Just use class_ as the parameter name. See the docs.
The problem arises because class is a Python keyword, so you can't use it directly.
As an alternative to scraping the web page, you could always check out Weather Underground's API. It's free for developers (limited number of calls per day, etc.), but if you're going to be doing a number of lookups, this might be easier in the end.

Categories

Resources