Scraping the web in python - python

I'm completely new to scraping the web but I really want to learn it in python. I have a basic understanding of python.
I'm having trouble understanding a code to scrape a webpage because I can't find a good documentation about the modules which the code uses.
The code scraps some movie's data of this webpage
I get stuck after the comment "selection in pattern follows the rules of CSS".
I would like to understand the logic behind that code or a good documentation to understand that modules. Is there any previous topic which I need to learn?
The code is the following :
import requests
from pattern import web
from BeautifulSoup import BeautifulSoup
url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2012'
r = requests.get(url)
print r.url
url = 'http://www.imdb.com/search/title'
params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2012')
r = requests.get(url, params=params)
print r.url # notice it constructs the full url for you
#selection in pattern follows the rules of CSS
dom = web.Element(r.text)
for movie in dom.by_tag('td.title'):
title = movie.by_tag('a')[0].content
genres = movie.by_tag('span.genre')[0].by_tag('a')
genres = [g.content for g in genres]
runtime = movie.by_tag('span.runtime')[0].content
rating = movie.by_tag('span.value')[0].content
print title, genres, runtime, rating

Here's the documentation for BeautifulSoup, which is an HTML and XML parser.
The comment
selection in pattern follows the rules of CSS
means the strings such as 'td.title' and 'span.runtime' are CSS selectors that help find the data you are looking for, where td.title searches for the <TD> element with attribute class="title".
The code is iterating through the HTML elements in the webpage body and extracting title, genres, runtime, and rating by the CSS selectors .

Related

Finding the correct elements for scraping a website

I am trying to scrape only certain articles from this main page. To be more specific, I am trying to scrape only articles from sub-page media and from sub-sub-pages Press releases; Governing Council decisions; Press conferences; Monetary policy accounts; Speeches; Interviews, and also just those which are in English.
I managed (based on some tutorials and other SE:overflow answers), to put together a code that scrapes completely everything from the website because my original idea was to scrape everything and then in data frame just clear the output later but the website includes so much that it always freezes after some time.
Getting the sub-links:
import requests
import re
from bs4 import BeautifulSoup
master_request = requests.get("https://www.ecb.europa.eu/")
base_url = "https://www.ecb.europa.eu"
master_soup = BeautifulSoup(master_request.content, 'html.parser')
master_atags = master_soup.find_all("a", href=True)
master_links = [ ]
sub_links = {}
for master_atag in master_atags:
master_href = master_atag.get('href')
master_href = base_url + master_href
print(master_href)
master_links.append(master_href)
sub_request = requests.get(master_href)
sub_soup = BeautifulSoup(sub_request.content, 'html.parser')
sub_atags = sub_soup.find_all("a", href=True)
sub_links[master_href] = []
for sub_atag in sub_atags:
sub_href = sub_atag.get('href')
sub_links[master_href].append(sub_href)
print("\t"+sub_href)
Some things I tried were to change the base link to sublinks - my idea was that maybe I can just do it separately for every sub-page and later just put the links together but that did not work). Other things that I tried was to replace the 17th line with the following;
sub_atags = sub_soup.find_all("a",{'class': ['doc-title']}, herf=True)
this seemed to partially solve my problem because even though it did not got only links from the sub-pages it at least ignored links that are not 'doc-title' which are all the links with text on the website but it was still too much and some links were not retrieved correctly.
I tried also tried the following:
for master_atag in master_atags:
master_href = master_atag.get('href')
for href in master_href:
master_href = [base_url + master_href if str(master_href).find(".en") in master_herf
print(master_href)
I thought that because all hrefs with English documents had .en somewhere in them this would only give me all links where .en occurs somewhere in the href but this code gives me syntax error for the print(master_href) which I dont understand because previously print(master_href) worked.
Next I want to extract the following information from sublinks. This part of code works when I test it for a single link, but I never had chance to try it on the above code since it wont finish running. Will this work once I manage to get the proper list of all links?
for links in sublinks:
resp = requests.get(sublinks)
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
title = soup.find('title')
textdate = soup.find('h2')
paragraphs = article.find_all('p')
matches = re.findall('(\d{2}[\/ ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[\/ ]\d{2,4})', str(textdate))
for match in matches:
print(match[0])
datadate = match[0]
import pandas as pd
ecbdf = pd.DataFrame({"Article": [Article]; "Title": [title]: "Text": [paragraphs], "date": datadate})
Also going back to the scraping, since the first approach with beautiful soup did not worked for me I also tried to just approach the problem differently. The website has RSS feeds so I wanted to use the following code:
import feedparser
from pandas.io.json import json_normalize
import pandas as pd
import requests
rss_url='https://www.ecb.europa.eu/home/html/rss.en.html'
ecb_feed = feedparser.parse(rss_url)
df_ecb_feed=json_normalize(ecb_feed.entries)
df_ecb_fead.head()
Here I run into a problem of not being even able to find the RSS feed url in the first place. I tried the following: I viewed the source page and I tried to search for "RSS" and tried all urls that I could find this way but I always get empty dataframe.
I am a beginner to web-scraping and at this point I dont know how to proceed or how to approach this problem. In the end what I want to accomplish is to just collect all articles from the subpages with their titles, and dates and authors and put them into one dataframe.
The biggest problem you have with scraping this site is probably the lazy loading: Using JavaScript, they load the articles from several html pages and merge them into the list. For details, look out for index_include in the source code. This is problematic for scraping with only requests and BeautifulSoup because what your soup instance gets from the request content is just the basic skeleton without the list of articles. Now you have two options:
Instead of the main article list page (Press Releases, Interviews, etc.), use the lazy-loaded lists of articles, e.g., /press/pr/date/2019/html/index_include.en.html. This will probably be the easier option, but you have to do it for each year you're interested in.
Use a client that can execute JavaScript like Selenium to obtain the HTML instead of requests.
Apart from that, I would suggest to use CSS selectors for extracting information from the HTML code. This way, you only need a few lines for the article thing. Also, I don't think you have to filter for English articles if you use the index.en.html page for scraping because it shows English by default and -- additionally -- other languages if available.
Here's an example I quickly put together, this can certainly be optimized but it shows how to load the page with Selenium and extract the article URLs and article contents:
from bs4 import BeautifulSoup
from selenium import webdriver
base_url = 'https://www.ecb.europa.eu'
urls = [
f'{base_url}/press/pr/html/index.en.html',
f'{base_url}/press/govcdec/html/index.en.html'
]
driver = webdriver.Chrome()
for url in urls:
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for anchor in soup.select('span.doc-title > a[href]'):
driver.get(f'{base_url}{anchor["href"]}')
article_soup = BeautifulSoup(driver.page_source, 'html.parser')
title = article_soup.select_one('h1.ecb-pressContentTitle').text
date = article_soup.select_one('p.ecb-publicationDate').text
paragraphs = article_soup.select('div.ecb-pressContent > article > p:not([class])')
content = '\n\n'.join(p.text for p in paragraphs)
print(f'title: {title}')
print(f'date: {date}')
print(f'content: {content[0:80]}...')
I get the following output for the Press Releases page:
title: ECB appoints Petra Senkovic as Director General Secretariat and Pedro Gustavo Teixeira as Director General Secretariat to the Supervisory Board
date: 20 December 2019
content: The European Central Bank (ECB) today announced the appointments of Petra Senkov...
title: Monetary policy decisions
date: 12 December 2019
content: At today’s meeting the Governing Council of the European Central Bank (ECB) deci...

How to collect data from website when wanted tag haven't class?

I would know how to get data from a website
I find a tutorial and finished with this
import os
import csv
import requests
from bs4 import BeautifulSoup
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
page = requete.content
soup = BeautifulSoup(page)
The tutorial say me that I should use something like this to get the string of a tag
h1 = soup.find("h1", {"class": "ico-after ico-tutorials"})
print(h1.string)
But I got a problem : the tag where I want to get text content haven't class... how should I do ?
I tried to put {} but not working
this too {"class": ""}
In fact, it's return me a None
I want to get the text content of this part of the website :
<div style="font-size:3em; color:#6200C5;">
Orchard</div>
Where Orchard is the random word
Thank for any type of help
Unfortunately, there aren't many pointers featured in BeautifulSoup, and the page you are trying to get is terribly ill-suited for your task (no IDs, classes, or other useful html features to point at).
Hence, you should change the way you use to point at the html element, and use the Xpath - and you can't do it with BeautifulSoup. In order to do that, just use html from package lxml to parse the page. Below a code snippet (based on the answers to this question) which extracts the random word in your example.
import requests
from lxml import html
requete = requests.get("https://www.palabrasaleatorias.com/mots-aleatoires.php")
tree = html.fromstring(requete.content)
rand_w = tree.xpath('/html/body/center/center/table[1]/tr/td/div/text()')
print(rand_w)

Python web-scraping on a multi-layered website without [href]

I am looking for a way to scrape data from the student-accomodation website uniplaces: https://www.uniplaces.com/en/accommodation/berlin.
In the end, I would like to scrape particular information for each property, such as bedroom size, number of roommates, location. In order to do this, I will first have to scrape all property links and then scrape the individual links afterwards.
However, even after going through the console and using BeautifulSoup for the extraction of urls, I was not able to extract the urls leading to the separate listings. They don't seem to be included as a [href] and I wasn't able to identify the links in any other format within the html code.
This is the python code I used but it also didn't return anything:
from bs4 import BeautifulSoup
import urllib.request
resp = urllib.request.urlopen("https://www.uniplaces.com/accommodation/lisbon")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
So my question is: If links are not included in http:// format or referenced as [href]: is there any way to extract the listings urls?
I would really highly appreciate any support on this!
All the best,
Hannah
If you look at the network tab, you find some API call specifically to this url : https://www.uniplaces.com/api/search/offers?city=PT-lisbon&limit=24&locale=en_GB&ne=38.79507211908374%2C-9.046124472314432&page=1&sw=38.68769060641113%2C-9.327992453271463
which specifies the location PT-lisbon and northest(ne) and southwest(sw) direction. From this file, you can get the id for each offers and append it to the current url, you can also get all info you get from the webpage (price, description etc...)
For instance :
import requests
resp = requests.get(
url = 'https://www.uniplaces.com/api/search/offers',
params = {
"city":'PT-lisbon',
"limit":'24',
"locale":'en_GB',
"ne":'38.79507211908374%2C-9.046124472314432',
"page":'1',
"sw":'38.68769060641113%2C-9.327992453271463'
})
body = resp.json()
base_url = 'https://www.uniplaces.com/accommodation/lisbon'
data = [
(
t['id'], #offer id
base_url + '/' + t['id'], #this is the offer page
t['attributes']['accommodation_offer']['title'],
t['attributes']['accommodation_offer']['price']['amount'],
t['attributes']['accommodation_offer']['available_from']
)
for t in body['data']
]
print(data)

Python: (Beautifulsoup) How to limit extracted text from a html news article to only the news article.

I wrote this test code which uses BeautifulSoup.
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
print(n.get_text())
It works fine but it also retrieves text that is not part of the news article, such as the time it was posted, number of comments, copyrights ect.
I would wish for it to only retrieve text from the news article itself, how would one go about this?
You might have much better luck with newspaper library which is focused on scraping articles.
If we talk about BeautifulSoup only, one option to get closer to the desired result and have more relevant paragraphs is to find them in the context of div element with itemprop="articleBody" attribute:
article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
print(p.get_text())
You'll need to target more specifically than just the p tag. Try looking for a div class="article" or something similar, then only grab paragraphs from there
Be more specific, you need to catch the div with class articleBody, so :
import urllib.request
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
print(n.get_text())
Responses on SO is not just for you, but also for people coming from google searches and such. So as you can see, attrs is a dict, it is then possible to pass more attributes/values if needed.

regex pattern in python for parsing HTML title tags

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:
[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]
How would I combine both so that I can take into account any additional parameters passed within the title tag?
It is recommended that you use Beautiful Soup or any other parser to parse HTML, but if you badly want regex the following piece of code would do the job.
The regex code:
<title.*?>(.+?)</title>
How it works:
Produces:
['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.
Your specific problem can be solved by matching additional characters within the title tag, optionally:
r'<title[^>]*>([^<]+)</title>'
This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.
If you wish to identify all the htlm tags, you can use this
batRegex = re.compile(r'(<[a-z]*>)')
m1=batRegex.search(html)
print batRegex.findall(yourstring)
You could scrape a bunch of titles with a couple lines of gazpacho:
from gazpacho import Soup
urls = ["http://google.com", "https://facebook.com", "http://reddit.com"]
titles = []
for url in urls:
soup = Soup.get(url)
title = soup.find("title", mode="first").text
titles.append(title)
This will output:
titles
['Google',
'Facebook - Log In or Sign Up',
'reddit: the front page of the internet']

Categories

Resources