Wikipedia Scraping ambiguous results - python

I am wiki scraping and an disambigious error comes up because there are multiple articles with the same title. How to do I go through all of them and pull them? Additional question how do I skip them?
text = list('child', 'pca, 'united states')
df = []
for x in text:
wiki = wikipedia.page(x)
df.append(wiki.content)
and multiple results come up for some of them and it will error out, any ideas? I am thinking a try: except: else: ?

The disambiguation notes have a very specific format. It should be easy for you to find them and extract the links they contain. Indeed, the disambiguation links themselves have a unique class that you can search for.
As to whether you pull them or skip them, that's entirely up to you, depending on your need.

Related

for loop with .find ( )

I'm new to programming and I have been studying through FreeCodeCamp for a couple of weeks. I was reviewing a code available on their website and I have two questions.
Why is it necessary to shuffle the variable allLinks?
What does if link ['href'].find("/wiki/") == -1 really check?
I really appreciate your help.
import requests
from bs4 import BeautifulSoup
import random
response = requests.get(
url="https://en.wikipedia.org/wiki/Web_scraping",
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.content)
# Get all the links
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# We are only interested in other wiki articles
if link['href'].find("/wiki/") == -1:
continue
# Use this link to scrape
linkToScrape = link
break
print(linkToScrape)
Here is the output:
Eventbrite, Inc.
It's better to first explain question 2 and describe what the code is doing.
Recall that find() returns returns the index of the first occurence of the substring or -1 if it can't find the desired substring. So if link ['href'].find("/wiki/") == -1checks if the substring "/wiki/" is found in the string link ['href']. It executes the continue if the reference link does not contain "/wiki/". So the loop will continue through the links until it finds a link that contains /wiki/ or until it goes through all the links. Hence the value of linkToScrape at the end of the loop will be the first link containing /wiki/ or 0 if no links contain it.
Now for the first question, the shuffle is necessary for the code to do what it does, which is printing a random wikipidia article linked within the article https://en.wikipedia.org/wiki/Web_scraping. Without the shuffling the code would instead always print the first wiki article linked within the article https://en.wikipedia.org/wiki/Web_scraping.
Note: The intention expressed in the comments is to find only wiki articles, but a link such as https://google.com/wiki/ will also be printed without being a wiki article.
You can read the documentation for .find() here: https://docs.python.org/3/library/stdtypes.html#str.find
Basically, haystack.find(needle) == -1 is a roundabout way of saying needle not in haystack.
The point of calling random.shuffle() is that the author of this code evidently wanted to select a link from the page at random rather than using, say, the first one or the last one.

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

How to scrape data selectively from tables in Python 2

I am working on a small project of my own and try to wrap my mind around web scraping.
I am using Python 2 and BeautifulSoap module(but tried other modules as well, experimenting with re module, others).
Briefly, given the website: http://www.bankofcanada.ca/rates/exchange/daily-closing-past-five-day/ I would like to gather the information about the exchange rates for each Currency but with more flexible code.
Here is my example:
import urllib2
from bs4 import BeautifulSoup
import string
import re
myurl = 'http://www.bankofcanada.ca/rates/exchange/daily-closing-past-five-day/'
soup = BeautifulSoup(urllib2.urlopen(myurl).read(), "lxml")
dataTables = soup.find_all('td')
brandNewList = []
for x in dataTables:
text = x.get_text().strip()
brandNewList.append(text)
#print text
for index, item in enumerate(brandNewList):
if item == "U.S. dollar (close)":
for item in brandNewList[index:6]:
print item
It displays:
$ python crawler.py
U.S. dollar (close)
1.4530
1.4557
1.4559
1.4490
1.4279
So, as you may see, I can display the data corresponding to each currency by scraping the 'td' tags; I can get even more specific if I would use 'th' in combination with 'td' tags.
But, what if I don't really want to specify the exact string "U.S. dollar (close)", how can I make the script mode adaptable to different websites?
For example, I would like enter as an argument from the terminal only "US"/"us" and the script will give me back the values corresponding to the US dollar independently on how the column is named on different websites?
Also, I am kind of a beginner in Python so can you, please show me the more neat way of re-writing my web crawler? It feels like I have written it in a kind of "dumb" way, mostly :)
how can I make the script mode adaptable to different websites?
Different sites have really different markups, it is close to impossible to make a universal and reliable location mechanism in your case. Depending on how many sites you want to scrape, you may just loop over the different locating functions with an EAFP approach until you successfully get the currency rate.
Note that some resources provide public or private APIs and you don't really need to scrape them.
By the way, you can improve your code by locating the U.S. dollar (close) label and getting the following td siblings:
us_dollar_label = soup.find("td", text="U.S. dollar (close)")
rates = [td.get_text() for td in us_dollar_label.find_next_siblings("td")]

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Use python to crawl a website

So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am somewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.
The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:
http://pypi.python.org/pypi/spider.py/0.5
Even better, Google found this question already asked and answered here on StackOverflow:
Anyone know of a good Python based web crawler that I could use?
If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:
self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
if not frm.has_key('src'):
continue
src = frm[str('src')]
#rest of URL processing here
except Exception, e:
print 'Parser <frame> tag error: ', str(e)
for the frame tag. The same goes for "img src"and "a href" tags.
I like the topic though - maybe its me who has sth wrong here...
edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...
To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):
#!/usr/bin/env ruby
require 'open-uri'
hyperlinks = []
visited = []
# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
links = []
begin
s = open(url).read
s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
link = $&.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
link = url + link if link[0] == '/'
# add to array if not already there
links << link if not links =~ /url/
end
rescue
puts 'Looks like we can\'t be here...'
end
links
end
print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
break if hyperlinks.length == 0
link = hyperlinks.shift
next if visited.include? link
visited << link
puts "Connecting to #{link}..."
links = get_hyperlinks(link)
puts "Found #{links.length} links on #{link}..."
hyperlinks = links + hyperlinks
puts "Moving on with #{hyperlinks.length} links left...\n\n"
end
sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.
1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.
2) To handle multiple levels of links, we can use recursion.
def followAllLinks(self, from_where):
for link in list(self.getAllUniqueLinks(from_where)):
self.followAllLinks(link)
This does not handle cycles of links, but neither did the original approach. You can handle that by building a set of already-visited links as you go.
Use scrapy:
Scrapy is a fast high-level screen scraping and web crawling
framework, used to crawl websites and extract structured data from
their pages. It can be used for a wide range of purposes, from data
mining to monitoring and automated testing.

Categories

Resources