for loop with .find ( ) - python

I'm new to programming and I have been studying through FreeCodeCamp for a couple of weeks. I was reviewing a code available on their website and I have two questions.
Why is it necessary to shuffle the variable allLinks?
What does if link ['href'].find("/wiki/") == -1 really check?
I really appreciate your help.
import requests
from bs4 import BeautifulSoup
import random
response = requests.get(
url="https://en.wikipedia.org/wiki/Web_scraping",
)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find(id="firstHeading")
print(title.content)
# Get all the links
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# We are only interested in other wiki articles
if link['href'].find("/wiki/") == -1:
continue
# Use this link to scrape
linkToScrape = link
break
print(linkToScrape)
Here is the output:
Eventbrite, Inc.

It's better to first explain question 2 and describe what the code is doing.
Recall that find() returns returns the index of the first occurence of the substring or -1 if it can't find the desired substring. So if link ['href'].find("/wiki/") == -1checks if the substring "/wiki/" is found in the string link ['href']. It executes the continue if the reference link does not contain "/wiki/". So the loop will continue through the links until it finds a link that contains /wiki/ or until it goes through all the links. Hence the value of linkToScrape at the end of the loop will be the first link containing /wiki/ or 0 if no links contain it.
Now for the first question, the shuffle is necessary for the code to do what it does, which is printing a random wikipidia article linked within the article https://en.wikipedia.org/wiki/Web_scraping. Without the shuffling the code would instead always print the first wiki article linked within the article https://en.wikipedia.org/wiki/Web_scraping.
Note: The intention expressed in the comments is to find only wiki articles, but a link such as https://google.com/wiki/ will also be printed without being a wiki article.

You can read the documentation for .find() here: https://docs.python.org/3/library/stdtypes.html#str.find
Basically, haystack.find(needle) == -1 is a roundabout way of saying needle not in haystack.
The point of calling random.shuffle() is that the author of this code evidently wanted to select a link from the page at random rather than using, say, the first one or the last one.

Related

getting hyperlinks with a certain prefix on python BeautifulSoup

I was trying to create a function to get a link to another Wikipedia page from one. Link to All other wiki articles starts with the prefix "/wiki/". I tried a code to get one random link but my code was getting all the class. After that, I saw the following code on the internet.
allLinks = soup.find(id="bodyContent").find_all("a")
random.shuffle(allLinks)
linkToScrape = 0
for link in allLinks:
# We are only interested in other wiki articles
if link['href'].find("/wiki/") == -1:
continue
# Use this link to scrape
linkToScrape = link
break
This code chunk seems to work perfectly. However, I couldn't understand a part.
if link['href'].find("/wiki/") == -1:
I couldn't understand the use of -1. Moreover, can someone explain how the conditional in this line of code works and how the find function is used here?
For background, here is the page I found the code on: "https://www.freecodecamp.org/news/scraping-wikipedia-articles-with-python/"
s.find(sub) returns -1 if the substring sub is not found in the string s. So in this case it is saying "If we don't find /wiki/ in the link string then continue because it is not a wikipedia link".
The reason why it is a weird number like -1 is because find returns the index the substring is found at and that could be any positive number or 0. So -1 is used to signify not found at any index.
https://docs.python.org/3/library/stdtypes.html#str.find
Although it seems like startswith would be more appropriate in this case:
if not link['href'].starts("/wiki/"):
continue
https://docs.python.org/3/library/stdtypes.html#str.startswith

Using regex to find something in the middle of a href while looping

For "extra credit" in a beginners class in Python that I am taking I wanted to extract data out of a URL using regex. I know that there are other ways I could probably do this, but my regex desperately needs work so...
Given a URL to start at, find the xth occurrence of a href on the page, and use that link to go down a level. Rinse and repeat until I have found the required link on the page at the requested depth on the site.
I am using Python 3.7 and Beautiful Soup 4.
At the beginning of the program, after all of the house-keeping is done, I have:
starting_url = 'http://blah_blah_blah_by_Joe.html'
extracted_name = re.findall('(?<=by_)([a-zA-Z0-9]+)[^.html]*', starting_url)
selected_names.append(extracted_name)
# Just for testing purposes
print(selected_name) [['Joe']]
Hmm, a bit odd didn't expect a nested list, but I know how to flatten a list, so ok. Let's go on.
I work my way through a couple of loops, opening each url for the next level down by using:
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
Continue processing and, in the loop where the program should have found the href I want:
# Testing to check I have found the correct href
print(desired_link) <a href="http://blah_blah_blah_by_Mary.html">blah
blah</a>
type(desired_link) bs4.element.tag
Correct link, but a "type" new to me and not something I can use re.findall on. So more research and I have found:
for link in soup.find_all('a') :
tags = link.get('href')
type(tags) str
print(tags)
http://blah_blah_blah_by_George.html
http://blah_blah_blah_by_Bill.html
http://blah_blah_blah_by_Mary.html
etc.
Right type, but when I look at what printed, I think what I am looking at is maybe just one long string? And I need a way to just assign the third href in the string to a variable that I can use in re.findall('regex expression', desired_link).
Time to ask for help, I think.
And, while we are at it, any ideas about why I get the nested list the first time I used re.findall with the regex?
Please let me know how to improve this question so it is clearer what I've done and what I'm looking for (I KNOW you guys will, without me even asking).
You've printed every link on the page. But each time in the loop tags contains only one of them (you can print len(tags) to validate it easily).
Also I suggest replacing [a-zA-Z0-9]+ with \w+ - it will catch letters, numbers and underscores and is much cleaner.

Python- Regular Expression outputting last occurrence [HTML Scraping]

I'm web scraping from a local archive of techcrunch.com. I'm using regex to sort through and grab every heading for each article, however my output continues to remain as the last occurrence.
def extractNews():
selection = listbox.curselection()
if selection == (0,):
# Read the webpage:
response = urlopen("file:///E:/University/IFB104/InternetArchive/Archives/Sun,%20October%201st,%202017.html")
html = response.read()
match = findall((r'<h2 class="post-title"><a href="(.*?)".*>(.*)</a></h2>'), str(html)) # use [-2] for position after )
if match:
for link, title in match:
variable = "%s" % (title)
print(variable)
and the current output is
Heetch raises $12 million to reboot its ridesharing service
which is the last heading of the entire webpage, as seen in the image below (last occurrence)
The website/image looks like
this and each article block consists of the same code for the heading:
<h2 class="post-title">Heetch raises $12 million to reboot its ridesharing service</h2>
I cannot see why it keeps resulting to this last match. I have ran it through websites such as https://regex101.com/ and it tells me that I only have one match which is not the one being outputted in my program. Any help would be greatly appreciated.
EDIT: If anyone is aware of a way to display each matched result SEPARATELY between different <h1></h1> tags when writing to a .html file, it would mean a lot :) I am not sure if this is right but I think you use [-#] for the position/match being referred too?
The regex is fine, but your problem is in the loop here.
if match:
for link, title in match:
variable = "%s" % (title)
Your variable is overwritten in each iteration. That's why you only see the its value for the last iteration of the loop.
You could do something along these lines:
if match:
variableList = []
for link, title in match:
variable = "%s" % (title)
variableList.append(variable)
print variableList
Also, generally, I would recommend against using regex to parse html (as per the famous answer).
If you haven't already familiarised yourself with BeautifulSoup, you should. Here is a non-regex solution using BeautifulSoup to dig out all h2 post-titles from your html page.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
soup.findAll('h2', {'class':'post-title'})

Python update value within function and reuse it

I searched a lot about this but I might be using the wrong terms, the answers I found are not very relevant or they are too advance for me.
So, I have a very simple program. I have a function that reads a web page, scans for href links using BeautifulSoup, takes one of the links it founds and follows it. The function takes the first link through user input.
Now I want this function to re-run automatically using the link it found, but I only manage to create endless loops by using the first variable it got. This is all done in a controlled environment which has a maximum depth of 10 links.
This is my code:
import urllib
from BeautifulSoup import *
site=list()
def follinks(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
tags = bs('a')
for tag in tags:
site.append(tag.get('href', None))
x=site[2]
print x
return;
url1 = raw_input('Enter url:')
How do I make it use the x variable and go back to start and rerun the function until there are no more links to follow? I tried few variations of while true, but again ended in endless loops of the url the user gave.
thanks.
What you're looking for is called recursion. It's where you call a method from within its own body definition.
def follow_links(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
# Put all the links on page x into the pagelinks list
pagelinks = []
tags = bs('a')
for tag in tags:
pagelinks.append(tag.get('href', None))
# Track all links from this page in the master sites list
site += pagelinks
# Follow the third link, if there is one
if len(pagelinks) > 2:
follow_links(pagelinks[2])

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Categories

Resources