I searched a lot about this but I might be using the wrong terms, the answers I found are not very relevant or they are too advance for me.
So, I have a very simple program. I have a function that reads a web page, scans for href links using BeautifulSoup, takes one of the links it founds and follows it. The function takes the first link through user input.
Now I want this function to re-run automatically using the link it found, but I only manage to create endless loops by using the first variable it got. This is all done in a controlled environment which has a maximum depth of 10 links.
This is my code:
import urllib
from BeautifulSoup import *
site=list()
def follinks(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
tags = bs('a')
for tag in tags:
site.append(tag.get('href', None))
x=site[2]
print x
return;
url1 = raw_input('Enter url:')
How do I make it use the x variable and go back to start and rerun the function until there are no more links to follow? I tried few variations of while true, but again ended in endless loops of the url the user gave.
thanks.
What you're looking for is called recursion. It's where you call a method from within its own body definition.
def follow_links(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
# Put all the links on page x into the pagelinks list
pagelinks = []
tags = bs('a')
for tag in tags:
pagelinks.append(tag.get('href', None))
# Track all links from this page in the master sites list
site += pagelinks
# Follow the third link, if there is one
if len(pagelinks) > 2:
follow_links(pagelinks[2])
Related
A total newbie here in search for your wisdom (1st post/question, too)! Thank you in advance for you time and patience.
I am hoping to automatize scientific literature searches in Google Scholar using Selenium specifically (via Chrome) with Python. I envision entering a topic, which will be searched on Google Scholar, and then entering each link of the articles/books in the results, extracting the abstract/summary, and printing them on the console (or saving them on a text file). This will be an easy way to determine the relevancy of the articles in the results for the stuff that I'm writing.
Thus far, I am able to visit Google scholar, enter text in the search bar, filter by date (newest to oldest), and extract each of the links on the results. I have not been able to write a loop that will enter each article link and extract the abstracts (or other relevant text), as each result may have been coded differently.
Kind regards,
JP (Aotus_californicus)
This is my code so far:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
def get_results(search_term):
url = 'https://scholar.google.com'
browser = webdriver.Chrome(executable_path=r'C:\Users\Aotuscalifornicus\Downloads\chromedriver_win32\chromedriver.exe')
browser.get(url)
searchBar = browser.find_element_by_id('gs_hdr_tsi')
searchBar.send_keys(search_term)
searchBar.submit()
browser.find_element_by_link_text("Trier par date").click()
results = []
links = browser.find_elements_by_xpath('//h3/a')
for link in links:
href = link.get_attribute('href')
print(href)
results.append(href)
browser.close()
get_results('Primate thermoregulation')
Wrt your comment, and using that as a basis for my answer:
To clarify, I am looking to write a loop that enters each link and extracts an element by tag, for example
Open a new window or start a new driver session to check the links in the results. Then use a rule to extract the text you want. You could re-use your existing driver session if you extract all the hrefs first or create a new tab as you get each result link.
for link in links:
href = link.get_attribute('href')
print(href)
results.append(href)
extractor = webdriver.Chrome(executable_path=...) # as above
for result in results:
extractor.get(url)
section_you_want = extractor.find_elements_by_xpath(...) # or whichever set of rules
# other code here
extractor.close()
You can setup rules to use with the base find_element() or find_elements() finders and then iterate over them until you get a result (validate best on element presence or text length or something sane & useful). Each of the the rules is a tuple that can be passed to the base finder function:
from selenium.webdriver.common.by import By # see the docs linked above for the available `By` class attributes
rules = [(By.XPATH, '//h3/p'),
(By.ID, 'summary'),
(By.TAG_NAME, 'div'),
... # etc.
]
for url in results:
extractor.get(url)
for rule in rules:
elems = extractor.find_elements(*rule) # argument unpacking
if not elems:
continue # not found, try next rule
print(elems[0].getText())
break # stop after first successful "find"
else: # only executed if no rules match and `break` is never reached, or `rules` list is empty
print('Could not find anything for url:', url)
Below is my python code for scraping using BS4. When I try to run the loop it prints the same data and also please let me know how to run the pagination loop in python.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.yellowpages.com/los-angeles-ca/restaurants'
page = requests.get(url)
soup = bs(page.content,'html.parser')
#print(len(soup))
containers = soup.find_all("div",{"class","v-card"})
#print(containers[0])
name = containers.find_all("a",{"class","business-name"})
print(name[0].get_text())
phone = soup.find_all("div",{"class","phone"})
#print(phone[0].get_text())
add = soup.find_all("p",{"class","adr"})
#print(add[0].get_text())
for items in containers:
name_soup = containers.find("a",{"class","business-name"})
print(name_soup)
This line will give you an error-
name = containers.find_all("a",{"class","business-name"})
because containers is a list, and not a single element on which you could call find_all() method.
You need to access containers in a loop, as it is a list of div tags that you have extracted on the previous line.
This is your previous line, where you are extracting all div tags (having class=v-card) in a list-
containers = soup.find_all("div",{"class","v-card"})
for items in containers:
name_soup = containers.find("a",{"class","business-name"})
print(name_soup)
You're not using your items variable; you're constantly searching in containers.
Use items.find(...).
please let me know how to run the pagination loop in python.
This is much more broad and really depends on the target website. Look at what changes when you click the (next page) button on the site. Often it's just a query string parameter (e.g. ?p=3). Then replicate that in your GET.
I have a loop that is constantly adding a variable with an unknown value to a list, and then prints the list. However I don't find a way to ignore the values previously found and added to the list when I want to print the list the next time.
I'm scraping a constantly updating website for keyword-matching links using requests and bs4 inside a loop. Once the website added the links I'm looking for my code adds them to a list, and prints the list. Once the website adds the next wave of matching links, these will also be added to my list, however my code will also add the old links found before to the list again since they still match my keyword. Is it possible to ignore these old links?
url = "www.website.com"
keyword = "news"
results = [] #list which saves the links
while True:
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
options = soup.find_all("a", class_="name-link")
for o in options:
if keyword in o.text:
link = o.attrs["href"] #the links I want
results.append(link) #adds links to list
print(results)
time.sleep(5) #wait until next scrape
#so with every loop the value of 'link' is changing which makes it hard
for me to find a way to ignore previously found links
To maybe make it easier to understand you could think of a loop adding an unknown number to a list with every execution of the loop, but the number should only be printed in the first execution.
Here is a proof of concept using sets, if the challenge is that you only want to keep unique links, and then print the new links found that have not been found previously:
import random
results = set()
for k in range(15):
new = {random.randint(1,5)}
print(f"First Seen: {new-results}")
results = results.union(new)
print(f"All: {results}")
If it is more of a streaming issue, where you save all links to a large list, but only want to print the latest ones found you can do something like this:
import random
results = []
for k in range(5):
n = len(results)
new = []
for k in range(random.randint(1,5)):
new.append(random.randint(1,5))
results.extend(new)
print(results[n:])
But then again, you can also just print new in this case....
This is a good use case for Set data structure. Sets do not maintain any ordering of the items. Very simple change to your code above:
url = "www.website.com"
keyword = "news"
results = {}
while True:
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
options = soup.find_all("a", class_="name-link")
for o in options:
if keyword in o.text:
link = o.attrs["href"] #the links I want
results.add(link) #adds links to list
print(results)
time.sleep(5) #wait until next scrape
If you want to maintain order, you can use some variation of an ordered dictionary. Please see here: Does Python have an ordered set?
I am in need of some help for an assignment. I need to build a "simple" (according to my teacher) web-scraper that takes a URL as an argument, searches the source code of that URL, and then returns the links within that source code back (anything after href). The example URL my teacher has been having us use is http://citstudent.lanecc.net/tools.shtml. When the program is executed, there should be ten links returned as well as the URL of the website.
Since I am still trying to wrap my head around these concepts, I wasn't sure where to start so I turned to stack overflow and I found a script that -kind of- works. It does what I want it to do, but does not fulfill every requirement:
import urllib2
url = "http://citstudent.lanecc.net/tools.shtml"
page = urllib2.urlopen(url)
data = page.read().split("</a>")
tag = "<a href=\""
endtag = "\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item = item[ind+len(tag):]
end = item.index(endtag)
except: pass
else:
print item[:end]
This works because I hard-coded the URL into my code, and because it prints after some href tags. Normally I would say to just guide me through this and to not just give me the code, but I'm having such a shit day and any explanation or example of this has to be better than what we went over in class. Thank you.
Trying to parse a weather page and select the weekly forecasted highs.
Normally I would search with tags = soup.find_all("span", id="hi") but this tag doesn't use an id it uses a class.
Full code:
import mechanize
from bs4 import BeautifulSoup
my_browser = mechanize.Browser()
html_page = my_browser.open("http://www.wunderground.com/weather-forecast/45056")
html_text = html_page.get_data()
my_soup = BeautifulSoup(html_text)
tags = my_soup.find_all("span", class_="hi")
temp = tags[0].string
print temp
When I run this, nothing prints
The piece of HTML is buried inside a bunch of other tags, however the specific tag for today's high is as follows:
<span class="hi">63</span>
Just use class_ as the parameter name. See the docs.
The problem arises because class is a Python keyword, so you can't use it directly.
As an alternative to scraping the web page, you could always check out Weather Underground's API. It's free for developers (limited number of calls per day, etc.), but if you're going to be doing a number of lookups, this might be easier in the end.