Problem looping over multiple divs using Beautiful Soup - python

Below is my python code for scraping using BS4. When I try to run the loop it prints the same data and also please let me know how to run the pagination loop in python.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.yellowpages.com/los-angeles-ca/restaurants'
page = requests.get(url)
soup = bs(page.content,'html.parser')
#print(len(soup))
containers = soup.find_all("div",{"class","v-card"})
#print(containers[0])
name = containers.find_all("a",{"class","business-name"})
print(name[0].get_text())
phone = soup.find_all("div",{"class","phone"})
#print(phone[0].get_text())
add = soup.find_all("p",{"class","adr"})
#print(add[0].get_text())
for items in containers:
name_soup = containers.find("a",{"class","business-name"})
print(name_soup)

This line will give you an error-
name = containers.find_all("a",{"class","business-name"})
because containers is a list, and not a single element on which you could call find_all() method.
You need to access containers in a loop, as it is a list of div tags that you have extracted on the previous line.
This is your previous line, where you are extracting all div tags (having class=v-card) in a list-
containers = soup.find_all("div",{"class","v-card"})

for items in containers:
name_soup = containers.find("a",{"class","business-name"})
print(name_soup)
You're not using your items variable; you're constantly searching in containers.
Use items.find(...).
please let me know how to run the pagination loop in python.
This is much more broad and really depends on the target website. Look at what changes when you click the (next page) button on the site. Often it's just a query string parameter (e.g. ?p=3). Then replicate that in your GET.

Related

Scraping Webpage with Javascript Elements

So to preface the website I've been trying to scrape seems to have/use (I'm unsure about the jargon with things relating to web development and the like) javascript code and I've been having varying success trying to scrape different tables on different pages.
For instance on this page: http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic I was easily able to 'inspect element' then go to Network find the correct 'Name' of the script and then find the Request URL I needed to get the table that I wanted. The code I used for this was:
url = 'http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/NovakDjokovic.js'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find('table', id='tour-years', attrs= {'class':'tablesorter'})
dfs = pd.read_html(str(table))
df = pd.concat(dfs)
However, now when I'm looking at a different page on the same site, say this one http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html, I'm unable to find the Request URL that will allow me to eventually get the table that I want. I repeat the same process as I did above, but there's no .js script under the Network tab that has the table. I do see the table when I'm looking at the html elements, but of course I can't get it without the correct url.
So my question would be, how can I get the table from this page http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html ?
TIA!
On looking at the source code of the html page, you can see that all the data is already loaded in the script tag. Only thing you want is extract the variable value and load it to beautifulsoup.
The following code gives all the variables and the values from script tag
import requests, re
from bs4 import BeautifulSoup
res = requests.get("http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html")
soup = BeautifulSoup(res.text, "lxml")
script = soup.find("script", attrs={"language":"JavaScript"}).text
var_only = script[:script.index("$(document)")].strip()
Next you can use regex to get the variable values - https://regex101.com/r/7cE85A/1

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

Scraping with xpath with requests and lxml but having problems

I keep running into an issue when I scrape data with lxml by using the xpath. I want to scrape the dow price but when I print it out in python it says Element span at 0x448d6c0. I know that must be a block of memory but I just want the price. How can I print the price instead of the place in memory it is?
from lxml import html
import requests
page = requests.get('https://markets.businessinsider.com/index/realtime-
chart/dow_jones')
content = html.fromstring(page.content)
#This will create a list of prices:
prices = content.xpath('//*[#id="site"]/div/div[3]/div/div[3]/div[2]/div/table/tbody/tr[1]/th[1]/div/div/div/span')
#This will create a list of volume:
print (prices)
You're getting generators which as you said are just memory locations. To access them, you need to call a function on them, in this case, you want the text so .text
Additionally, I would highly recommend changing your XPath since it's a literal location and subject to change.
prices = content.xpath("//div[#id='site']//div[#class='price']//span[#class='push-data ']")
prices_holder = [i.text for i in prices]
prices_holder
['25,389.06',
'25,374.60',
'7,251.60',
'2,813.60',
'22,674.50',
'12,738.80',
'3,500.58',
'1.1669',
'111.7250',
'1.3119',
'1,219.58',
'15.43',
'6,162.55',
'67.55']
Also of note, you will only get the values at load. If you want the prices as they change, you'd likely need to use Selenium.
The variable prices is a list containing a web element. You need to call the text method to extract the value.
print(prices[0].text)
'25,396.03'

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

Python update value within function and reuse it

I searched a lot about this but I might be using the wrong terms, the answers I found are not very relevant or they are too advance for me.
So, I have a very simple program. I have a function that reads a web page, scans for href links using BeautifulSoup, takes one of the links it founds and follows it. The function takes the first link through user input.
Now I want this function to re-run automatically using the link it found, but I only manage to create endless loops by using the first variable it got. This is all done in a controlled environment which has a maximum depth of 10 links.
This is my code:
import urllib
from BeautifulSoup import *
site=list()
def follinks(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
tags = bs('a')
for tag in tags:
site.append(tag.get('href', None))
x=site[2]
print x
return;
url1 = raw_input('Enter url:')
How do I make it use the x variable and go back to start and rerun the function until there are no more links to follow? I tried few variations of while true, but again ended in endless loops of the url the user gave.
thanks.
What you're looking for is called recursion. It's where you call a method from within its own body definition.
def follow_links(x):
html = urllib.urlopen(x).read()
bs = BeautifulSoup(html)
# Put all the links on page x into the pagelinks list
pagelinks = []
tags = bs('a')
for tag in tags:
pagelinks.append(tag.get('href', None))
# Track all links from this page in the master sites list
site += pagelinks
# Follow the third link, if there is one
if len(pagelinks) > 2:
follow_links(pagelinks[2])

Categories

Resources