Scraping data from href - python

I was trying to get the postcodes for DFS, for that i tried getting the href for each shop and then click on it, the next page has shop location from which i can get the postal code, but i am able to get things working, Where am i going wrong?
I tried getting upper level attribute first td.searchResults and then for each of them i am trying to click on href with title DFS and after clicking getting the postalCode. Eventually iterate for all three pages.
If there is a better way to do it let me know.
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
while True:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops= {}
#info = soup.find('span', itemprop='postalCode').contents
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find(itemprop="postalCode").get_text()
shops.append(info)
Update:
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find_all('span', attrs={"itemprop": "postalCode"})
for m in info:
if m:
m_text = m.get_text()
shops.append(m_text)
print (shops)

So after playing with this for a little while, I don't think the best way to do this is with selenium. It would require using driver.back() and waiting for elements to re-appear, and a whole mess of other stuff. I was able to get what you want using just requests, re and bs4. re is included in the Python standard library and if you haven't installed requests, you can do it with pip as follows: pip install requests
from bs4 import BeautifulSoup
import re
import requests
base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)
shops = []
links = soup.find_all('a', href=re.compile('.*\/store\/.*'))
for l in links:
full_link = base_url + l['href']
town = l['title'].split(',')[1].strip()
res = requests.get(full_link)
soup = BeautifulSoup(res.text)
info = soup.find('span', attrs={"itemprop": "postalCode"})
postalcode = info.text
shops.append(dict(town_name=town, postal_code=postalcode))
print shops

Your code has some problems. You are using an infinite loop without breaking condition. Also shops= {} is a dict but you are using append method on it.
Instead of using selenium you can use python-requests or urllib2.
But In your code you can do something like this,
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find('span', attrs={"itemprop": "postalCode"})
if info:
info_text = info.get_text()
shops.append(info_text)
print shops
In Beautifulsoup you can find a tag by it's attribute like this:
soup.find('span', attrs={"itemprop": "postalCode"})
also if it doesn't find anything, it will return None and .get_text() method on it will raise AttributeError. So check first before applying .get_text()

Related

How can I extract href links from a within a table th using BeautifulSoup

I am trying to create a list of all football teams/links from any one of a number of tables within the base URL: https://fbref.com/en/comps/10/stats/Championship-Stats
I would then use the link from the href to scrape each individual team's data. The href is embedded within the th tag as per below
th scope="row" class="left " data-stat="squad">Barnsley</th
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley</a
The following code gives me a list of the 'a' tags
page = "https://fbref.com/en/comps/10/Championship-Stats"
pageTree = requests.get(page)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Teams = pageSoup.find_all("th", {"class": "left"})
Output(for each class of 'left'):
th class="left" data-stat="squad" scope="row">
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley,
I have tried the guidance from a previous Stack question (Extract links after th in beautifulsoup)
However, the following code based on that thread produces errors
AttributeError: 'NoneType' object has no attribute 'find_parent'
def import_TeamList():
BASE_URL = "https://fbref.com/en/comps/10/Championship-Stats"
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
team_list = []
team_tr = soup.find('a', {'data-stat': 'squad'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text != 'squad':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return TeamList
Here is a version using CSS selectors, which I find simpler than most other methods.
import requests
from bs4 import BeautifulSoup
url = 'https://fbref.com/en/comps/10/stats/Championship-Stats'
data = requests.get(url).text
soup = BeautifulSoup(data)
links = BeautifulSoup(data).select('th a')
urls = [link['href'] for link in links]
print(urls)
Is this what you're looking for?
import requests
from bs4 import BeautifulSoup as BS
from lxml import etree
with requests.Session() as session:
r = session.get('https://fbref.com/en/comps/10/stats/Championship-Stats')
r.raise_for_status()
dom = etree.HTML(str(BS(r.text, 'lxml')))
for a in dom.xpath('//th[#class="left"]/a'):
print(a.attrib['href'])

How to get just links of articles in list using BeautifulSoup

Hey guess so I got as far as being able to add the a class to a list. The problem is I just want the href link to be added to the links_with_text list and not the entire a class. What am I doing wrong?
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id = 'hnmain')
articles = results.find_all(class_="title")
links_with_text = []
for article in articles:
link = article.find('a', href=True)
links_with_text.append(link)
print('\n'.join(map(str, links_with_text)))
This prints exactly how I want the list to print but I just want the href from every a class not the entire a class. Thank you
To get all links from the https://news.ycombinator.com, you can use CSS selector 'a.storylink'.
For example:
from bs4 import BeautifulSoup
from requests import get
import requests
URL = "https://news.ycombinator.com"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
links_with_text = []
for a in soup.select('a.storylink'): # <-- find all <a> with class="storylink"
links_with_text.append(a['href']) # <-- note the ['href']
print(*links_with_text, sep='\n')
Prints:
https://blog.mozilla.org/futurereleases/2020/06/18/introducing-firefox-private-network-vpns-official-product-the-mozilla-vpn/
https://mxb.dev/blog/the-return-of-the-90s-web/
https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/
https://www.sciencemag.org/news/2018/11/why-536-was-worst-year-be-alive
https://www.strongtowns.org/journal/2020/6/16/do-the-math-small-projects
https://devblogs.nvidia.com/announcing-cuda-on-windows-subsystem-for-linux-2/
https://lwn.net/SubscriberLink/822568/61d29096a4012e06/
https://imil.net/blog/posts/2020/fakecracker-netbsd-as-a-function-based-microvm/
https://jepsen.io/consistency
https://tumblr.beesbuzz.biz/post/621010836277837824/advice-to-young-web-developers
https://archive.org/search.php?query=subject%3A%22The+Navy+Electricity+and+Electronics+Training+Series%22&sort=publicdate
https://googleprojectzero.blogspot.com/2020/06/ff-sandbox-escape-cve-2020-12388.html?m=1
https://apnews.com/1da061ce00eb531291b143ace0eed1c9
https://support.apple.com/library/content/dam/edam/applecare/images/en_US/appleid/android-apple-music-account-payment-none.jpg
https://standpointmag.co.uk/issues/may-june-2020/the-healing-power-of-birdsong/
https://steveblank.com/2020/06/18/the-coming-chip-wars-of-the-21st-century/
https://www.videolan.org/security/sb-vlc3011.html
https://onesignal.com/careers/2023b71d-2f44-4934-a33c-647855816903
https://www.bbc.com/news/world-europe-53006790
https://github.com/efficient/HOPE
https://everytwoyears.org/
https://www.historytoday.com/archive/natural-histories/intelligence-earthworms
https://cr.yp.to/2005-590/powerpc-cwg.pdf
https://quantum.country/
http://www.crystallography.net/cod/
https://parkinsonsnewstoday.com/2020/06/17/tiny-magnetically-powered-implant-may-be-future-of-deep-brain-stimulation/
https://spark.apache.org/releases/spark-release-3-0-0.html
https://arxiv.org/abs/1712.09624
https://www.washingtonpost.com/technology/2020/06/18/data-privacy-law-sherrod-brown/
https://blog.chromium.org/2020/06/improving-chromiums-browser.html

Beautiful Soup find nested div

I am trying to parse a webpage that looks like this with Python->Beautiful Soup
see image
I need data from
<div class="p-offer__price-new">199,99 ₽</div>
I tried this code:
soup = BeautifulSoup(data)
res = soup.findAll("div", {"class": "poffer__price-new"})
print(res)
But result is empty -- []
How can I get this data?
Example of URL: https://edadeal.ru/moskva/offers/d71b75ff-bfee-4731-95ad-52a24ddea72e?from=%2F
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
driver.get('https://edadeal.ru/moskva/offers/d71b75ff-bfee-4731-95ad-52a24ddea72e?from=%2F')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
res = soup.findAll("div", {"class": "p-offer__price-new"})
print (res[0].text)
driver.close()

BeautifulSoup and scraping href's isn't working

Again I am having trouble scraping href's in BeautifulSoup. I have a list of pages that I am scraping and I have the data but I can't seem to get the hrefs even when I use various codes that work in other scripts.
So here is the code and my data will be below that:
import requests
from bs4 import BeautifulSoup
with open('states_names.csv', 'r') as reader:
states = [states.strip().replace(' ', '-') for states in reader]
url = 'https://www.hauntedplaces.org/state/alabama'
for state in states:
page = requests.get(url+state)
soup = BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('div', class_='description')
# When I try to add .get('href') I get a traceback error. Am I trying to scrape the href too early?
h_page = soup.findAll('h3')
<h3>Gaines Ridge Dinner Club</h3>
<h3>Purifoy-Lipscomb House</h3>
<h3>Kate Shepard House Bed and Breakfast</h3>
<h3>Cedarhurst Mansion</h3>
<h3>Crybaby Bridge</h3>
<h3>Gaineswood Plantation</h3>
<h3>Mountain View Hospital</h3>
This works perfectly:
from bs4 import BeautifulSoup
import requests
url = 'https://www.hauntedplaces.org/state/Alabama'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for link in soup.select('div.description a'):
print(link['href'])
Try that:
soup = BeautifulSoup(page.content, 'html.parser')
list0 = []
possible_links = soup.find_all('a')
for link in possible_links:
if link.has_attr('href'):
print (link.attrs['href'])
list0.append(link.attrs['href'])
print(list0)

Convert a result in the console into a list Python

I want to web scrape a list of urls from a web site and then open them one by one.
I can get the list of all urls but then try to turn into a list things get wrong.
When I print the list, intead of geting [url1, urls2...] I get something like this in the console:
[url1,url2,url3] dif line
[url1,url2,url3,url4] difline
[url1,url2,url3,url4,url5]
Find the my script bellow:
driver = webdriver.Chrome()
my_url="https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
driver.get(my_url)
time.sleep(3)
content = driver.page_source.encode('utf-8').strip()
page_soup = soup(content,"html.parser")
links =[]
for link in page_soup.find_all('a', href=True):
url=link['href']
ai=str(url)
links.append(ai)
print(links)
links.append(ai)
print(links)
I have rewritten your code a little. First you need to load and scrap main page to get all links from "href". After that just use scraped urls in a loop to get next pages.
Also there is some junk in "href" which isn't url so you have to clean it first.
I prefer requests to do GET.
http://docs.python-requests.org/en/master/
I hope it helps.
from bs4 import BeautifulSoup
import requests
def main():
links = []
url = "https://prog.nfz.gov.pl/app-jgp/AnalizaPrzekrojowa.aspx"
web_page = requests.get(url)
soup = BeautifulSoup(web_page.content, "html.parser")
a_tags = soup.find_all('a', href=True)
for a in a_tags:
links.append(a.get("href"))
print(links) # just to demonstrate that links are there
cleaned_list = []
for link in links:
if "http" in link:
cleaned_list.append(link)
print(cleaned_list)
return cleaned_list
def load_pages_from_links(urls):
user_agent = {'User-agent': 'Mozilla/5.0'}
links = urls
downloaded_pages = {}
if len(links) == 0:
return "There are no links."
else:
for nr, link in enumerate(links):
web_page = requests.get(link, headers=user_agent)
downloaded_pages[nr] = web_page.content
print(downloaded_pages)
if __name__ == "__main__":
links = main()
load_pages_from_links(links)

Categories

Resources