How to scrape links without href attribute? - python

I want to extract links but there is not any href attribute given. How do I scrape the links from the page?
from bs4 import BeautifulSoup
import requests
for count in range(1,421):
r = requests.get('http://iapsm.org/MemberPage/members.php?
page='+str(count)+'&Search=',headers= {'User-Agent':'Googleboat'})
soup = BeautifulSoup(r.text,'lxml')
links = soup.find_all('div',class_='Table')
for link in soup.find_all('tr'):
c = (link.get('a'))
print c
I'm not getting any output or getting any error

To scrape all the details, first search for all the div which have there class as modal-content
You can try my code below to get all the information of users.
modals = soup.find_all('div',{'class':'modal-content'})
user_data = []
for modal in modals:
uls = modal.find_all('ul',{'class':'Modal-List'})
info = {}
for ul in uls:
info[ul[0]] = ul[1]
user_data.append(info)
print(user_data)

Related

How can I extract href links from a within a table th using BeautifulSoup

I am trying to create a list of all football teams/links from any one of a number of tables within the base URL: https://fbref.com/en/comps/10/stats/Championship-Stats
I would then use the link from the href to scrape each individual team's data. The href is embedded within the th tag as per below
th scope="row" class="left " data-stat="squad">Barnsley</th
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley</a
The following code gives me a list of the 'a' tags
page = "https://fbref.com/en/comps/10/Championship-Stats"
pageTree = requests.get(page)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')
Teams = pageSoup.find_all("th", {"class": "left"})
Output(for each class of 'left'):
th class="left" data-stat="squad" scope="row">
a href="/en/squads/293cb36b/Barnsley-Stats">Barnsley,
I have tried the guidance from a previous Stack question (Extract links after th in beautifulsoup)
However, the following code based on that thread produces errors
AttributeError: 'NoneType' object has no attribute 'find_parent'
def import_TeamList():
BASE_URL = "https://fbref.com/en/comps/10/Championship-Stats"
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
team_list = []
team_tr = soup.find('a', {'data-stat': 'squad'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text != 'squad':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return TeamList
Here is a version using CSS selectors, which I find simpler than most other methods.
import requests
from bs4 import BeautifulSoup
url = 'https://fbref.com/en/comps/10/stats/Championship-Stats'
data = requests.get(url).text
soup = BeautifulSoup(data)
links = BeautifulSoup(data).select('th a')
urls = [link['href'] for link in links]
print(urls)
Is this what you're looking for?
import requests
from bs4 import BeautifulSoup as BS
from lxml import etree
with requests.Session() as session:
r = session.get('https://fbref.com/en/comps/10/stats/Championship-Stats')
r.raise_for_status()
dom = etree.HTML(str(BS(r.text, 'lxml')))
for a in dom.xpath('//th[#class="left"]/a'):
print(a.attrib['href'])

How would I scrape some of the 'title' attributes if there are multiple under the same HTML tree?

I'm trying to scrape the front page of https://nyaa.si/ for torrent names and torrent magnet links. I was successful in getting the magnet links but am having issues with the torrent names. This is due to the HTML structure of where the torrent names are placed. The contents I'm trying to scrape are located in a <td> tag (which are table rows) which can be uniquely identified through an attribute, but after that the contents are located in an <a> tag under the <title> attribute which has no uniquely identifiable attribute I can see. Is there anyway I can scrape this information?
Here is my code:
import re, requests
from bs4 import BeautifulSoup
nyaa_link = 'https://nyaa.si/'
request = requests.get(nyaa_link, headers={'User-Agent': 'Mozilla/5.0'})
source = request.content
soup = BeautifulSoup(source, 'lxml')
#GETTING TORRENT NAMES
title = []
rows = soup.findAll("td", colspan="2")
for row in rows:
title.append(row.content)
#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
magnets.append(link.get('href'))
print(magnets)
You need to pull the title from the link in the table datum. Since each <td> here contains an <a>, just call td.find('a')['title']
import re, requests
from bs4 import BeautifulSoup
nyaa_link = 'https://nyaa.si/'
request = requests.get(nyaa_link, headers={'User-Agent': 'Mozilla/5.0'})
source = request.content
soup = BeautifulSoup(source, 'lxml')
#GETTING TORRENT NAMES
title = []
rows = soup.findAll("td", colspan="2")
for row in rows:
#UPDATED CODE
desired_title = row.find('a')['title']
if 'comment' not in desired_title:
title.append(desired_title)
#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
magnets.append(link.get('href'))
print(magnets)
So I've figured out the problem and found a solution
The problem was this line: if 'comment' not in desired_title:
What that did was it only processed the HTML which didn't contain 'comment'. The issue is the way the HTML structure on the page I was trying to scrape, basically if the torrent had a comment on it it would show up on the HTML structure higher than the title name. So my code would completely skip torrents with comments on them.
Here is a working solution:
import re, requests
from bs4 import BeautifulSoup
nyaa_link = 'https://nyaa.si/?q=test'
request = requests.get(nyaa_link)
source = request.content
soup = BeautifulSoup(source, 'lxml')
#GETTING TORRENT NAMES
title = []
n = 0
rows = soup.findAll("td", colspan="2")
for row in rows:
if 'comment' in row.find('a')['title']:
desired_title = row.findAll('a', title=True)[1].text
print(desired_title)
title.append(desired_title)
n = n+1
else:
desired_title = row.find('a')['title']
title.append(desired_title)
print(row.find('a')['title'])
print('\n')
#print(title)
#GETTING MAGNET LINKS
magnets = []
for link in soup.findAll('a', attrs={'href': re.compile("^magnet")}):
magnets.append(link.get('href'))
#print(magnets)
#GETTING NUMBER OF MAGNET LINKS AND TITLES
print('Number of rows', len(rows))
print('Number of magnet links', len(magnets))
print('Number of titles', len(title))
print('Number of removed', n)
Thank you CannedScientist for some of the code needed for the solution

Beautifulsoup not finding the tag

im working on a project on beautifulsoup
from bs4 import BeautifulSoup as soup
from requests import get
url = "https://www.yelp.com/search?find_desc=&find_loc=New+York%2C+NY&ns=1"
clnt = get(url)
page=soup(clnt.text,"html.parser")
container = page.findAll("div",{"class":"lemon--div__373c0__1mboc container__373c0__ZB8u4 hoverable__373c0__3CcYQ margin-t3__373c0__1l90z margin-b3__373c0__q1DuY padding-t3__373c0__1gw9E padding-r3__373c0__57InZ padding-b3__373c0__342DA padding-l3__373c0__1scQ0 border--top__373c0__3gXLy border--right__373c0__1n3Iv border--bottom__373c0__3qNtD border--left__373c0__d1B7K border-color--default__373c0__3-ifU"})
container = container[1]
url2= "https://www.yelp.com"+container.a["href"]
clnt2 = get(url2)
page2 = soup(clnt2.text, 'html.parser')
info = page2.find("div",{"class":"lemon--div__373c0__1mboc island__373c0__3fs6U u-padding-t1 u-padding-r1 u-padding-b1 u-padding-l1 border--top__373c0__19Owr border--right__373c0__22AHO border--bottom__373c0__uPbXS border--left__373c0__1SjJs border-color--default__373c0__2oFDT background-color--white__373c0__GVEnp"})
contact=info.div (Example contact variable)
In this "info" variable i am getting the div which has all the contact details, I want to grab the Contact number form this div
And as I print this "info" variable it also shows that contact no. is present in the variable including other details but as I traverse through the div's to get the Contact No. I couldn't find it.
I have also tried to get all the child div's and even including the class of the div itself I wasn't able to get that
First url given is this :https://www.yelp.com/search?find_desc=&find_loc=New+York%2C+NY&ns=1
Second url "url2" is this : https://www.yelp.com/biz/levain-bakery-new-york Which has the Contact details
Any Solutions ???
You can get the contact number with the help of the class name. But I seriously doubt if it works for any given page as the class names seem to be dynamic. But you could give it a try.
from bs4 import BeautifulSoup as soup
from requests import get
url = "https://www.yelp.com/search?find_desc=&find_loc=New+York%2C+NY&ns=1"
clnt = get(url)
page=soup(clnt.text,"html.parser")
container = page.findAll("div",{"class":"lemon--div__373c0__1mboc container__373c0__ZB8u4 hoverable__373c0__3CcYQ margin-t3__373c0__1l90z margin-b3__373c0__q1DuY padding-t3__373c0__1gw9E padding-r3__373c0__57InZ padding-b3__373c0__342DA padding-l3__373c0__1scQ0 border--top__373c0__3gXLy border--right__373c0__1n3Iv border--bottom__373c0__3qNtD border--left__373c0__d1B7K border-color--default__373c0__3-ifU"})
container = container[1]
url2= "https://www.yelp.com"+container.a["href"]
clnt2 = get(url2)
page2 = soup(clnt2.text, 'html.parser')
info = page2.find("div",{"class":"lemon--div__373c0__1mboc island__373c0__3fs6U u-padding-t1 u-padding-r1 u-padding-b1 u-padding-l1 border--top__373c0__19Owr border--right__373c0__22AHO border--bottom__373c0__uPbXS border--left__373c0__1SjJs border-color--default__373c0__2oFDT background-color--white__373c0__GVEnp"})
ContactNumber = info.find("p",{"class":"lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_"})
print(ContactNumber.text)
Output:
(917) 464-3769
The following works for me:
from bs4 import BeautifulSoup as soup
from requests import get
url = "https://www.yelp.com/search?find_desc=&find_loc=New+York%2C+NY&ns=1"
clnt = get(url)
page=soup(clnt.text,"html.parser")
container = page.findAll("div",{"class":"lemon--div__373c0__1mboc container__373c0__ZB8u4 hoverable__373c0__3CcYQ margin-t3__373c0__1l90z margin-b3__373c0__q1DuY padding-t3__373c0__1gw9E padding-r3__373c0__57InZ padding-b3__373c0__342DA padding-l3__373c0__1scQ0 border--top__373c0__3gXLy border--right__373c0__1n3Iv border--bottom__373c0__3qNtD border--left__373c0__d1B7K border-color--default__373c0__3-ifU"})
container = container[1]
url2= "https://www.yelp.com"+container.a["href"]
clnt2 = get(url2)
page2 = soup(clnt2.text, 'html.parser')
info = page2.find("div",{"class":"lemon--div__373c0__1mboc island__373c0__3fs6U u-padding-t1 u-padding-r1 u-padding-b1 u-padding-l1 border--top__373c0__19Owr border--right__373c0__22AHO border--bottom__373c0__uPbXS border--left__373c0__1SjJs border-color--default__373c0__2oFDT background-color--white__373c0__GVEnp"})
p_tags_of_info = info.find_all("p")
print(p_tags_of_info[2].text)
I just extracted all the <p> tags from the info variable, and then selected the text of the third <p> tag.
try this. I have tried for the second URL and it gives a contact number.
url = 'https://www.yelp.com/biz/levain-bakery-new-york'
page = requests.get(url)
soup1 = BeautifulSoup(page.content, "lxml")
info = soup1.find_all("div", {
"class": "lemon--div__373c0__1mboc island-section__373c0__3vKXy border--top__373c0__19Owr border-color--default__373c0__2oFDT"})
contact_number = info[1].find("p", attrs={
'class': 'lemon--p__373c0__3Qnnj text__373c0__2pB8f text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_'}).text
print(contact_number)
approve the answer if useful. others might help it out.

Scraping each element from website with BeautifulSoup

I wrote a code for scraping one real estate website. This is the link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/
From this page I can get only location, size and price of the apartment, but Is it possible to write a code that will go on page of each appartment and scrape values from it, because it contains much more info. Check this link:
https://www.nekretnine.rs/stambeni-objekti/stanovi/arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I have posted a code. I noticed that my url changes when I click on specific real estate. For example:
arena-bulevar-arsenija-carnojevica-97m-2-lode-energoprojekt/NkvJK0Ou5tV/
I taught about creating for loop, but there is no way to know how it changes because it has some id number at the end:
NkvJK0Ou5tV
This is the code that I have:
from bs4 import BeautifulSoup
import requests
website = "https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
lokacija = my_html.find_all('p', class_='offer-location text-truncate')
ukupna_kvadratura = my_html.find_all('p', class_='offer-price offer-price--invert')
ukupna_cena = my_html.find_all('div', class_='d-flex justify-content-between w-100')
ukupni_opis = my_html.find_all('div', class_='mt-1 mb-1 mt-lg-0 mb-lg-0 d-md-block offer-meta-info offer-adress')
for lok, kvadratura, cena_stana, sumarno in zip(lokacija, ukupna_kvadratura, ukupna_cena, ukupni_opis):
lok = lok.text.split(',')[0] #lokacija
kv = kvadratura.span.text.split(' ')[0] #kvadratura
jed = kvadratura.span.text.split(' ')[1] #jedinica mere
cena = cena_stana.span.text #cena
sumarno = sumarno.text
datum = sumarno.split('|')[0].strip()
status = sumarno.split('|')[1].strip()
opis = sumarno.split('|')[2].strip()
print(lok, kv, jed, cena, datum, status, opis)
You can get href from div class="placeholder-preview-box ratio-4-3".
From here you can find the URL.
You can iterate over the links provided by the pagination at the bottom of the page:
from bs4 import BeautifulSoup as soup
import requests
d = soup(requests.get('https://www.nekretnine.rs/stambeni-objekti/stanovi/lista/po-stranici/10/').text, 'html.parser')
def scrape_page(page):
return [{'title':i.h2.get_text(strip=True), 'loc':i.p.get_text(strip=True), 'price':i.find('p', {'class':'offer-price'}).get_text(strip=True)} for i in page.find_all('div', {'class':'row offer'})]
result = [scrape_page(d)]
while d.find('a', {'class':'pagination-arrow arrow-right'}):
d = soup(requests.get(f'https://www.nekretnine.rs{d.find("a", {"class":"pagination-arrow arrow-right"})["href"]}').text, 'html.parser')
result.append(scrape_page(d))

Scraping data from href

I was trying to get the postcodes for DFS, for that i tried getting the href for each shop and then click on it, the next page has shop location from which i can get the postal code, but i am able to get things working, Where am i going wrong?
I tried getting upper level attribute first td.searchResults and then for each of them i am trying to click on href with title DFS and after clicking getting the postalCode. Eventually iterate for all three pages.
If there is a better way to do it let me know.
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
while True:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops= {}
#info = soup.find('span', itemprop='postalCode').contents
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find(itemprop="postalCode").get_text()
shops.append(info)
Update:
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find_all('span', attrs={"itemprop": "postalCode"})
for m in info:
if m:
m_text = m.get_text()
shops.append(m_text)
print (shops)
So after playing with this for a little while, I don't think the best way to do this is with selenium. It would require using driver.back() and waiting for elements to re-appear, and a whole mess of other stuff. I was able to get what you want using just requests, re and bs4. re is included in the Python standard library and if you haven't installed requests, you can do it with pip as follows: pip install requests
from bs4 import BeautifulSoup
import re
import requests
base_url = 'http://www.localstore.co.uk'
url = 'http://www.localstore.co.uk/stores/75061/dfs/'
res = requests.get(url)
soup = BeautifulSoup(res.text)
shops = []
links = soup.find_all('a', href=re.compile('.*\/store\/.*'))
for l in links:
full_link = base_url + l['href']
town = l['title'].split(',')[1].strip()
res = requests.get(full_link)
soup = BeautifulSoup(res.text)
info = soup.find('span', attrs={"itemprop": "postalCode"})
postalcode = info.text
shops.append(dict(town_name=town, postal_code=postalcode))
print shops
Your code has some problems. You are using an infinite loop without breaking condition. Also shops= {} is a dict but you are using append method on it.
Instead of using selenium you can use python-requests or urllib2.
But In your code you can do something like this,
driver = webdriver.Firefox()
driver.get('http://www.localstore.co.uk/stores/75061/dfs/')
html = driver.page_source
soup = BeautifulSoup(html)
listings = soup.select('td.searchResults')
for l in listings:
driver.find_element_by_css_selector("a[title*='DFS']").click()
shops = []
html = driver.page_source
soup = BeautifulSoup(html)
info = soup.find('span', attrs={"itemprop": "postalCode"})
if info:
info_text = info.get_text()
shops.append(info_text)
print shops
In Beautifulsoup you can find a tag by it's attribute like this:
soup.find('span', attrs={"itemprop": "postalCode"})
also if it doesn't find anything, it will return None and .get_text() method on it will raise AttributeError. So check first before applying .get_text()

Categories

Resources