I am using Python and Beautiful Soup to obtain url of available software from Civic Commons - Social Media link. I want the link of all the Social Media software (spread across 20 pages). I am able to get the url of software listed in the first page.
Below is the Python code that I wrote for obtaining these values.
from bs4 import BeautifulSoup
import re
import urllib2
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/software-functions/social-media"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/apps/.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
for link_item in list_of_links:
print link_item
print ("\n")
#Newly added code to get all Next Page links from a url
next_page_links = []
for link_tag in soup.findAll('a', href=re.compile('^/.*page=')):
string_temp_link = base_url+link_tag.get('href')
next_page_links.append(string_temp_link)
for next_page in next_page_links:
print next_page
I used /apps/ regex to get the list of software.
But I wanted to know if there is better approach to crawl through next page. I am able to match the next page link by using regex "*page=". But this gives repeated list of pages.
How can I do this in a better way?
Looking at the page, there's 5 pages, the last of which is "...?page=4", so, we know there's the first page, then page=1 through page=4...
<li class="pager-last last">
last »
</li>
So you could retrieve that by the class (or by title), then parse the href...
from urlparse import urlparse, parse_qs
for pageno in xrange(1, int(parse_qs(urlparse(url).query)['page'][0]) + 1):
pass # do something useful here like building a url string with pageno
Related
I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())
Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)
The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep
I have been trying to scrape a website such as the one below. In the footer there are a bunch of links of their social media out of which the LinkedIn URL is the point of focus for me. Is there a way to fish out only that link maybe using regex or any other libraries available in Python.
This is what I have tried so far -
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(reqs.text,"html.parser")
for link in soup.find_all('a'):
print(link.get('href'))
But I'm fetching all the URLs instead of the one I'm looking for.
Note: I'd appreciate a dynamic code which I can use for other sites as well.
Thanks in advance for you suggestion/help.
One approach could be to use css selectors and look for string linkedin.com/company/ in values of href attributes:
soup.select_one('a[href*="linkedin.com/company/"]')['href']
Example
import requests
from bs4 import BeautifulSoup
url = "https://www.southcoast.org/"
req = requests.get(url)
soup = BeautifulSoup(req.text,"html.parser")
# single (first) link
link = e['href'] if(e := soup.select_one('a[href*="linkedin.com/company/"]')) else None
# multiple links
links = [link['href'] for link in soup.select('a[href*="linkedin.com/company/"]')]
I have been tasked with creating a search engine. I understand that I need to create an adaptable URL, I have found the source code that I need to use from the onclick attribute on the button however as this changes from page to page. I need my for loop to be able to read this each time the page changes to be able to update the new URL. I have provided an example of the URL I need to change in square brackets.
I have provided a picture with the highlighted source code I require and part of my unfinished code.
Any help with this would be greatly appreciated.
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=c7lwAPTu__8J&astart=20
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=[NEW AUTHOR/USER CODE]&astart=[NEW PAGE NUMBER]
def main_page(max_pages):
page = 0
newpage = soup.find_all('button', {'onclick': ''})
while page <= max_pages:
url = 'https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author='+str(newpage)'&astart='+str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'href': '/citations?hl=en&user='}):
href = link.get('href')
print(href)
page += 10
main_page(1)
Highlighted source code required
You can use a little regular expression and urllib.
from bs4 import BeautifulSoup
import re
from urllib import parse
data = '''
<button onclick="window.location='/citations?view_op\x3dview_org\x26hl\x3den\x26org\x3d9117984065169182779\x26after_author\x3doHpYACHy__8J\x26astart\x3d30'">click me</button>
'''
PATTERN = re.compile(r"^window.location='(.+)'$")
soup = BeautifulSoup(data, 'html.parser')
for button in soup.find_all('button'):
location = PATTERN.match(button.attrs['onclick']).group(1)
parseresult = parse.urlparse(location)
d = parse.parse_qs(parseresult.query)
print(d['after_author'][0])
print(d['astart'][0])
I am trying to scrape a website www.zath.co.uk, and extract the links to all of the articles using Python 3. Looking at the raw html file I identified one of the sections I am interested in, displayed below using BeautifulSoup.
<article class="post-32595 post type-post status-publish format-standard has-post-thumbnail category-games entry" itemscope="" itemtype="https://schema.org/CreativeWork">
<header class="entry-header">
<h2 class="entry-title" itemprop="headline">
<a class="entry-title-link" href="https://www.zath.co.uk/family-games-day-night-event-giffgaff/" rel="bookmark">
A Family Games Night (& Day) With giffgaff
</a>
I then wrote this code to excute this, I started by setting up a list of urls from the website to scrape.
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/",....."https://www.zath.co.uk/page/35/"
Then (after importing the necessary libraries) defined a function get all Zeth articles.
def getAllZathPosts(url,links):
request = urllib.request.Request(url)
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response)
for a in soup.findAll('a'):
url = a['href']
c = a['class']
if c == "entry-title-link":
print(url)
links.append(url)
return
Then call the function.
links = []
zathPosts = {}
for url in urlList:
zathPosts = getAllZathPosts(url,links)
The code runs with no errors but the links list remains empty with no urls printed as if the class never equals "entry-title-link". I have tried adding an else case.
else:
print(url + " not article")
and all the links from the pages printed as expected. Any suggestions?
You can simply iterate it using range and extract article tag
import requests
from bs4 import BeautifulSoup
for page_no in range(35):
page=requests.get("https://www.zath.co.uk/page/{}/".format(page_no))
parser=BeautifulSoup(page.content,'html.parser')
for article in parser.findAll('article'):
print(article.h2.a['href'])
You can do something like the below code:
import requests
from bs4 import BeautifulSoup
def getAllZathPosts(url,links):
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
results = soup.select("a.entry-title-link")
#for i in results:
#print(i.text)
#links.append(url)
if len(results) >0:
links.append(url)
links = []
urlList = ["https://www.zath.co.uk/","https://www.zath.co.uk/page/2/","https://www.zath.co.uk/page/35/"]
for url in urlList:
getAllZathPosts(url,links)
print(set(links))
I'm trying to collect a specific link to visit it later throughout my script, but there are many links on the page I'm crawling and they all have the same a href tag.
How can I select one specifically? The site is bbb.org and my code is below.
Example, search lamps on bbb and i want to collect the links embedded with the business names so i can visit their profiles later.
#!/usr/bin/python
import requests
from bs4 import BeautifulSoup
def bbb_spider(max_pages):
bus_cat = raw_input('Enter a business category: ')
pages = 1
while pages <= max_pages:
url = 'http://www.bbb.org/search/?type=category&input=' + str(bus_cat) + '&page=' + str(pages)
sauce_code = requests.get(url)
plain_text = sauce_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a'):
href = link.get('href')
print(href)
pages += 1
You need the links located inside h4 elements which are inside the search results table. There are different ways to get to them, but I would make a CSS selector:
soup.select("table.search-results-table tr h4 a")
I have created something similar like this.
Look at my example for a crawler.
https://github.com/shiva1791/Python_webcrawler
The code takes the url it needs to parse from link.csv.
All the logic behind parsing every link on the page is in webcrawler.py file.