Python Web Crawler from thenewboston - python

I recently watched a thenewboston video on writing a web crawler using python. For some reason, I'm getting a SSLError. I tried fixing it with line 6 of code but no luck. Any idea why it's throwing errors? The code is verbatim from thenewboston.
import requests
from bs4 import BeautifulSoup
def creepy_crawly(max_pages):
page = 1
#requests.get('https://www.thenewboston.com/', verify = True)
while page <= max_pages:
url = "https://www.thenewboston.com/trade/search.php?pages=" + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class' : 'item-name'}):
href = "https://www.thenewboston.com" + link.get('href')
print(href)
page += 1
creepy_crawly(1)

I've done a web crawler using urllib, it can be faster and has no problem accessing https pages, one thing though is that it doesn't validates the server certificate, this make it faster but more dangerous ( vulnerable to mitm attacks).
Bellow there's an usage example of that lib:
link = 'https://www.stackoverflow.com'
html = urllib.urlopen(link).read()
print(html)
3 lines is all you need to grab the HTML from a page, simple isn't it?
More about urllib: https://docs.python.org/2/library/urllib.html
I also recommend you use regex on the HTML to grab other links, an example for that (using re library) would be:
for url in re.findall(r'<a[^>]+href=["\'](.[^"\']+)["\']', html, re.I): # Searches the HTML for other URLs
link = url.split("#", 1)[0] \
if url.startswith("http") \
else '{uri.scheme}://{uri.netloc}'.format(uri=urlparse.urlparse(origLink)) + url.split("#", 1)[0] # Checks if the HTML is valid and format it

Related

why doesn't my web scraper work? Python3 - requests, BeautifulSoup

I have been following this python tutorial for a while, and I made a web scrawler, similar to the one in the video.
Language: Python
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class':'item-title'}):
href = link.get('href')
title = link.string
print(href)
page += 1
spider(1)
And this is the output that the program gives:
PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>
What can I do?
Before this, I had an error, the code was:
soup = BeautifulSoup(plain_text)
i changed this to
soup = BeautifulSoup(plain_text, 'html.parser')
and the error was gone,
the error i got here was:
d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
soup = BeautifulSoup(plain_text)
Any help is appreciated, Thank You!
There are no results as the class you are targeting is not present until the webpage is rendered, which doesn't happen with requests.
Data is dynamically retrieved from a script tag. You can regex the JavaScript object holding the data and parse with json to get that info.
The error you show was due to a parser not being specified originally; which you rectified.
import re, json, requests
import pandas as pd
r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)

Web Crawling Google Scholar - Extracting part of HTML URL to be able crawl the next/previous page

I have been tasked with creating a search engine. I understand that I need to create an adaptable URL, I have found the source code that I need to use from the onclick attribute on the button however as this changes from page to page. I need my for loop to be able to read this each time the page changes to be able to update the new URL. I have provided an example of the URL I need to change in square brackets.
I have provided a picture with the highlighted source code I require and part of my unfinished code.
Any help with this would be greatly appreciated.
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=c7lwAPTu__8J&astart=20
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=[NEW AUTHOR/USER CODE]&astart=[NEW PAGE NUMBER]
def main_page(max_pages):
page = 0
newpage = soup.find_all('button', {'onclick': ''})
while page <= max_pages:
url = 'https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author='+str(newpage)'&astart='+str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'href': '/citations?hl=en&user='}):
href = link.get('href')
print(href)
page += 10
main_page(1)
Highlighted source code required
You can use a little regular expression and urllib.
from bs4 import BeautifulSoup
import re
from urllib import parse
data = '''
<button onclick="window.location='/citations?view_op\x3dview_org\x26hl\x3den\x26org\x3d9117984065169182779\x26after_author\x3doHpYACHy__8J\x26astart\x3d30'">click me</button>
'''
PATTERN = re.compile(r"^window.location='(.+)'$")
soup = BeautifulSoup(data, 'html.parser')
for button in soup.find_all('button'):
location = PATTERN.match(button.attrs['onclick']).group(1)
parseresult = parse.urlparse(location)
d = parse.parse_qs(parseresult.query)
print(d['after_author'][0])
print(d['astart'][0])

Getting specific data after crawling websites in Python

This is my first Python project, which I pretty much wrote by following youtube videos. Although not well versed, I think I have the basics of coding.
#importing the module that allows to connect to the internet
import requests
#this allows to get data from by crawling webpages
from bs4 import BeautifulSoup
#creating a loop to change url everytime it is executed
def creator_spider(max_pages):
page = 0
while page < max_pages:
url = 'https://www.patreon.com/sitemap/campaigns/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': ''}):
href = "https://www.patreon.com" + link.get('href')
#title = link.string
print(href)
#print(title)
get_single_item_data(href)
page = page + 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print soup
for item_name in soup.findAll('h6'):
print(item_name.string)
From each page I crawl, I want the code to get this highlighted information: http://imgur.com/a/e59S9
whose source code is: http://imgur.com/a/8qv7k
what I reckon is I should change the attributes of soup.findAll() in the get_single_item_data() functiom, but all my attempts have been futile. Any help on this is very much appreciated.
from bs4 docs
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
However after closer look at the code you mentioned in pic this approach will not get what you want. In the source I see data-react-id . DOM is build by ReactJS and requests.get(url) will not execute JS on your end. Disable JS in your browser to see what is returned with requests.get(url).
Best regards

Parsing a webpage using Beautiful Soup in python, doesnt work with specific page

New to python, thought I'd try to make a web crawler as a first project. Found Beautiful Soup as the solution. All is well except that the ONE page I want to crawl yields no results :(
Here is the code:
import requests
from bs4 import BeautifulSoup
from mechanize import Browser
def crawl_list(max_pages):
mech = Browser()
place = 1
while place <= max_pages:
url = "http://www.crummy.com/software/BeautifulSoup/bs4/doc/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
for link in soup.findAll('a'):
href = link.get('href')
print(href)
place += 1
crawl_list(1)
This code works wonders. I get the whole list of links. BUT, as soon as I put http://diseasesdatabase.com/disease_index_a.asp in the value of 'url', no dice.
Perhaps it has to do with the .asp? Can someone please solve this mystery?
I'm getting this as an error message:
mechanize._response.httperror_seek_wrapper: HTTP Error 410: Gone
Thanks in advance.

Python Web Scraping - Navigating to Next_Page link and obtaining data

I am using Python and Beautiful Soup to obtain url of available software from Civic Commons - Social Media link. I want the link of all the Social Media software (spread across 20 pages). I am able to get the url of software listed in the first page.
Below is the Python code that I wrote for obtaining these values.
from bs4 import BeautifulSoup
import re
import urllib2
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/software-functions/social-media"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/apps/.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
for link_item in list_of_links:
print link_item
print ("\n")
#Newly added code to get all Next Page links from a url
next_page_links = []
for link_tag in soup.findAll('a', href=re.compile('^/.*page=')):
string_temp_link = base_url+link_tag.get('href')
next_page_links.append(string_temp_link)
for next_page in next_page_links:
print next_page
I used /apps/ regex to get the list of software.
But I wanted to know if there is better approach to crawl through next page. I am able to match the next page link by using regex "*page=". But this gives repeated list of pages.
How can I do this in a better way?
Looking at the page, there's 5 pages, the last of which is "...?page=4", so, we know there's the first page, then page=1 through page=4...
<li class="pager-last last">
last »
</li>
So you could retrieve that by the class (or by title), then parse the href...
from urlparse import urlparse, parse_qs
for pageno in xrange(1, int(parse_qs(urlparse(url).query)['page'][0]) + 1):
pass # do something useful here like building a url string with pageno

Categories

Resources