Getting specific data after crawling websites in Python

Getting specific data after crawling websites in Python - python

This is my first Python project, which I pretty much wrote by following youtube videos. Although not well versed, I think I have the basics of coding.
#importing the module that allows to connect to the internet
import requests
#this allows to get data from by crawling webpages
from bs4 import BeautifulSoup
#creating a loop to change url everytime it is executed
def creator_spider(max_pages):
page = 0
while page < max_pages:
url = 'https://www.patreon.com/sitemap/campaigns/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('a', {'class': ''}):
href = "https://www.patreon.com" + link.get('href')
#title = link.string
print(href)
#print(title)
get_single_item_data(href)
page = page + 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
print soup
for item_name in soup.findAll('h6'):
print(item_name.string)
From each page I crawl, I want the code to get this highlighted information: http://imgur.com/a/e59S9
whose source code is: http://imgur.com/a/8qv7k
what I reckon is I should change the attributes of soup.findAll() in the get_single_item_data() functiom, but all my attempts have been futile. Any help on this is very much appreciated.

from bs4 docs
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:
soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
However after closer look at the code you mentioned in pic this approach will not get what you want. In the source I see data-react-id . DOM is build by ReactJS and requests.get(url) will not execute JS on your end. Disable JS in your browser to see what is returned with requests.get(url).
Best regards

Related

why doesn't my web scraper work? Python3 - requests, BeautifulSoup

I have been following this python tutorial for a while, and I made a web scrawler, similar to the one in the video.
Language: Python
import requests
from bs4 import BeautifulSoup
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.findAll('a', {'class':'item-title'}):
href = link.get('href')
title = link.string
print(href)
page += 1
spider(1)
And this is the output that the program gives:
PS D:\development> & C:/Users/hirusha/AppData/Local/Programs/Python/Python38/python.exe "d:/development/Python/TheNewBoston/Python/one/web scrawler.py"n/TheNewBoston/Python/one/web scrawler.py"
PS D:\development>
What can I do?
Before this, I had an error, the code was:
soup = BeautifulSoup(plain_text)
i changed this to
soup = BeautifulSoup(plain_text, 'html.parser')
and the error was gone,
the error i got here was:
d:/development/Python/TheNewBoston/Python/one/web scrawler.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file d:/development/Python/TheNewBoston/Python/one/web scrawler.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.
soup = BeautifulSoup(plain_text)
Any help is appreciated, Thank You!

There are no results as the class you are targeting is not present until the webpage is rendered, which doesn't happen with requests.
Data is dynamically retrieved from a script tag. You can regex the JavaScript object holding the data and parse with json to get that info.
The error you show was due to a parser not being specified originally; which you rectified.
import re, json, requests
import pandas as pd
r = requests.get('https://www.aliexpress.com/category/7/computer-office.html?trafficChannel=main&catName=computer-office&CatId=7&ltype=wholesale&SortType=default&g=n&page=1')
data = json.loads(re.search(r'window\.runParams = (\{".*?\});', r.text, re.S).group(1))
df = pd.DataFrame([(item['title'], 'https:' + item['productDetailUrl']) for item in data['items']])
print(df)

Web Crawling Google Scholar - Extracting part of HTML URL to be able crawl the next/previous page

I have been tasked with creating a search engine. I understand that I need to create an adaptable URL, I have found the source code that I need to use from the onclick attribute on the button however as this changes from page to page. I need my for loop to be able to read this each time the page changes to be able to update the new URL. I have provided an example of the URL I need to change in square brackets.
I have provided a picture with the highlighted source code I require and part of my unfinished code.
Any help with this would be greatly appreciated.
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=c7lwAPTu__8J&astart=20
https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author=[NEW AUTHOR/USER CODE]&astart=[NEW PAGE NUMBER]
def main_page(max_pages):
page = 0
newpage = soup.find_all('button', {'onclick': ''})
while page <= max_pages:
url = 'https://scholar.google.co.uk/citations?view_op=view_org&hl=en&org=9117984065169182779&after_author='+str(newpage)'&astart='+str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'href': '/citations?hl=en&user='}):
href = link.get('href')
print(href)
page += 10
main_page(1)
Highlighted source code required

You can use a little regular expression and urllib.
from bs4 import BeautifulSoup
import re
from urllib import parse
data = '''
<button onclick="window.location='/citations?view_op\x3dview_org\x26hl\x3den\x26org\x3d9117984065169182779\x26after_author\x3doHpYACHy__8J\x26astart\x3d30'">click me</button>
'''
PATTERN = re.compile(r"^window.location='(.+)'$")
soup = BeautifulSoup(data, 'html.parser')
for button in soup.find_all('button'):
location = PATTERN.match(button.attrs['onclick']).group(1)
parseresult = parse.urlparse(location)
d = parse.parse_qs(parseresult.query)
print(d['after_author'][0])
print(d['astart'][0])

Implementing web crawler in Python

While I am trying to implement a simple web crawler code in Colab, and as I have written the following code, I got the syntax error as follows. Please advise me how to resolve the issue to run it:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page=1
while page <= max_pages:
url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
source_code= requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'})
href = link.get('href')
print(href)
page+=1
trade_spider(1)
Error:
File "<ipython-input-4-5d567ac26fb5>", line 11
for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'})
^
IndentationError: unexpected indent

There are a lot of wrong things about this code, but I can help. The for loop has an extra indent, so delete an indent from the start of and also add a : to the end of the for loop. Also, it seems like you just copied this from the internet but whatever. Anyways, here is the correct code:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page=1
while page <= max_pages:
url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
source_code= requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findALL('a', {'class':'s-item__title s-item__title--has-tags'}):
href = link.get('href')
print(href)
page+=1
trade_spider(1)
Edit: after I ran this code, there is an error:
main.py:10: GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 10 of the file main.py. To get rid of this warning, pass the additional argument 'features="html5lib"' to the BeautifulSoup constructor.
So here's the correct code:
import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page=1
while page <= max_pages:
url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw=2%22+Butterfly+Valve&_sacat=0&_pgn='+ str(page)
source_code= requests.get(url)
plain_text=source_code.text
soup = BeautifulSoup(plain_text, features="html5lib")
for link in soup.find_all('a', {'class':'s-item__title s-item__title--has-tags'}):
href = link.get('href')
print(href)
page+=1
trade_spider(1)

List links in web page with python

I am trying to write a python script that lists all the links in a webpage that contain some substring. The problem that I am running into is that the webpage has multiple "pages" so that it doesn't clutter all the screen. Take a look at https://www.go-hero.net/jam/17/solutions/1/1/C++ for an example.
This is what I have so far:
import requests
from bs4 import BeautifulSoup
url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Any suggestions on how I might get this to work? Thanks in advance.

Edit/Update: Using Selenium, you could click the page links before scraping the html to collect all the content into the html. Many/most websites with pagination don't collect all the text in the html when you click through the pages, but I noticed that the example you provided does. Take a look at this SO question for a quick example of making Selenium work with BeautifulSoup. Here is how you could use it in your code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
driver.get(original_url)
# click the links for pages 1-29
for i in range(1, 30):
path_string = '/jam/17/solutions/1/1/C++#page-' + str(i)
driver.find_element_by_xpath('//a[#href=' + path_string + ']').click()
# scrape from the accumulated html
html = driver.page_source
soup = BeautifulSoup(html)
links = soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
Original Answer: For the link you provided above, you could simply loop through possible urls and run your scraping code in the loop:
import requests
from bs4 import BeautifulSoup
original_url = "https://www.go-hero.net/jam/17/solutions/1/1/C++"
# scrape from the original page (has no page number)
response = requests.get(original_url)
soup = BeautifulSoup(response.content, "html5lib")
links = soup.find_all('a')
# prepare to scrape from the pages numbered 1-29
# (note that the original page is not numbered, and the next page is "#page-1")
url_suffix = '#page-'
for i in range(1, 30):
# add page number to the url
paginated_url = original_url + url_suffix + str(i)
response = requests.get(paginated_url)
soup = BeautifulSoup(response.content, "html5lib")
# append resulting list to 'links' list
links += soup.find_all('a')
# proceed as normal from here
for tag in links:
link = tag.get('href', None)
if link is not None and 'GetSource' in link:
print(link)
I don't know if you mind that you'll get duplicates in your results. You will get duplicate results in your link list as the code currently stands, but you could add the links to a Set or something instead to easily remedy that.

Python - How do I find a link on webpage that has no class?

I am a beginner python programmer and I am trying to make a webcrawler as practice.
Currently I am facing a problem that I cannot find the right solution for. The problem is that I am trying to get a link location/address from a page that has no class, so I have no idea how to filter that specific link.
It is probably better to show you.
The page I am trying to get the link from.
As you can see, I am trying to get what is inside of the href attribute of the "Historical prices" link. Here is my python code:
import requests
from bs4 import BeautifulSoup
def find_historicalprices_link(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find_all('li', 'fjfe-nav-sub')
href = str(link.get('href'))
find_spreadsheet(href)
def find_spreadsheet(url):
source = requests.get(url)
text = source.text
soup = BeautifulSoup(text, 'html.parser')
link = soup.find('a', {'class' : 'nowrap'})
href = str(link.get('href'))
download_spreadsheet(href)
def download_spreadsheet(url):
response = requests.get(url)
text = response.text
lines = text.split("\\n")
filename = r'google.csv'
file = open(filename, 'w')
for line in lines:
file.write(line + "\n")
file.close()
find_historicalprices_link('https://www.google.com/finance?q=NASDAQ%3AGOOGL&ei=3lowWYGRJNSvsgGPgaywDw')
In the function "find_spreadsheet(url)", I could easily filter the link by looking for the class called "nowrap". Unfortunately, the Historical prices link does not have such a class and right now my script just gives me the following error:
AttributeError: ResultSet object has no attribute 'get'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
How do I make sure that my crawler only takes the href from the "Historical prices"?
Thank you in advance.
UPDATE:
I found the way to do it. By only looking for the link with a specific text attached to it, I could find the href I needed.
Solution:
soup.find('a', string="Historical prices")

Does the following code sniplet helps you? I think you can solve your problem with the following code as I hope:
from bs4 import BeautifulSoup
html = """<a href='http://www.google.com'>Something else</a>
<a href='http://www.yahoo.com'>Historical prices</a>"""
soup = BeautifulSoup(html, "html5lib")
urls = soup.find_all("a")
print(urls)
print([a["href"] for a in urls if a.text == "Historical prices"])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting specific data after crawling websites in Python - python

Related

why doesn't my web scraper work? Python3 - requests, BeautifulSoup

Web Crawling Google Scholar - Extracting part of HTML URL to be able crawl the next/previous page

Implementing web crawler in Python

List links in web page with python

Python - How do I find a link on webpage that has no class?

Categories

Resources