Scraping threads that are hundreds of pages deep w/ BeautifulSoup - python

Python and BeautifulSoup newbie here.
I am trying to scrape a forum that has about 500 pages, each of which contains 50 individual threads. Some of these threads contain about 200 pages worth of posts.
I would like to write a program that can scrape the relevant parts of the whole forum in an automated fashion, having been fed a single URL as an entry point:
page_list = ['http://forum.doctissimo.fr/sante/diabete/liste_sujet-1.htm']
While I have no problem extracting the 'next link' for both the individual threads and the pages containing the threads... :
def getNext_link(soup0bj):
#extracts a page's next link from the Bsoup object
try:
next_link = []
soup0bj = (soup0bj)
for link in soup0bj.find_all('link', {'rel' : 'next'}):
if link.attrs['href'] not in next_link:
next_link.append(link.attrs['href'])
return next_link
...I'm stuck with a program that takes that seeded URL and extracts contents only from the first pages of each thread that it hosts. The programme then ends:
for page in page_list:
if page != None:
html = getHTMLsoup(page)
print(getNext_link(html))
page_list.append(getNext_link(html))
print(page_list)
for thread in getThreadURLs(html):
if thread != None:
html = getHTMLsoup(thread)
print('\n'.join(getHandles(html)))
print('\n'.join(getTime_stamps(html)))
print('\n', getNext_link(html))
print('\n'.join(getPost_contents(html)),'\n')
I've tried appending the 'next link' into page_list, but that hasn't worked, as urlopen then tries to access a list rather than a string. I've also tried this:
for page in itertools.chain(page_list):
...but the programme throws this error:
AttributeError: 'list' object has no attribute 'timeout'
I'm really stuck. Any and all help would be most welcome!

I solved this myself, so I'm posting the answer, just in case someone else might benefit.
So, the problem was that urlopen could not open the URL found in a list within a list.
In my case, each forum page had a maximum of 1 relevant internal link. Rather than asking my getNext_link function to return a list containing the internal link, as seen here (see empty list next_link)...
def getNext_link(soup0bj):
#extracts a page's next link (if available)
try:
soup0bj = (soup0bj)
next_link = []
if len(soup0bj.find_all('link', {'rel' : 'next'})) != 0:
for link in soup0bj.find_all('link', {'rel' : 'next'}):
next_link.append(link.attrs['href'])
return next_link
I asked it to return the URL as a string, as seen here:
def getNext_link(soup0bj):
try:
soup0bj = (soup0bj)
if len(soup0bj.find_all('link', {'rel' : 'next'})) != 0:
for link in soup0bj.find_all('link', {'rel' : 'next'}):
next_link = link.attrs['href']
return next_link
As the variable next_link is simply a string, it can easily be added to a list that is being iterated over (see my post above for details). Voilà!

Related

How do I obtain results from 'yield' in python?

Perhaps yield in Python is remedial for some, but not for me... at least not yet.
I understand yield creates a 'generator'.
I stumbled upon yield when I decided to learn scrapy.
I wrote some code for a Spider which works as follows:
Go to start hyperlink and extract all hyperlinks - which are not full hyperlinks, just sub-directories concatenated onto the starting hyperlink
Examines hyperlinks appends those meeting specific criteria to base hyperlink
Uses Request to navigate to new hyperlink and parses to find unique id in element with 'onclick'
import scrapy
class newSpider(scrapy.Spider)
name = 'new'
allowed_domains = ['www.alloweddomain.com']
start_urls = ['https://www.alloweddomain.com']
def parse(self, response)
links = response.xpath('//a/#href').extract()
for link in links:
if link == 'SpecificCriteria':
next_link = response.urljoin(link)
yield Request(next_link, callback=self.parse_new)
EDIT 1:
for uid_dict in self.parse_new(response):
print(uid_dict['uid'])
break
End EDIT 1
Running the code here evaluates response as the HTTP response to start_urls and not to next_link.
def parse_new(self, response)
trs = response.xpath("//*[#class='unit-directory-row']").getall()
for tr in trs:
if 'SpecificText' in tr:
elements = tr.split()
for element in elements:
if 'onclick' in element:
subelement = element.split('(')[1]
uid = subelement.split(')')[0]
print(uid)
yield {
'uid': uid
}
break
It works, scrapy crawls the first page, creates the new hyperlink and navigates to the next page. new_parser parses the HTML for the uid and 'yields' it. scrapy's engine shows that the correct uid is 'yielded'.
What I don't understand is how I can 'use' that uid obtained by parse_new to create and navigate to a new hyperlink like I would a variable and I cannot seem to be able to return a variable with Request.
I'd check out What does the "yield" keyword do? for a good explanation of how exactly yield works.
In the meantime, spider.parse_new(response) is an iterable object. That is, you can acquire its yielded results via a for loop. E.g.,
for uid_dict in spider.parse_new(response):
print(uid_dict['uid'])
After much reading and learning I discovered the reason scrapy does not perform the callback in the first parse and it has nothing to do with yield! It has a lot to do with two issues:
1) robots.txt. Link Can be 'resolved' with ROBOTSTXT_OBEY = False in settings.py
2) The logger has Filtered offsite request to. Link dont_filter=True may resolve this.

Python beautiful soup web scraper doesn't return tag contents

I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")

Scrapy infinite scrolling - no pagination indication

I am new to web scraping and I encountered some issues when I was trying to scrape a website with infinite scroll. I looked at some other questions but I could not find the answer, so I hope someone could help me out here.
I am working on the website http://www.aastocks.com/tc/stocks/analysis/stock-aafn/00001/0/all/. I have the following (very basic) piece of code some far, where I could get every piece of article on the first page (20 entries).
def parse(self, response):
# collect all article links
news = response.xpath("//div[starts-with(#class,'newshead4')]//a//text()").extract()  
# visit each news link and gather news info
for n in news:
url = urljoin(response.url, n)
yield scrapy.Request(url, callback=self.parse_news)
However, I could not figure out how to go to the next page. I read some tutorials online, such as going to Inspect -> Network and observe the Request URL after scrolling, it returned http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001 where I could not find an indication of pagination or other pattern to help me go to the next page. When I copy this link to a new tab, I see a json document with the news of the next page, but without a url with it. In this case, how could I fix it? Many thanks!
Link
http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001
gives JSON data with values like NOW.XXXXXX which you can use to generate links to news
"http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/" + "NOW.XXXXXX" + "/all"
If you scroll down few times then you will see that next pages generate similar links but with different parameters newstime, newsid.
If you check JSON data then you will see that last item has values 'dtd' and 'id' which are the same as parameters newstime, newsid in link used to download JSON data for next page.
So you can generate link to get JSON data for next page(s).
"http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=" + DTD + "&newsid=" + ID + "&period=0&key=&symbol=00001"
Working example with requests
import requests
newstime = '934735827'
newsid = 'HKEX-EPS-20190815-003587368'
url = 'http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime={}&newsid={}&period=0&key=&symbol=00001'
url_article = "http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/{}/all"
for x in range(5):
print('---', x, '----')
print('data:', url.format(newstime, newsid))
# get JSON data
r = requests.get(url.format(newstime, newsid))
data = r.json()
#for item in data[:3]: # test only few links
for item in data[:-1]: # skip last link which gets next page
# test links to articles
r = requests.get(url_article.format(item['id']))
print('news:', r.status_code, url_article.format(item['id']))
# get data for next page
newstime = data[-1]['dtd']
newsid = data[-1]['id']
print('next page:', newstime, newsid)

How to follow 302 redirects while still getting page information when scraping using Scrapy?

Been wrestling with trying to get around this 302 redirection. First of all, the point of this particular part of my scraper is to get the next page index so I can flip through pages. The direct URLS aren't available for this site, so I cant just move on to the next or anything; in order to continue scraping the actual data using a parse_details function, I have to go through each page and simulate requests.
This is all pretty new to me, so I made sure to try anything I could find first. I have tried various settings ("REDIRECT_ENABLED":False, altering handle_httpstatus_list, etc.) but none are getting me through this. Currently I'm trying to follow the location of the redirection, but this isn't working either.
Here is an example of one of the potential solutions I've tried following.
try:
print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
if response.status in (302,) and 'Location' in response.headers:
location = to_native_str(response.headers['location'].decode('latin1'))
yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)
The code, without the details parsing and such, is as follows:
def parse(self, response):
table = response.css('td> a::attr(href)').extract()
additional_page = response.css('span.page_list::text').extract()
for string_item in additional_page: # The text has some non-breaking
# spaces (&nbsp) to ignore. We want the text representing the
# current page index only.
char_list = list(string_item)
for char in char_list:
if char.isdigit():
page_index = char
break # Now that we have the current page index, we
# can back out of this loop.
# Below is where the code breaks; it cannot find page_index since it is
# not getting to the site for scraping after redirection.
try:
print('Current page index: ', page_index)
# To get to the next page, we submit a form request since it is all
# setup with javascript instead of simlpy giving a URL to follow.
# The event target has 'dgTournament' information where the first
# piece is always '_ctl1' and the second is '_ctl' followed by
# the page index number we want to go to minus one (so if we want
# to go to the 8th page, its '_ctl7').
# Thus we can just plug in the current page index which is equal to
# the next we want to hit minus one.
# Here is how I am making the requests; they work until the (302)
# redirection...
form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
"__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}
yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)
Alternatively, a solution may be to follow pagination in a different way, instead of making all of these requests?
The original link is
https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=&sectiondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1
if anyone is able to help.
You don't have to follow 302 requests instead you can do a POST request and receive the details of the page. The following code prints the data in the first 5 pages:
import requests
from bs4 import BeautifulSoup
url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'
pages=5
for i in range(pages):
params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}
res= requests.post(url,params=params,data=payload)
soup = BeautifulSoup(res.content,'lxml')
table=soup.find('table',id='ctl00_mainContent_dgTournaments')
#pretty print the table contents
for row in table.find_all('tr'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-'*10)

How to check if a web element is visible

I am using Python with BeautifulSoup4 and I need to retrieve visible links on the page. Given this code:
soup = BeautifulSoup(html)
links = soup('a')
I would like to create a method is_visible that checks whether or not a link is displayed on the page.
Solution Using Selenium
Since I am working also with Selenium I know that there exist the following solution:
from selenium.webdriver import Firefox
firefox = Firefox()
firefox.get('https://google.com')
links = firefox.find_elements_by_tag_name('a')
for link in links:
if link.is_displayed():
print('{} => Visible'.format(link.text))
else:
print('{} => Hidden'.format(link.text))
firefox.quit()
Performance Issue
Unfortunately the is_displayed method and getting the text attribute perform a http request to retrieve such informations. Therefore things can get really slow when there are many links on a page or when you have to do this multiple times.
On the other hand BeautifulSoup can perform these parsing operations in zero time once you get the page source. But I can't figure out how to do this.
AFAIK, BeautifulSoup will only help you parse the actual markup of the HTML document anyway. If that's all you need, then you can do it in a manner like so (yes, I already know it's not perfect):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
def is_visible_1(link):
#do whatever in this function you can to determine your markup is correct
try:
style = link.get('style')
if 'display' in style and 'none' in style:#or use a regular expression
return False
except Exception:
return False
return True
def is_visible_2(**kwargs):
try:
soup = kwargs.get('soup', None)
del kwargs['soup']
#Exception thrown if element can't be found using kwargs
link = soup.find_all(**kwargs)[0]
style = link.get('style')
if 'display' in style and 'none' in style:#or use a regular expression
return False
except Exception:
return False
return True
#checks links that already exist, not *if* they exist
for link in soup.find_all('a'):
print(str(is_visible_1(link)))
#checks if an element exists
print(str(is_visible_2(soup=soup,id='someID')))
BeautifulSoup doesn't take into account other parties that will tell you that the element is_visible or not, like: CSS, Scripts, and dynamic DOM changes. Selenium, on the other hand, does tell you that an element is actually being rendered or not and generally does so through accessibility APIs in the given browser. You must decide if sacrificing accuracy for speed is worth pursuing. Good luck! :-)
try with find_elements_by_xpath and execute_script
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.com/?hl=en")
links = driver.find_elements_by_xpath('//a')
driver.execute_script('''
var links = document.querySelectorAll('a');
links.forEach(function(a) {
a.addEventListener("click", function(event) {
event.preventDefault();
});
});
''')
visible = []
hidden = []
for link in links:
try:
link.click()
visible.append('{} => Visible'.format(link.text))
except:
hidden.append('{} => Hidden'.format(link.get_attribute('textContent')))
#time.sleep(0.1)
print('\n'.join(visible))
print('===============================')
print('\n'.join(hidden))
print('===============================\nTotal links length: %s' % len(links))
driver.execute_script('alert("Finish")')

Categories

Resources