Scrapy infinite scrolling - no pagination indication - python

I am new to web scraping and I encountered some issues when I was trying to scrape a website with infinite scroll. I looked at some other questions but I could not find the answer, so I hope someone could help me out here.
I am working on the website http://www.aastocks.com/tc/stocks/analysis/stock-aafn/00001/0/all/. I have the following (very basic) piece of code some far, where I could get every piece of article on the first page (20 entries).
def parse(self, response):
# collect all article links
news = response.xpath("//div[starts-with(#class,'newshead4')]//a//text()").extract()  
# visit each news link and gather news info
for n in news:
url = urljoin(response.url, n)
yield scrapy.Request(url, callback=self.parse_news)
However, I could not figure out how to go to the next page. I read some tutorials online, such as going to Inspect -> Network and observe the Request URL after scrolling, it returned http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001 where I could not find an indication of pagination or other pattern to help me go to the next page. When I copy this link to a new tab, I see a json document with the news of the next page, but without a url with it. In this case, how could I fix it? Many thanks!

Link
http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001
gives JSON data with values like NOW.XXXXXX which you can use to generate links to news
"http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/" + "NOW.XXXXXX" + "/all"
If you scroll down few times then you will see that next pages generate similar links but with different parameters newstime, newsid.
If you check JSON data then you will see that last item has values 'dtd' and 'id' which are the same as parameters newstime, newsid in link used to download JSON data for next page.
So you can generate link to get JSON data for next page(s).
"http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=" + DTD + "&newsid=" + ID + "&period=0&key=&symbol=00001"
Working example with requests
import requests
newstime = '934735827'
newsid = 'HKEX-EPS-20190815-003587368'
url = 'http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime={}&newsid={}&period=0&key=&symbol=00001'
url_article = "http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/{}/all"
for x in range(5):
print('---', x, '----')
print('data:', url.format(newstime, newsid))
# get JSON data
r = requests.get(url.format(newstime, newsid))
data = r.json()
#for item in data[:3]: # test only few links
for item in data[:-1]: # skip last link which gets next page
# test links to articles
r = requests.get(url_article.format(item['id']))
print('news:', r.status_code, url_article.format(item['id']))
# get data for next page
newstime = data[-1]['dtd']
newsid = data[-1]['id']
print('next page:', newstime, newsid)

Related

How to follow 302 redirects while still getting page information when scraping using Scrapy?

Been wrestling with trying to get around this 302 redirection. First of all, the point of this particular part of my scraper is to get the next page index so I can flip through pages. The direct URLS aren't available for this site, so I cant just move on to the next or anything; in order to continue scraping the actual data using a parse_details function, I have to go through each page and simulate requests.
This is all pretty new to me, so I made sure to try anything I could find first. I have tried various settings ("REDIRECT_ENABLED":False, altering handle_httpstatus_list, etc.) but none are getting me through this. Currently I'm trying to follow the location of the redirection, but this isn't working either.
Here is an example of one of the potential solutions I've tried following.
try:
print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
if response.status in (302,) and 'Location' in response.headers:
location = to_native_str(response.headers['location'].decode('latin1'))
yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)
The code, without the details parsing and such, is as follows:
def parse(self, response):
table = response.css('td> a::attr(href)').extract()
additional_page = response.css('span.page_list::text').extract()
for string_item in additional_page: # The text has some non-breaking
# spaces (&nbsp) to ignore. We want the text representing the
# current page index only.
char_list = list(string_item)
for char in char_list:
if char.isdigit():
page_index = char
break # Now that we have the current page index, we
# can back out of this loop.
# Below is where the code breaks; it cannot find page_index since it is
# not getting to the site for scraping after redirection.
try:
print('Current page index: ', page_index)
# To get to the next page, we submit a form request since it is all
# setup with javascript instead of simlpy giving a URL to follow.
# The event target has 'dgTournament' information where the first
# piece is always '_ctl1' and the second is '_ctl' followed by
# the page index number we want to go to minus one (so if we want
# to go to the 8th page, its '_ctl7').
# Thus we can just plug in the current page index which is equal to
# the next we want to hit minus one.
# Here is how I am making the requests; they work until the (302)
# redirection...
form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
"__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}
yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)
Alternatively, a solution may be to follow pagination in a different way, instead of making all of these requests?
The original link is
https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=&sectiondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1
if anyone is able to help.
You don't have to follow 302 requests instead you can do a POST request and receive the details of the page. The following code prints the data in the first 5 pages:
import requests
from bs4 import BeautifulSoup
url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'
pages=5
for i in range(pages):
params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}
res= requests.post(url,params=params,data=payload)
soup = BeautifulSoup(res.content,'lxml')
table=soup.find('table',id='ctl00_mainContent_dgTournaments')
#pretty print the table contents
for row in table.find_all('tr'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-'*10)

How to get all links in a directory separated by pagination

The structure of my website is:
article manager
article1link # on 1st page
article2link # on 1st page
article3link # on 2nd page
article4link # on 2nd page
article5link # on 3rd page
article6link # on 3rd page
I am using the requests and BeautifulSoap module to get the links of article[1-6]link by:
from bs4 import BeautifulSoup
article_manager_soup = BeautifulSoup(article_manager_page,'lxml')
article_list = [article_manager_url + '/' + node.get('href') for node in article_manager_soup.find_all('a') if node.get('href')]
But the problem is that I'm able to get only the first two article links on the first page, i.e, article1link, article2link. When I click on the next page, no parameter are changed so I cannot use the url as a reference. How can I get the remaining links?

Python Requests / Sessions using generated cookie

I'm trying to create a bot that checks for open classes, the webpage uses a cookie that is set when visiting the site. However I cant seem to replicate this using requests/sessions with my code.
What it's supposed to do:
visit link 1 (creates the cookie) (search page)
visit link 2 which includes the search terms in the URL (search results)
When done in browser, Link 2 should show the search results
Issue:
I can create the cookie visiting link 1
But cant use it with the link 2 that includes the search terms
this results in loading the same first link (search page)
Here is some sample code I have tried:
s = requests.Session()
# create the cookie using first link
r = s.get(url)
# r2 should be search results
r2 = s.post(urlWithSearchTerms, cookies=r.cookies)
# parse html etc, however loads wrong page
data = r2.text
soup = BeautifulSoup(data,"html.parser")
print(soup.prettify())
Instead of loading the search results, it still loads the first page.
I also tried including r.headers, using sessions.post(url), using without sessions etc..
How would I get python to load the second page?
Thanks!
you are sending an HTTP POST request where you should be sending a GET.
change this line:
r2 = s.post(urlWithSearchTerms, cookies=r.cookies)
to:
r2 = s.get(urlWithSearchTerms, cookies=r.cookies)

Can't crawl more than a few items per page

I'm new to scrapy and tried to crawl from a couple of sites, but wasn't able to get more than a few images from there.
For example, for http://shop.nordstrom.com/c/womens-dresses-new with the following code -
def parse(self, response):
for dress in response.css('article.npr-product-module'):
yield {
'src': dress.css('img.product-photo').xpath('#src').extract_first(),
'url': dress.css('a.product-photo-href').xpath('#href').extract_first()
}
I got 6 products. I expect 66.
For URL https://www.renttherunway.com/products/dress with the following code -
def parse(self, response):
for dress in response.css('div.cycle-image-0'):
yield {
'image-url': dress.xpath('.//img/#src').extract_first(),
}
I got 12. I expect roughly 100.
Even when I changed it to crawl every 'next' page, I got the same number per page but it went through all pages successfully.
I have tried a different USER_AGENT, disabled COOKIES, and DOWNLOAD_DELAY of 5.
I imagine I will run into the same problem on any site so folks should have seen this before but can't find a reference to it.
What am I missing?
It's one of those weird websites where they store product data as json in html source and unpack it with javascript on page load later.
To figure this out usually what you want to do is
disable javascript and do scrapy view <url>
investigate the results
find the id in the product url and search that id in page source to check whether it exists and if so where it is hidden. If it doesn't exist that means it's being populated by some AJAX request -> reenable javascript, go to the page and dig through browser inspector's network tab to find it.
if you do regex based search:
re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
You'll get a huge json that contains all products and their information.
import json
import re
data = re.findall("ProductResults, (\{.+\})\)", response.body_as_unicode())
data = json.loads(data[0])['data']
print(len(data['ProductResult']['Products']))
>> 66
That gets a correct amount of products!
So in your parse you can do this:
def parse(self, response):
for product in data['ProductResult']['Products']:
# find main image
image_url = [m['Url'] for m in product['Media'] if m['Type'] == 'MainImage']
yield {'image_url': image_url}

Scraping threads that are hundreds of pages deep w/ BeautifulSoup

Python and BeautifulSoup newbie here.
I am trying to scrape a forum that has about 500 pages, each of which contains 50 individual threads. Some of these threads contain about 200 pages worth of posts.
I would like to write a program that can scrape the relevant parts of the whole forum in an automated fashion, having been fed a single URL as an entry point:
page_list = ['http://forum.doctissimo.fr/sante/diabete/liste_sujet-1.htm']
While I have no problem extracting the 'next link' for both the individual threads and the pages containing the threads... :
def getNext_link(soup0bj):
#extracts a page's next link from the Bsoup object
try:
next_link = []
soup0bj = (soup0bj)
for link in soup0bj.find_all('link', {'rel' : 'next'}):
if link.attrs['href'] not in next_link:
next_link.append(link.attrs['href'])
return next_link
...I'm stuck with a program that takes that seeded URL and extracts contents only from the first pages of each thread that it hosts. The programme then ends:
for page in page_list:
if page != None:
html = getHTMLsoup(page)
print(getNext_link(html))
page_list.append(getNext_link(html))
print(page_list)
for thread in getThreadURLs(html):
if thread != None:
html = getHTMLsoup(thread)
print('\n'.join(getHandles(html)))
print('\n'.join(getTime_stamps(html)))
print('\n', getNext_link(html))
print('\n'.join(getPost_contents(html)),'\n')
I've tried appending the 'next link' into page_list, but that hasn't worked, as urlopen then tries to access a list rather than a string. I've also tried this:
for page in itertools.chain(page_list):
...but the programme throws this error:
AttributeError: 'list' object has no attribute 'timeout'
I'm really stuck. Any and all help would be most welcome!
I solved this myself, so I'm posting the answer, just in case someone else might benefit.
So, the problem was that urlopen could not open the URL found in a list within a list.
In my case, each forum page had a maximum of 1 relevant internal link. Rather than asking my getNext_link function to return a list containing the internal link, as seen here (see empty list next_link)...
def getNext_link(soup0bj):
#extracts a page's next link (if available)
try:
soup0bj = (soup0bj)
next_link = []
if len(soup0bj.find_all('link', {'rel' : 'next'})) != 0:
for link in soup0bj.find_all('link', {'rel' : 'next'}):
next_link.append(link.attrs['href'])
return next_link
I asked it to return the URL as a string, as seen here:
def getNext_link(soup0bj):
try:
soup0bj = (soup0bj)
if len(soup0bj.find_all('link', {'rel' : 'next'})) != 0:
for link in soup0bj.find_all('link', {'rel' : 'next'}):
next_link = link.attrs['href']
return next_link
As the variable next_link is simply a string, it can easily be added to a list that is being iterated over (see my post above for details). Voilà!

Categories

Resources