How to get all links in a directory separated by pagination - python

The structure of my website is:
article manager
article1link # on 1st page
article2link # on 1st page
article3link # on 2nd page
article4link # on 2nd page
article5link # on 3rd page
article6link # on 3rd page
I am using the requests and BeautifulSoap module to get the links of article[1-6]link by:
from bs4 import BeautifulSoup
article_manager_soup = BeautifulSoup(article_manager_page,'lxml')
article_list = [article_manager_url + '/' + node.get('href') for node in article_manager_soup.find_all('a') if node.get('href')]
But the problem is that I'm able to get only the first two article links on the first page, i.e, article1link, article2link. When I click on the next page, no parameter are changed so I cannot use the url as a reference. How can I get the remaining links?

Related

How to scrape a page that is dynamicaly locaded?

So here's my problem. I wrote a program that is perfectly able to get all of the information I want on the first page that I load. But when I click on the nextPage button it runs a script that loads the next bunch of products without actually moving to another page.
So when I run the next loop all that happens is that I get the same content of the first one, even when the ones on the browser I'm emulating itself is different.
This is the code I run:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
driver.get("https://www.my-website.com/search/results-34y1i")
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(2)
# /////////// code to find total number of pages
currentPage = 0
button_NextPage = driver.find_element(By.ID, 'nextButton')
while currentPage != totalPages:
# ///////// code to find the products
currentPage += 1
button_NextPage = driver.find_element(By.ID, 'nextButton')
button_NextPage.click()
time.sleep(5)
Is there any way for me to scrape exactly what's loaded on my browser?
The issue it seems to be because you're just fetching the page 1 as shown in the next line:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page=1&view=grid")
But as you can see there's a query parameter called page in the url that determines which html's page you are fetching. So what you'll have to do is every time you're looping to a new page you'll have to fetch the new html content with the driver by changing the page query parameter. For example in your loop it will be something like this:
driver.get("https://www.tcgplayer.com/search/magic/commander-streets-of-new-capenna?productLineName=magic&setName=commander-streets-of-new-capenna&page={page}&view=grid".format(page = currentPage))
And after you fetch the new html structure you'll be able to access to the new elements that are present in the differente pages as you require.

Click Multiple links and get their url

I need to get the detail page urls of the match links on this webpage: https://www.sportybet.com/ke/sport/football/today
What I want?
I want to click on Man city vs PSG, copy the detail match url and print.. and do the same for the next match, Holstein Kiel vs SV Sandhausen and likewise for all the matches on the web page.
I have this code selenium below for just 1 match;
driver.find_element_by_xpath('//*[#id="importMatch"]/div[2]/div/div[3]/div[2]/div[3]').click()
get_url = driver.current_url
print(get_url)
I need help to get all the match urls with a loop or any better suggestions.
If I understand correctly what you asking for, you should do the following:
links = driver.find_element_by_xpath("//div[#class='match-league']//div[contains(#class,'market-size')]")
for link in links:
link.click()
time.sleep(1)
url = driver.current_url
print(get_url)
driver.execute_script("window.history.go(-1)")
time.sleep(2)
On the first line you collecting the elements you can click to expand
Then with for loop iterate thru all these elements, click each one of them, get the URL and go back to the main page

Scrapy infinite scrolling - no pagination indication

I am new to web scraping and I encountered some issues when I was trying to scrape a website with infinite scroll. I looked at some other questions but I could not find the answer, so I hope someone could help me out here.
I am working on the website http://www.aastocks.com/tc/stocks/analysis/stock-aafn/00001/0/all/. I have the following (very basic) piece of code some far, where I could get every piece of article on the first page (20 entries).
def parse(self, response):
# collect all article links
news = response.xpath("//div[starts-with(#class,'newshead4')]//a//text()").extract()  
# visit each news link and gather news info
for n in news:
url = urljoin(response.url, n)
yield scrapy.Request(url, callback=self.parse_news)
However, I could not figure out how to go to the next page. I read some tutorials online, such as going to Inspect -> Network and observe the Request URL after scrolling, it returned http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001 where I could not find an indication of pagination or other pattern to help me go to the next page. When I copy this link to a new tab, I see a json document with the news of the next page, but without a url with it. In this case, how could I fix it? Many thanks!
Link
http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001
gives JSON data with values like NOW.XXXXXX which you can use to generate links to news
"http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/" + "NOW.XXXXXX" + "/all"
If you scroll down few times then you will see that next pages generate similar links but with different parameters newstime, newsid.
If you check JSON data then you will see that last item has values 'dtd' and 'id' which are the same as parameters newstime, newsid in link used to download JSON data for next page.
So you can generate link to get JSON data for next page(s).
"http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=" + DTD + "&newsid=" + ID + "&period=0&key=&symbol=00001"
Working example with requests
import requests
newstime = '934735827'
newsid = 'HKEX-EPS-20190815-003587368'
url = 'http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime={}&newsid={}&period=0&key=&symbol=00001'
url_article = "http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/{}/all"
for x in range(5):
print('---', x, '----')
print('data:', url.format(newstime, newsid))
# get JSON data
r = requests.get(url.format(newstime, newsid))
data = r.json()
#for item in data[:3]: # test only few links
for item in data[:-1]: # skip last link which gets next page
# test links to articles
r = requests.get(url_article.format(item['id']))
print('news:', r.status_code, url_article.format(item['id']))
# get data for next page
newstime = data[-1]['dtd']
newsid = data[-1]['id']
print('next page:', newstime, newsid)

How to follow 302 redirects while still getting page information when scraping using Scrapy?

Been wrestling with trying to get around this 302 redirection. First of all, the point of this particular part of my scraper is to get the next page index so I can flip through pages. The direct URLS aren't available for this site, so I cant just move on to the next or anything; in order to continue scraping the actual data using a parse_details function, I have to go through each page and simulate requests.
This is all pretty new to me, so I made sure to try anything I could find first. I have tried various settings ("REDIRECT_ENABLED":False, altering handle_httpstatus_list, etc.) but none are getting me through this. Currently I'm trying to follow the location of the redirection, but this isn't working either.
Here is an example of one of the potential solutions I've tried following.
try:
print('Current page index: ', page_index)
except: # Will be thrown if page_index wasnt found due to redirection.
if response.status in (302,) and 'Location' in response.headers:
location = to_native_str(response.headers['location'].decode('latin1'))
yield scrapy.Request(response.urljoin(location), method='POST', callback=self.parse)
The code, without the details parsing and such, is as follows:
def parse(self, response):
table = response.css('td> a::attr(href)').extract()
additional_page = response.css('span.page_list::text').extract()
for string_item in additional_page: # The text has some non-breaking
# spaces (&nbsp) to ignore. We want the text representing the
# current page index only.
char_list = list(string_item)
for char in char_list:
if char.isdigit():
page_index = char
break # Now that we have the current page index, we
# can back out of this loop.
# Below is where the code breaks; it cannot find page_index since it is
# not getting to the site for scraping after redirection.
try:
print('Current page index: ', page_index)
# To get to the next page, we submit a form request since it is all
# setup with javascript instead of simlpy giving a URL to follow.
# The event target has 'dgTournament' information where the first
# piece is always '_ctl1' and the second is '_ctl' followed by
# the page index number we want to go to minus one (so if we want
# to go to the 8th page, its '_ctl7').
# Thus we can just plug in the current page index which is equal to
# the next we want to hit minus one.
# Here is how I am making the requests; they work until the (302)
# redirection...
form_data = {"__EVENTTARGET": "dgTournaments:_ctl1:_ctl" + page_index,
"__EVENTARGUMENT": {";;AjaxControlToolkit, Version=3.5.50731.0, Culture=neutral, PublicKeyToken=28f01b0e84b6d53e:en-US:ec0bb675-3ec6-4135-8b02-a5c5783f45f5:de1feab2:f9cec9bc:35576c48"}}
yield FormRequest(current_LEVEL, formdata=form_data, method="POST", callback=self.parse, priority=2)
Alternatively, a solution may be to follow pagination in a different way, instead of making all of these requests?
The original link is
https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx?typeofsubmit=&action=2&keywords=&tournamentid=&sectiondistrict=&city=&state=&zip=&month=0&startdate=&enddate=&day=&year=2019&division=G16&category=28&surface=&onlineentry=&drawssheets=&usertime=&sanctioned=-1&agegroup=Y&searchradius=-1
if anyone is able to help.
You don't have to follow 302 requests instead you can do a POST request and receive the details of the page. The following code prints the data in the first 5 pages:
import requests
from bs4 import BeautifulSoup
url = 'https://m.tennislink.usta.com/TournamentSearch/searchresults.aspx'
pages=5
for i in range(pages):
params={'year':'2019','division':'G16','month':'0','searchradius':'-1'}
payload={'__EVENTTARGET': 'dgTournaments:_ctl1:_ctl'+str(i)}
res= requests.post(url,params=params,data=payload)
soup = BeautifulSoup(res.content,'lxml')
table=soup.find('table',id='ctl00_mainContent_dgTournaments')
#pretty print the table contents
for row in table.find_all('tr'):
for column in row.find_all('td'):
text = ', '.join(x.strip() for x in column.text.split('\n') if x.strip()).strip()
print(text)
print('-'*10)

BeautifulSoup returns empty span elements?

I'm trying to pull prices from Binance's home page and BeautifulSoup returns empty elements for me. Binance's home page is at https://www.binance.com/en/, and the interesting block I'm trying to get text from is:
<div class="sc-62mpio-0-sc-iAyFgw iQwJlO" color="#999"><span>"/" "$" "35.49"</span></div>
On Binance's home page is a table and one of the columns is titled "Last Price". Next to the last price is the last USD price in a faded gray color and I'm trying to pull every one of those. Here's my code so far.
def grabPrices():
page = requests.get("https://www.binance.com/en")
soup = BeautifulSoup(page.text, "lxml")
prices = soup.find_all("span", {"class": None})
print(prices)
But the output is just a large array of "–" tags.
Selenium should be one way of scraping the table content you want from this biniance page. And google Selenium about its set up (pretty much by download a driver and place it in your local disk, if you are a chrome user, see this download link chrome driver). Here is my code to access the content you are interested:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome(executable_path=r'C:\chromedriver\chromedriver.exe')
time.sleep(3) # Allow time to launch the controlled web
driver.get('https://www.binance.com/en/')
time.sleep(3) # Allow time to load the page
sel = Selector(text=driver.page_source)
Table = sel.xpath('//*[#id="__next"]/div/main/div[4]/div/div[2]/div/div[2]/div/div[2]/div')
Table.extract() # This basically gives you all the content of the table, see follow screen shot (screen shot is truncated for display purpose)
Then if you further process the entire table content with something like:
tb_rows = Table.xpath('.//div/a//div//div//span/text()').extract()
tb_rows # Then you will get follow screen shot
At this point, the result is narrowed down to pretty much what you are interested, but notice that the lastprice's two components (number/dollar price) are stored in two tag in source page, so we can do following to combine them together and reach to the destination:
for n in range(0,len(tb_rows),2):
LastPrice = tb_rows[n] + tb_rows[n+1]
print(LastPrice) # For sure, other than print, you could store each element in a list
driver.quit() # don't forget to quit driver by the end
The final output looks like:

Categories

Resources