Webscraping with varying page numbers - python

So I'm trying to webscrape a bunch of profiles. Each profile has a collection of videos. I'm trying to webscrape information about each video. The problem I'm running into is that each profile uploads a different number of videos, so the number of pages containing videos per profile varies. For instance, one profile has 45 pages of videos, as you can see by the html below:
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li><li>7</li><li>8</li><li>9</li><li>10</li><li>11</li><li class="no-page">...<li>45</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
While another profile has 2 pages
<div class="pagination "><ul><li><a class="active" href="">1</a></li><li>2</li><li><a href="#1" class="no-page next-page"><span class="mobile-hide">Next</span>
My question is, how do I account for the varying changes in page? I was thinking of making a for loop and just adding a random number at the end, like
for i in range(0,1000):
new_url = 'url' + str(i)
where i accounts for the page, but I want to know if there's a more efficient way of doing this.
Thank you.

The "skeleton" of the loop can look like this:
url = 'http://url/?page={page}'
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
# ...
# do we have next page?
next_page = soup.select_one('.next-page')
# no, so break from the loop
if not next_page:
break
page += 1
You can have infinite loop while True: and you will break the loop only if there's no next page (if there isn't any class="next-page" tag on the last page).

Get the <li>...</li> elements of the <div class="pagination "><ul>
Exclude the last one by its class <li class="no-page">
Parse the "href" and build your next url destinations.
Scrape every new url destination.

I just want to thank everyone who did answer for taking the time to answer my question. I figured out the answer - or at least what worked for me- and decided to share in case it would be helpful for anyone else.
url = 'insert url'
re = requests.get(url)
soup = BeautifulSoup(re.content,'html.parser')
#look for pagination class
page = soup.find(class_='pagination')
#create list to include all page numbers
href=[]
#look for all 'li' tags as the users above suggested
links = page.findAll('li')
for link in links:
href += [link.find('a',href=True).text]
'''
href will now include all pages and the word Next.
So for instance it will look something like this:[1,2,3...,44,Next].
I want to get 44, which will be href[-2] and then convert that to an int for
a for loop. In the for loop add + 1 because it will iterate to i-1, instead of
i. For instance, if you're iterating (0,44), the last output of i will be 43,
which is why we +1
'''
for i in range(0, int(href[-2])+1):
new_url = url + str(1)

Related

python selenium search element by class name

I want to collect the detailed recommendation description paragraphs that a person received on his/her LinkedIn profile, such as this link:
https://www.linkedin.com/in/teddunning/details/recommendations/
(This link can be viewed after logging in any LinkedIn account)
Here is my best try:
for index, row in df2.iterrows():
linkedin = row['LinkedIn Website']
current_url = f'{linkedin}/details/recommendations/'
driver.get(current_url)
time.sleep(random.uniform(2,3))
descriptions=driver.find_elements_by_xpath("//*[#class='display-flex align-items-center t-14 t-normal t-black']")
s=0
for description in descriptions:
s+=1
print(description.text)
df2.loc[index, f'RecDescription_{str(s)}'] = description.text
The urls I scraped in df2 are all similar to the example link above.
The code find nothing in the "descriptions" variable.
My question is: What element I should use to find the detailed recommendation content under "received tab"? Thank you very much!
Well you would first get the direct parent of the paragraphs. You can do that with XPath, class or id whatever fits best. After that you can do Your_Parent.find_elements(by=By.XPATH, value='./child::*') you can then loop over the result of that to get all paragraphs.
Edit
This selects all the paragraphs i have not yet looked into seperating them by post but here is what i got so far:
parents_of_paragraphs = driver.find_elements(By.CSS_SELECTOR, "div.display-flex.align-items-center.t-14.t-normal.t-black")
text_total = ""
for element in parents_of_paragraphs:
paragraph = element.find_element(by=By.XPATH, value='./child::*')
text_total += f"{paragraph.text}\n"
print(text_total)

Scraping news articles using Selenium Python

I am Learning to scrape news articles from the website https://tribune.com.pk/pakistan/archives. The first thing is to scrape the link of every news article. Now the problem is that <a tag contains two href in it but I want to get the first href tag which I am unable to do
I am attaching the html of that particular part
The code I have written returns me 2 href tags but I only want the first one
def Url_Extraction():
category_name = driver.find_element(By.XPATH, '//*[#id="main-section"]/h1')
cat = category_name.text # Save category name in variable
print(f"{cat}")
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
for element in news_articles:
URL = element.get_attribute('href')
print(URL)
Url.append(URL)
Category.append(cat)
current_time = time.time() - start_time
print(f'{len(Url)} urls extracted')
print(f'{len(Category)} categories extracted')
print(f'Current Time: {current_time / 3600:.2f} hr, {current_time / 60:.2f} min, {current_time:.2f} sec',
flush=True)
Moreover I am able to paginate but I can't get the full article by clicking the individual links given on the main page.
You have to modify the below XPath:
Instead of this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]//a")
Use this -
news_articles = driver.find_elements(By.XPATH,"//div[contains(#class,'flex-wrap')]/a")

Unable to list "all" class text from a webpage

I'm trying to list all nickname from a specific forum thread (webpage)
url = "https://www.webpage.com"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
username = doc.find('div', class_='userText')
userd = username.a.text
print(userd)
On the webpage:
<div class="userText">
Nickname1
</div>
Nickname2
</div>
etc
So I'm sucessfully isolating the "userText" name from the webpage.
The thing is that I'm only able to get the frist nickname while there is more than 150 inside the page.
I tried a
doc.find_all
instead of my
doc.find
But then I'm hit with a
You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I'm unsure on how to tackle this.
Fixed with a loop + put the div inside a list
username = doc.find_all(["div"], class_="userText")
for i in range(0,150):
print(username[i].a.text)

Scrapy infinite scrolling - no pagination indication

I am new to web scraping and I encountered some issues when I was trying to scrape a website with infinite scroll. I looked at some other questions but I could not find the answer, so I hope someone could help me out here.
I am working on the website http://www.aastocks.com/tc/stocks/analysis/stock-aafn/00001/0/all/. I have the following (very basic) piece of code some far, where I could get every piece of article on the first page (20 entries).
def parse(self, response):
# collect all article links
news = response.xpath("//div[starts-with(#class,'newshead4')]//a//text()").extract()  
# visit each news link and gather news info
for n in news:
url = urljoin(response.url, n)
yield scrapy.Request(url, callback=self.parse_news)
However, I could not figure out how to go to the next page. I read some tutorials online, such as going to Inspect -> Network and observe the Request URL after scrolling, it returned http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001 where I could not find an indication of pagination or other pattern to help me go to the next page. When I copy this link to a new tab, I see a json document with the news of the next page, but without a url with it. In this case, how could I fix it? Many thanks!
Link
http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=905169272&newsid=NOW.895783&period=0&key=&symbol=00001
gives JSON data with values like NOW.XXXXXX which you can use to generate links to news
"http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/" + "NOW.XXXXXX" + "/all"
If you scroll down few times then you will see that next pages generate similar links but with different parameters newstime, newsid.
If you check JSON data then you will see that last item has values 'dtd' and 'id' which are the same as parameters newstime, newsid in link used to download JSON data for next page.
So you can generate link to get JSON data for next page(s).
"http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime=" + DTD + "&newsid=" + ID + "&period=0&key=&symbol=00001"
Working example with requests
import requests
newstime = '934735827'
newsid = 'HKEX-EPS-20190815-003587368'
url = 'http://www.aastocks.com/tc/resources/datafeed/getmorenews.ashx?cat=all&newstime={}&newsid={}&period=0&key=&symbol=00001'
url_article = "http://www.aastocks.com/tc/stocks/analysis/stock-aafn-con/00001/{}/all"
for x in range(5):
print('---', x, '----')
print('data:', url.format(newstime, newsid))
# get JSON data
r = requests.get(url.format(newstime, newsid))
data = r.json()
#for item in data[:3]: # test only few links
for item in data[:-1]: # skip last link which gets next page
# test links to articles
r = requests.get(url_article.format(item['id']))
print('news:', r.status_code, url_article.format(item['id']))
# get data for next page
newstime = data[-1]['dtd']
newsid = data[-1]['id']
print('next page:', newstime, newsid)

Trying to use Python Mechanize to fill in search box contained in <td> tags

I'm brand new to Python and am attempting to scrape information from a real estate listing website (www.realtor.ca). So far, I've managed to collect MLS numbers in a list using this code:
import urllib2, sys, re, mechanize, itertools, csv
# Set the url for the online search
url = 'http://www.realtor.ca/PropertyResults.aspx?Page=1&vs=Residential&ret=300&curPage=PropertySearch.aspx&sts=0-0&beds=0-0&baths=0-0&ci=Victoria&pro=3&mp=200000-300000-0&mrt=0-0-4&trt=2&of=1&ps=10&o=A'
content = urllib2.urlopen(url).read()
text = str(content)
# finds all instances of "MLS®: " to create a list of MLS numbers
# "[0-9]+" matches all numbers (the plus means one or more) In this case it's looking for a 6-digit MLS number
findMLS = re.findall("MLS®: [0-9]+", text)
findMLS = [x.strip('MLS®: ') for x in findMLS]
# "Page 1 of " precedes the number of pages in the search result (10 listings per page)
num_pages = re.findall("Page 1 of [0-9]+", text)
num_pages = [y.strip('Page 1 of ') for y in num_pages]
pages = int(num_pages[0])
for page in range(2,pages+1):
# Update the url with the different search page numbers
url_list = list(url)
url_list[48] = str(page)
url = "".join(url_list)
# Read the new url to get more MLS numbers
content = urllib2.urlopen(url).read()
text = str(content)
newMLS = re.findall("MLS®: [0-9]+", text)
newMLS = [x.strip('MLS®: ') for x in newMLS]
# Append new MLS numbers to the list findMLS
for number in newMLS:
findMLS.append(number)
With my list of MLS numbers (findMLS), I'd like to input each number into the MLS# search box at the top of this website: http://www.realtor.ca/propertySearch.aspx
Using inspect element I can find this search box, but I don't know how to use Python code and Mechanize to access it.
<input type="text" id="txtMlsNumber" value="" style="background-color:#ebebeb;border:solid 1px #C8CACA; " onkeypress="javascript:MLSNumberSearch(event)">
Any help would be greatly appreciated.
I have not used Mechanize but I have had great luck navigating with Selenium. I know this is an extra module that you may or may not want to use, but it's very user friendly since Selenium 2 came out and you could definitely navigate that site they way you'd like to.
edit:
This would be real easy with something like this.
mls_search = driver.find_element_by_id('txtMlsNumber')
mls_search.send_keys('number that you scraped')
search = driver.find_element_by_id('lnkMlsSearch')
search.click()

Categories

Resources