Website not allowing to extract data using python - python

I am trying to extract data from a webpage (https://clinicaltrials.gov), I have built a scraper using selenium and lxml and it is working fine. I need to hit the next page button once the first page scraping is done, after going to the next page I need to take the url of that page using (driver.current_url) and start the scraping again.
Here the problem is search results table only changing but the URL remaining same. So whenever driver taking current url (driver.current_url) first page results coming again and again.
Edited:
here is the code
import re
import time
import urllib.parse
import lxml.html
import pandas as pd
import requests
import urllib3
from lxml import etree
from lxml import html
from pandas import ExcelFile
from pandas import ExcelWriter
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
siteurl = 'https://clinicaltrials.gov/'
driver = webdriver.Chrome()
driver.get(siteurl)
WebDriverWait(driver, 5)
driver.maximize_window()
def advancesearch():
driver.find_element_by_link_text('Advanced Search').click()
driver.find_element_by_id('StartDateStart').send_keys('01/01/2016')
driver.find_element_by_id('StartDateEnd').send_keys('12/30/2020')
webdriver.ActionChains(driver).send_keys(Keys.ENTER).perform()
time.sleep(3)
driver.find_element_by_xpath("//input[contains(#id, 'home-search-condition-query')]").send_keys('medicine') #Give keyword here
advancesearch()
#driver.find_element_by_xpath("//div[contains(#class, 'dataTables_length')]//label//select//option[4]").click()
#time.sleep(8)
def nextbutton():
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
driver.find_element_by_xpath("//a[contains(#class, 'paginate_button next')]").click()
def extractor():
cur_url = driver.current_url
read_url = requests.get(cur_url)
souptree = html.fromstring(read_url.content)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
for tbu in tburl:
allurl = []
allurl.append(urllib.parse.urljoin(siteurl, tbu))
for tb in allurl:
get_url = requests.get(tb)
get_soup = html.fromstring(get_url.content)
pattern = re.compile("^\s+|\s*,\s*|\s+$")
name = get_soup.xpath('//td[#headers="contactName"]//text()')
phone = get_soup.xpath('//td[#headers="contactPhone"]//text()')
mail = get_soup.xpath('//td[#headers="contactEmail"]//a//text()')
artitle = get_soup.xpath('//td[#headers="contactEmail"]//a//#href')
artit = ([x for x in pattern.split(str(artitle)) if x][-1])
title = artit[:-2]
for (names, phones, mails) in zip(name, phone, mail):
fullname = names[9:]
print(fullname, phones, mails, title, sep='\t')
while True:
extractor()
nextbutton()

You don't need to get the URL if the page it's already change.
You could start re-iterating from when the page has reloaded after you click next. You can make the driver wait until an element is present (explicit wait) or just wait (implicit wait).

There are a number of changes (for example, use shorter less fragile css selectors and bs4) I would probably make but the two that stand out:
1) You already have the data you need so there is no requirement for a new url. Simply use the current page_source of driver.
So top of extractor function becomes
def extractor():
souptree = html.fromstring(driver.page_source)
tburl = souptree.xpath("//table[contains(#id, 'theDataTable')]//tbody//tr//td[4]//a//#href")
#rest of code
2) To reduce iterations I would set results count to 100 at start
def advancesearch():
driver.find_element_by_link_text('Advanced Search').click()
driver.find_element_by_id('StartDateStart').send_keys('01/01/2016')
driver.find_element_by_id('StartDateEnd').send_keys('12/30/2020')
webdriver.ActionChains(driver).send_keys(Keys.ENTER).perform()
time.sleep(3)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#theDataTable_length [value='100']"))).click() #change for 100 results so less looping
then add additional import of
from selenium.webdriver.common.by import By

Related

Using Python and Selenium to scrape hard-to-find web tables

I've been using Python and Selenium to scrape data from specific state health web pages and output the table to a local CSV.
I've had a lot of success on several other states using similar code. But, I have hit a state that is using what appears to be R to create dynamic dashboards that I can't really access using my normal methods.
I've spent a great deal of time combing through StackOverflow . . . I've checked to see if there's an iframe to switch to, but, I'm just not seeing the data I want located in the iframe on the page.
I can find the table info easy enough using Chrome's "Inspect" feature. But, starting from the original URL, the data I need is not on that page and I can't find a source URL for the the table. I've even used Fiddler to see if there's a call somewhere.
So, I'm not sure what to do. I can see the data--but, I don't know where it is to tell Selenium and BS4 where to access it.
The page is here: https://coronavirus.utah.gov/case-counts/
The page takes a while to load . . . I've had other states have this issue and Selenium could work through it.
The table I need looks like this:
Any help or suggestions would be appreciated.
Here is the code I've been using . . . it doesn't work here, but, the structure is very similar to that which has worked for other states.
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout).until(EC.visibility_of_element_located()((By.ID, "total-number-of-lab-confirmed-covid-19-cases-living-in-utah")))
# Now, scrape table
html = driver.find_element_by_id("total-number-of-lab-confirmed-covid-19-cases-living-in-utah")
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='#DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()
I found another way to the get the information I needed.
Thanks to Julian Stanley for letting me know about the Katalon Recorder product. That allowed me to see what the iframe was where the table was.
Using my old method of finding an element by CSS or XPATH was causing a Pickle error due to locked thread. I have no clue how to deal with that . . . but, it caused the entire project to just hang.
But, I was able to get the text/HTML of the table via attribute. After that, I just read it with BS4 as usual.
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
st = 'ut'
url = 'https://coronavirus.utah.gov/case-counts/'
timeout = 20
# Spawn the webpage using Selenium
driver = webdriver.Chrome(r'D:\Work\Python\utilities\chromedriver\chromedriver.exe')
#driver.minimize_window()
driver.get(url)
# Let page load . . . it takes a while
wait = WebDriverWait(driver, timeout)
# Get name of frame (or use index=0)
frames = [frame.get_attribute('id') for frame in driver.find_elements_by_tag_name('iframe')]
# Switch to frame
#driver.switch_to_frame("coronavirus-dashboard")
driver.switch_to_frame(0)
# Now, scrape table
html = driver.find_element_by_css_selector('#DataTables_Table_0_wrapper').get_attribute('innerHTML')
soup = BeautifulSoup(html, 'lxml')
table = soup.find_all('table', id='DataTables_Table_0')
df = pd.read_html(str(table))
exec(st + "_counts = df[0]")
tmp_str = f"{st}_counts.to_csv(r'D:\Work\Python\projects\Covid_WebScraping\output\{st}_covid_cnts_' + str(datetime.now().strftime('%Y_%m_%d_%H_%M_%S')) + '.csv'"
file_path = tmp_str + ", index=False)"
exec(file_path)
# Close the chrome web driver
driver.close()

Error scraping Highcharts in Python with selenium

I am trying to scrape highcharts from two different websites,
I came across this execute_script answer in this stackoverflow question :
How to scrape charts from a website with python?
It helped me scrape from the first website but when i use it on the second website it returns the following error:
line 27, in <module>
temp = driver.execute_script('return window.Highcharts.charts[0]'
selenium.common.exceptions.JavascriptException: Message: javascript error: Cannot read
property '0' of undefined
The website is :
http://lumierecapital.com/#
You're supposed to click on the performance button on the left to get the highchart.
Goal: i just want to scrape the Date and NAV per unit values from it
Like the last website, this code should've printed out a dict with X and Y as keys and the date and data as values but it doesn't work for this one.
Here's the python code:
from bs4 import BeautifulSoup
import requests
from selenium.webdriver.chrome.options import Options
from shutil import which
from selenium import webdriver
import time
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_path = which("chromedriver")
driver = webdriver.Chrome(executable_path=chrome_path, options=chrome_options)
driver.set_window_size(1366, 768)
driver.get("http://lumierecapital.com/#")
performance_button = driver.find_element_by_xpath("//a[#page='performance']")
performance_button.click()
time.sleep(7)
temp = driver.execute_script('return window.Highcharts.charts[0]'
'.series[0].options.data')
for item in temp:
print(item)
You can use re module to extract the values of performance chart:
import re
import requests
url = 'http://lumierecapital.com/content_performance.html'
html_data = requests.get(url).text
for year, month, day, value, datex in re.findall(r"{ x:Date\.UTC\((\d+), (\d+), (\d+)\), y:([\d.]+), datex: '(.*?)' }", html_data):
print('{:<10} {}'.format(datex, value))
Prints:
30/9/07 576.092
31/10/07 577.737
30/11/07 567.998
31/12/07 556.670
31/1/08 460.886
29/2/08 496.740
31/3/08 484.016
30/4/08 523.829
31/5/08 546.661
30/6/08 494.067
31/7/08 475.942
31/8/08 389.147
30/9/08 299.661
31/10/08 183.690
30/11/08 190.054
31/12/08 211.960
31/1/09 193.308
... and so on.

Scraping the data based on automatically access the multiple pages based on href

I am having the problem of automating the multiple pages using the selenium webdriver and python. In my code, I am getting the pages clicked automatically upto 10 pages but after the 10 pages, it won't work. I am not getting the page clicking after page number 11.
import urllib.request
from bs4 import BeautifulSoup
import csv
import os
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd
import os
url = 'http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?
hDistName=Buldhana'
chrome_path = r'C:/Users/User/AppData/Local/Programs/Python/Python36/Scripts/chromedriver.exe'
d = webdriver.Chrome(executable_path=chrome_path)
d.implicitly_wait(10)
d.get(url)
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlTaluka')).select_by_value('7')
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlVillage')).select_by_value('1464')
page = [page.get_attribute('href')for page in
d.find_elements_by_css_selector(
"#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate [href*='Page$']")]
while True:
pages = [page.get_attribute('href')for page in
d.find_elements_by_css_selector(
"#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate
[href*='Page$']")]
for script_page in pages:
d.execute_script(script_page)
#print(script_page)
Try using page indexes and check whether the page is available and you have to click on each page and go on.Try the following code.
from selenium import webdriver
url = 'http://www.igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Buldhana'
chrome_path = r'C:/Users/User/AppData/Local/Programs/Python/Python36/Scripts/chromedriver.exe'
d = webdriver.Chrome(executable_path=chrome_path)
d.implicitly_wait(10)
d.get(url)
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlTaluka')).select_by_value('7')
Select(d.find_element_by_name('ctl00$ContentPlaceHolder5$ddlVillage')).select_by_value('1464')
i=2
while True:
if len(d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate a[href*='Page${}']".format(i)))>0:
print( d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate a[href*='Page${}']".format(i))[0].get_attribute('href'))
d.find_elements_by_css_selector("#ctl00_ContentPlaceHolder5_grdUrbanSubZoneWiseRate a[href*='Page${}']".format(i))[0].click()
i+=1
else:
break
Output:
Since i started from page 2.
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$2')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$3')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$4')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$5')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$6')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$7')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$8')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$9')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$10')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$11')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$12')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$13')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$14')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$15')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$16')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$17')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$18')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$19')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$20')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$21')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$22')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$23')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$24')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$25')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$26')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$27')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$28')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$29')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$30')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$31')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$32')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$33')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$34')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$35')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$36')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$37')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$38')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$39')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$40')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$41')
javascript:__doPostBack('ctl00$ContentPlaceHolder5$grdUrbanSubZoneWiseRate','Page$42')
Process finished with exit code 0

python selenium to scrape data from asos - need a better approach

Hi I'm new to Python and crawling. I've been researching and going over Stackoverflow and come up with Python + Selenium to open Webdriver to open the URL and get the page source and turn it into the data I need. However, I know there's a better approach (for example, scraping w/o selenium, not having to scrape page source, posting data to ASP, etc) and I hope I can seek some help here for an educational purpose.
Here's what I'd like to achieve.
Start:
http://www.asos.com/Women/New-In-Clothing/Cat/pgecategory.aspx?cid=2623
Obtain: product title, price, img, and its link
Next: go to next page if there is, if not, output
BEFORE you go into my code, here is some background information. Asos is a site that uses pagination so this is related to scraping through multipages. Also, I tried without Selenium by posting to http://www.asos.com/services/srvWebCategory.asmx/GetWebCategories
with this data:
{'cid':'2623', 'strQuery':"", 'strValues':'undefined', 'currentPage':'0',
'pageSize':'204','pageSort':'-1','countryId':'10085','maxResultCount':''}
but there's no return I get.
I know my approach is not good, I'd much appreciate any help/recommendation/approach/idea! Thanks!
import scrapy
import time
import logging
from random import randint
from selenium import webdriver
from asos.items import ASOSItem
from scrapy.selector import Selector
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class ASOSSpider(scrapy.Spider):
name = "asos"
allowed_domains = ["asos.com"]
start_urls = [
"http://www.asos.com/Women/New-In-Clothing/Cat/pgecategory.aspx?cid=2623#/parentID=-1&pge=0&pgeSize=204&sort="
]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
view_204 = self.driver.find_element_by_xpath("//div[#class='product-count-bottom']/a[#class='view-max-paged']")
view_204.click() #click and show 204 pictures
time.sleep(5) #wait till 204 images loaded, I've also tried the explicit wait, but i got timedout
# element = WebDriverWait(self.driver, 8).until(EC.presence_of_element_located((By.XPATH, "category-controls bottom")))
logging.debug("wait time has reached! go CRAWL!")
next = self.driver.find_element_by_xpath("//li[#class='page-skip']/a")
pageSource = Selector(text=self.driver.page_source) # load page source instead, cant seem to crawl the page by just passing the reqular request
for sel in pageSource.xpath("//ul[#id='items']/li"):
item = ASOSItem()
item["product_title"] = sel.xpath("a[#class='desc']/text()").extract()
item["product_link"] = sel.xpath("a[#class='desc']/#href").extract()
item["product_price"] = sel.xpath("div/span[#class='price']/text()").extract()
item["product_img"] = sel.xpath("div/a[#class='productImageLink']/img/#src").extract()
yield item
next.click()
self.driver.close()

Urllib Python is not providing with the html code I see with with inspect element

I'm trying to crawl the results in this link:
url = "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F"
When I inspect it with firebug I can see the html code and I know what I need to do to extract the tweets. The problem is when I get the response using urlopen, i don't get the same html code. I only get tags. What am I missing?
Example code below:
def get_tweets(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
tweets = soup.find("div", "results")
category_links = [dd.a["href"] for tweet in tweets.findAll("div", "result-tweet")]
return category_links
url = "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F"
cat_links = get_tweets(url)
Thanks,
YB
The problem is that the content of results div is filled up with extra HTTP call and javascript code being executed on the browser side. urllib only "sees" the initial HTML page that doesn't contain the data you need.
One option would be to follow #Himal's suggestion and simulate the underlying request to trackbacks.js that is sent for the data with tweets. The result is in JSON format that you can load() using json module coming with standard library:
import json
import urllib2
url = 'http://otter.topsy.com/trackbacks.js?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&infonly=0&call_timestamp=1411090809443&apikey=09C43A9B270A470B8EB8F2946A9369F3'
data = json.load(urllib2.urlopen(url))
for tweet in data['response']['list']:
print tweet['permalink_url']
Prints:
http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832
This was "going down to metal" option.
Otherwise, you can take a "high-level" approach and don't bother about what is there happening under-the-hood. Let the real browser load the page which you would interact with through selenium WebDriver:
from selenium import webdriver
driver = webdriver.Chrome() # can be Firefox(), PhantomJS() and more
driver.get("http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F")
for tweet in driver.find_elements_by_class_name('result-tweet'):
print tweet.find_element_by_xpath('.//div[#class="media-body"]//ul[#class="inline"]/li//a').get_attribute('href')
driver.close()
Prints:
http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832
This is how you can scale the second option to get all of tweets following pagination:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
BASE_URL = 'http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&offset={offset}'
driver = webdriver.Chrome()
# get tweets count
driver.get('http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F')
tweets_count = int(driver.find_element_by_xpath('//li[#data-name="all"]/a/span').text)
for x in xrange(0, tweets_count, 10):
driver.get(BASE_URL.format(offset=x))
# page header appears in case no more tweets found
try:
driver.find_element_by_xpath('//div[#class="page-header"]/h3')
except NoSuchElementException:
pass
else:
break
# wait for results
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.ID, "results"))
)
# get tweets
for tweet in driver.find_elements_by_class_name('result-tweet'):
print tweet.find_element_by_xpath('.//div[#class="media-body"]//ul[#class="inline"]/li//a').get_attribute('href')
driver.close()

Categories

Resources