PYTHON scrapy selenium WebDriverWait - python

Experts here, I am searching for your help if you don't mind it.
Recently, I am working out a web crawler using scrapy and selenium in python. My mind has crush.
I just want to ask whether it is possible that you still get empty even if you've used the statement
WebDriverWait(driver, 100, 0.1).until(EC.presence_of_all_elements_located((By.XPATH,xxxxx)))
to get those elements. And also, it even doesn't take 100 second to get empty. Why?
And by the way, it is a random thing, which means this phenomenon happens anywhere, anytime.
Does getting empty had something about my network connection?
Could you help me or give me some opinions, suggestion about the question above?
Thanks a lot!
-----------------------supplementary notes-----------------------
Thanks for the heads up.
In summary, I used scrapy and selenium to crawl a site about reviews and write the username, posting time, comment content, etc. to a .xlsx file via pipeline.py, I wanted it to be as fast as possible while gathering complete information.
A page with many people commenting, and because the review text is too long it gets put away, which means that almost 20 comments per page have their expand button.
Therefore, I need to use selenium to click the expand button and then use driver to fetch the complete comment. Common sense dictates that it takes a bit of time to load after the expand button is clicked, and I believe the time it takes depends on the speed of the network. So using WebDriverWait seems to be a wise choice here. After my practice, the default parameters timeout=10 and poll_frequency=0.5 seem to be too slow and error-prone. So I considered using the specifications of timeout=100 and poll_frequency=0.1.
However, the problem is that every time I run the project through the cmd statement scrapy crawl spider, there are always several comment crawls that are empty, and each time the location of the empty is different. I've thought about using time.sleep() to force a stop, but that would take a lot of time if every page did that, and while it's certainly a more useful way to get complete information. Also, it's looks not so elegant and a little bit clumsy in my opinion.
Have I express my question clearly?
-------------------------------add something--------------------------------
The exact meaning of I got somwhere empty is shown as the picture below.
---------------------------add my code--------------------------2022/5/18
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
full_content, words = [], []
unfolds = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//a[#class='unfold']")))
# Here's how I think about and design my loop body.
# I click the expansion bottun, then grab the text, then put it away, then move on to the next one.
for i in range(len(unfolds)):
unfolds[i].click()
time.sleep(1)
# After the javascript, the `div[#class='review-content clearfix']` appear,
# and some of the full review content will be put in a `<p></p>` label
find_full_content_p = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='review-content clearfix']/p")))
full_content_p = [j.text for j in find_full_content_p]
# and some of them will just put in `div[#class='review-content clearfix']` itself.
find_full_content_div = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='review-content clearfix']")))
full_content_div = [j.text for j in find_full_content_div]
# and I made a list merge
full_content_p.extend(full_content_div)
full_content.append("".join(full_content_p))
words.append(len("".join(full_content_p)))
time.sleep(1)
# then put it away
WebDriverWait(driver,100,0.1).until(EC.element_to_be_clickable((By.XPATH,"//a[#class='fold']"))).click()
driver.close()
pd.DataFrame({"users":users, "dates":dates, "full_content":full_content, "words":words})
AND, this is the code of an expert I genuinely respect named sound wave.(This is slightly modified, the core code has not been changed)
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
# from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews, words = [], []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
words.append(len(review.text))
print('done',len(reviews),end='\r')
pd.DataFrame({"users":users,"dates":dates,"reviews":reviews,"words":words})

NEW
Added code for the site douban. To export the scraped data to a csv see the pandas code in the OLD section below
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service('...'))
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews = []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
if show_more:
# scroll to the show more button, needed to avoid ElementClickInterceptedException
driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
show_more[0].click()
review = review.find_element(By.XPATH, 'following-sibling::div')
while review.get_attribute('class') == 'hidden':
time.sleep(0.2)
review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
reviews.append(review.text)
print('done',len(reviews),end='\r')
OLD
For the website you mentioned (imdb.com) in order to scrape the hidden content there is no need to click on the show more button because the text is already loaded in the HTML code, simply it is not shown on the site. So you can scrape all the comments in a single time. Code below stores users, dates and reviews in seprate lists, and finally save data to a .csv file.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
driver = webdriver.Chrome(service=Service(chromedriver_path))
driver.get('https://www.imdb.com/title/tt1683526/reviews')
# sets a maximum waiting time for .find_element() and similar commands
driver.implicitly_wait(10)
reviews = [el.get_attribute('innerText') for el in driver.find_elements(By.CSS_SELECTOR, 'div.text')]
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.display-name-link')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.review-date')]
# store data in a csv file
import pandas as pd
df = pd.DataFrame(list(zip(users,dates,reviews)), columns=['user','date','review'])
df.to_csv(r'C:\Users\your_name\Desktop\data.csv', index=False)
To print a single review you can do something like this
i = 0
print(f'User: {users[i]}\nDate: {dates[i]}\n{reviews[i]}')
the output (truncated) is
User: dschmeding
Date: 26 February 2012
Wow! I was not expecting this movie to be this engaging. Its one of those films...

Related

Python Scraper Won't Complete

I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()
The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference
Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!
Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search

Selenium scraping Issues with site having an popup window with endless scroll

I am trying to scrape a website that populates a list of providers. the site makes you go through a list of options and then finally it populates a list of providers through a pop up that has an endless/continuous scroll.
i have tried:
from selenium.webdriver.common.action_chains import ActionChains
element = driver.find_element_by_id("my-id")
actions = ActionChains(driver)
actions.move_to_element(element).perform()
but this code didn't work.
I tried something similar to this:
driver.execute_script("arguments[0].scrollIntoView();", list )
but this didnt move anything. it just stayed on the first 20 providers.
i tried this alternative:
main = driver.find_element_by_id('mainDiv')
recentList = main.find_elements_by_class_name('nameBold')
for list in recentList :
driver.execute_script("arguments[0].scrollIntoView(true);", list)
time.sleep(20)
but ended up with this error message:
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: element is not attached to the page document
The code that worked the best was this one:
while True:
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
but this is an endless scroll that i dont know how to stop since "while True:" will always be true.
Any help with this would be great and thanks in advance.
This is my code so far:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.ui import Select
import pandas as pd
PATH = '/Users/AnthemScraper/venv/chromedriver'
driver = webdriver.Chrome(PATH)
#location for the website
driver.get('https://shop.anthem.com/sales/eox/abc/ca/en/shop/plans/medical/snq?execution=e1s13')
print(driver.title)
#entering the zipcode
search = driver.find_element_by_id('demographics.zip5')
search.send_keys(90210)
#making the scraper sleep for 5 seconds while the page loads
time.sleep(5)
#entering first name and DOB then hitting next
search = driver.find_element_by_id('demographics.applicants0.firstName')
search.send_keys('juelz')
search = driver.find_element_by_id('demographics.applicants0.dob')
search.send_keys('01011990')
driver.find_element_by_xpath('//*[#id="button/shop/getaquote/next"]').click()
#hitting the next button
driver.find_element_by_xpath('//*[#id="hypertext/shop/estimatesavings/skipthisstep"]').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
#clicking the no option to view all the health plans
driver.find_element_by_xpath('//*[#id="radioNoID"]').click()
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#making the scraper sleep for 2 seconds while the page loads
time.sleep(2)
driver.find_element_by_xpath('//*[#id="hypertext/shop/medical/showmemydoctorlink"]/span').click()
time.sleep(2)
#section to choose the specialist. here we are choosing all
find_specialist=\
driver.find_element_by_xpath('//*[#id="specializedin"]')
#this is the method for a dropdown
select_provider = Select(find_specialist)
select_provider.select_by_visible_text('All Specialties')
#choosing the distance. Here we click on 50 miles
choose_mile_radius=\
driver.find_element_by_xpath('//*[#id="distanceInMiles"]')
select_provider = Select(choose_mile_radius)
select_provider.select_by_visible_text('50 miles')
driver.find_element_by_xpath('/html/body/div[4]/div[11]/div/button[2]/span').click()
#handling the endless scroll
while True:
time.sleep(20)
# Scroll down to bottom
element_inside_popup = driver.find_element_by_xpath('//*[#id="mainDiv"]')
element_inside_popup.send_keys(Keys.END)
# Wait to load page
time.sleep(3)
#block below allows us to grab the majority of the data. we would have to split it up in pandas since this info
#is nested in with classes
time.sleep(5)
main = driver.find_element_by_id('mainDiv')
sections = main.find_elements_by_class_name('firstRow')
pcp_info = []
#print(section.text)
for pcp in sections:
#the site stores the information inside inner classes which make it difficult to scrape.
#the solution would be to pull the entire text in the block and hope to clean it aftewards
#innerText allows to pull just the text inside the blocks
first_blox = pcp.find_element_by_class_name('table_content_colone').get_attribute('innerText')
second_blox = pcp.find_element_by_class_name('table_content_coltwo').get_attribute('innerText')
#creating columns and rows and assigning them
pcp_items = {
'first_block' : [first_blox],
'second_block' : [second_blox]
}
pcp_info.append(pcp_items)
df = pd.DataFrame(pcp_info)
print(df)
df.to_csv('yerp.csv',index=False)
#driver.quit()

scraping yahoo stock news

I am scraping news articles related to Infosys at the end of page but getting error
selenium.common.exceptions.InvalidSelectorException: Message: invalid selector .
Want to scrape all articles related to Infosys.
from bs4 import BeautifulSoup
import re
from selenium import webdriver
import chromedriver_binary
import string
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("/Users/abhishekgupta/Downloads/chromedriver")
driver.get("https://finance.yahoo.com/quote/INFY/news?p=INFY")
for i in range(20): # adjust integer value for need
# you can change right side number for scroll convenience or destination
driver.execute_script("window.scrollBy(0, 250)")
# you can change time integer to float or remove
time.sleep(1)
print(driver.find_element_by_xpath('//*[#id="latestQuoteNewsStream-0-Stream"]/ul/li[9]/div/div/div[2]/h3/a/text()').text())
You could use less detailed xpath using // instead of /div/div/div[2]
And if you want last item then get all li as list and later use [-1] to get last element on list
from selenium import webdriver
import time
driver = webdriver.Chrome("/Users/abhishekgupta/Downloads/chromedriver")
#driver = webdriver.Firefox()
driver.get("https://finance.yahoo.com/quote/INFY/news?p=INFY")
for i in range(20):
driver.execute_script("window.scrollBy(0, 250)")
time.sleep(1)
all_items = driver.find_elements_by_xpath('//*[#id="latestQuoteNewsStream-0-Stream"]/ul/li')
#for item in all_items:
# print(item.find_element_by_xpath('.//h3/a').text)
# print(item.find_element_by_xpath('.//p').text)
# print('---')
print(all_items[-1].find_element_by_xpath('.//h3/a').text)
print(all_items[-1].find_element_by_xpath('.//p').text)
xPath you provided does not exist in the page.
Download the xPath Finder Chrome Extension to find the correct xPath for articles.
Here is an example xPath of articles list, you need to loop through id:
/html/body/div[1]/div/div/div[1]/div/div[3]/div[1]/div/div[5]/div/div/div/ul/li[ID]/div/div/div[2]/h3/a/u
I think your code is fine just one thing: there are few difference when we retrieve text or links when using xpath in selenium as compare to scrapy or if you are using lxml fromstring library so here is something that should work for you
#use this code for printing instead
print(driver.find_element_by_xpath('//*[#id="latestQuoteNewsStream-0- Stream"]/ul/li[9]/div/div/div[2]/h3/a').text)
Even if you do this it will work the same way since there is only one element with this id so simply use
#This should also work fine
print(driver.find_element_by_xpath('//*[#id="latestQuoteNewsStream-0- Stream"]').text)

selenium and python: How to click on 'OnClick' button for a div class and extract the subsequent data

I want to use selenium on this page.
The steps I want to take to scrape the page:
1. type '22663' into the box that says 'search by plant-based food'
2. click 'food-disease association
3. click submit on the bottom of the page
4. click 'plant-disease associations'
5. export the plant-disease table
I wrote this code:
import sys
import pandas as pd
from bs4 import BeautifulSoup
import selenium
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import csv
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
#binary = FirefoxBinary('/Users/kela/Desktop/scripts/scraping/geckodriver')
url = 'http://147.8.185.62/services/NutriChem-2.0/'
driver = webdriver.Firefox(executable_path='/Users/kela/Desktop/scripts/scraping/geckodriver')
driver.get(url)
element = driver.find_element_by_id("input_food_name")
element.send_keys("22663")
#click food-disease association
element = driver.find_element_by_xpath("//select[#name='food_search_section']")
#all_options = element.find_elements_by_tag_name("option")
element = Select(driver.find_element_by_css_selector('[name=food_search_section]'))
element.select_by_value('food_disease')
submit_xpath = '/html/body/form/p[2]/input[1]'
destination_page_link = driver.find_element_by_xpath(submit_xpath)
destination_page_link.click()
#this doesn't work for step 4
#xpath2 = '/html/body/table/tbody/tr/td[3]/div'
#destination_page_link = driver.find_element_by_xpath(xpath2)
#destination_page_link.click()
#this doesn't work for step 4
xpath2 = '/html/body/table/tbody/tr/td[3]/div/span'
destination_page_link = driver.find_element_by_xpath(xpath2)
destination_page_link.click()
I am struggling with steps 4 and 5.
For step 4, how do I select the 'div class -> onclick ClickButton (nutrichem12587_disease.tsv','plant_disease' button? You can see a couple of things I've tried in the above code based on other stackoverflow quesions e.g. here and , I tried a good few things, these are two examples.
Then for step 5, I can already foresee having a similar issue, because I want to click the 'expand/right arrow' for each row (e.g. the arrow beisde pomegranate/diabetes), and print out the data beneath that i.e.
PredictionPMID:22919408 Punica granatum Diabetes
PredictionPMID:22529479 P. granatum Diabetes
PredictionPMID:22529479 Punica granatum Diabetes
PredictionPMID:20020514 Punica granatum Diabetes
for each of the subsequent rows. Could someone show me how to do this.
Edit 1: for step 4, I've tried things like this, but they return errors saying the elements don't exist, even though I got the locations by copying the XPaths:
#click plant-disease associations
#submit_xpath = '/html/body/table/tbody/tr/td[3]/div/span'
submit_xpath = '/html/body/table/tbody/tr/td[3]'
destination_page_link = driver.find_element_by_xpath(submit_xpath)
destination_page_link.click()
For step 4
If you're confident the web page will be the exact same every time, you could identify an element which contains your "plant-disease associations" button and then manually click (x, y) coordinates within that element. as described as the second answer here
For step 5
Try to scoop the entire table first as opposed to the individual right arrows and go over it manually by identifying all the children.
For step 4, it's possible that your code is not working because it's not waiting for the page to load. If this is the case, adding these import statements:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
And I have this function that I find really handy for browser automation that you can add the your script:
def wait_for_element(driver, selector, method):
"""Returns element after waiting for page load"""
try:
wait = WebDriverWait(driver, 10)
wait.until(
eval(f'EC.presence_of_element_located((By.{method}, "{selector}"))')
)
finally:
element = eval(f'driver.find_element_by_{method.lower()}("{selector}")')
return element
Implement it to find the button for step 4 by using:
xpath2 = '/html/body/table/tbody/tr/td[3]/div'
destination_page_link = wait_for_element(driver, xpath2, 'XPATH')
Hope this helps!

extracting more information from webdriver

I have written a code to extract the mobile models from the following website
"http://www.kart123.com/mobiles/pr?p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.kart123.com/mobiles/pr?p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6")
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
Everything is working fine but the problem arises at the end of the page. It is showing the contents of the first page only.Please could you help me what can I do in order to get all the models.
This will get you on your way, I would use while loops using sleep to get all the page loaded before getting the information from the page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Firefox()
driver.get("http://www.flipkart.com/mobiles/pr? p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6")
time.sleep(3)
for i in range(5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to bottom of page
time.sleep(2)
driver.find_element_by_xpath('//*[#id="show-more-results"]').click() # click load more button, needs to be done until you reach the end.
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
Ok this is going to be a major hack but here goes... The site gets more phones as you scroll down by hitting an ajax script giving you 20 more each time. The script its hitting is this:
http://www.flipkart.com/mobiles/pr?p[]=sort%3Dpopularity&sid=tyy%2C4io&start=1&ref=8aef4a5f-3429-45c9-8b0e-41b05a9e7d28&ajax=true
Notice the start parameter you can hack this into what you want with
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
num = 1
while num <=2450:
"""
This condition will need to be updated to the maximum number
of models you're interested in (or if you're feeling brave try to extract
this from the top of the page)
"""
driver.get("http://www.flipkart.com/mobiles/pr?p[]=sort%3Dpopularity&sid=tyy%2C4io&start=%f&ref=8aef4a5f-3429-45c9-8b0e-41b05a9e7d28&ajax=true" % num)
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
num += 20
You'll be making 127 get requests so this will be quite slow...
You can get full source of the page and do all the analysis based on it:
page_text = driver.page_source
The page shall contain current content including whatever was generated by JavaScript. Be carefull to get this content at the moment, all the rendering is completed (you may e.g. wait for presence of some string, which gets rendered at the end).

Categories

Resources