How to Scrape data for mobile reviews in flipkart?

How to Scrape data for mobile reviews in flipkart? - python

how to scrap the mobile reviews data from Flipkart
I tried using selenium package and but unable to extract all the reviews at a glance except for one review so can anyone help me with the code...
fk_path = ('[https://www.flipkart.com/moto-g-turbo-white-16-gb/product-
reviews/itmecc4uhbue7ve6?pid=MOBECC4UQTJ5QZFR][1]')
from selenium import webdriver
browser = webdriver.Chrome('/home/subhasis/chromedriver')
browser.get(fk_path)
browser.find_element_by_xpath("//span[#class='_1EPkIx']/span").click()
# Mimick clicking on 'Read More'
[p.click() for p in browser.find_elements_by_xpath("//span[#class='_1EPkIx']/span")] # Expand
all 'Read More' buttons
browser.find_element_by_xpath("//div[#class='_3DCdKt']//div[#class='qwjRop']/div").text
# Extract texts from respective Xpaths (1st review)

Try opening a browser like firefox / chrome and checking the the xpath selection.
$x('//div[#class="col"]')
$x('//div[#class="col"]/*/*/p/text()')
Consider giving the browser some time to load all of the extra javascript as well before going through and clicking so quickly, this also prevents any timeouts that might occur because of getting blocked for making so many requests so quickly, consider between clicking "read more":
time.sleep(1)
The reason being is that it looks like it might make a network request when clicking read more.

Related

Selenium Web Driver didnt return the page source that I am interested in

I am trying to a website, which I believe is a dynamic website.
https://publicrecordsaccess.fultoncountyga.gov/Portal/Home/WorkspaceMode?p=0
If the link needs you to search again, here is what I put for the Search Criteria: 16ED* then click Advanced Filtering Options to choose the Filter by File Date Start: 01/01/2016 and Filter by File Date End: 01/08/2016. Then click submit. You will go to the webpage that I need to scrape from.
I used selenium chrome web driver to do this task, but every time when I put in the link, it will automatically show me the page that hasn't been put in any search criteria. So I pause the driver for 60 seconds in order to have enough time to navigate to the page that I need to scrape from. But for some reason, it still didn't scrape the page source that I need to scrape from.
I used the developer tools to inspect the source, rather than just view the page source. Since using Inspect is the only way to find the elements that I need to scrape from. I am wondering are there any ways to do this task or did I do something wrong with my code.
web = 'https://publicrecordsaccess.fultoncountyga.gov/Portal/Home/WorkspaceMode?p=0'
op = webdriver.ChromeOptions()
# op.add_argument('headless')
driver = webdriver.Chrome(options=op, executable_path="/usr/local/bin/chromedriver")
driver.get(web)
time.sleep(60)
plain_text = driver.page_source
soup = BeautifulSoup(plain_text, 'html.parser')
Update for the post/question.
The URL: https://publicrecordsaccess.fultoncountyga.gov/Portal/Home/Dashboard/29
I need to put in the information for the search criteria in order to go to the page that I am interested in.
Here is what I put for the Search Criteria:16ED* then click Advanced Filtering Options to choose the Filter by File Date Start: 01/01/2016 and Filter by File Date End: 01/08/2016. Then click submit. You will go to the webpage that I need to scrape from.
My goal is to scrape the website from Case Number, Style/Defendant, File Date, Type, Status, and the corresponding URL for the Case. Then store them into a data frame. I can see all the information when I used Developer Tools then click Inspect on Google Chrome. And since I know it is a dynamic website, I wrote code that uses Selenium to scrape the website.

Python and Selenium pulls up a dialog box instead of the automatic download I get when I manually click on link

I'm using Python to access the SEC's website for 10-K downloadable spreadsheets. I created code that requests user input for a Stock Ticker Symbol, successfully opens Firefox, accesses the Edgar search page at https://www.sec.gov/edgar/searchedgar/companysearch.html, and inputs the correct ticker symbol. The problem is downloading the spreadsheet automatically and saving it.
Right now, I can manually click on "View Excel Spreadsheet", and the spreadsheet automatically downloads. But when I run my Python code, I get a dialog box from Firefox. I've set Firefox to automatically download, I've tried using 'find_element_by_xpath', 'find_element_by_css_selector' and both do not work to simply download the file. Both those methods merely call up the same dialog box. I tried 'find_element_by_link_text' and got an error message about not being able to find "view Excel Spreadsheet". My example ticker symbol was CAT for Caterpillar (NYSE: CAT). My code is below:
import selenium.webdriver.support.ui as ui
from pathlib import Path
import selenium.webdriver as webdriver
import time
ticker = input("please provide a ticker symbol: ")
# can do this other ways, but will create a function to do this
def get_edgar_results(ticker):
url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + str(ticker) + "&type=10-k&dateb=20200501&count=20"
# define variable that opens Firefox via my executable path for geckodriver
driver = webdriver.Firefox(executable_path=r"C:\Program Files\JetBrains\geckodriver.exe")
# timers to wait for the webpage to open and display the page itself
wait = ui.WebDriverWait(driver,40)
driver.set_page_load_timeout(40)
driver.get(url)
# timers to have page wait for the page to load.
# seemed that the total amount of time was necessary; not sure about these additional lines
driver.set_page_load_timeout(50)
wait = ui.WebDriverWait(driver, 50)
time.sleep(30)
# actual code to search the resulting page for the button to click and access the excel document for download
annual_links = driver.find_element_by_xpath('//*[#id="interactiveDataBtn"]')
annual_links.click()
# need to download the excel spreadsheet itself in the "financial report"
driver.set_page_load_timeout(50)
wait = ui.WebDriverWait(driver, 50)
excel_sheet = driver.find_element_by_xpath('/html/body/div[5]/table/tbody/tr[1]/td/a[2]')
excel_sheet.click()
# i'm setting the resulting dialog box to open and download automatically from now on. if i want to change it back
# i'll need to use this page: https://support.mozilla.org/en-US/kb/change-firefox-behavior-when-open-file
# Testing showed that dialog box "open as" probably suits my needs better than 'save'.
driver.close()
driver.quit()
get_edgar_results(ticker)
Any help or suggestions are greatly appreciated. Thanks!!

This is not so much a recommendation based on your actual code, or how Selenium works, but more general advice when trying to gather information from the web.
Given the opportunity, accessing a website through its API is far more friendly to programming than attempting the same task through Selenium. When you use Selenium for webscraping, very often the websites do not behave in the same way they do when accessed through a normal browser. This could be for any number of reasons, not thee least of which could be websites intentionally preventing automated browsers like Selenium from accessing them.
In this case, EDGAR SEC provides an HTTPS access service through which you should be able to get the information you're looking for.
Without digging too deeply into this data, it should not be tremendously difficult to instead request this information with an http request library like requests, and save it that way.
import requests
result = requests.get("https://www.sec.gov/Archives/edgar/data/18230/000001823020000214/Financial_Report.xlsx")
with open("file.xlsx", "wb") as excelFile:
excelFile.write(result.content)
The only difficulty comes with getting the stock ticker's CIK to build the above URL, but that shouldn't be too hard with the same API information.
The EDGAR website fairly transparently exposes you to its data through its URLs. You can bypass all the Selenium weirdness and instead just build the URL, and request the information directly without loading all the JavaScript, etc.
EDIT: You can also browse through this information in a more programmatic fashion, too. The link above mentions that each directory within edgar/full-index also provides a JSON file, which is easily computer-readable. So you could request https://www.sec.gov/Archives/edgar/full-index/index.json, parse out the year you want, request that year, parse out the quarter you want, request that quarter, then parse out the company you want, and request that company's information, etc.
For instance, to get the CIK number of Caterpillar, you would get and parse thee company.gz file from https://www.sec.gov/Archives/edgar/full-index/2020/QTR4.json, parse it into a dataframe, find the line with CATERPILLAR INC on it, find the CIK and accession numbers from the associated .txt file, and then find the right URL to download their Excel file. A bit circuitous, but if you can work out a way to just skip to the CIK number you can cut down on the number of requests needed.

Using Python to scrape linkedin connections but only some show up- Selenium and BeautifulSoup

So the title says it all. I am trying to scrape the connections based upon a search term I supply. Once the page renders, all of the connections aren't in the html as if they are hidden until I scroll down to see them. Is there a way to use Selenium to show all of the connections at once? I have no code to post since this is only a question.

You can use selenium to scroll down the page, loading the data you intend to grab.
The code bellow will scroll to the bottom of the page:
...
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
...
I've had an use case where I had to hit the bottom of the page consecutive times to load the content and get all the data I needed, in which I used the mentioned approach.
Hope this helps...

How to use Python to scrape all the table contents on this website which is written by AJAX?

https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.

You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.

Access widget window beautifulsoup python mechanize

I am trying to scrape information off websites like this:
https://www.glassdoor.com/Overview/Working-at-7-Eleven-EI_IE3581.11,19.htm
using python + beautifulsoup + mechanize.
Accessing anything on the main-site is no problem. However, I also need the information that appears in a overlay-window that appears when one clicks on the "Rating Trends" button next to the bar with stars.
This overlay-window can also be accessed directly by using the url:
https://www.glassdoor.com/Reviews/7-Eleven-Reviews-E3581.htm#trends-overallRating
The html associated with this page is a modification of the original site's html.
However, regardless of what element I try to find (via findAll ) on that overlay-window website, beautifulsoup returns zero hits.
How can I fix this? I tried adding a sleep time between accessing the website and reading anything in, to no avail.
Thanks!

If you're using the Chrome browser select the background of that page (without the additional information displayed) and select 'Inspect' from the context menu (for Windows anyway), then the 'Network' tab, so that you can see network traffic. Now click on 'Rating trends'. The entry marked 'xhr' will be https://www.glassdoor.ca/api/employer/3581-rating.htm?locationStr=&jobTitleStr=&filterCurrentEmployee=false&filterEmploymentStatus=REGULAR&filterEmploymentStatus=PART_TIME (I much hope!) and its contents will be the following.
{"employerId":3581,"ratings":[{"hasRating":true,"type":"overallRating","value":2.9},{"hasRating":true,"type":"ceoRating","value":0.54},{"hasRating":true,"type":"bizOutlook","value":0.35},{"hasRating":true,"type":"recommend","value":0.4},{"hasRating":true,"type":"compAndBenefits","value":2.4},{"hasRating":true,"type":"cultureAndValues","value":2.5},{"hasRating":true,"type":"careerOpportunities","value":2.5},{"hasRating":true,"type":"workLife","value":2.4},{"hasRating":true,"type":"seniorManagement","value":2.3}],"week":0,"year":0}
Whether this URL can be altered for use in obtaining information for other employers, I regret, I cannot tell you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.