web scraping a site without direct access

web scraping a site without direct access - python

any help is appreciated in advance.
deal is i have been trying scrape data from this website(https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do),but direct access to the website is not possible.Rather then data i need,i am getting invalid access.To access the website i must go to (https://www.mptax.mp.gov.in/mpvatweb/index.jsp) and then click on 'dealer search' from dropdown menu while hovering over dealer information.
I am looking for solution in Python,
Here's something i tried.I have just started web scraping:
import requests
from bs4 import BeautifulSoup
with requests.session() as request:
MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"
page=request.get(INITIAL)
jsession=page.cookies["JSESSIONID"]
print(jsession)
print(page.headers)
result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
page1=request.get(MAIN,headers={"Referer":INITIAL})
soup=BeautifulSoup(page1.content,'html.parser')
data=soup.find_all("tr",class_="whitepapartd1")
print(data)
Deal is i want to scrape data about firm's based on their firm name.

thanks for telling me a way #Arnav and #Arman ,so here's the final code:
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value
PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
#ask for input
company_name=input("tell the company name")
#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")
#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()
#we are now on the leftmenu page
#click on radio button
browser.find_element_by_css_selector("#byName").click()
#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)
#submit form
inputElement.submit()
#now we are on dealerssearch page
#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")
#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")
#check length of 'list' and on that basis decide what to print
if(len(list)!=0):
#company name at index=9
#tin no. at index=10
#registration status at index=11
#circle name at index=15
#store the values
name=list[9].get_text()
tin=list[10].get_text()
status=list[11].get_text()
circle=list[15].get_text()
#make dictionary
Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}
print(Company_Details)
else:
Company_Details={"VAT RC No":"Not found in database"}
print(Company_Details)
#close the chrome
browser.stop_client()
browser.close()
browser.quit()

Would you mind using a browser?
You can use a browser and access the link at xpath (//*[#id="dropmenudiv"]/a[1]).
You might have to download and put chromedriver in the mentioned directory if you haven't used chromedriver before. You can also use selenium + phantomjs if you want to do headless browsing (without the browser opening up each time).
from selenium import webdriver
xpath = "//*[#id="dropmenudiv"]/a[1]"
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.set_window_size(1120,550)
browser.get('https://www.mptax.mp.gov.in/mpvatweb')
link = browser.find_element_by_xpath("//*[#id="dropmenudiv"]/a[1]")
link.click()
url = browser.current_url

Related

https://www.realestate.com.au/ not permitting web scraping?

I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)

you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)

I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.

How to select particular region and scrape all the Jobs from a website

I am trying to web scrape all the Jobs from a Job portal by selecting a particular country.
I am sorry to affix a picture but the intent to show you how the page looks like.
What i tried:
Below is what i tried but i;m not getting anything just started learning web scraping ..
import requests
from bs4 import BeautifulSoup
job_url = 'https://wd3.myworkdayjobs.com/careers/'
out_req = requests.get(job_url)
soup = BeautifulSoup(out_req.text, 'html.parser')
print(soup)
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
any help will be much appreciated.

Try selenium library, Search based on attributes & After search results scrape using beautiful soup.
from selenium import webdriver
#browser exposes an executable file
#Through Selenium test we will invoke the executable file which will then #invoke actual browser
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
# to maximize the browser window
driver.maximize_window()
#get method to launch the URL
driver.get("Website")
#to refresh the browser
driver.refresh()
# identifying the checkboxes with type attribute in a list
chk =driver.find_elements_by_xpath("//input[#type='checkbox']")
# len method is used to get the size of that list
print(len(chk))
# get_attribute method is get the value attribute
for i in chk:
if i.get_attribute("value") == "United states of America":
i.click()
#to close the browser
driver.close()
#############################
#Beautiful soup code here
#############################

Why do I only get first page data when using selenium?

I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more"?
I want to crawl reviews from imdb using python. It only displays 25 reviews until I click "load more" button. I use the python package selenium to click the "load more" button automatically, which is successful. But why do I cannot get data after "load more" and just get the first 25 reviews data repeatedly?
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import time
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed)
PATIENCE_TIME = 60
LOAD_MORE_BUTTON_XPATH = '//*[#id="browse-itemsprimary"]/li[2]/button/span/span[2]'
driver = webdriver.Chrome('D:/chromedriver_win32/chromedriver.exe')
driver.get(seed)
while True:
try:
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(movie_review.text, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break
print("Complete")
I want all the reviews, but now I can only get the first 25.

You have several issues in your script. Hardcoded wait is very inconsistent and certainly the worst option to comply. The way you have written your scraping logic within while True: loop, will slower the parsing process by collecting the same items over and over again. Moreover, every title produces a huge line gap in the output which needs to be properly stripped. I've slightly changed your script to reflect the suggestion I've given above.
Try this to get the required output:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
URL = "https://www.imdb.com/title/tt4209788/reviews"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
driver.get(URL)
soup = BeautifulSoup(driver.page_source, 'lxml')
while True:
try:
driver.find_element_by_css_selector("button#load-more-trigger").click()
wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR,".ipl-load-more__load-indicator")))
soup = BeautifulSoup(driver.page_source, 'lxml')
except Exception:break
for elem in soup.find_all(class_='imdb-user-review'):
name = elem.find(class_='title').get_text(strip=True)
print(name)
driver.quit()

Your code is fine. Great even. But, you never fetch the 'updated' HTML for the web page after hitting the 'Load More' button. That's why you are getting the same 25 reviews listed all the time.
When you use Selenium to control the web browser, you are clicking the 'Load More' button. This creates an XHR request (or more commonly called AJAX request) that you can see in the 'Network' tab of your web browser's developer tools.
The bottom line is that JavaScript (which is run in the web browser) updates the page. But in your Python program, you only get the HTML once for the page statically using the Requests library.
seed = 'https://www.imdb.com/title/tt4209788/reviews'
movie_review = requests.get(seed) #<-- SEE HERE? This is always the same HTML. You fetched in once in the beginning.
PATIENCE_TIME = 60
To fix this problem, you need to use Selenium to get the innerHTML of the div box containing the reviews. Then, have BeautifulSoup parse the HTML again. We want to avoid picking up the entire page's HTML again and again because it takes computation resources to have to parse that updated HTML over and over again.
So, find the div on the page that contains the reviews, and parse it again with BeautifulSoup. Something like this should work:
while True:
try:
allReviewsDiv = driver.find_element_by_xpath("//div[#class='lister-list']")
allReviewsHTML = allReviewsDiv.get_attribute('innerHTML')
loadMoreButton = driver.find_element_by_xpath("//button[#class='ipl-load-more__button']")
review_soup = BeautifulSoup(allReviewsHTML, 'html.parser')
review_containers = review_soup.find_all('div', class_ ='imdb-user-review')
pdb.set_trace()
print('length: ',len(review_containers))
for review_container in review_containers:
review_title = review_container.find('a', class_ = 'title').text
print(review_title)
time.sleep(2)
loadMoreButton.click()
time.sleep(5)
except Exception as e:
print(e)
break

Clicking multiple items on one page using selenium

My main purpose is to go to this specific website, to click each of the products, have enough time to scrape the data from the clicked product, then go back to click another product from the page until all the products are clicked through and scraped (The scraping code I have not included).
My code opens up chrome to redirect to my desired website, generates a list of links to click by class_name. This is the part I am stuck on, I would believe I need a for-loop to iterate through the list of links to click and go back to the original. But, I can't figure out why this won't work.
Here is my code:
import csv
import time
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
url = "https://www.vatainc.com/infusion/adult-infusion.html?limit=all"
service = service.Service('path to chromedriver')
service.start()
capabilities = {'chrome.binary': 'path to chrome'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = driver.find_elements_by_class_name('product-name')
for link in links:
link.click()
driver.back()
link.click()

I have another solution to your problem.
When I tested your code it showed a strange behaviour. Fixed all problems that I had using xpath.
url = "https://www.vatainc.com/infusion/adult-infusion.html?limit=all"
driver.get(url)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(#class, 'product-name')]/a")]
htmls = []
for link in links:
driver.get(link)
htmls.append(driver.page_source)
Instead of going back and forward I saved all links (named as links) and iterate over this list.

Python Selenium Screen Scrape

I am trying to screen scrape a website (snippet below)
The website takes an input, navigates to a second page and takes more inputs and finally displays a table. I fail at this step:
driver.find_element_by_xpath("//select[#id='agencies']/option[#value='13156']").click()
The error I get is:
selenium.common.exceptions.NoSuchElementException: Message: 'Unable to locate element:
Which is strange because I do see the element (Commented out Display id). Any help/pointers, please?
(I tried requests/RoboBrowser -- can't seem to get the post to work but failed there as well)
from selenium import webdriver
from selenium import selenium
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
url = 'http://www.ucrdatatool.gov/Search/Crime/Local/OneYearofData.cfm'
driver.get(url)
driver.find_element_by_xpath("//select[#id='state']/option[#value='1']").click()
#driver.find_element_by_xpath("//select[#id='groups']/option[#value='8']").click()
driver.find_element_by_xpath("//input[#type='submit' and #value='Next']").click()
driver.implicitly_wait(5) # seconds
# Display id tags
#elementsAll = driver.find_elements_by_xpath('//*[#id]')
#for elements in elementsAll:
# print("id: ", repr(elements))
# print("idName: ",elements.get_attribute("id"))
# driver.implicitly_wait(5) # seconds
driver.find_element_by_xpath("//select[#id='groups']/option[#value='2']").click()
driver.find_element_by_xpath("//select[#id='year']/option[#value=1986]").click()
driver.find_element_by_xpath("//select[#id='agencies']/option[#value='13156']").click()
Update -- the below works on Selenium. I intended to choose all options in the list box and save the query results...Thanks for the pointer, Alecxe!
select = Select(driver.find_element_by_id('agencies'))
for options in select.options:
select.select_by_visible_text(options.text)
select = Select(driver.find_element_by_id('groups'))
for options in select.options:
select.select_by_visible_text(options.text)
driver.find_element_by_xpath("//select[#id='year']/option[#value=1985]").click()
driver.find_element_by_xpath("//input[#type='submit' and #value='Get Table']").click()

There is no option with 13156 value in select with agencies id. There are values from 102 to 522, you can see them by printing:
[element.get_attribute('value') for element in driver.find_elements_by_xpath('//select[#id="agencies"]/option')]
Also, instead of finding options by value, use Select and get options by text:
from selenium.webdriver.support.ui import Select
select = Select(driver.find_element_by_id('agencies'))
print select.options
select.select_by_visible_text('Selma Police Dept')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

web scraping a site without direct access - python

Related

https://www.realestate.com.au/ not permitting web scraping?

How to select particular region and scrape all the Jobs from a website

Why do I only get first page data when using selenium?

Clicking multiple items on one page using selenium

Python Selenium Screen Scrape

Categories

Resources