Web scrape with multiple inputs and collect Total Margin required - python

I have a web link as:
url = "zerodha.com/margin-calculator/SPAN"
Here input parameters with sample values for reference mentioned below:
Exchange - NFO
Product - Options
Symbol - DHFL 27-JUN-19
Option Type - Calls
Strike Price - 120
Net Qty appears automatically as 1500,
and Use SELL Button then Click ADD Button.
I want to collect the Total Margin required (in above case its Rs 49,308) which appears in the right end.

You can just use requests. If you observe your network, you can see that it is making a POST requests with selected payload. This is how I would do it:
from requests import Session
BASE_URL = 'https://zerodha.com/margin-calculator/SPAN'
payload = {'action': 'calculate',
'exchange[]': 'NFO',
'product[]': 'FUT',
'scrip[]': 'DHFL19AUG',
'option_type[]': 'CE',
'strike_price[]':120,
'qty[]': 4000,
'trade[]': 'sell'
}
session = Session()
res = session.post(BASE_URL, data=payload)
data = res.json()
print(data)
I got the URL and Payload from observing network. This is what you will get as data in json form.
Results in chrome and python
Just observe how chrome or firefox send and receive data. And reverse engineer with your requests.

website link is dynamic rendering request table data. You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome("/usr/bin/chromedriver")
driver.get("https://zerodha.com/margin-calculator/SPAN")
# select exchange option of NFO
exchange = driver.find_element_by_name('exchange[]')
exchange.send_keys("NFO")
# select product option of option
product = driver.find_element_by_name('product[]')
product.send_keys("OPT")
# select symbol by option value
symbol = Select(driver.find_element_by_name("scrip[]"))
symbol.select_by_value("DHFL19JUN")
# select option Type CELL option
optionType = driver.find_element_by_name('option_type[]')
optionType.send_keys("CE")
#add Strike price
strikePrice = driver.find_element_by_name('strike_price[]')
strikePrice.clear()
strikePrice.send_keys("120")
# add Net quantity
netQty = driver.find_element_by_name('qty[]')
netQty.clear()
netQty.send_keys("1500")
# select sell radio button
driver.find_elements_by_css_selector("input[name='trade[]'][value='sell']")[0].click()
#submit form
submit = driver.find_element_by_css_selector("input[type='submit'][value='Add']")
submit.click()
time.sleep(2)
# scrape margin
margin = driver.find_element_by_css_selector(".val.total")
print(margin.text)
where '/usr/bin/chromedriver' selenium web driver path.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/

Related

Python log in website once for multiple site scraping

I want to scrape data from a website.
The Python code below works fine for 1 id. But for each id, the code logs in ONCE. Thus if I want to download data for 100id, the code will need to log in 100 times.
Is there anyway to log in just one time, and followed by a scraping loop to download data for multiple ids?
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# URL and id ---------------------------------
company = 313055
url = 'https://www.capitaliq.com/CIQDotNet/Estimates/CIQ/consensus.aspx?CompanyId={}&dataVendorId=1'
bot = webdriver.Chrome(ChromeDriverManager().install())
bot.get(url.format(company))
# Log in -------------------------------------
bot.find_element_by_id('username').send_keys("")
pwd = bot.find_element_by_id('password')
pwd.send_keys('')
pwd.send_keys(Keys.RETURN)
# Download data ------------------------------
soup = BeautifulSoup(bot.page_source, 'lxml')
UPDATED:
So the solution is to use the main page url, and log in.
Then putting company into search box.
This case, you dont need to login again and again
UPDATED: So the solution is to use the main page url, and log in. Then putting company into search box. This case, you dont need to login again and again
# LOG ON TO Capitaliq.com
url = 'https://www.capitaliq.com'
bot = webdriver.Chrome(ChromeDriverManager().install())
bot.get(url.format(company))
bot.find_element_by_id('username').send_keys("")
pwd = bot.find_element_by_id('password')
pwd.send_keys('')
pwd.send_keys(Keys.RETURN)
tickers_list=[19049,24937]
for ticker in tickers_list:
print('**********', ticker)
good=0
# Enter Firm information
bot.find_element_by_id('SearchTopBar').send_keys(ticker)
bot.find_element_by_id('SearchTopBar').send_keys(Keys.RETURN)

Python Selenium scrape data when button "Load More" doesnt change URL

I am using the following code to attempt to keep clicking a "Load More" button until all page results are shown on the website:
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def startWebDriver():
global driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(options = chrome_options)
startWebDriver()
driver.get("https://together.bunq.com/all")
time.sleep(4)
while True:
try:
wait = WebDriverWait(driver, 10,10)
element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[title='Load More']")))
element.click()
print("Loading more page results")
except:
print("All page results displayed")
break;
However, since the button click does not change the URL, no new data is loaded into chromedriver and the while loop will break on the second iteration.
Selenium is overkill for this. You only need requests. Logging one's network traffic reveals that at some point JavaScript makes an XHR HTTP GET request to a REST API endpoint, the response of which is JSON and contains all the information you're likely to want to scrape.
One of the query-string parameters for that endpoint URL is page[offset], which is used to offset the query results for pagination (in this case the "load more button"). A value of 0 corresponds to no offset, or "start at the beginning". Increment this value to suit your needs - in a loop would probably be a good place to do this.
Simply imitate that XHR HTTP GET request - copy the API endpoint URL and query-string parameters and request headers, then parse the JSON response:
def get_discussions():
import requests
url = "https://together.bunq.com/api/discussions"
params = {
"include": "user,lastPostedUser,tags,firstPost",
"page[offset]": 0
}
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
yield from response.json()["data"]
def main():
for discussion in get_discussions():
print(discussion["attributes"]["title"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
⚑️What’s new in App Update 18.8.0
Local Currencies Accounts Fees
Local Currencies πŸ’Έ
Spanish IBANs Service Coverage πŸ‡ͺπŸ‡Έ
bunq Update 18 FAQ πŸ“š
Phishing and Spoofing - The new ways of scamming, explained πŸ‘€
Easily rent a car with your bunq credit card πŸš—
Giveaway - Hallo Deutschland! πŸ‡©πŸ‡ͺ
Giveaway - Hello Germany! πŸ‡©πŸ‡ͺ
True Name: Feedback πŸ’¬
True Name 🌈
What plans are available?
Everything about refunds πŸ’Έ
Identity verification explained! 🀳
When will I receive my payment?
Together Community Guidelines πŸ“£
What is my Tax Identification Number (TIN)?
How do I export a bank statement?
How do I change my contact info?
Large cash withdrael
If this is a new concept for you, I would suggest you look up tutorials on how to use your browser's developer tools (Google Chrome's Devtools, for example), how to log your browser's network traffic, REST APIs, HTTP, etc.

https://www.realestate.com.au/ not permitting web scraping?

I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)
you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)
I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.

Creating POST request to scrape website with python where no network form data changes

I am scraping a website that dynamically renders with javascript. The urls don't change when hitting the > button So I have been trying to look at the inspector in the network section and more specifically the "General" section for the "Request Url" and the "Request Method" as well as in the "Form Data" section looking for any sort of ID that could be unique to distinguish each successive page. However when recording a log of clicking the > button from page to page the "Form Data" data seems to be the same each time (See images):
Currently my code doesn't incorporate this method because I can't see it helping until I can find a unique identifier in the "Form Data" section. However, I can show my code if helpful. In essence it just pulls the first page of data over and over again in my while loop even though I'm using a driver with selenium and using driver.find_elements_by_xpath("xpath of > button").click() before trying to get the data with BeautifulSoup.
(Updated code see comments)
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import pandas as pd
from pandas import *
masters_list = []
def extract_info(html_source):
# html_source will be inner HTMl of table
global lst
soup = BeautifulSoup(html_source, 'html.parser')
lst = soup.find('tbody').find_all('tr')[0]
masters_list.append(lst)
# i am printing just id because it's id set as crypto name you have to do more scraping to get more info
chrome_driver_path = '/Users/Justin/Desktop/Python/chromedriver'
driver = webdriver.Chrome(executable_path=chrome_driver_path)
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: # loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID, 'DataTables_Table_0').get_attribute(
'innerHTML') # this is for crypto data table
extract_info(crypto_table)
paginate = driver.find_element(
By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME, 'li')
# we clicking on next arrow sign at last not on 2,3,.. etc anchor link
next_page_link = pages_list[-1].find_element(By.TAG_NAME, 'a')
# checking is there next page available
if "disabled" in next_page_link.get_attribute('class'):
loop = False
pages_list[-1].click() # if there next page available then click on it
df = pd.DataFrame(masters_list)
print(df)
df.to_csv("crypto_list.csv")
driver.quit()
I am using my own code to show how i am getting the table i add explanation as comment for important line
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def extract_info(html_source):
soup = BeautifulSoup(html_source,'html.parser') #html_source will be inner HTMl of table
lst = soup.find('tbody').find_all('tr')
for i in lst:
print(i.get('id')) # i am printing just id because it's id set as crypto name you have to do more scraping to get more info
driver = webdriver.Chrome()
url = 'https://cryptoli.st/lists/fixed-supply'
driver.get(url)
loop = True
while loop: #loop for extrcting all 120 pages
crypto_table = driver.find_element(By.ID,'DataTables_Table_0').get_attribute('innerHTML') # this is for crypto data table
print(extract_info(crypto_table))
paginate = driver.find_element(By.ID, "DataTables_Table_0_paginate") # all table pagination
pages_list = paginate.find_elements(By.TAG_NAME,'li')
next_page_link = pages_list[-1].find_element(By.TAG_NAME,'a') # we clicking on next arrow sign at last not on 2,3,.. etc anchor link
if "disabled" in next_page_link.get_attribute('class'): # checking is there next page available
loop = False
pages_list[-1].click() # if there next page available then click on it
so main answer of your question is when you click on button, selenium update the page then you can use driver.page_source to get updated html. some times (*not this url) page can have ajax request which can take some time so you have to wait till the selenium load the full page.

web scraping a site without direct access

any help is appreciated in advance.
deal is i have been trying scrape data from this website(https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do),but direct access to the website is not possible.Rather then data i need,i am getting invalid access.To access the website i must go to (https://www.mptax.mp.gov.in/mpvatweb/index.jsp) and then click on 'dealer search' from dropdown menu while hovering over dealer information.
I am looking for solution in Python,
Here's something i tried.I have just started web scraping:
import requests
from bs4 import BeautifulSoup
with requests.session() as request:
MAIN="https://www.mptax.mp.gov.in/mpvatweb/leftMenu.do"
INITIAL="https://www.mptax.mp.gov.in/mpvatweb/"
page=request.get(INITIAL)
jsession=page.cookies["JSESSIONID"]
print(jsession)
print(page.headers)
result=request.post(INITIAL,headers={"Cookie":"JSESSIONID="+jsession+"; zoomType=0","Referer":INITIAL})
page1=request.get(MAIN,headers={"Referer":INITIAL})
soup=BeautifulSoup(page1.content,'html.parser')
data=soup.find_all("tr",class_="whitepapartd1")
print(data)
Deal is i want to scrape data about firm's based on their firm name.
thanks for telling me a way #Arnav and #Arman ,so here's the final code:
from selenium import webdriver #to work with website
from bs4 import BeautifulSoup #to scrap data
from selenium.webdriver.common.action_chains import ActionChains #to initiate hovering
from selenium.webdriver.common.keys import Keys #to input value
PROXY = "10.3.100.207:8080" # IP:PORT or HOST:PORT
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
#ask for input
company_name=input("tell the company name")
#import website
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get("https://www.mptax.mp.gov.in/mpvatweb/")
#perform hovering to show hovering
element_to_hover_over = browser.find_element_by_css_selector("#mainsection > form:nth-child(2) > table:nth-child(1) > tbody:nth-child(1) > tr:nth-child(3) > td:nth-child(3) > a:nth-child(1)")
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
#click on dealer search from dropdown menu
browser.find_element_by_css_selector("#dropmenudiv > a:nth-child(1)").click()
#we are now on the leftmenu page
#click on radio button
browser.find_element_by_css_selector("#byName").click()
#input company name
inputElement = browser.find_element_by_css_selector("#showNameField > td:nth-child(2) > input:nth-child(1)")
inputElement.send_keys(company_name)
#submit form
inputElement.submit()
#now we are on dealerssearch page
#scrap data
soup=BeautifulSoup(browser.page_source,"lxml")
#get the list of values we need
list=soup.find_all('td',class_="tdBlackBorder")
#check length of 'list' and on that basis decide what to print
if(len(list)!=0):
#company name at index=9
#tin no. at index=10
#registration status at index=11
#circle name at index=15
#store the values
name=list[9].get_text()
tin=list[10].get_text()
status=list[11].get_text()
circle=list[15].get_text()
#make dictionary
Company_Details={"TIN":tin ,"Firm name":name ,"Circle_Name":circle, "Registration_Status":status}
print(Company_Details)
else:
Company_Details={"VAT RC No":"Not found in database"}
print(Company_Details)
#close the chrome
browser.stop_client()
browser.close()
browser.quit()
Would you mind using a browser?
You can use a browser and access the link at xpath (//*[#id="dropmenudiv"]/a[1]).
You might have to download and put chromedriver in the mentioned directory if you haven't used chromedriver before. You can also use selenium + phantomjs if you want to do headless browsing (without the browser opening up each time).
from selenium import webdriver
xpath = "//*[#id="dropmenudiv"]/a[1]"
browser = webdriver.Chrome('/usr/local/bin/chromedriver')
browser.set_window_size(1120,550)
browser.get('https://www.mptax.mp.gov.in/mpvatweb')
link = browser.find_element_by_xpath("//*[#id="dropmenudiv"]/a[1]")
link.click()
url = browser.current_url

Categories

Resources