How to Scrape a popup using python and selenium - python

I'm trying to scrape Ngo's data like name,mobile no,city etc from https://ngodarpan.gov.in/index.php/search/. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO.I am making a get followed by a post request to extract the data. I've tried accessing it using selenium but the json data is not coming.
list_of_cells = []
for cell in row.find_all('td'):
text = cell.text.replace(" ", "")
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
writer=csv.writer(f)
writer.writerow(list_of_cells)
By implementing above portion we can get entire details of the table of all the pages .In this website there are 7721 pages.we can simply change number_of_pages var.
But our motive is to find Ngo phone no/email id which is the main purpose we will get after clicking ngo name link.But it is not a href to link rather it a api get req followed by post request to fetch data.find in network section of inspect
driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
list_of_rows = []
src = driver.page_source # gets the html source of the page
parser = BeautifulSoup(src,'html.parser')
sleep(1)
table = parser.find("table",{ "class" : "table table-bordered table-striped" })
sleep(1)
for row in table.find_all('tr')[:]:
list_of_cells = []
for cell in row.find_all('td'):
x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
dat=x.json()
z=dat["csrf_token"]
print(z) # prints csrf token
r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
json_data=r.text # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
with open('data1.json', 'a') as outfile:
json.dump(json_data, outfile)
driver.find_element_by_xpath("//a[contains(text(),'ยป')]").click()
There is no such error message the code is running but it is printing html content
<html>
...
...
<body>
<div id="container">
<h1>An Error Was Encountered</h1>
<p>The action you have requested is not allowed.</p> </div>
</body>
</html>

This could be done much faster by avoiding the use of Selenium. Their site appears to continually request a token prior to each request, you might find it is possible to skip this.
The following shows how to get the JSON containing the mobile number and email address:
from bs4 import BeautifulSoup
import requests
import time
def get_token(sess):
req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
return req_csrf.json()['csrf_token']
search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"
sess = requests.Session()
for page in range(0, 10000, 10): # Advance 10 at a time
print(f"Getting results from {page}")
for retry in range(1, 10):
data = {
'state_search' : 7,
'district_search' : '',
'sector_search' : 'null',
'ngo_type_search' : 'null',
'ngo_name_search' : '',
'unique_id_search' : '',
'view_type' : 'detail_view',
'csrf_test_name' : get_token(sess),
}
req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
soup = BeautifulSoup(req_search.content, "html.parser")
table = soup.find('table', id='example')
if table:
for tr in table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
link = tr.find('a', onclick=True)
if link:
link_number = link['onclick'].strip("show_ngif(')")
req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
json = req_details.json()
details = json['infor']['0']
print([details['Mobile'], details['Email'], row[1], row[2]])
break
else:
print(f'No data returned - retry {retry}')
time.sleep(3)
This would give you the following kind of output for the first page:
['9871249262', 'pnes.delhi#yahoo.com', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', 'mathew.cherian#helpageindia.org', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', 'aipssngo#yahoo.com', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']

Switch to an iframe through Selenium and python
You can use an XPath to locate the :
iframe = driver.find_element_by_xpath("//iframe[#name='Dialogue Window']")
Then switch_to the :
driver.switch_to.frame(iframe)
Here's how to switch back to the default content (out of the ):
driver.switch_to.default_content()
In your instance, I believe the 'Dialogue Window' name would be CalendarControlIFrame
Once you switch to that frame, you will be able to use Beautiful Soup to get the frame's html.

I am tying to iterate over all the pages and extract data in one attempt
After extracting data from one page it is not iterating other pages
....
....
['9829059202', 'cecoedecon#gmail.com', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
['9443382475', 'odamindia#gmail.com', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
['9816510096', 'shrisaisnr#gmail.com', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
['9425013029', 'card_vivek#yahoo.com', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
['9204645161', 'secretary_smvm#yahoo.co.in', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
['9419107550', 'amarjit.randwal#gmail.com', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
...
...

Related

Why is Python requests returning a different text value to what I get when I navigate to the webpage by hand?

I am trying to build a simple 'stock-checker' for a T-shirt I want to buy. Here is the link: https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069
As you can see, I am present with 'Coming Soon' text, whereas usually if an item is in stock, it will show 'Add To Cart'.
I thought the simplest way would be to use requests and beautifulsoup to isolate this <button> tag, and read the value of text. If it eventually says 'Add To Cart', then I will write the code to email/message myself it's back in stock.
However, here's the code I have so far, and you'll see that the response says the text contains 'Add To Cart', which is not what the website actually shows?
import requests
import bs4
URL = 'https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069'
def check_stock(url):
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, "html.parser")
buttons = soup.find_all('button', {'name': 'add'})
return buttons
if __name__ == '__main__':
buttons = check_stock(URL)
print(buttons[0].text)
All data available in <script> tag in JSON. So we need to get this, and extract the information we need. Let's use a simple slice by indexes to get clean JSON
import requests
import json
url = 'https://yesfriends.co/products/mens-t-shirt-black'
response = requests.get(url)
index_start = response.text.index('product:', 0) + len('product:')
index_finish = response.text.index(', }', index_start)
json_obj = json.loads(response.text[index_start:index_finish])
for variant in json_obj['variants']:
available = 'IN STOCK' if variant['available'] else 'OUT OF STOCK'
print(variant['id'], variant['option1'], available)
OUTPUT:
40840532623533 XXS OUT OF STOCK
40840532656301 XS OUT OF STOCK
40840532689069 S OUT OF STOCK
40840532721837 M OUT OF STOCK
40840532754605 L OUT OF STOCK
40840532787373 XL OUT OF STOCK
40840532820141 XXL OUT OF STOCK
40840532852909 3XL IN STOCK
40840532885677 4XL OUT OF STOCK

Beautiful Soup not returing data within a table

I want to Retrieve a financial dataset from a Website which has a log in. I've managed to log in using requests and access the HTML
from bs4 import BeautifulSoup
import pandas as pd
s = requests.session()
login_data = dict(email='my login', password='password')
s.post('*portal webiste with/login*', data=login_data)
r = s.get(' *website with finacial page* ')
print (r.content)
## work on r as its a direct link
url = r # stock url
page = url
soup = BeautifulSoup(page.text) # returns the htm of the finance page.
The above code allows me to log in and get the html from the correct page.
headers = []
# finds all the headers.
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns = headers)
print(df)
this block finds the table and gets the column headers.
which are printed as:
Columns: [Date, Type, Type, Credit, Debit, Outstanding, Case File, ]
The next part is the problem. when I attempt to retrieve the financials using the following code:
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip()for td in data]
print(row_data)
it returns this
['"Loading Please Wait..."']
HTML of the site looks like this
html of the site i want to scrape

Extracting url and headline of a list of articles from a given url

http://comp20008-jh.eng.unimelb.edu.au:9889/main/
Hi, so I'm trying to get the list of all the urls and main headings from articles given on a html page. So the link above is the main page and clicking the 'next article' link leads to the next article with a link like this:
http://comp20008-jh.eng.unimelb.edu.au:9889/main/Hodg001.html
"Hodg001.html" href which continues until 147th article. This page has a 'next article' link that leads to the next article and so on.
I'm trying to extract the url and the heading from each article and create a dataframe with to save into a csv file. I'm totally clueless and don't know how to proceed now
base_url = 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/'
req = requests.get(base_url)
print(req)
soup = BeautifulSoup(req.text, 'html.parser')
print(soup.prettify())
print(soup.h1)
links = soup.findAll("a")
print(links)
headings = soup.findAll("h1")
print(headings)
for link in links:
print(link.get("href")) ##only gets 1
for i in headings:
print(i) #doesn't work
Can anyone please explain how I can proceed? I can provide more information if needed.
Do you mean something like this?
In the code below a few things are happening:
set the a base_url as the links are relative links. Not absolute
keep track of the next_url for the while loop
the next_link_class is just a placeholder to find which -tag is needed
data will contain the links and headings
csv_path is the path to your export file
Next we tell the script to keep on fetching links and extract information as long as the next_link is populated.
After it's done, it will write the data to a the provided path for the csv file.
I must admit I didn't let it run through to the end - so you may need to catch it differently if no next_link available on the html-page. But this should get you well on your way.
import csv
import bs4
import requests
url_base = 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/'
next_url = 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Hodg001.html'
next_link_class = 'nextLink'
csv_path = 'export.csv'
data = []
while next_url:
link_list.append(next_url)
soup = bs4.BeautifulSoup(requests.get(next_url).content)
try:
next_url = url_base + soup.find('p', {'class': next_link_class}).a['href']
except AttributeError:
break # This exception should be thrown when the last page is reached.
# The loop should break at that point and dump the data to the csv-file
d = {
'url': next_url,
'heading': soup.find('h1').string
}
data.append(d)
print(d)
with open(csv_path, 'w') as f:
c = csv.DictWriter(f, fieldnames=data[0].keys())
c.writeheader()
[c.writerow(i) for i in data]
This code above will provide you with a dictionary of links and headings. And save it all to the given csv_path
In [20]: data
Out[20]:
[{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Vick002.html',
'heading': 'Hodgson shoulders England blame'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Yach003.html',
'heading': 'Vickery out of Six Nations'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Lapo004.html',
'heading': 'Yachvili savours France comeback'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Lews005.html',
'heading': 'Laporte tinkers with team'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Fumi006.html',
'heading': 'Lewsey puzzle over disallowed try'}]

Unable to access table inside div (basketballreference)

I'm currently writing a Python script and part of it gets winshares from the first 4 seasons of every player's career in the NBA draft between 2005 and 2015. I've been messing around with this for almost 2 hours (getting increasingly frustrated), but I've been unable to get the Win Shares for the individual players. I'm trying to use the "Advanced" table at the following link as a test case: https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none
When getting the player's names from the draft pages I had no problems, but I've tried so many iterations of the following code and have had no success in accessing the td element the stat is in.
playerSoup = BeautifulSoup(playerHtml)
playertr = playerSoup.find_all("table", id = "advanced").find("tbody").findAll("tr")
playerws = playertr.findAll("td")[21].getText()
This page use JavaScript to add tables but it doesn't read data from server. All tables are in HTML but as comments <!-- ... ->
Using BeautifulSoup you can find all comments and then check which one has text "Advanced". And then you can use this comment as normal HTML in BeautifulSoup
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none'
r = requests.get(url)
soup = BeautifulSoup(r.content)
all_comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for item in all_comments:
if "Advanced" in item:
adv = BeautifulSoup(item)
playertr = adv.find("table", id="advanced")
if not playertr:
#print('skip')
continue # skip comment without table - go back to `for`
playertr = playertr.find("tbody").findAll("tr")
playerws = adv.find_all("td")[21].getText()
print('playertr:', playertr)
print('playerws:', playerws)
for row in playertr:
if row:
print(row.find_all('th')[0].text)
all_td = row.find_all('td')
print([x.text for x in all_td])
print('--')

Dynamic Web scraping

I am trying to scrape this page ("http://www.arohan.in/branch-locator.php") in which when I select the state and city, an address will be displayed and I have to write the state,city and address in csv/excel file. I am able to reach this till step, now I am stuck.
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[#id='city1']/option[text()='Gaya']"))
city_option.click()
Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza.
Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.
In a slightly organized manner:
import requests
from bs4 import BeautifulSoup
link = "http://www.arohan.in/branch-locator.php?"
def get_links(session,url,payload):
session.headers["User-Agent"] = "Mozilla/5.0"
res = session.get(url,params=payload)
soup = BeautifulSoup(res.text,"lxml")
item = [item.text for item in soup.select(".address_area p")]
print(item)
if __name__ == '__main__':
for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
payload = {
'state':st ,
'branch':br
}
with requests.Session() as session:
get_links(session,link,payload)
Output:
['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact#arohan.in']
A better approach would be to avoid using selenium. That is useful if you require the javascript processing required to render the HTML. In your case, this is not needed. The required information is already contained within the HTML.
What is needed is to first make a request to get a page containing all of the states. Then for each state, request the list of branch. Then for each state/branch combination, a URL request can be made to get the HTML containing the address. This happens to be contained in the second <li> entry following a <ul class='address_area'> entry:
from bs4 import BeautifulSoup
import requests
import csv
import time
# Get a list of available states
r = requests.get('http://www.arohan.in/branch-locator.php')
soup = BeautifulSoup(r.text, 'html.parser')
state_select = soup.find('select', id='state1')
states = [option.text for option in state_select.find_all('option')[1:]]
# Open an output CSV file
with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['State', 'Branch', 'Address'])
# For each state determine the available branches
for state in states:
r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
soup = BeautifulSoup(r_branches.text, 'html.parser')
# For each branch, request a page contain the address
for option in soup.find_all('option')[1:]:
time.sleep(0.5) # Reduce server loading
branch = option.text
print("{}, {}".format(state, branch))
r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
ul = soup_branch.find('ul', class_='address_area')
if ul:
address = ul.find_all('li')[1].get_text(strip=True)
row = [state, branch, address]
csv_output.writerow(row)
else:
print(soup_branch.title)
Giving you an output CSV file starting:
State,Branch,Address
West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091"
West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)"
West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002"
You should slow the script down using a time.sleep(0.5) to avoid too much loading on the server.
Note: [1:] is used as the first item in the drop down lists is not a branch or state, but a Select Branch entry.

Categories

Resources