Dynamic Web scraping

Dynamic Web scraping - python

I am trying to scrape this page ("http://www.arohan.in/branch-locator.php") in which when I select the state and city, an address will be displayed and I have to write the state,city and address in csv/excel file. I am able to reach this till step, now I am stuck.
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[#id='city1']/option[text()='Gaya']"))
city_option.click()

Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza.
Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.

In a slightly organized manner:
import requests
from bs4 import BeautifulSoup
link = "http://www.arohan.in/branch-locator.php?"
def get_links(session,url,payload):
session.headers["User-Agent"] = "Mozilla/5.0"
res = session.get(url,params=payload)
soup = BeautifulSoup(res.text,"lxml")
item = [item.text for item in soup.select(".address_area p")]
print(item)
if __name__ == '__main__':
for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
payload = {
'state':st ,
'branch':br
}
with requests.Session() as session:
get_links(session,link,payload)
Output:
['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact#arohan.in']

A better approach would be to avoid using selenium. That is useful if you require the javascript processing required to render the HTML. In your case, this is not needed. The required information is already contained within the HTML.
What is needed is to first make a request to get a page containing all of the states. Then for each state, request the list of branch. Then for each state/branch combination, a URL request can be made to get the HTML containing the address. This happens to be contained in the second <li> entry following a <ul class='address_area'> entry:
from bs4 import BeautifulSoup
import requests
import csv
import time
# Get a list of available states
r = requests.get('http://www.arohan.in/branch-locator.php')
soup = BeautifulSoup(r.text, 'html.parser')
state_select = soup.find('select', id='state1')
states = [option.text for option in state_select.find_all('option')[1:]]
# Open an output CSV file
with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['State', 'Branch', 'Address'])
# For each state determine the available branches
for state in states:
r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
soup = BeautifulSoup(r_branches.text, 'html.parser')
# For each branch, request a page contain the address
for option in soup.find_all('option')[1:]:
time.sleep(0.5) # Reduce server loading
branch = option.text
print("{}, {}".format(state, branch))
r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
ul = soup_branch.find('ul', class_='address_area')
if ul:
address = ul.find_all('li')[1].get_text(strip=True)
row = [state, branch, address]
csv_output.writerow(row)
else:
print(soup_branch.title)
Giving you an output CSV file starting:
State,Branch,Address
West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091"
West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)"
West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002"
You should slow the script down using a time.sleep(0.5) to avoid too much loading on the server.
Note: [1:] is used as the first item in the drop down lists is not a branch or state, but a Select Branch entry.

Related

Scrape data from a website that URL doesn't change

I'm new to web scraping, but have enough command on requests, BeautifulSoup and Selenium that can do extract data from a website. Now the problem is, I'm trying to scrape data from the website that URL doesn't change when click on the page number for next page.
Page number in inspection
websiteURL ==> https://www.ellsworth.com/products/adhesives/
I also try the Google Developer tool but couldn't get the way. If someone guides me with Code that would be grateful.
Google Developer show Get Request
Here is my Code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import requests
itemproducts = pd.DataFrame()
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.ellsworth.com/products/adhesives/')
base_url = 'https://www.ellsworth.com'
html= driver.page_source
s = BeautifulSoup(html,'html.parser')
data = []
href_link = s.find_all('div',{'class':'results-products-item-image'})
for links in href_link:
href_link_a = links.find('a')['href']
data.append(base_url+href_link_a)
# url = 'https://www.ellsworth.com/products/adhesives/silicone/dow-838-silicone-adhesive-sealant-white-90-ml-tube/'
for c in data:
driver.get(c)
html_pro = driver.page_source
soup = BeautifulSoup(html_pro,'html.parser')
title = soup.find('span',{'itemprop':'name'}).text.strip()
part_num = soup.find('span',{'itemprop':'sku'}).text.strip()
manfacture = soup.find('span',{'class':'manuSku'}).text.strip()
manfacture_ = manfacture.replace('Manufacturer SKU:', '').strip()
pro_det = soup.find('div',{'class':'product-details'})
p = pro_det.find_all('p')
try:
d = p[1].text.strip()
c = p.text.strip()
except:
pass
table = pro_det.find('table',{'class':'table'})
tr = table.find_all('td')
typical = tr[1].text.strip()
brand = tr[3].text.strip()
color = tr[5].text.strip()
image = soup.find('img',{'itemprop':'image'})['src']
image_ = base_url + image
png_url = title +('.jpg')
img_data = requests.get(image_).content
with open(png_url,'wb') as fh:
fh.write(img_data)
itemproducts=itemproducts.append({'Product Title':title,
'Part Number':part_num,
'SKU':manfacture_,
'Description d':d,
'Description c':c,
'Typical':typical,
'Brand':brand,
'Color':color,
'Image URL':image_},ignore_index=True)

The content of the page is rendered dynamically, but if you inspect the XHR tab under Network in the Developer Tool you can fetch the API request url. I've shortened the URL a bit, but it still works just fine.
Here's how you can get the list of the first 10 products from page 1:
import requests
start = 0
n_items = 10
api_request_url = f"https://www.ellsworth.com/api/catalogSearch/search?sEcho=1&iDisplayStart={start}&iDisplayLength={n_items}&DefaultCatalogNode=Adhesives&_=1497895052601"
data = requests.get(api_request_url).json()
print(f"Found: {data['iTotalRecords']} items.")
for item in data["aaData"]:
print(item)
This gets you a nice JSON response with all the data for each product and that should get you started.
['Sauereisen Insa-Lute Adhesive Cement No. P-1 Powder Off-White 1 qt Can', 'P-1-INSA-LUTE-ADHESIVE', 'P-1 INSA-LUTE ADHESIVE', '$72.82', '/products/adhesives/ceramic/sauereisen-insa-lute-adhesive-cement-no.-p-1-powder-off-white-1-qt-can/', '/globalassets/catalogs/sauereisen-insa-lute-cement-no-p-1-off-white-1qt_170x170.jpg', 'Adhesives-Ceramic', '[{"qty":"1-2","price":"$72.82","customerPrice":"$72.82","eachPrice":"","custEachPrice":"","priceAmount":"72.820000000","customerPriceAmount":"72.820000000","currency":"USD"},{"qty":"3-15","price":"$67.62","customerPrice":"$67.62","eachPrice":"","custEachPrice":"","priceAmount":"67.620000000","customerPriceAmount":"67.620000000","currency":"USD"},{"qty":"16+","price":"$63.36","customerPrice":"$63.36","eachPrice":"","custEachPrice":"","priceAmount":"63.360000000","customerPriceAmount":"63.360000000","currency":"USD"}]', '', '', '', 'P1-Q', '1000', 'true', 'Presentation of packaged goods may vary. For special packaging requirements, please call (877) 454-9224', '', '', '']
If you want to get the next 10 items, you have to modify the iDisplayStart value to 10. And if you want more items per request just change the iDisplayLength to say 20.
In the demo, I substitute these values with start and n_items but you can easily automate that because the number of all items found comes with the response e.g. iTotalRecords.

How to Scrape a popup using python and selenium

I'm trying to scrape Ngo's data like name,mobile no,city etc from https://ngodarpan.gov.in/index.php/search/. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO.I am making a get followed by a post request to extract the data. I've tried accessing it using selenium but the json data is not coming.
list_of_cells = []
for cell in row.find_all('td'):
text = cell.text.replace(" ", "")
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
writer=csv.writer(f)
writer.writerow(list_of_cells)
By implementing above portion we can get entire details of the table of all the pages .In this website there are 7721 pages.we can simply change number_of_pages var.
But our motive is to find Ngo phone no/email id which is the main purpose we will get after clicking ngo name link.But it is not a href to link rather it a api get req followed by post request to fetch data.find in network section of inspect
driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
list_of_rows = []
src = driver.page_source # gets the html source of the page
parser = BeautifulSoup(src,'html.parser')
sleep(1)
table = parser.find("table",{ "class" : "table table-bordered table-striped" })
sleep(1)
for row in table.find_all('tr')[:]:
list_of_cells = []
for cell in row.find_all('td'):
x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
dat=x.json()
z=dat["csrf_token"]
print(z) # prints csrf token
r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
json_data=r.text # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
with open('data1.json', 'a') as outfile:
json.dump(json_data, outfile)
driver.find_element_by_xpath("//a[contains(text(),'»')]").click()
There is no such error message the code is running but it is printing html content
<html>
...
...
<body>
<div id="container">
<h1>An Error Was Encountered</h1>
<p>The action you have requested is not allowed.</p> </div>
</body>
</html>

This could be done much faster by avoiding the use of Selenium. Their site appears to continually request a token prior to each request, you might find it is possible to skip this.
The following shows how to get the JSON containing the mobile number and email address:
from bs4 import BeautifulSoup
import requests
import time
def get_token(sess):
req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
return req_csrf.json()['csrf_token']
search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"
sess = requests.Session()
for page in range(0, 10000, 10): # Advance 10 at a time
print(f"Getting results from {page}")
for retry in range(1, 10):
data = {
'state_search' : 7,
'district_search' : '',
'sector_search' : 'null',
'ngo_type_search' : 'null',
'ngo_name_search' : '',
'unique_id_search' : '',
'view_type' : 'detail_view',
'csrf_test_name' : get_token(sess),
}
req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
soup = BeautifulSoup(req_search.content, "html.parser")
table = soup.find('table', id='example')
if table:
for tr in table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
link = tr.find('a', onclick=True)
if link:
link_number = link['onclick'].strip("show_ngif(')")
req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
json = req_details.json()
details = json['infor']['0']
print([details['Mobile'], details['Email'], row[1], row[2]])
break
else:
print(f'No data returned - retry {retry}')
time.sleep(3)
This would give you the following kind of output for the first page:
['9871249262', 'pnes.delhi#yahoo.com', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', 'mathew.cherian#helpageindia.org', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', 'aipssngo#yahoo.com', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']

Switch to an iframe through Selenium and python
You can use an XPath to locate the :
iframe = driver.find_element_by_xpath("//iframe[#name='Dialogue Window']")
Then switch_to the :
driver.switch_to.frame(iframe)
Here's how to switch back to the default content (out of the ):
driver.switch_to.default_content()
In your instance, I believe the 'Dialogue Window' name would be CalendarControlIFrame
Once you switch to that frame, you will be able to use Beautiful Soup to get the frame's html.

I am tying to iterate over all the pages and extract data in one attempt
After extracting data from one page it is not iterating other pages
....
....
['9829059202', 'cecoedecon#gmail.com', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
['9443382475', 'odamindia#gmail.com', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
['9816510096', 'shrisaisnr#gmail.com', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
['9425013029', 'card_vivek#yahoo.com', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
['9204645161', 'secretary_smvm#yahoo.co.in', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
['9419107550', 'amarjit.randwal#gmail.com', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
...
...

Unable to access table inside div (basketballreference)

I'm currently writing a Python script and part of it gets winshares from the first 4 seasons of every player's career in the NBA draft between 2005 and 2015. I've been messing around with this for almost 2 hours (getting increasingly frustrated), but I've been unable to get the Win Shares for the individual players. I'm trying to use the "Advanced" table at the following link as a test case: https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none
When getting the player's names from the draft pages I had no problems, but I've tried so many iterations of the following code and have had no success in accessing the td element the stat is in.
playerSoup = BeautifulSoup(playerHtml)
playertr = playerSoup.find_all("table", id = "advanced").find("tbody").findAll("tr")
playerws = playertr.findAll("td")[21].getText()

This page use JavaScript to add tables but it doesn't read data from server. All tables are in HTML but as comments <!-- ... ->
Using BeautifulSoup you can find all comments and then check which one has text "Advanced". And then you can use this comment as normal HTML in BeautifulSoup
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none'
r = requests.get(url)
soup = BeautifulSoup(r.content)
all_comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for item in all_comments:
if "Advanced" in item:
adv = BeautifulSoup(item)
playertr = adv.find("table", id="advanced")
if not playertr:
#print('skip')
continue # skip comment without table - go back to `for`
playertr = playertr.find("tbody").findAll("tr")
playerws = adv.find_all("td")[21].getText()
print('playertr:', playertr)
print('playerws:', playerws)
for row in playertr:
if row:
print(row.find_all('th')[0].text)
all_td = row.find_all('td')
print([x.text for x in all_td])
print('--')

Links on original webpage missing after parsing with beautiful soup

Please excuse me if my explanation seems elementary. I'm new to both python and beautiful soup.
I'm trying to extract data from the following website :
https://valor.militarytimes.com/award/5?page=1
I want to extract the links that correspond to each of the 24 medal recipients on the website. I can see from Firefox inspector that they all have the word 'hero' in their links. However, when I use beautiful soup to parse the website, these links do not appear.
I have tried using the standard html parser as well as the html5lib parser but none of them show the links corresponding to these medal recipients.
page = requests.get('https://valor.militarytimes.com/award/5?page=1')
soup = BeautifulSoup(page.text, "html5lib")
for idx, link in enumerate(soup.find_all('a', href = True)):
print(link)
The above code finds only some of the links on the original website, and in particular, there are no links corresponding to the medal recipients. Even running soup.prettify() shows that these links are not in the parsed text.
I hope to have a simple code that can extract the links for the 24 medal recipients on this website.

If you want to avoid using selenium, there is a simple way to get the data you require. The page loads the data by sending a post requests to he following url,
https://valor.militarytimes.com/api/awards/5?page=1
This sends a json response which is then used to populate the page using JavaScript. All you have to do is send the same request using python-requests and then get the data out of the json response.
import requests
r=requests.post('https://valor.militarytimes.com/api/awards/5?page=1')
for item in r.json()['data']:
name=item['recipient']['name']
url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
print(name,url)
Output:
EUGENE MCCARLEY https://valor.militarytimes.com/hero/500963
TIMOTHY KEENAN https://valor.militarytimes.com/hero/500962
JOHN THOMPSON https://valor.militarytimes.com/hero/500961
WALTER BORDEN https://valor.militarytimes.com/hero/500941
WILLIAM ROSE https://valor.militarytimes.com/hero/94465
YUKITAKA MIZUTARI https://valor.militarytimes.com/hero/94175
ALBERT MARTIN https://valor.militarytimes.com/hero/92498
FRANCIS CODY https://valor.militarytimes.com/hero/500944
JAMES O'KEEFFE https://valor.militarytimes.com/hero/500943
PHILLIP FLEMING https://valor.militarytimes.com/hero/500942
JOHN WANAMAKER https://valor.militarytimes.com/hero/314466
ROBERT CHILSON https://valor.militarytimes.com/hero/102316
CHRISTOPHER NELMS https://valor.militarytimes.com/hero/89255
SAMUEL BARNETT https://valor.militarytimes.com/hero/71533
ANDREW BYERS https://valor.militarytimes.com/hero/500938
ANDREW RUSSELL https://valor.militarytimes.com/hero/500937
****** CALDWELL https://valor.militarytimes.com/hero/500935
****** WALWRATH https://valor.militarytimes.com/hero/500934
****** MADSEN https://valor.militarytimes.com/hero/500933
****** NELSON https://valor.militarytimes.com/hero/500932
WILLIAM SOUKUP https://valor.militarytimes.com/hero/500931
BENJAMIN WILSON https://valor.militarytimes.com/hero/500930
ANDREW MARCKESANO https://valor.militarytimes.com/hero/500929
WAYNE KUNZ https://valor.militarytimes.com/hero/500927
I have fetched the name as well. You can just get the link if you require only that.
Edit
To get urls from multiple pages, use this code
import requests
list_of_urls=[]
last_page=9 #replace this with your last page
for i in range(1,last_page+1):
r=requests.post('https://valor.militarytimes.com/api/awards/5?page={}'.format(i))
for item in r.json()['data']:
url='https://valor.militarytimes.com/hero/'+str(item['recipient']['id'])
list_of_urls.append(url)
print(list_of_urls)
Output:
['https://valor.militarytimes.com/hero/500963', 'https://valor.militarytimes.com/hero/500962', 'https://valor.militarytimes.com/hero/500961', 'https://valor.militarytimes.com/hero/500941', 'https://valor.militarytimes.com/hero/94465', 'https://valor.militarytimes.com/hero/94175', 'https://valor.militarytimes.com/hero/92498', 'https://valor.militarytimes.com/hero/500944', 'https://valor.militarytimes.com/hero/500943', 'https://valor.militarytimes.com/hero/500942', 'https://valor.militarytimes.com/hero/314466', 'https://valor.militarytimes.com/hero/102316', 'https://valor.militarytimes.com/hero/89255', 'https://valor.militarytimes.com/hero/71533', 'https://valor.militarytimes.com/hero/500938', 'https://valor.militarytimes.com/hero/500937', 'https://valor.militarytimes.com/hero/500935', 'https://valor.militarytimes.com/hero/500934', 'https://valor.militarytimes.com/hero/500933', 'https://valor.militarytimes.com/hero/500932', 'https://valor.militarytimes.com/hero/500931', 'https://valor.militarytimes.com/hero/500930', 'https://valor.militarytimes.com/hero/500929', 'https://valor.militarytimes.com/hero/500927', 'https://valor.militarytimes.com/hero/500926', 'https://valor.militarytimes.com/hero/500925', 'https://valor.militarytimes.com/hero/500924', 'https://valor.militarytimes.com/hero/500923', 'https://valor.militarytimes.com/hero/500922', 'https://valor.militarytimes.com/hero/500921', 'https://valor.militarytimes.com/hero/500920', 'https://valor.militarytimes.com/hero/500919', 'https://valor.militarytimes.com/hero/500918', 'https://valor.militarytimes.com/hero/500917', 'https://valor.militarytimes.com/hero/500916', 'https://valor.militarytimes.com/hero/500915', 'https://valor.militarytimes.com/hero/500914', 'https://valor.militarytimes.com/hero/500913', 'https://valor.militarytimes.com/hero/500912', 'https://valor.militarytimes.com/hero/500911', 'https://valor.militarytimes.com/hero/500910', 'https://valor.militarytimes.com/hero/500909', 'https://valor.militarytimes.com/hero/500908', 'https://valor.militarytimes.com/hero/500907', 'https://valor.militarytimes.com/hero/500906', 'https://valor.militarytimes.com/hero/500905', 'https://valor.militarytimes.com/hero/500904', 'https://valor.militarytimes.com/hero/500903', 'https://valor.militarytimes.com/hero/500902', 'https://valor.militarytimes.com/hero/500901', 'https://valor.militarytimes.com/hero/500900', 'https://valor.militarytimes.com/hero/500899', 'https://valor.militarytimes.com/hero/500898', 'https://valor.militarytimes.com/hero/500897', 'https://valor.militarytimes.com/hero/500896', 'https://valor.militarytimes.com/hero/500895', 'https://valor.militarytimes.com/hero/500894', 'https://valor.militarytimes.com/hero/500893', 'https://valor.militarytimes.com/hero/500892', 'https://valor.militarytimes.com/hero/500891', 'https://valor.militarytimes.com/hero/500890', 'https://valor.militarytimes.com/hero/500889', 'https://valor.militarytimes.com/hero/500888', 'https://valor.militarytimes.com/hero/29160', 'https://valor.militarytimes.com/hero/106931', 'https://valor.militarytimes.com/hero/106375', 'https://valor.militarytimes.com/hero/94936', 'https://valor.militarytimes.com/hero/94928', 'https://valor.militarytimes.com/hero/94927', 'https://valor.militarytimes.com/hero/94926', 'https://valor.militarytimes.com/hero/94923', 'https://valor.militarytimes.com/hero/94777', 'https://valor.militarytimes.com/hero/94769', 'https://valor.militarytimes.com/hero/94711', 'https://valor.militarytimes.com/hero/94644', 'https://valor.militarytimes.com/hero/94571', 'https://valor.militarytimes.com/hero/94570', 'https://valor.militarytimes.com/hero/94494', 'https://valor.militarytimes.com/hero/94468', 'https://valor.militarytimes.com/hero/94454', 'https://valor.militarytimes.com/hero/94388', 'https://valor.militarytimes.com/hero/94358', 'https://valor.militarytimes.com/hero/94279', 'https://valor.militarytimes.com/hero/94275', 'https://valor.militarytimes.com/hero/94253', 'https://valor.militarytimes.com/hero/94251', 'https://valor.militarytimes.com/hero/94223', 'https://valor.militarytimes.com/hero/94222', 'https://valor.militarytimes.com/hero/94217', 'https://valor.militarytimes.com/hero/94211', 'https://valor.militarytimes.com/hero/94210', 'https://valor.militarytimes.com/hero/94195', 'https://valor.militarytimes.com/hero/94194', 'https://valor.militarytimes.com/hero/94173', 'https://valor.militarytimes.com/hero/94168', 'https://valor.militarytimes.com/hero/94055', 'https://valor.militarytimes.com/hero/93916', 'https://valor.militarytimes.com/hero/93847', 'https://valor.militarytimes.com/hero/93780', 'https://valor.militarytimes.com/hero/93779', 'https://valor.militarytimes.com/hero/93775', 'https://valor.militarytimes.com/hero/93774', 'https://valor.militarytimes.com/hero/93733', 'https://valor.militarytimes.com/hero/93722', 'https://valor.militarytimes.com/hero/93706', 'https://valor.militarytimes.com/hero/93551', 'https://valor.militarytimes.com/hero/93435', 'https://valor.militarytimes.com/hero/93407', 'https://valor.militarytimes.com/hero/93374', 'https://valor.militarytimes.com/hero/93277', 'https://valor.militarytimes.com/hero/93243', 'https://valor.militarytimes.com/hero/93193', 'https://valor.militarytimes.com/hero/92989', 'https://valor.militarytimes.com/hero/92972', 'https://valor.militarytimes.com/hero/92958', 'https://valor.militarytimes.com/hero/93923', 'https://valor.militarytimes.com/hero/90130', 'https://valor.militarytimes.com/hero/90128', 'https://valor.militarytimes.com/hero/89704', 'https://valor.militarytimes.com/hero/89703', 'https://valor.militarytimes.com/hero/89702', 'https://valor.militarytimes.com/hero/89701', 'https://valor.militarytimes.com/hero/89698', 'https://valor.militarytimes.com/hero/89673', 'https://valor.militarytimes.com/hero/89661', 'https://valor.militarytimes.com/hero/90127', 'https://valor.militarytimes.com/hero/89535', 'https://valor.militarytimes.com/hero/89493', 'https://valor.militarytimes.com/hero/89406', 'https://valor.militarytimes.com/hero/89405', 'https://valor.militarytimes.com/hero/89404', 'https://valor.militarytimes.com/hero/89261', 'https://valor.militarytimes.com/hero/89259', 'https://valor.militarytimes.com/hero/88805', 'https://valor.militarytimes.com/hero/88803', 'https://valor.militarytimes.com/hero/88789', 'https://valor.militarytimes.com/hero/88770', 'https://valor.militarytimes.com/hero/88766', 'https://valor.militarytimes.com/hero/88765', 'https://valor.militarytimes.com/hero/88719', 'https://valor.militarytimes.com/hero/88680', 'https://valor.militarytimes.com/hero/88679', 'https://valor.militarytimes.com/hero/88678', 'https://valor.militarytimes.com/hero/88658', 'https://valor.militarytimes.com/hero/88657', 'https://valor.militarytimes.com/hero/88616', 'https://valor.militarytimes.com/hero/88578', 'https://valor.militarytimes.com/hero/88551', 'https://valor.militarytimes.com/hero/88445', 'https://valor.militarytimes.com/hero/88366', 'https://valor.militarytimes.com/hero/88365', 'https://valor.militarytimes.com/hero/88045', 'https://valor.militarytimes.com/hero/88044', 'https://valor.militarytimes.com/hero/88013', 'https://valor.militarytimes.com/hero/88012', 'https://valor.militarytimes.com/hero/87986', 'https://valor.militarytimes.com/hero/87918', 'https://valor.militarytimes.com/hero/87909', 'https://valor.militarytimes.com/hero/87898', 'https://valor.militarytimes.com/hero/87830', 'https://valor.militarytimes.com/hero/88570', 'https://valor.militarytimes.com/hero/88568', 'https://valor.militarytimes.com/hero/88239', 'https://valor.militarytimes.com/hero/87792', 'https://valor.militarytimes.com/hero/87782', 'https://valor.militarytimes.com/hero/87677', 'https://valor.militarytimes.com/hero/87655', 'https://valor.militarytimes.com/hero/87523', 'https://valor.militarytimes.com/hero/87460', 'https://valor.militarytimes.com/hero/87292', 'https://valor.militarytimes.com/hero/87291', 'https://valor.militarytimes.com/hero/87288', 'https://valor.militarytimes.com/hero/87283', 'https://valor.militarytimes.com/hero/87282', 'https://valor.militarytimes.com/hero/87281', 'https://valor.militarytimes.com/hero/87280', 'https://valor.militarytimes.com/hero/87279', 'https://valor.militarytimes.com/hero/87272', 'https://valor.militarytimes.com/hero/86875', 'https://valor.militarytimes.com/hero/86811', 'https://valor.militarytimes.com/hero/86451', 'https://valor.militarytimes.com/hero/86077', 'https://valor.militarytimes.com/hero/86076', 'https://valor.militarytimes.com/hero/85994', 'https://valor.militarytimes.com/hero/86005', 'https://valor.militarytimes.com/hero/6190', 'https://valor.militarytimes.com/hero/5022', 'https://valor.militarytimes.com/hero/500877', 'https://valor.militarytimes.com/hero/500851', 'https://valor.militarytimes.com/hero/500844', 'https://valor.militarytimes.com/hero/500843', 'https://valor.militarytimes.com/hero/500842', 'https://valor.militarytimes.com/hero/500841', 'https://valor.militarytimes.com/hero/500840', 'https://valor.militarytimes.com/hero/500839', 'https://valor.militarytimes.com/hero/500838', 'https://valor.militarytimes.com/hero/500837', 'https://valor.militarytimes.com/hero/500836', 'https://valor.militarytimes.com/hero/500835', 'https://valor.militarytimes.com/hero/500834', 'https://valor.militarytimes.com/hero/500833', 'https://valor.militarytimes.com/hero/500832', 'https://valor.militarytimes.com/hero/500831', 'https://valor.militarytimes.com/hero/500830', 'https://valor.militarytimes.com/hero/500829', 'https://valor.militarytimes.com/hero/500827', 'https://valor.militarytimes.com/hero/500826', 'https://valor.militarytimes.com/hero/500817', 'https://valor.militarytimes.com/hero/500816', 'https://valor.militarytimes.com/hero/500815', 'https://valor.militarytimes.com/hero/500813', 'https://valor.militarytimes.com/hero/500808', 'https://valor.militarytimes.com/hero/401188', 'https://valor.militarytimes.com/hero/401185', 'https://valor.militarytimes.com/hero/89851', 'https://valor.militarytimes.com/hero/89846']

You can use both selenium webdriver & beautiful soup
from selenium import webdriver
import time
from bs4 import BeautifulSoup
url = 'https://valor.militarytimes.com/award/5?page=1'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('window-size=1920x1080');
driver = webdriver.Chrome(options=chrome_options)
driver.get(url)
time.sleep(10)
page=driver.page_source
soup=BeautifulSoup(page,'lxml')
items = soup.select('a',href=True)
hero=[]
for item in items:
if 'hero' in item['href']:
print(item['href'])
hero.append(item['href'])
print(hero)
Output:
/hero/500963
/hero/500962
/hero/500961
/hero/500941
/hero/94465
/hero/94175
/hero/92498
/hero/500944
/hero/500943
/hero/500942
/hero/314466
/hero/102316
/hero/89255
/hero/71533
/hero/500938
/hero/500937
/hero/500935
/hero/500934
/hero/500933
/hero/500932
/hero/500931
/hero/500930
/hero/500929
/hero/500927
['/hero/500963', '/hero/500962', '/hero/500961', '/hero/500941', '/hero/94465', '/hero/94175', '/hero/92498', '/hero/500944', '/hero/500943', '/hero/500942', '/hero/314466', '/hero/102316', '/hero/89255', '/hero/71533', '/hero/500938', '/hero/500937', '/hero/500935', '/hero/500934', '/hero/500933', '/hero/500932', '/hero/500931', '/hero/500930', '/hero/500929', '/hero/500927']

You can make POST requests to API to retrieve json containing the ids for each recipient you can concatenate onto a base url to give the full url for each recipient. The json contains the url of the last page so you can determine the end point for a subsequent loop over all pages.
import requests
import pandas as pd
baseUrl = 'https://valor.militarytimes.com/hero/'
url = 'https://valor.militarytimes.com/api/awards/5?page=1'
headers = {
'Accept' : 'application/json, text/plain, */*' ,
'Referer' : 'https://valor.militarytimes.com/award/5?page=1',
'User-Agent' : 'Mozilla/5.0'
}
info = requests.post(url, headers = headers, data = '').json()
urls = [baseUrl + str(item['recipient']['id']) for item in info['data']] #page 1
linksInfo = info['links']
firstLink = linksInfo['first']
lastLink = linksInfo['last']
lastPage = lastLink.replace('https://valor.militarytimes.com/api/awards/5?page=','')
print('last page = ' + lastPage)
print(urls)
I had been testing with retrieving all results and noticed you would need potentially back off and retry.
You can build the additional urls as follows:
if lastPage > 1:
for page in range(2, lastPage + 1):
url = 'https://valor.militarytimes.com/api/awards/5?page={}'.format(page)

Python 3 Scrape yellow Pages

Im trying to scrape the data off of yellowpages but am running into where I can't get the text of each business name and address/phone. I'm using the code below, where am I going wrong? I'm trying to print the text of each business but only printing it out for the sake of seeing it right now as I test but once I'm done then Im going to save the data to csv.
import csv
import requests
from bs4 import BeautifulSoup
#dont worry about opening this file
"""with open('cities_louisiana.csv','r') as cities:
lines = cities.read().splitlines()
cities.close()"""
for city in lines:
print(city)
url = "http://www.yellowpages.com/search? search_terms=businesses&geo_location_terms=amite+LA&page="+str(count)
for city in lines:
for x in range (0, 50):
print("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=amite+LA&page="+str(x))
page = requests.get("http://www.yellowpages.com/search?search_terms=businesses&geo_location_terms=amite+LA&page="+str(x))
soup = BeautifulSoup(page.text, "html.parser")
name = soup.find_all("div", {"class": "v-card"})
for name in name:
try:
print(name.contents[0]).find_all(class_="business-name").text
#print(name.contents[1].text)
except:
pass

You should iterate over search results, then, for every search result locate the business name (the element with the "business-name" class) and the address (the element with the "adr" class):
for result in soup.select(".search-results .result"):
name = result.select_one(".business-name").get_text(strip=True, separator=" ")
address = result.select_one(".adr").get_text(strip=True, separator=" ")
print(name, address)
.select() and .select_one() are handy CSS selector methods.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dynamic Web scraping - python

Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza. Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.

Related

Scrape data from a website that URL doesn't change

How to Scrape a popup using python and selenium

Unable to access table inside div (basketballreference)

Links on original webpage missing after parsing with beautiful soup

Python 3 Scrape yellow Pages

Categories

Resources