Scraping and navigating to links to get more info

Scraping and navigating to links to get more info - python

Not sure if what I am trying to do is possible...but here goes. I am trying to navigate and scrape info from this table (simplified)...
> <tr class="transaction odd" id="transaction_7"><td><a
> href="/show_customer/11111">Erin</a></td></tr> <tr class="transaction
> even" id="transaction_6"><td><a
> href="/show_customer/2222">Jack</a></td></tr> <tr class="transaction
> odd" id="transaction_5"><td><a
> href="/show_customer/3333">Carl</a></td></tr> <tr class="transaction
> even" id="transaction_4"><td><a
> href="/show_customer/4444">Kelly</a></td></tr>
This is the code I used to scrape the table and output into a csv...works well.
columns = ["User Name", "Source", "Staff", "Location", "Attended On", "Used", "Date"]
table = []
for row in table_1.find_all('tr'):
tds = row.find_all('td')
try:
data = [td.get_text() for td in tds]
for field,value in zip(columns, data):
print("{}: {}".format(field, value))
table.append(data)
except:
print("Bad string value")
import csv
with open("myfile.csv", "wb") as outf:
outcsv = csv.writer(outf)
# header row
outcsv.writerow(columns)
# data
outcsv.writerows(table)
What I need to do is navigate to each link in the table like this
<a> href="/show_customer/11111">Erin</a>
and grab the customers email address that is in this html form
<div class="field">
<div class = "label">Email</div>
<p>XXXX#email.com</p>
</div>
And add that to the relevant row in my csv.
Any help would be greatly appreciated!

You would have to make a http request for every href in the td. This is how you would modify your existing code to do that:
from urllib2 import urlopen
for row in table_1.find_all('tr'):
tds = row.find_all('td')
# Get all the hrefs to make http request
links = row.find_all('a').get('href')
try:
data = [td.get_text() for td in tds]
for field,value in zip(columns, data):
print("{}: {}".format(field, value))
# For every href make a request, get the page,
# create a BS object
for link in links:
link_soup = BeautifulSoup(urlopen(link))
# Use link_soup BS instance to get the email
# by navigating the div and p and add it to your data
table.append(data)
except:
print("Bad string value")
Note that your href is relative to the website's url. So after you extract the href you would have to prepend it with the website's url to form a valid url

Related

How to Scrape a popup using python and selenium

I'm trying to scrape Ngo's data like name,mobile no,city etc from https://ngodarpan.gov.in/index.php/search/. It has names of the NGOs in a table format and on clicking on each name gives way to a pop up page. In my code below, I'm extracting the onclick attribute for each NGO.I am making a get followed by a post request to extract the data. I've tried accessing it using selenium but the json data is not coming.
list_of_cells = []
for cell in row.find_all('td'):
text = cell.text.replace(" ", "")
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
writer=csv.writer(f)
writer.writerow(list_of_cells)
By implementing above portion we can get entire details of the table of all the pages .In this website there are 7721 pages.we can simply change number_of_pages var.
But our motive is to find Ngo phone no/email id which is the main purpose we will get after clicking ngo name link.But it is not a href to link rather it a api get req followed by post request to fetch data.find in network section of inspect
driver.get("https://ngodarpan.gov.in/index.php/search/") # load the web page
sleep(2)
....
....
driver.find_element(By.NAME,"commit").submit()
for page in range(number_of_pages - 1):
list_of_rows = []
src = driver.page_source # gets the html source of the page
parser = BeautifulSoup(src,'html.parser')
sleep(1)
table = parser.find("table",{ "class" : "table table-bordered table-striped" })
sleep(1)
for row in table.find_all('tr')[:]:
list_of_cells = []
for cell in row.find_all('td'):
x = requests.get("https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf")
dat=x.json()
z=dat["csrf_token"]
print(z) # prints csrf token
r= requests.post("https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info", data = {'id':'','csrf_test_name':'z'})
json_data=r.text # i guess here is something not working it is printing html text but we need text data of post request like mob,email,and here it will print all the data .
with open('data1.json', 'a') as outfile:
json.dump(json_data, outfile)
driver.find_element_by_xpath("//a[contains(text(),'»')]").click()
There is no such error message the code is running but it is printing html content
<html>
...
...
<body>
<div id="container">
<h1>An Error Was Encountered</h1>
<p>The action you have requested is not allowed.</p> </div>
</body>
</html>

This could be done much faster by avoiding the use of Selenium. Their site appears to continually request a token prior to each request, you might find it is possible to skip this.
The following shows how to get the JSON containing the mobile number and email address:
from bs4 import BeautifulSoup
import requests
import time
def get_token(sess):
req_csrf = sess.get('https://ngodarpan.gov.in/index.php/ajaxcontroller/get_csrf')
return req_csrf.json()['csrf_token']
search_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/search_index_new/{}"
details_url = "https://ngodarpan.gov.in/index.php/ajaxcontroller/show_ngo_info"
sess = requests.Session()
for page in range(0, 10000, 10): # Advance 10 at a time
print(f"Getting results from {page}")
for retry in range(1, 10):
data = {
'state_search' : 7,
'district_search' : '',
'sector_search' : 'null',
'ngo_type_search' : 'null',
'ngo_name_search' : '',
'unique_id_search' : '',
'view_type' : 'detail_view',
'csrf_test_name' : get_token(sess),
}
req_search = sess.post(search_url.format(page), data=data, headers={'X-Requested-With' : 'XMLHttpRequest'})
soup = BeautifulSoup(req_search.content, "html.parser")
table = soup.find('table', id='example')
if table:
for tr in table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
link = tr.find('a', onclick=True)
if link:
link_number = link['onclick'].strip("show_ngif(')")
req_details = sess.post(details_url, headers={'X-Requested-With' : 'XMLHttpRequest'}, data={'id' : link_number, 'csrf_test_name' : get_token(sess)})
json = req_details.json()
details = json['infor']['0']
print([details['Mobile'], details['Email'], row[1], row[2]])
break
else:
print(f'No data returned - retry {retry}')
time.sleep(3)
This would give you the following kind of output for the first page:
['9871249262', 'pnes.delhi#yahoo.com', 'Pragya Network Educational Society', 'S-52559, Narela, DELHI']
['9810042046', 'mathew.cherian#helpageindia.org', 'HelpAge India', '9270, New Delhi, DELHI']
['9811897589', 'aipssngo#yahoo.com', 'All India Parivartan Sewa Samiti', 's-43282, New Delhi, DELHI']

Switch to an iframe through Selenium and python
You can use an XPath to locate the :
iframe = driver.find_element_by_xpath("//iframe[#name='Dialogue Window']")
Then switch_to the :
driver.switch_to.frame(iframe)
Here's how to switch back to the default content (out of the ):
driver.switch_to.default_content()
In your instance, I believe the 'Dialogue Window' name would be CalendarControlIFrame
Once you switch to that frame, you will be able to use Beautiful Soup to get the frame's html.

I am tying to iterate over all the pages and extract data in one attempt
After extracting data from one page it is not iterating other pages
....
....
['9829059202', 'cecoedecon#gmail.com', 'CECOEDECON', '206, Jaipur, RAJASTHAN']
['9443382475', 'odamindia#gmail.com', 'ODAM', '43/1995, TIRUCHULI, TAMIL NADU']
['9816510096', 'shrisaisnr#gmail.com', 'OPEN EDUCATIONAL DEVELOPMENT RESEARCH AND WELFARE', '126/2004, SUNDERNAGAR, HIMACHAL PRADESH']
['9425013029', 'card_vivek#yahoo.com', 'Centre for Advanced Research and Development', '25634, Bhopal, MADHYA PRADESH']
['9204645161', 'secretary_smvm#yahoo.co.in', 'Srijan Mahila Vikas Manch', '833, Chakradharpur, JHARKHAND']
['9419107550', 'amarjit.randwal#gmail.com', 'J and K Sai Star Society', '4680-S, Jammu, JAMMU & KASHMIR']
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
No data returned - retry 2
...
...

Extract links after th in beautifulsoup

Im trying to extract links from this page:
http://www.tadpoletunes.com/tunes/celtic1/
view-source:http://www.tadpoletunes.com/tunes/celtic1/
but I only want the reels: which in the page are delineated by :
start:
<th align="left"><b><a name="reels">REELS</a></b></th>
end ( the lines above the following):
<th align="left"><b><a name="slides">SLIDES</a></b></th>
The question is how to do this. I have the following code which gets the links for everything with a .mid extension:
def import_midifiles():
archive_url="http://www.tadpoletunes.com/tunes/celtic1/"
sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
soup=bs.BeautifulSoup(sauce,'lxml')
tables=soup.find_all('table')
for table in tables:
for link in table.find_all('a',href=True):
if link['href'].endswith('.mid'):
listofmidis.append(archive_url + link['href'])
if listofmidis:
listoflists.append(listofmidis)
midi_list = [item for sublist in listoflists for item in sublist]
return midi_list
I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.

To get all the "REELS" links, you need to do the following:
Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr> tag containing <a name="reels">REELS</a>. This can be done using the .find_parent() method.
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
Now, you can use the .find_next_siblings() method to get all the <tr> tags after "REELS". We can break the loop when we find the <tr> tag with <a name="slides">SLIDES</a> (or .find('a').text == 'SLIDES').
Complete code:
def import_midifiles():
BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
midi_list = []
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text == 'SLIDES':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return midi_list
print(import_midifiles())
Partial output:
['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid',
'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid',
'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid',
'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid',
'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid',
'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid',
'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']

Python Scraping: How to separate multiple attributes in one cell (td)?

When scraping an HTML table, if a cell (td) in the table has multiple attributes (See HTML snippet for example) how can you separate the two and/or how could you select just one?
HTML snippet:
<td class="playerName md align-left pre in post" style="display: table-cell;"><span ...</span>
<a role="button" class="full-name">Dustin Johnson</a>
<a role="button" class="short-name">D. Johnson</a></td>
Code I'm trying:
url = 'http://www.espn.com/golf/leaderboard?tournamentId=3742'
req = requests.get(url)
soup = bs4.BeautifulSoup(req.text,'lxml')
table = soup.find(id='leaderboard-view')
headings = [th.get_text() for th in table.find('tr').find_all('th')]
dataset = []
for row in table.find_all('tr'):
a = [td.get_text() for td in row.find_all('td')]
dataset.append(a)
Any advice on how to either a) select just one of the names, or b) separate the cell in to two cells would be appreciated.
Thank you.

If you want the full name and short name, you can try this:
for td in row.find_all('td'):
full_name = td.find('a', {'class': 'full-name'}).text
short_name = td.find('a', {'class': 'short-name'}).text

try to use regex to match the tr
players = the_soup.findAll('tr',{'class':re.compile("player-overview")})
for p in players:
name = p.find('a',{'class':'full-name'}).get_text()

Trying to select rows in a table, always getting NavigableString error

I'm trying unsuccessfully to scrape a list of countries and altitudes from a wiki page:
Here's the relevant HTML from this page:
<table class="wikitable sortable jquery-tablesorter">
<thead>
<tbody>
<tr>
<td>
And here's my code
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
soup = BeautifulSoup(read_url(url), 'html.parser')
table = soup.find("table", {"class":"wikitable"})
tbody = table.find("tbody")
rows = tbody.find("tr") <---this gives the error, saying tbody is None
countries = []
altitudes = []
for row in rows:
cols = row.findAll('td')
for td in cols:
if td.a:
countries.append(td.a.text)
elif "m (" in td.text:
altitudes.append(float(td.text.split("m")[0].replace(",", "")))
Here's the error:
Traceback (most recent call last):
File "wiki.py", line 18, in <module>
rows = tbody.find("tr")
AttributeError: 'NoneType' object has no attribute 'find'
So then I tried just selecting the rows straight up with soup.find('tr').
This results in a NavigableString error. What else can I try to retrieve the info in a paired fashion?

If you go to the page source and search for tbody, you will get 0 results, so that could be the cause of the first problem. It seems like Wikipedia uses a custom <table class="wikitable sortable"> and does not specify tbody.
For your second problem, you need to be using find_all and not find because find just returns the first tr. So instead you want
rows = soup.find_all("tr")
Hope this helps :)

Below code worked for me-
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
countries = []
altitudes = []
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country= col[0].text.strip()
elevation = float(''.join(map(unicode.strip,col[1].text.split("m")[0])).replace(',',''))
countries.append(country)
altitudes.append(elevation)
print countries,'\n',altitudes

I do not quite understand how to parse the Yahoo NHL Page

Here is my code so far:
from bs4 import BeautifulSoup
from urllib.request import urlopen
url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01")
content = url.read()
soup = BeautifulSoup(content)
print (soup.prettify)
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = td.findAll('yspscores')
for yspscores in td:
print (yspscores)
The problem I've been having is that the HTML for that yahoo page has the table data in this context: <td class="yspscores">
I do not quite understand how to reference it in my code. My goal is to print out the scores and name of the teams that the score corresponds to.

You grabbed the first table, but there is more than one table on that page. In fact, there are 46 tables.
You want to find the tables with the scores class:
for table in soup.find_all('table', class_='scores'):
for row in table.find_all('tr'):
for cell in row.find_all('td', class_='yspscores'):
print(cell.text)
Note that searching for a specific class is done with the class_ keyword argument.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping and navigating to links to get more info - python

Related

How to Scrape a popup using python and selenium

Extract links after th in beautifulsoup

Python Scraping: How to separate multiple attributes in one cell (td)?

Trying to select rows in a table, always getting NavigableString error

I do not quite understand how to parse the Yahoo NHL Page

Categories

Resources