Python beautifulsoup - getting input value - python

I've got many table rows like this:
<tr>
<td>100</td>
<td>200</td>
<td><input type="radio" value="123599"></td>
</tr>
Iterate with:
table = BeautifulSoup(response).find(id="sometable") # Make soup.
for row in table.find_all("tr")[1:]: # Find rows.
cells = row.find_all("td") # Find cells.
points = int(cells[0].get_text())
gold = int(cells[1].get_text())
id = cells[2].input['value']
print id
Error:
File "./script.py", line XX, in <module>
id = cells[2].input['value']
TypeError: 'NoneType' object has no attribute '__getitem__'
How can I get input value? I don't want to use regexp.

soup = BeautifulSoup(html)
try:
value = soup.find('input', {'id': 'xyz'}).get('value')
except Exception as e:
print("Got unhandled exception %s" % str(e))

You want to find the <input> element inside the cell, so you should use find/find_all on the cell like this:
cells[2].find('input')['value']

Related

Splitting HTML text by <br> while using beautifulsoup

HTML code:
<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>
I need to get values 4.5 kn and 7.1 as separate list items so I could append them separately. I do not want to split it I wanted to split the text string using re.sub, but it does not work. I tried too use replace to replace br, but it did not work. Can anybody provide any insight?
Python code:
def NameSearch(shipLink, mmsi, shipName):
from bs4 import BeautifulSoup
import urllib2
import csv
import re
values = []
values.append(mmsi)
values.append(shipName)
regex = re.compile(r'[\n\r\t]')
i = 0
with open('Ship_indexname.csv', 'wb')as f:
writer = csv.writer(f)
while True:
try:
shipPage = urllib2.urlopen(shipLink, timeout=5)
except urllib2.URLError:
continue
except:
continue
break
soup = BeautifulSoup(shipPage, "html.parser") # Read the web page HTML
#soup.find('br').replaceWith(' ')
#for br in soup('br'):
#br.extract()
table = soup.find_all("table", {"id": "vessel-related"}) # Finds table with class table1
for mytable in table: #Loops tables with class table1
table_body = mytable.find_all('tbody') #Finds tbody section in table
for body in table_body:
rows = body.find_all('tr') #Finds all rows
for tr in rows: #Loops rows
cols = tr.find_all('td') #Finds the columns
for td in cols: #Loops the columns
checker = td.text.encode('ascii', 'ignore')
check = regex.sub('', checker)
if check == ' Speed (avg./max): ':
i = 1
elif i == 1:
print td.text
pat=re.compile('<br\s*/>')
print pat.sub(" ",td.text)
values.append(td.text.strip("\n").encode('utf-8')) #Takes the second columns value and assigns it to a list called Values
i = 0
#print values
return values
NameSearch('https://www.fleetmon.com/vessels/kind-of-magic_0_3478642/','230034570','KIND OF MAGIC')
Locate the "Speed (avg./max)" label first and then go to the value via .find_next():
from bs4 import BeautifulSoup
data = '<td> <label class="identifier">Speed (avg./max):</label> </td> <td class="value"> <span class="block">4.5 kn<br>7.1 kn</span> </td>'
soup = BeautifulSoup(data, "html.parser")
label = soup.find("label", class_="identifier", text="Speed (avg./max):")
value = label.find_next("td", class_="value").get_text(strip=True)
print(value) # prints 4.5 kn7.1 kn
Now, you can extract the actual numbers from the string:
import re
speed_values = re.findall(r"([0-9.]+) kn", value)
print(speed_values)
Prints ['4.5', '7.1'].
You can then further convert the values to floats and unpack into separate variables:
avg_speed, max_speed = map(float, speed_values)

Trying to select rows in a table, always getting NavigableString error

I'm trying unsuccessfully to scrape a list of countries and altitudes from a wiki page:
Here's the relevant HTML from this page:
<table class="wikitable sortable jquery-tablesorter">
<thead>
<tbody>
<tr>
<td>
And here's my code
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
soup = BeautifulSoup(read_url(url), 'html.parser')
table = soup.find("table", {"class":"wikitable"})
tbody = table.find("tbody")
rows = tbody.find("tr") <---this gives the error, saying tbody is None
countries = []
altitudes = []
for row in rows:
cols = row.findAll('td')
for td in cols:
if td.a:
countries.append(td.a.text)
elif "m (" in td.text:
altitudes.append(float(td.text.split("m")[0].replace(",", "")))
Here's the error:
Traceback (most recent call last):
File "wiki.py", line 18, in <module>
rows = tbody.find("tr")
AttributeError: 'NoneType' object has no attribute 'find'
So then I tried just selecting the rows straight up with soup.find('tr').
This results in a NavigableString error. What else can I try to retrieve the info in a paired fashion?
If you go to the page source and search for tbody, you will get 0 results, so that could be the cause of the first problem. It seems like Wikipedia uses a custom <table class="wikitable sortable"> and does not specify tbody.
For your second problem, you need to be using find_all and not find because find just returns the first tr. So instead you want
rows = soup.find_all("tr")
Hope this helps :)
Below code worked for me-
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
countries = []
altitudes = []
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country= col[0].text.strip()
elevation = float(''.join(map(unicode.strip,col[1].text.split("m")[0])).replace(',',''))
countries.append(country)
altitudes.append(elevation)
print countries,'\n',altitudes

Scraping and navigating to links to get more info

Not sure if what I am trying to do is possible...but here goes. I am trying to navigate and scrape info from this table (simplified)...
> <tr class="transaction odd" id="transaction_7"><td><a
> href="/show_customer/11111">Erin</a></td></tr> <tr class="transaction
> even" id="transaction_6"><td><a
> href="/show_customer/2222">Jack</a></td></tr> <tr class="transaction
> odd" id="transaction_5"><td><a
> href="/show_customer/3333">Carl</a></td></tr> <tr class="transaction
> even" id="transaction_4"><td><a
> href="/show_customer/4444">Kelly</a></td></tr>
This is the code I used to scrape the table and output into a csv...works well.
columns = ["User Name", "Source", "Staff", "Location", "Attended On", "Used", "Date"]
table = []
for row in table_1.find_all('tr'):
tds = row.find_all('td')
try:
data = [td.get_text() for td in tds]
for field,value in zip(columns, data):
print("{}: {}".format(field, value))
table.append(data)
except:
print("Bad string value")
import csv
with open("myfile.csv", "wb") as outf:
outcsv = csv.writer(outf)
# header row
outcsv.writerow(columns)
# data
outcsv.writerows(table)
What I need to do is navigate to each link in the table like this
<a> href="/show_customer/11111">Erin</a>
and grab the customers email address that is in this html form
<div class="field">
<div class = "label">Email</div>
<p>XXXX#email.com</p>
</div>
And add that to the relevant row in my csv.
Any help would be greatly appreciated!
You would have to make a http request for every href in the td. This is how you would modify your existing code to do that:
from urllib2 import urlopen
for row in table_1.find_all('tr'):
tds = row.find_all('td')
# Get all the hrefs to make http request
links = row.find_all('a').get('href')
try:
data = [td.get_text() for td in tds]
for field,value in zip(columns, data):
print("{}: {}".format(field, value))
# For every href make a request, get the page,
# create a BS object
for link in links:
link_soup = BeautifulSoup(urlopen(link))
# Use link_soup BS instance to get the email
# by navigating the div and p and add it to your data
table.append(data)
except:
print("Bad string value")
Note that your href is relative to the website's url. So after you extract the href you would have to prepend it with the website's url to form a valid url

Get an attribute value based on the name attribute with BeautifulSoup

I want to print an attribute value based on its name, take for example
<META NAME="City" content="Austin">
I want to do something like this
soup = BeautifulSoup(f) # f is some HTML containing the above meta tag
for meta_tag in soup("meta"):
if meta_tag["name"] == "City":
print(meta_tag["content"])
The above code give a KeyError: 'name', I believe this is because name is used by BeatifulSoup so it can't be used as a keyword argument.
It's pretty simple, use the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<META NAME="City" content="Austin">')
>>> soup.find("meta", {"name":"City"})
<meta name="City" content="Austin" />
>>> soup.find("meta", {"name":"City"})['content']
'Austin'
theharshest answered the question but here is another way to do the same thing.
Also, In your example you have NAME in caps and in your code you have name in lowercase.
s = '<div class="question" id="get attrs" name="python" x="something">Hello World</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div').attrs
print attributes_dictionary
# prints: {'id': 'get attrs', 'x': 'something', 'class': ['question'], 'name': 'python'}
print attributes_dictionary['class'][0]
# prints: question
print soup.find('div').get_text()
# prints: Hello World
6 years late to the party but I've been searching for how to extract an html element's tag attribute value, so for:
<span property="addressLocality">Ayr</span>
I want "addressLocality". I kept being directed back here, but the answers didn't really solve my problem.
How I managed to do it eventually:
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs('<span property="addressLocality">Ayr</span>', 'html.parser')
>>> my_attributes = soup.find().attrs
>>> my_attributes
{u'property': u'addressLocality'}
As it's a dict, you can then also use keys and 'values'
>>> my_attributes.keys()
[u'property']
>>> my_attributes.values()
[u'addressLocality']
Hopefully it helps someone else!
theharshest's answer is the best solution, but FYI the problem you were encountering has to do with the fact that a Tag object in Beautiful Soup acts like a Python dictionary. If you access tag['name'] on a tag that doesn't have a 'name' attribute, you'll get a KeyError.
The following works:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<META NAME="City" content="Austin">', 'html.parser')
metas = soup.find_all("meta")
for meta in metas:
print meta.attrs['content'], meta.attrs['name']
One can also try this solution :
To find the value, which is written in span of table
htmlContent
<table>
<tr>
<th>
ID
</th>
<th>
Name
</th>
</tr>
<tr>
<td>
<span name="spanId" class="spanclass">ID123</span>
</td>
<td>
<span>Bonny</span>
</td>
</tr>
</table>
Python code
soup = BeautifulSoup(htmlContent, "lxml")
soup.prettify()
tables = soup.find_all("table")
for table in tables:
storeValueRows = table.find_all("tr")
thValue = storeValueRows[0].find_all("th")[0].string
if (thValue == "ID"): # with this condition I am verifying that this html is correct, that I wanted.
value = storeValueRows[1].find_all("span")[0].string
value = value.strip()
# storeValueRows[1] will represent <tr> tag of table located at first index and find_all("span")[0] will give me <span> tag and '.string' will give me value
# value.strip() - will remove space from start and end of the string.
# find using attribute :
value = storeValueRows[1].find("span", {"name":"spanId"})['class']
print value
# this will print spanclass
If tdd='<td class="abc"> 75</td>'
In Beautifulsoup
if(tdd.has_attr('class')):
print(tdd.attrs['class'][0])
Result: abc

Beautiful Soup line matching

Im trying to build a html table that only contains the table header and the row that is relevant to me. The site I'm using is http://wolk.vlan77.be/~gerben.
I'm trying to get the the table header and my the table entry so I do not have to look each time for my own name.
What I want to do :
get the html page
Parse it to get the header of the table
Parse it to get the line with table tags relevant to me (so the table row containing lucas)
Build a html page that shows the header and table entry relevant to me
What I am doing now :
get the header with beautifulsoup first
get my entry
add both to an array
pass this array to a method that generates a string that can be printed as html page
def downloadURL(self):
global input
filehandle = self.urllib.urlopen('http://wolk.vlan77.be/~gerben')
input = ''
for line in filehandle.readlines():
input += line
filehandle.close()
def soupParserToTable(self,input):
global header
soup = self.BeautifulSoup(input)
header = soup.first('tr')
tableInput='0'
table = soup.findAll('tr')
for line in table:
print line
print '\n \n'
if '''lucas''' in line:
print 'true'
else:
print 'false'
print '\n \n **************** \n \n'
I want to get the line from the html file that contains lucas, however when I run it like this I get this in my output :
****************
<tr><td>lucas.vlan77.be</td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> </tr>
false
Now I don't get why it doesn't match, the string lucas is clearly in there :/ ?
It looks like you're over-complicating this.
Here's a simpler version...
>>> import BeautifulSoup
>>> import urllib2
>>> html = urllib2.urlopen('http://wolk.vlan77.be/~gerben')
>>> soup = BeautifulSoup.BeautifulSoup(html)
>>> print soup.find('td', text=lambda data: data.string and 'lucas' in data.string)
lucas.vlan77.be
It's because line is not a string, but BeautifulSoup.Tag instance. Try to get td value instead:
if '''lucas''' in line.td.string:

Categories

Resources