Beautiful Soup line matching - python

Im trying to build a html table that only contains the table header and the row that is relevant to me. The site I'm using is http://wolk.vlan77.be/~gerben.
I'm trying to get the the table header and my the table entry so I do not have to look each time for my own name.
What I want to do :
get the html page
Parse it to get the header of the table
Parse it to get the line with table tags relevant to me (so the table row containing lucas)
Build a html page that shows the header and table entry relevant to me
What I am doing now :
get the header with beautifulsoup first
get my entry
add both to an array
pass this array to a method that generates a string that can be printed as html page
def downloadURL(self):
global input
filehandle = self.urllib.urlopen('http://wolk.vlan77.be/~gerben')
input = ''
for line in filehandle.readlines():
input += line
filehandle.close()
def soupParserToTable(self,input):
global header
soup = self.BeautifulSoup(input)
header = soup.first('tr')
tableInput='0'
table = soup.findAll('tr')
for line in table:
print line
print '\n \n'
if '''lucas''' in line:
print 'true'
else:
print 'false'
print '\n \n **************** \n \n'
I want to get the line from the html file that contains lucas, however when I run it like this I get this in my output :
****************
<tr><td>lucas.vlan77.be</td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> <td><span style="color:green;font-weight:bold">V</span></td> </tr>
false
Now I don't get why it doesn't match, the string lucas is clearly in there :/ ?

It looks like you're over-complicating this.
Here's a simpler version...
>>> import BeautifulSoup
>>> import urllib2
>>> html = urllib2.urlopen('http://wolk.vlan77.be/~gerben')
>>> soup = BeautifulSoup.BeautifulSoup(html)
>>> print soup.find('td', text=lambda data: data.string and 'lucas' in data.string)
lucas.vlan77.be

It's because line is not a string, but BeautifulSoup.Tag instance. Try to get td value instead:
if '''lucas''' in line.td.string:

Related

scraping a web page with python

Here is the code, it produces what I want but not in the way I want to output the result
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Florida'
fl = requests.get(url)
fl_soup = BeautifulSoup(fl.text, 'html.parser')
block = fl_soup.findAll('td', {'class': 'bb-04em'})
for name in fl_soup.findAll('td', {'class': 'bb-04em'}):
print(name.text)
output
2020-04-21
27,869(+3.0%)
867
I would like the output like this
2020-04-21 27,869(+3.0%) 867
Before accesing each <td>, try to get the data by each <tr>, you will get the information of each table row. Then you could search inside <td> or whatever you want.
The following should do what you want:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Florida'
fl = requests.get(url)
fl_soup = BeautifulSoup(fl.text, 'html.parser')
div_with_table = fl_soup.find('div', {'class': 'barbox tright'})
table = div_with_table.find('table')
for row in table.findAll('tr'):
for cell in row.findAll('td', {'class': 'bb-04em'}):
print(cell.text, end=' ')
print() # new line for each row
For the last print statement include the end parameter. By default the print statement has end='\n'
print(name.text, end=' ')
This would give you the desired output.

Beautiful Soup - scraping empty values

I have some Python code which scrapes the game logs of NBA players for a given season (for instance: the data here) into a csv file. I'm using Beautiful Soup. I am aware that there is an option to just get csv version by clicking on a link on the website, but I am adding something to each line, so I feel like scraping line by line is the easiest option. The goal is to eventually write code that does this for every season of every player.
The code looks like this:
import urllib
from bs4 import BeautifulSoup
def getData(url):
html = urllib.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
type(soup)
file = open('/Users/Mika/Desktop/a_players.csv', 'a')
for table in soup.find_all("pre", class_ = ""):
dataline = table.getText
player_id = player_season_url[47:-14]
file.write(player_id + ',' + dataline + '\n')
file.close()
player_season_url = "https://www.basketball-reference.com/players/a/abdelal01/gamelog/1991/"
getData(player_season_url)
The problem is this: as you can see from inspecting the element of the URL, some cells in the table have empty values.
<td class="right " data-stat="fg3_pct"></td>
(this is an example of a good cell with a value ("1") in in that is properly scraped):
<th scope="row" class="right " data-stat="ranker" csk="1">1</th>
When scraping, the rows come out uneven, skipping over the empty values to create a csv file with the values out of place. Is there a way to ensure that those empty values get replaces with " " in the csv file?
For writing csv files Python has builtin support (module csv). For grabbing whole table from the page you could use script like this:
import requests
from bs4 import BeautifulSoup
import csv
import re
def getData(url):
soup = BeautifulSoup(requests.get(url).text, 'lxml')
player_id = re.findall(r'(?:/[^/]/)(.*?)(?:/gamelog)', url)[0]
with open('%s.csv' % player_id, 'w') as f:
csvwriter = csv.writer(f, delimiter=',', quotechar='"')
d = [[td.text for td in tr.find_all('td')] for tr in soup.find('div', id='all_pgl_basic').find_all('tr') if tr.find_all('td')]
for row in d:
csvwriter.writerow([player_id] + row)
player_season_url = "https://www.basketball-reference.com/players/a/abdelal01/gamelog/1991/"
getData(player_season_url)
Output is in CSV file (I added from LibreOffice):
Edit:
extracted player_id from URL
file is saved in {player_id}.csv

Trying to select rows in a table, always getting NavigableString error

I'm trying unsuccessfully to scrape a list of countries and altitudes from a wiki page:
Here's the relevant HTML from this page:
<table class="wikitable sortable jquery-tablesorter">
<thead>
<tbody>
<tr>
<td>
And here's my code
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
soup = BeautifulSoup(read_url(url), 'html.parser')
table = soup.find("table", {"class":"wikitable"})
tbody = table.find("tbody")
rows = tbody.find("tr") <---this gives the error, saying tbody is None
countries = []
altitudes = []
for row in rows:
cols = row.findAll('td')
for td in cols:
if td.a:
countries.append(td.a.text)
elif "m (" in td.text:
altitudes.append(float(td.text.split("m")[0].replace(",", "")))
Here's the error:
Traceback (most recent call last):
File "wiki.py", line 18, in <module>
rows = tbody.find("tr")
AttributeError: 'NoneType' object has no attribute 'find'
So then I tried just selecting the rows straight up with soup.find('tr').
This results in a NavigableString error. What else can I try to retrieve the info in a paired fashion?
If you go to the page source and search for tbody, you will get 0 results, so that could be the cause of the first problem. It seems like Wikipedia uses a custom <table class="wikitable sortable"> and does not specify tbody.
For your second problem, you need to be using find_all and not find because find just returns the first tr. So instead you want
rows = soup.find_all("tr")
Hope this helps :)
Below code worked for me-
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_countries_by_average_elevation"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table')
countries = []
altitudes = []
for row in table.find_all('tr')[1:]:
col = row.find_all('td')
country= col[0].text.strip()
elevation = float(''.join(map(unicode.strip,col[1].text.split("m")[0])).replace(',',''))
countries.append(country)
altitudes.append(elevation)
print countries,'\n',altitudes

BeautifulSoup - How to extract text after specified string

I have HTML like:
<tr>
<td>Title:</td>
<td>Title value</td>
</tr>
I have to specify after which <td> with text i want to grab text of second <td>. Something like: Grab text of first next <td> after <td> which contain text Title:. Result should be: Title value
I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify.
I have tried this:
row = soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)
and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'
First of all, soup.find_all() returns a ResultSet which contains all the elements with tag td and string as Title: .
For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td , since you can get other elements in between (like a NavigableString)).
Example -
>>> from bs4 import BeautifulSoup
>>> s="""<tr>
... <td>Title:</td>
... <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row = soup.find_all('td', string='Title:')
>>> for r in row:
... nextSib = r.nextSibling
... while nextSib.name != 'td' and nextSib is not None:
... nextSib = nextSib.nextSibling
... print(nextSib.text)
...
Title value
Or you can use another library that has support for XPATH , and with Xpath you can do this easily. Other libraries like - lxml or xml.etree .
What you're intending to do is relatively easier with lxml using xpath. You can try something like this,
from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
if path_list[i].text == '<What you want>' and i != len(path_list) :
your_text = path_list[i+1].text

How to use BeautifulSoup to parse a table?

This is a context-specific question regarding how to use BeautifulSoup to parse an html table in python2.7.
I would like to extract the html table here and place it in a tab-delim csv, and have tried playing around with BeautifulSoup.
Code for context:
proxies = {
"http://": "198.204.231.235:3128",
}
site = "http://sloanconsortium.org/onlineprogram_listing?page=11&Institution=&field_op_delevery_mode_value_many_to_one[0]=100%25%20online"
r = requests.get(site, proxies=proxies)
print 'r: ', r
html_source = r.text
print 'src: ', html_source
soup = BeautifulSoup(html_source)
Why doesn't this code get the 4th row?
soup.find('table','views-table cols-6').tr[4]
How would I print out all of the elements in the first row (not the header row)?
Okey, someone might be able to give you a one liner, but the following should get you started
table = soup.find('table', class_='views-table cols-6')
for row in table.find_all('tr'):
row_text = list()
for item in row.find_all('td'):
text = item.text.strip()
row_text.append(text.encode('utf8'))
print row_text
I believe your tr[4] is believed to be an attribute and not an index as you suppose.

Categories

Resources