Established html table line to python - python

Let's say, i have an HTML Table like this:
<tr>
<td class="Klasse gerade">12A<br></td>
<td class="Stunde gerade">4<br></td>
<td class="Fach gerade">GEO statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">603<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
<tr>
<td class="Klasse gerade">10A<br></td>
<td class="Stunde gerade">2<br></td>
<td class="Fach gerade">MA statt GE<br></td>
<td class="Lehrer gerade"><br></td>
<td class="Vertretung gerade">Herr Grieger<br></td>
<td class="Raum gerade">406<br></td>
<td class="Anmerkung gerade"><br></td>
</tr>
if phrase the HTML to python(2.7) with:
link = "http://www.test.com/vplan.html"
f = urllib.urlopen(link)
vplan = f.read()
print vplan
how can i do this?: if td=10A then print the complete tr of 10A
Sorry for the bad formulation but this is in my opinion the easiest was to explain my question and don't wonder about the German word's (I'm a German)

You need an HTML parser like Beautifulsoup. Assuming the table in question is the only one or the first one in the document, the program may look like this:
#!/usr/bin/env python
import urllib
from bs4 import BeautifulSoup
def main():
link = 'http://www.test.com/vplan.html'
soup = BeautifulSoup(urllib.urlopen(link), 'lxml')
table = soup.find('table')
rows = [x.find_parent('tr') for x in table.find_all(text='10A')]
for row in rows:
for cell in row.find_all('td'):
print cell.text
print '-' * 10

Related

How to extract <td> elements with no attributes?

I am a newbie in web scraping and I have been stuck with some issue. I tried and searched but nothing at all. I want to extract data from the table. The problem is that tr and td elements don't have attributes, just class. <td class="review-value"> is the same for different values. I don't know how to separate them. I need lists with every single value for example:
list_1 = [aircraft1, aircraft2, aircraft3, ..]
list_2 = [type_of_traveller1, type_of_traveller2, ...]
list_3 = [cabin_flown1, cabin_flown2,...]
list_4 = [route1, route2,...]
list_5 = [date_flown1, date_flown2, ..]
This is the table code:
<table class="review-ratings">
<tr>
<td class="review-rating-header aircraft">Aircraft</td>
<td class="review-value">Boeing 787-9</td>
</tr>
<tr>
<td class="review-rating-header type_of_traveller">Type Of Traveller</td>
<td class="review-value">Couple Leisure</td>
</tr>
<tr>
<td class="review-rating-header cabin_flown">Seat Type</td>
<td class="review-value">Business Class</td>
</tr>
<tr>
<td class="review-rating-header route">Route</td>
<td class="review-value">Mexico City to London</td>
</tr>
<tr>
<td class="review-rating-header date_flown">Date Flown</td>
<td class="review-value">February 2023</td>
</tr>
</table>
I am using BeautifulSoup:
page = requests.get(url)
table = soup.find('article')
review_table = table.find_all('table', class_ = 'review- ratings')
find_tr_header = table.find_all('td', class_ = 'review-rating-header')
headers = []
for i in find_tr_header:
headers.append(i.text.strip())
And I don't know what to do with class="review-value".
As I can see in your table each field has a cell .review-value that is following it (direct sibling).
So what you can do is use the selector + in CSS.
For instance .aircraft + .review-value will give you the value of the aircraft.
In Beautiful Soup you can even avoid this type of selector since there are built-in methods available for you. Check next-sibling

Select a table without tag by Beautifulsoup

Could BeautifulSoup select no tag table?
There's many tables in a HTML, but the data I want is in the table without any tags.
Here is my example:
There are 2 tables in HTML.
One is english, and the other is number.
from bs4 import BeautifulSoup
HTML2 = """
<table>
<tr>
<td class>a</td>
<td class>b</td>
<td class>c</td>
<td class>d</td>
</tr>
<tr>
<td class>e</td>
<td class>f</td>
<td class>g</td>
<td class>h</td>
</tr>
</table>
<table cellpadding="0">
<tr>
<td class>111</td>
<td class>222</td>
<td class>333</td>
<td class>444</td>
</tr>
<tr>
<td class>555</td>
<td class>666</td>
<td class>777</td>
<td class>888</td>
</tr>
"""
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.select('table[cellpadding!="0"]') #<---I think the key point is here.
for div in f2:
row = ''
rows = div.findAll('tr')
for row in rows:
if(row.text.find('td') != False):
print(row.text)
I only want the data in the "english" table
And make the format like following:
a b c d
e f g h
Then save to excel.
But I can only access that "number" table.
Is there a hint?
Thanks!
You could use find_all and select only tables that don't have a specific attribute.
f2 = soup2.find_all('table', {'cellpadding':None})
Or if you want to select tables that have absolutely no attributes:
f2 = [tbl for tbl in soup2.find_all('table') if not tbl.attrs]
Then you can make a list of columns from f2 and pass it to the dataframe .
data = [
[td.text for td in tr.find_all('td')]
for table in f2 for tr in table.find_all('tr')
]
You can use has_attr method to test whether table contains the cellpadding attribute:
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.find_all('table')
for div in f2:
if not div.has_attr('cellpadding'):
row = ''
rows = div.findAll('tr')
for row in rows:
if(row.text.find('td') != False):
print(row.text)

Python Beautiful Soup find string and extract following string

I am programming a web crawler with the help of beautiful soup.I have the following html code:
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
My goal is to write the numbers after class="numeric" to a specific variable. I want to do this conditional on the string above the class statement (e.g. "xyz", "abc", ...).
At the moment I am doing the following:
for c in soup.find_all("a", string=re.compile('abc')):
abc=c.string
But of course it returns the string "abc" and not the number in the tag afterwards.
So basically my question is how to adress the string after class="numeric" conditional on the string beforehand.
Thanks for your help!!!
Once you find the correct tdwhich I presume is what you meant to have in place of a then get the next sibling with the class you want:
h = """<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling("td",class_="numeric"))
If the numeric td is always next you can just call find_next_sibling():
for td in soup.find_all("td",text="abc"):
print(td.find_next_sibling())
For your input both would give you:
td class="numeric">50,00%</td>
If I understand your question correctly, and if I assume your html code will always follow your sample structure, you can do this:
result = {}
table_rows = soup.find_all("tr")
for row in table_rows:
table_columns = row.find_all("td")
result[table_columns[0].text] = tds[1].text
print result #### {u'xyz': u'2,50%', u'abc': u'2,50%', u'ghf': u'2,50%'}
You got a dictionary eventually with the key names are 'xyz','abc'..etc and their values are the string in class="numeric"
So as I understand your question you want to iterate over the tuples
('xyz', '5,00%'), ('abc', '50,00%'), ('ghf', '2,50%'). Is that correct?
But I don't understand how your code produces any results, since you are searching for <a> tags.
Instead you should iterate over the <tr> tags and then take the strings inside the <td> tags. Notice the double next_sibling for accessing the second <td>, since the first next_sibling would reference the whitespace between the two tags.
html = """
<tr class="odd-row">
<td>xyz</td>
<td class="numeric">5,00%</td>
</tr>
<tr class="even-row">
<td>abc</td>
<td class="numeric">50,00%</td
</tr>
<tr class="odd-row">
<td>ghf</td>
<td class="numeric">2,50%</td>
</tr>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find_all("tr"):
print((tr.td.string, tr.td.next_sibling.next_sibling.string))

Iterating Through Table Rows in Selenium (Python)

I have a webpage with a table that only appears when I click 'Inspect Element' and is not visible through the View Source page. The table contains only two rows with several cells each and looks similar to this:
<table class="datadisplaytable">
<tbody>
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</tbody>
</table>
What I'm trying to do is to iterate through the rows and return the text contained in each cell. I can't really seem to do it with Selenium. The elements contain no IDs and I'm not sure how else to get them. I'm not very familiar with using xpaths and such.
Here is a debugging attempt that returns a TypeError:
def check_grades(self):
table = []
for i in self.driver.find_element_by_class_name("dddefault"):
table.append(i)
print(table)
What is an easy way to get the text from the rows?
XPath is fragile. It's better to use CSS selectors or classes:
mytable = find_element_by_css_selector('table.datadisplaytable')
for row in mytable.find_elements_by_css_selector('tr'):
for cell in row.find_elements_by_tag_name('td'):
print(cell.text)
If you want to go row by row using an xpath, you can use the following:
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault'][text()])
Which outputs:
['16759', 'MATH', '123', '001', 'Calculus']
['16449', 'PHY', '456', '002', 'Physics']
Using td[text()] will avoid getting any Nones returned for the td's that hold no text.
So to do the same using selenium you would:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][1]"])
For multiple tables:
def get_row_data(table):
for row in table.find_elements_by_xpath(".//tr"):
yield [td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault'][text()]"])
for table in driver.find_elements_by_xpath("//table[#class='datadisplaytable']"):
for data in get_row_data(table):
# use the data
Correction of the Selenium part of #Padraic Cunningham's answer:
table = driver.find_element_by_xpath("//table[#class='datadisplaytable']")
for row in table.find_elements_by_xpath(".//tr"):
print([td.text for td in row.find_elements_by_xpath(".//td[#class='dddefault']")])
Note: there was one missing round bracket at the end; also removed the [1] index, to match the first XML example.
Another note: Though, the example with the index [1] should also be preserved, to show how to extract individual elements.
Another Version (modified and corrected post by Padraic Cunningham):
Tested with Python 3.x
#!/usr/bin/python
h = """<table class="datadisplaytable">
<tr>
<td class="dddefault">16759</td>
<td class="dddefault">MATH</td>
<td class="dddefault">123</td>
<td class="dddefault">001</td>
<td class="dddefault">Calculus</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
<tr>
<td class="dddefault">16449</td>
<td class="dddefault">PHY</td>
<td class="dddefault">456</td>
<td class="dddefault">002</td>
<td class="dddefault">Physics</td>
<td class="dddefault"></td>
<td class="dddead"></td>
<td class="dddead"></td>
</tr>
</table>"""
from lxml import html
xml = html.fromstring(h)
# gets the table
table = xml.xpath("//table[#class='datadisplaytable']")[0]
# iterate over all the rows
for row in table.xpath(".//tr"):
# get the text from all the td's from each row
print([td.text for td in row.xpath(".//td[#class='dddefault']")])

How to get all html elements with id equal to `constant_text-something_changed`?

I am trying to parse html with lxml like below:
<tr id="element-36a07b7" class=" " ... data-date="2014-05-25">
<td>2014-05-25</td>
<td>Wikipedia (link)</td>
<td>Yandex (link)</td>
<td title="what I am looking for">another needed info<span class="small">(info 3)</span>
</td>
<td class="result">1</td>
<td class="result">2</td>
<td class="result">3</td>
...
</tr>
and would like to get all elements with id equal to element-... and extract 36a07b7, data-date, what I am looking for, another needed info and info 3 from there.
Firstly, I am trying to get all element-s:
elements = t.find('//*[#id="flight-"]')
how can use wildcards in the id name? Tried to use *, .*, but it doesn't work.
Use starts-with function:
import lxml.html
root = lxml.html.fromstring('''
<table>
<tr id="element-36a07b7" class=" " data-date="2014-05-25">
<td>2014-05-25</td>
<td>Wikipedia (link)</td>
<td>Yandex (link)</td>
<td title="what I am looking for">another needed info<span class="small">(info 3)</span>
</td>
<td class="result">1</td>
<td class="result">2</td>
<td class="result">3</td>
...
</tr>
</table>
''')
tr_list = root.xpath('//*[starts-with(#id, "element-")]')
for tr in tr_list:
print tr.get('id').split('-')[1]
print tr.get('data-date')
output:
36a07b7
2014-05-25
Alternatively, you can use css selector, using cssselect method:
tr_list = root.cssselect('[id^=element-]')

Categories

Resources