Reading table using BeautifulSoup - python

I'm reading an HTML file with BeautifulSoup. I have a table in the HTML from which I need to read data, but the HTML contains more than one table.
To distinguish between the tables, I need to see the number of columns on each line by counting <td> tags.
I count like this:
for i in soup.find_all('tr'):
for x in i.findallnext('td'):
This returns all <td> tags after the <tr> until the end of the document. But I need to know the numbers of <td> tags between the start of a line (<tr>) and the and of that line (</tr>).
<tr> <!-- Should return 2 columns, but will return 4 in script. -->
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>

Replace findallnext with find_all.
findallnext gives all tags after the until the end of the document as you said.
find_all gives you the child elements.

Related

How do I get Beautifulsoup to Parse a Serial HTML list in a table into a CSV pattern of data?

I have an internal company webpage that lists a variety of data in a long list that I want to convert into a CSV file for reviewing. The data is in the format of:
*CUSTOMER_1*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
*Customer_2*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
Encoded in HTML it looks like
<table id="responsibility">
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 1</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_1</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 2</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_3</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
</table>
I'd like to end up with a file.csv that contains the info in this fashion
CUSTOMER1,Role_Name1,Name_1,Email_1,Category_Text,Phone_Numbers
CUSTOMER1,Role_Name2,Name_2,Email_2,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_3,Email_3,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_2,Email_2,Category_Text,Phone_Numbers
Right now i can get a list of all of the Customer names or a list of all of the text but I haven't been able to figure out how to iterate over every customer and then iterate over every line for each customer
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
line=responsibility.tr
for i in responsibility:
print(line)
line=responsibility.tr.next_sibling
I was expecting this to print every tag in the document but instead it only prints the first and never cycles to the next tags.
Focus on this line of code :
line=responsibility.tr
Here, you are using .tr tag, which locates the first instance of <tr> tag block and returns it's contents.
What does it mean over here?
Let's just say you have n instances of <tr> tag, then using .tr will give you only the first instance among those n <tr> instances as a result. So, if you wish to extract all n of them, then use find_all(). It will return a list of all possible matches.
line=responsibility.find_all("tr", class_="customer")
Also, add the class_="customer" filter. It will help you to locate all the <tr> blocks with the "customer" class. Then simply using the .next_sibling will allow you to find the 2 subsequent rows with title="Role_Name*" attribute.
So, to put the above theory in practice, watch this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
lines=responsibility.find_all("tr", class_ = "customer")
for i in responsibility:
for line in lines:
line1=line.next_sibling #locates tr with title="Role_Name1"
line2=line.next_sibling.next_sibling #locates tr with title="Role_Name2"
print(line1)
print(line2)

Python Bs4: How to retrieve row in table based on specific 'td' value that row

If I have a website page with multiple tables and I want to retrieve the source code for a specific row from a specific table based on a keyword in beautifulsoup4, how can I go about doing that using the find or find_all methods (or any other methods in that matter)
Using the table above, lets say I want to retrieve the row that contains the keyword "ROW 1" (or "A", "B", "C" etc.) and only that row, how can I go about that?
Contrived example below but with bs4 4.7.1 you can use pseudo-class css selectors of :has and :contains to specify pattern of tr (row) that has td (table cell) which contains 'wanted phrase'. A table identifier is passed as well to target the correct table (id here to make things simple). select will return all qualifying tr elements; use select_one if only the first match is required.
soup.select('#example tr:has(> td:contains("Row 1"))')
py
from bs4 import BeautifulSoup as bs
html = '''
<table id="example">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
<table id="example2">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Not Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Not Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
'''
soup = bs(html, 'lxml') #'html.parser'
soup.select('#example tr:has(> td:contains("Row 1"))')
Grab the entire html with pandas and do the following (this code is untested)
import pandas as pd
html_table = 'From your web scrapping'
df = pd.read_html(io=html_table)
df.loc[1] # Will give you all the information for the first row
I'd suggest spending 10 minutes to learn pandas it will really help out. https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

Crawling webpage with scrapy

Been reading up on Scrapy. My python skills are weak but i usually am able to build something on trial and error and determination...
I'm able to run trough my project site and scrape 'structured' product data.
The problem occurs with a table that has different rows and values per page.
Beneath an example, I can get the name and price of the product.
The problem is with the table underneath, products have different specifications and different amount of rows but always 2 columns. I'm trying to loop trough by counting the <tr> and for each get the first <td> as a label and the second <td> as the corresponding value. Then append it with the other page data to create 1 entry.
In the end i'd like to yield Name: name, Price:price, Label X : Value X, label y : value y
<div>name</div>
<div>price</div>
<table>
<tr><td>LABEL X</td><td>VALUE X</td></tr>
<tr><td>LABEL Y</td><td>VALUE Y</td></tr>
<tr><td>LABEL Z</td><td>VALUE Z</td></tr>
Could be anywhere from 2 to 6 rows
</table>
Any help would be much appreciated, or if someone could point me to an example.
EDIT >>>>
The HTML code
<table class="table table-striped">
<tbody>
<tr>
<td><b>Name:</b></td>
<td>Car</td>
</tr>
<tr>
<td><b>Brand:</b></td>
<td itemprop="brand">Merc</td>
</tr>
<tr>
<td><b>Size:</b></td>
<td>30 XL</td>
</tr>
<tr>
<td><b>Color:</b></td>
<td>white</td>
</tr>
<tr>
<td><b>Stock</b></td>
<td>20</td>
</tr>
</tbody>
</table>
You should have posted some Scrapy code to help us out.
Anyways, here is the code you can use to parse your HTML.
for row in response.css('table > tr'):
data = {}
data['name'] = row.css("td:nth-child(1) b::text").extract()[0]
data['value'] = row.css("td:nth-child(2)::text").extract()[0]
yield MyItem(name = data['name'], value = data['value'])
PS:
Do not use tbody in selectors on xpaths, tbody is added by modern browsers, its not included in original response.
See here: https://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use

Transform BeautifulSoup extract into sqlite table

I have a html table structure that looks something like:
<table>
<tbody>
<tr>
<td>
<ul>
</ul
</td>
<td>
<table>
<tbody>
<tr></tr>
<tr></tr>
<tr></tr>
</tbody>
</table>
<table> -- (table structure I am interested in)
<tbody>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
using python/BeautifulSoup, I have managed to print an output to screen like:-
[b'16 March', b'987654', b'Something happens on this date']
[b'23 March', b'321987', b'Something happens on this date']
[b'26 March', b'123456', b'Something happens on this date']
using the following code (which I have hacked together from various posts on this site):-
for mytable in soup.find('body').find_all('table'):
#print (len(mytable))
for trs in mytable.find_all('tr'):
tds = trs.find_all('td', class_='dte id desc'.split())
if tds: # checks if 'tds' has value. if YES then block is executed
row = [elem.text.strip().encode('utf-8') for elem in tds]
print (row)
else:
continue # 'row' item is empty, proceed to next loop
2 questions:
when the output prints to screen, I get the whole table structure on the first line (so each of the above examples would be output on the first line (the actual table has about 100 entries in length)) and then from the second line I get a single entry per line (as shown above) which is what I want. How can I ignore or NOT output the full structure on the first line? And why do I get that?
I would like to transform the results shown above into a sqlite3 table structure which I would at a later date etl into a production mssql environment. I have not been able to find a way to do this based on the output I am getting.

Limiting BeautifulSoup output

I have been working semi-successfully with BeautifulSoup and Selenium for some weeks now. However I have found myself in a situation I cannot untangle.
I need to extract the html from the first 6 rows or so out of a table. These rows do not share any class, id or similar.
Table structure:
<table class="Table">
<tr class="Table_Header">
<td colspan="2">Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td><span class="Class"></span>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2"> Some Text </td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td> <div class="Class4">Some Text</div>
<div class="Class4">Some Text</div>
</td>
</tr>
The table goes on and on, maintaining this structure but with seemingly random classes popping in and out.
Basically I would need to return the first six tr . I have tried several methods that either return the entire table or a single tr.
Any ideas?
Thanks in advance!
So you're trying to get the first 6 tr from a table? If I understand the question correctly I had a similar problem where I needed to get the first 400 td. Perhaps the code below would help?
Maybe something like
for row in get_log().findAll('tr'):
for cell in row.findAll('td'):
print (cell.text)
logfile.write('{}\n'.format(cell.text))
i += 1
if i == 400:
break
Also let me point you at the article I used to solve my own problem, the good stuff is near the end as it assumes you know literally nothing.
https://first-web-scraper.readthedocs.org/en/latest/
EDIT:
Using the table on Boone County as a source:
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'collapse shadow BCSDTable'})
i = 0
for row in table.findAll('tr'):
print (row.prettify())
i += 1
print i
if i == 6:
break
This outputs a ton of information, so I won't post it.Maybe you want to refine what you want from within each tr?

Categories

Resources