Transform BeautifulSoup extract into sqlite table

Transform BeautifulSoup extract into sqlite table - python

I have a html table structure that looks something like:
<table>
<tbody>
<tr>
<td>
<ul>
</ul
</td>
<td>
<table>
<tbody>
<tr></tr>
<tr></tr>
<tr></tr>
</tbody>
</table>
<table> -- (table structure I am interested in)
<tbody>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
<tr>
<td class="dte"></td>
<td class="id"></td>
<td class="desc"></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
using python/BeautifulSoup, I have managed to print an output to screen like:-
[b'16 March', b'987654', b'Something happens on this date']
[b'23 March', b'321987', b'Something happens on this date']
[b'26 March', b'123456', b'Something happens on this date']
using the following code (which I have hacked together from various posts on this site):-
for mytable in soup.find('body').find_all('table'):
#print (len(mytable))
for trs in mytable.find_all('tr'):
tds = trs.find_all('td', class_='dte id desc'.split())
if tds: # checks if 'tds' has value. if YES then block is executed
row = [elem.text.strip().encode('utf-8') for elem in tds]
print (row)
else:
continue # 'row' item is empty, proceed to next loop
2 questions:
when the output prints to screen, I get the whole table structure on the first line (so each of the above examples would be output on the first line (the actual table has about 100 entries in length)) and then from the second line I get a single entry per line (as shown above) which is what I want. How can I ignore or NOT output the full structure on the first line? And why do I get that?
I would like to transform the results shown above into a sqlite3 table structure which I would at a later date etl into a production mssql environment. I have not been able to find a way to do this based on the output I am getting.

Related

Python Bs4: How to retrieve row in table based on specific 'td' value that row

If I have a website page with multiple tables and I want to retrieve the source code for a specific row from a specific table based on a keyword in beautifulsoup4, how can I go about doing that using the find or find_all methods (or any other methods in that matter)
Using the table above, lets say I want to retrieve the row that contains the keyword "ROW 1" (or "A", "B", "C" etc.) and only that row, how can I go about that?

Contrived example below but with bs4 4.7.1 you can use pseudo-class css selectors of :has and :contains to specify pattern of tr (row) that has td (table cell) which contains 'wanted phrase'. A table identifier is passed as well to target the correct table (id here to make things simple). select will return all qualifying tr elements; use select_one if only the first match is required.
soup.select('#example tr:has(> td:contains("Row 1"))')
py
from bs4 import BeautifulSoup as bs
html = '''
<table id="example">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
<table id="example2">
<tbody><tr>
<th>Col1</th>
<th>Col2</th>
<th>Col3</th>
</tr>
<tr>
<td>Not Row 1</td>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>Not Row 2</td>
<td>C</td>
<td>D</td>
</tr>
</tbody></table>
'''
soup = bs(html, 'lxml') #'html.parser'
soup.select('#example tr:has(> td:contains("Row 1"))')

Grab the entire html with pandas and do the following (this code is untested)
import pandas as pd
html_table = 'From your web scrapping'
df = pd.read_html(io=html_table)
df.loc[1] # Will give you all the information for the first row
I'd suggest spending 10 minutes to learn pandas it will really help out. https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html

how to verify that a specific cell in a table contains a certain value in a link with Selenium and Python

I have a table similar to the following
<tbody>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
<tr>
<td>some text</td>
<td>other text</td>
<td>
process data or
delete
</td>
</tr>
</tbody>
I would like to check with Python and Selenium that the cell in row 1, column 3 contains a link to /process/100/
I am able to access the text with
cell_content = self.browser.find_element_by_xpath('//table/tbody/tr[1]/td[3]').text
but I would like to access
href="/process/101/"
to check if it contains 101

Use get_attribute() method as below
cell_href = self.browser.find_element_by_xpath('//table/tbody/tr[1]/td[3]').get_attribute('href')
assert "101" in cell_href

How to auto extract data from a html file with python?

I'm beginning to learn python (2.7) and would like to extract certain information from a html code stored in a text file. The code below is just a snippet of the whole html code. In the full html text file the code structure is the same for all other firms data as well and these html code "blocks" are positioned underneath each other (if the latter info helps).
The html snippet code:
<body><div class="tab_content-wrapper noPrint"><div class="tab_content_card">
<div class="card-header">
<strong title="" d.="" kon.="" nl="">"Liberty Associates LLC"</strong>
<span class="tel" title="Phone contacts">Phone contacts</span>
</div>
<div class="card-content">
<table>
<tbody>
<tr>
<td colspan="4">
<label class="downdrill-sbi" title="Industry: Immigration">Industry: Immigration</label>
</td>
</tr>
<tr>
<td width="20"> </td>
<td width="245"> </td>
<td width="50"> </td>
<td width="80"> </td>
</tr>
<tr>
<td colspan="2">
59 Wall St</td>
<td></td>
<td></td>
</tr>
<tr>
<td colspan="2">NJ 07105
<label class="downdrill-sbi" title="New York">New York</label>
</td>
<td></td>
<td></td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr><td>Phone:</td><td>+1 973-344-8300</td><td>Firm Nr:</td><td>KL4568TL</td></tr>
<tr><td>Fax:</td><td>+1 973-344-8300</td><td colspan="2"></td></tr>
<tr>
<td colspan="2"> www.liberty.edu </td>
<td>Active:</td>
<td>Yes</td>
</tr>
</tbody>
</table>
</div>
</div></div></body>
How it looks like on a webpage:
Right now im using the following script to extract the desired information:
from lxml import html
str = open('html1.txt', 'r').read()
tree = html.fromstring(str)
for variable in tree.xpath('/html/body/div/div'):
company_name = variable.xpath('/html/body/div/div/div[1]/strong/text()')
location = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[4]/td[1]/label/text()')
website = variable.xpath('/html/body/div/div/div[2]/table/tbody/tr[8]/td[1]/a/text()')
print(company_name, location, website)
Printed result:
('"Liberty Associates LLC"', 'New York', 'www.liberty.edu')
So far so good. However, when I use the script above to scape the whole html file, results are printed right after each other on one single line. But I would like to print the data (html code "blocks") under eachother like this:
Liberty Associates LLC | New York | +1 973-344-8300 | www.liberty.edu
Company B | Los Angeles | +1 213-802-1770 | perchla.com
I know I can use [0], [1], [2] etc. to get the data under each other like I would like, but doing this manually for all thousands of html "blocks" is just not really feasible.
So my question: how can I automatically extract the data "block by block" from the html code and print the results under each other like illustrated above?

I think what you want is
print(company_name, location, website,'\n')

Limiting BeautifulSoup output

I have been working semi-successfully with BeautifulSoup and Selenium for some weeks now. However I have found myself in a situation I cannot untangle.
I need to extract the html from the first 6 rows or so out of a table. These rows do not share any class, id or similar.
Table structure:
<table class="Table">
<tr class="Table_Header">
<td colspan="2">Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td><span class="Class"></span>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2"> Some Text </td>
<td>Some Text</td>
</tr>
<tr class="Class3">
<td class="Class2">Some Text</td>
<td>Some Text</td>
</tr>
<tr>
<td class="Class2">Some Text</td>
<td> <div class="Class4">Some Text</div>
<div class="Class4">Some Text</div>
</td>
</tr>
The table goes on and on, maintaining this structure but with seemingly random classes popping in and out.
Basically I would need to return the first six tr . I have tried several methods that either return the entire table or a single tr.
Any ideas?
Thanks in advance!

So you're trying to get the first 6 tr from a table? If I understand the question correctly I had a similar problem where I needed to get the first 400 td. Perhaps the code below would help?
Maybe something like
for row in get_log().findAll('tr'):
for cell in row.findAll('td'):
print (cell.text)
logfile.write('{}\n'.format(cell.text))
i += 1
if i == 400:
break
Also let me point you at the article I used to solve my own problem, the good stuff is near the end as it assumes you know literally nothing.
https://first-web-scraper.readthedocs.org/en/latest/
EDIT:
Using the table on Boone County as a source:
import requests
from BeautifulSoup import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'collapse shadow BCSDTable'})
i = 0
for row in table.findAll('tr'):
print (row.prettify())
i += 1
print i
if i == 6:
break
This outputs a ton of information, so I won't post it.Maybe you want to refine what you want from within each tr?

Reading table using BeautifulSoup

I'm reading an HTML file with BeautifulSoup. I have a table in the HTML from which I need to read data, but the HTML contains more than one table.
To distinguish between the tables, I need to see the number of columns on each line by counting <td> tags.
I count like this:
for i in soup.find_all('tr'):
for x in i.findallnext('td'):
This returns all <td> tags after the <tr> until the end of the document. But I need to know the numbers of <td> tags between the start of a line (<tr>) and the and of that line (</tr>).
<tr> <!-- Should return 2 columns, but will return 4 in script. -->
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
</tr>

Replace findallnext with find_all.
findallnext gives all tags after the until the end of the document as you said.
find_all gives you the child elements.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transform BeautifulSoup extract into sqlite table - python

Related

Python Bs4: How to retrieve row in table based on specific 'td' value that row

how to verify that a specific cell in a table contains a certain value in a link with Selenium and Python

How to auto extract data from a html file with python?

Limiting BeautifulSoup output

Reading table using BeautifulSoup

Categories

Resources