I am using Python 2.7, mechanize, and beautifulsoup and if it helps I could use urllib
ok, I am trying to download a couple different zip files that are in an different html tables. I know what tables the particular files are in ( I know if they are in the first, second,third ... table)
here is the second table in the html format from the webpage:
<table class="fe-form" cellpadding="0" cellspacing="0" border="0" width="50%">
<tr>
<td colspan="2"><h2>Eligibility List</h2></td>
</tr>
<tr>
<td><b>Eligibility File for Met-Ed</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=ME&ftype=1&fname=cmb_me_elig_lst_06_2013.zip">cmb_me_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penelec</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PN&ftype=1&fname=cmb_pn_elig_lst_06_2013.zip">cmb_pn_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PP&ftype=1&fname=cmb_pennelig_06_2013.zip">cmb_pennelig_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for West Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=WP&ftype=1&fname=cmb_wp_elig_lst_06_2013.zip">cmb_wp_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
I was going to use the following code just to get to the 2nd table:
from bs4 import BeautifulSoup
html= br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table", class=fe-form)
I guess that class="fe-form" is wrong because it will not work, but there are no other attributes of the table that differentiates it from the other tables. All tables have cellpadding="0" cellspacing="0" border="0" width="50%". I guess I can't use the find() function.
so I am trying to get to the second table and then to download the files on this page. Could someone give me some info to push me in the right direction. I have worked with forms before, but not tables. I wish there was some way to find the find the particular title of the zip files I am looking for then download them since I will always know their names
Thanks for any help,
Tom
To select the table you want, simply do
table = soup.find('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
This assumes that there is only one table with class=fe-form and cellpadding=0 in your document. If there are more, this code will select only the first table. To be sure you are not overlooking anything on the page, you could do
tables = soup.findAll('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
table = tables[0]
And maybe assert that len(tables)==1 to be sure that there is only one table.
Now, to download the file, there is plenty you can do. Assuming from your code that you have loaded mechanize, you could something like
a_tags = table.findAll('a')
for a in a_tags:
if '.zip' in a.get('href'):
br.retrieve(a.get('href'), a.text)
That would download all files to your current working directory and would name them according to their link text.
Related
I am doing web scraping for a DS project, and i am using BeautifulSoup for that. But i am unable to extract the Duration from "tbody" tag in "table" class.
Following is the HTML code :
<div class="table-responsive">
<table class="table">
<thead>
<tr>
<th>Start Date</th>
<th>Duration</th>
<th>Stipend</th>
<th>Posted On</th>
<th>Apply By</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<div id="start-date-first">Immediately</div>
</td>
<td>1 Month</td>
<td class="stipend_container_table_cell"> <i class="fa fa-inr"></i>
1500 /month
</td>
<td>26 May'20</td>
<td>23 Jun'20</td>
</tr>
</tbody>
</table>
</div>
Note : for extracting 'Immediately' text, i use the following code :
x = container.find("div", {"class" : "table-responsive"})
x.table.tbody.tr.td.div.text
You can use select() function to find tags by css selector.
tds = container.select('div > table > tbody > tr > td')
# or just select('td'), since there's no other td tag
print(tds[1].text)
The return value of select() function is the list of all HTML tags that matches the selector. The one you want to retrieve is second one, so using index 1, then get text of it.
Try this:
from bs4 import BeautifulSoup
import requests
url = "yourUrlHere"
pageRaw = requests.get(url).text
soup = BeautifulSoup(pageRaw , 'lxml')
print(soup.table)
In my code i use lxml library to parse the data. If you want to install pip install lxml... or just change into your libray in this part of the code:
soup = BeautifulSoup(pageRaw , 'lxml')
This code will return the first table ok?
Take care
I have an internal company webpage that lists a variety of data in a long list that I want to convert into a CSV file for reviewing. The data is in the format of:
*CUSTOMER_1*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
*Customer_2*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
Encoded in HTML it looks like
<table id="responsibility">
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 1</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_1</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 2</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_3</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
</table>
I'd like to end up with a file.csv that contains the info in this fashion
CUSTOMER1,Role_Name1,Name_1,Email_1,Category_Text,Phone_Numbers
CUSTOMER1,Role_Name2,Name_2,Email_2,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_3,Email_3,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_2,Email_2,Category_Text,Phone_Numbers
Right now i can get a list of all of the Customer names or a list of all of the text but I haven't been able to figure out how to iterate over every customer and then iterate over every line for each customer
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
line=responsibility.tr
for i in responsibility:
print(line)
line=responsibility.tr.next_sibling
I was expecting this to print every tag in the document but instead it only prints the first and never cycles to the next tags.
Focus on this line of code :
line=responsibility.tr
Here, you are using .tr tag, which locates the first instance of <tr> tag block and returns it's contents.
What does it mean over here?
Let's just say you have n instances of <tr> tag, then using .tr will give you only the first instance among those n <tr> instances as a result. So, if you wish to extract all n of them, then use find_all(). It will return a list of all possible matches.
line=responsibility.find_all("tr", class_="customer")
Also, add the class_="customer" filter. It will help you to locate all the <tr> blocks with the "customer" class. Then simply using the .next_sibling will allow you to find the 2 subsequent rows with title="Role_Name*" attribute.
So, to put the above theory in practice, watch this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
lines=responsibility.find_all("tr", class_ = "customer")
for i in responsibility:
for line in lines:
line1=line.next_sibling #locates tr with title="Role_Name1"
line2=line.next_sibling.next_sibling #locates tr with title="Role_Name2"
print(line1)
print(line2)
I am trying to parse the data in this website:
http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml
I want to extract some of the data in the tables. But for some reason, I am struggling to find them. For example, what I want to do is this
from bs4 import BeautifulSoup
import requests
url = 'http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml'
soup = BeautifulSoup(requests.get(url).text)
soup.find('table', id='ChicagoCubsbatting')
The final line returns nothing despite a table with that id existing in the html. Furthermore, len(soup.findAll('table')) returns 1 even though there are many tables in the page. I've tried using the 'lxml', 'html.parser' and 'html5lib'. All behave the same way.
What is going on? Why does this not work and what can I do to extract the table?
use soup.find('div', class_='placeholder').next_sibling.next_sibling to get the comment text, then build a new soup using those text.
In [35]: new_soup = BeautifulSoup(text, 'lxml')
In [36]: new_soup.table
Out[36]:
<table class="teams poptip" data-tip="San Francisco Giants at Atlanta Braves">
<tbody>
<tr class="winner">
<td>SFG</td>
<td class="right">6</td>
<td class="right gamelink">
Final
</td>
</tr>
<tr class="loser">
<td>ATL</td>
<td class="right">0</td>
<td class="right">
</td>
</tr>
</tbody>
</table
Been reading up on Scrapy. My python skills are weak but i usually am able to build something on trial and error and determination...
I'm able to run trough my project site and scrape 'structured' product data.
The problem occurs with a table that has different rows and values per page.
Beneath an example, I can get the name and price of the product.
The problem is with the table underneath, products have different specifications and different amount of rows but always 2 columns. I'm trying to loop trough by counting the <tr> and for each get the first <td> as a label and the second <td> as the corresponding value. Then append it with the other page data to create 1 entry.
In the end i'd like to yield Name: name, Price:price, Label X : Value X, label y : value y
<div>name</div>
<div>price</div>
<table>
<tr><td>LABEL X</td><td>VALUE X</td></tr>
<tr><td>LABEL Y</td><td>VALUE Y</td></tr>
<tr><td>LABEL Z</td><td>VALUE Z</td></tr>
Could be anywhere from 2 to 6 rows
</table>
Any help would be much appreciated, or if someone could point me to an example.
EDIT >>>>
The HTML code
<table class="table table-striped">
<tbody>
<tr>
<td><b>Name:</b></td>
<td>Car</td>
</tr>
<tr>
<td><b>Brand:</b></td>
<td itemprop="brand">Merc</td>
</tr>
<tr>
<td><b>Size:</b></td>
<td>30 XL</td>
</tr>
<tr>
<td><b>Color:</b></td>
<td>white</td>
</tr>
<tr>
<td><b>Stock</b></td>
<td>20</td>
</tr>
</tbody>
</table>
You should have posted some Scrapy code to help us out.
Anyways, here is the code you can use to parse your HTML.
for row in response.css('table > tr'):
data = {}
data['name'] = row.css("td:nth-child(1) b::text").extract()[0]
data['value'] = row.css("td:nth-child(2)::text").extract()[0]
yield MyItem(name = data['name'], value = data['value'])
PS:
Do not use tbody in selectors on xpaths, tbody is added by modern browsers, its not included in original response.
See here: https://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use
I am attempting to parse the second table seen below using BeautifulSoup. I am having trouble identifying the second table verses the first because the tables attributes are the exact same. How do I access the information in the table such as name = PATHWAY? What I have used so far to attempt to access the table is:
table = soup.find('table', {'name':'PATHWAY'})
I receive a response of "None" although I know the table is present. To me this means that my method to distinguish between the two is not working. Any suggestions?
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="REACTION TYPE">REACTION TYPE</td><td class=ue>ORGANISM</td><td class=ue>COMMENTARY</td><td class=ue>LITERATURE</td></tr>
<tr class=tr1>
<td class=g>condensation</td><td class=no>-</td><td class=no>-</td><td class=no>-</td></tr>
</table>
</td></tr></table>
<br>
<table border="0" cellspacing="0" cellpadding="0" bgcolor="#DCDCDC">
<tr><td>
<table border="0" cellspacing="1" cellpadding="3">
<tr>
<td class=ue><a name="PATHWAY">PATHWAY</td><td class=ue>KEGG Link</td><td class=ue>MetaCyc Link</td><td class=ue></td></tr>
<table>
Mu Mind has it right: find the "a" then traverse back up to the parent
soup.find(attrs={"name":"PATHWAY"}).findParent('table')
That's the python way....There is a single xpath command but operating with xpath on axis is more complicated and only worth the effort it it has some specific use (xslt or javascript requirements eg)
>>> soup.find(attrs={"name":"PATHWAY"})
<a name="PATHWAY">PATHWAY</a>
First:
table = soup.find('table' {'name':'PATHWAY'}
is no proper Python code.
What should this match?
This will match only.
Either you iterate through each single table and perform related check inside each table or you iterate over each single node of the tree until you find the related node and then walk up the node hierarchy (by following the parent nodes) until you find a table element. The recursiveChildGenerator() can be used to iterate over all nodes (like in a flat list).
You can use the function form of find:
soup.find(lambda tag: (tag.name=='table' and \
(tag.find('a', attrs={'name': 'PATHWAY'}) is not None)))