I'm trying to obtain links using selenium from an e-commerce website. I'm a literally noob at web-scraping. So I'm open to any type of suggestions.
So this is the basic structure. Some of <tr> tags contain <href> which I want.
<tbody>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
</tbody>
What I have tried :
x1 = driver.find_elements_by_tag_name('tbody')
for x in x1:
print(x.text)
For some reason, this is fetching everything on the page, not only the things I want. Maybe that's because, there's another <tbody> tag at the start of the code and it covers everything in it.
My Question is:
How can I grab links from the <tbody> tag that I want?
x1 = driver.find_elements_by_tag_name('tbody')
for x in x1:
print(x.get_attribute('href'))
Related
I have an internal company webpage that lists a variety of data in a long list that I want to convert into a CSV file for reviewing. The data is in the format of:
*CUSTOMER_1*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
*Customer_2*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
Encoded in HTML it looks like
<table id="responsibility">
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 1</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_1</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 2</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_3</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
</table>
I'd like to end up with a file.csv that contains the info in this fashion
CUSTOMER1,Role_Name1,Name_1,Email_1,Category_Text,Phone_Numbers
CUSTOMER1,Role_Name2,Name_2,Email_2,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_3,Email_3,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_2,Email_2,Category_Text,Phone_Numbers
Right now i can get a list of all of the Customer names or a list of all of the text but I haven't been able to figure out how to iterate over every customer and then iterate over every line for each customer
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
line=responsibility.tr
for i in responsibility:
print(line)
line=responsibility.tr.next_sibling
I was expecting this to print every tag in the document but instead it only prints the first and never cycles to the next tags.
Focus on this line of code :
line=responsibility.tr
Here, you are using .tr tag, which locates the first instance of <tr> tag block and returns it's contents.
What does it mean over here?
Let's just say you have n instances of <tr> tag, then using .tr will give you only the first instance among those n <tr> instances as a result. So, if you wish to extract all n of them, then use find_all(). It will return a list of all possible matches.
line=responsibility.find_all("tr", class_="customer")
Also, add the class_="customer" filter. It will help you to locate all the <tr> blocks with the "customer" class. Then simply using the .next_sibling will allow you to find the 2 subsequent rows with title="Role_Name*" attribute.
So, to put the above theory in practice, watch this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
lines=responsibility.find_all("tr", class_ = "customer")
for i in responsibility:
for line in lines:
line1=line.next_sibling #locates tr with title="Role_Name1"
line2=line.next_sibling.next_sibling #locates tr with title="Role_Name2"
print(line1)
print(line2)
It's an odd one and I sit on this for nearly a week now.
Maybe it's obvious and im just not seeing things right anymore...
Any leads for alternative solutions are welcome, too.
I have no influence on the website.
I'm new to HTML.
I try to get specific Links from a website using scrapy. (how many is changing)
in this case RELATIVELINK1 and RELATIVELINK4; both are labeled "Details".
How many tables depends on how what you are allowd to see.
Before I start with the problem:
I'm using scrpy shell to test responses.
I get Values from all other parts of the HTML code.
I tried xpath, response.css und scrapy's LinkExtractor.
I tried ignoring the /p part in the path.
Now, If I try to get a response with xpath :
response.xpath('/html/body').extract() - I get a everything, including inside <p>
but when i get to
response.xpath('/html/body/.../p').extract() - I only get: ['<p>\n<br>\n</p>']
and then
response.xpath('/html/body/.../p/table').extract() - I get [ ]
same for
response.xpath('/html/body/.../p/br').extract()
Here is the HTML segment I'm having trouble with:
<p>
<BR>
<TABLE BORDER>
<TR>
<TD><b>NAME1</b></TD>
<TD><b>NAME2</b></TD>
<TD><b>NAME3</b></TD>
<TD><b>NAME4</b></TD>
<TD COLSPAN=3><b>Links</b></TD>
</TR>
<TR>
<TD>NUMBER1</font></TD>
<TD>LINK1 </font></TD>
<TD> </font></TD>
<TD>NAME5 </font></TD>
<TD><a href=RELATIVELINK1>Details</a></TD>
<TD><a href=RELATIVELINK2>LABEL1</TD>
<TD><a href=RELATIVELINK3>LABEL2</TD>
</TR>
<TR>
<TD>NUMBER2</font></TD>
<TD>LINK2 </font></TD>
<TD> </font></TD>
<TD>NAME5;</font></TD>
<TD><a href=RELATIVELINK4>Details</a></TD>
<TD><a href=RELATIVELINK5>LABEL1</TD>
<TD><a href=RELATIVELINK6>LABEL2</TD>
</TR>
</TABLE>
<BR>
There is no </P>.
for link_href in response.xpath('//a[.="Details"]/#href').extract():
print(link_href)
I am trying to parse the data in this website:
http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml
I want to extract some of the data in the tables. But for some reason, I am struggling to find them. For example, what I want to do is this
from bs4 import BeautifulSoup
import requests
url = 'http://www.baseball-reference.com/boxes/CHN/CHN201606020.shtml'
soup = BeautifulSoup(requests.get(url).text)
soup.find('table', id='ChicagoCubsbatting')
The final line returns nothing despite a table with that id existing in the html. Furthermore, len(soup.findAll('table')) returns 1 even though there are many tables in the page. I've tried using the 'lxml', 'html.parser' and 'html5lib'. All behave the same way.
What is going on? Why does this not work and what can I do to extract the table?
use soup.find('div', class_='placeholder').next_sibling.next_sibling to get the comment text, then build a new soup using those text.
In [35]: new_soup = BeautifulSoup(text, 'lxml')
In [36]: new_soup.table
Out[36]:
<table class="teams poptip" data-tip="San Francisco Giants at Atlanta Braves">
<tbody>
<tr class="winner">
<td>SFG</td>
<td class="right">6</td>
<td class="right gamelink">
Final
</td>
</tr>
<tr class="loser">
<td>ATL</td>
<td class="right">0</td>
<td class="right">
</td>
</tr>
</tbody>
</table
Been reading up on Scrapy. My python skills are weak but i usually am able to build something on trial and error and determination...
I'm able to run trough my project site and scrape 'structured' product data.
The problem occurs with a table that has different rows and values per page.
Beneath an example, I can get the name and price of the product.
The problem is with the table underneath, products have different specifications and different amount of rows but always 2 columns. I'm trying to loop trough by counting the <tr> and for each get the first <td> as a label and the second <td> as the corresponding value. Then append it with the other page data to create 1 entry.
In the end i'd like to yield Name: name, Price:price, Label X : Value X, label y : value y
<div>name</div>
<div>price</div>
<table>
<tr><td>LABEL X</td><td>VALUE X</td></tr>
<tr><td>LABEL Y</td><td>VALUE Y</td></tr>
<tr><td>LABEL Z</td><td>VALUE Z</td></tr>
Could be anywhere from 2 to 6 rows
</table>
Any help would be much appreciated, or if someone could point me to an example.
EDIT >>>>
The HTML code
<table class="table table-striped">
<tbody>
<tr>
<td><b>Name:</b></td>
<td>Car</td>
</tr>
<tr>
<td><b>Brand:</b></td>
<td itemprop="brand">Merc</td>
</tr>
<tr>
<td><b>Size:</b></td>
<td>30 XL</td>
</tr>
<tr>
<td><b>Color:</b></td>
<td>white</td>
</tr>
<tr>
<td><b>Stock</b></td>
<td>20</td>
</tr>
</tbody>
</table>
You should have posted some Scrapy code to help us out.
Anyways, here is the code you can use to parse your HTML.
for row in response.css('table > tr'):
data = {}
data['name'] = row.css("td:nth-child(1) b::text").extract()[0]
data['value'] = row.css("td:nth-child(2)::text").extract()[0]
yield MyItem(name = data['name'], value = data['value'])
PS:
Do not use tbody in selectors on xpaths, tbody is added by modern browsers, its not included in original response.
See here: https://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use
I am using Python 2.7, mechanize, and beautifulsoup and if it helps I could use urllib
ok, I am trying to download a couple different zip files that are in an different html tables. I know what tables the particular files are in ( I know if they are in the first, second,third ... table)
here is the second table in the html format from the webpage:
<table class="fe-form" cellpadding="0" cellspacing="0" border="0" width="50%">
<tr>
<td colspan="2"><h2>Eligibility List</h2></td>
</tr>
<tr>
<td><b>Eligibility File for Met-Ed</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=ME&ftype=1&fname=cmb_me_elig_lst_06_2013.zip">cmb_me_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penelec</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PN&ftype=1&fname=cmb_pn_elig_lst_06_2013.zip">cmb_pn_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PP&ftype=1&fname=cmb_pennelig_06_2013.zip">cmb_pennelig_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for West Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=WP&ftype=1&fname=cmb_wp_elig_lst_06_2013.zip">cmb_wp_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
I was going to use the following code just to get to the 2nd table:
from bs4 import BeautifulSoup
html= br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table", class=fe-form)
I guess that class="fe-form" is wrong because it will not work, but there are no other attributes of the table that differentiates it from the other tables. All tables have cellpadding="0" cellspacing="0" border="0" width="50%". I guess I can't use the find() function.
so I am trying to get to the second table and then to download the files on this page. Could someone give me some info to push me in the right direction. I have worked with forms before, but not tables. I wish there was some way to find the find the particular title of the zip files I am looking for then download them since I will always know their names
Thanks for any help,
Tom
To select the table you want, simply do
table = soup.find('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
This assumes that there is only one table with class=fe-form and cellpadding=0 in your document. If there are more, this code will select only the first table. To be sure you are not overlooking anything on the page, you could do
tables = soup.findAll('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
table = tables[0]
And maybe assert that len(tables)==1 to be sure that there is only one table.
Now, to download the file, there is plenty you can do. Assuming from your code that you have loaded mechanize, you could something like
a_tags = table.findAll('a')
for a in a_tags:
if '.zip' in a.get('href'):
br.retrieve(a.get('href'), a.text)
That would download all files to your current working directory and would name them according to their link text.