Crawling webpage with scrapy - python

Been reading up on Scrapy. My python skills are weak but i usually am able to build something on trial and error and determination...
I'm able to run trough my project site and scrape 'structured' product data.
The problem occurs with a table that has different rows and values per page.
Beneath an example, I can get the name and price of the product.
The problem is with the table underneath, products have different specifications and different amount of rows but always 2 columns. I'm trying to loop trough by counting the <tr> and for each get the first <td> as a label and the second <td> as the corresponding value. Then append it with the other page data to create 1 entry.
In the end i'd like to yield Name: name, Price:price, Label X : Value X, label y : value y
<div>name</div>
<div>price</div>
<table>
<tr><td>LABEL X</td><td>VALUE X</td></tr>
<tr><td>LABEL Y</td><td>VALUE Y</td></tr>
<tr><td>LABEL Z</td><td>VALUE Z</td></tr>
Could be anywhere from 2 to 6 rows
</table>
Any help would be much appreciated, or if someone could point me to an example.
EDIT >>>>
The HTML code
<table class="table table-striped">
<tbody>
<tr>
<td><b>Name:</b></td>
<td>Car</td>
</tr>
<tr>
<td><b>Brand:</b></td>
<td itemprop="brand">Merc</td>
</tr>
<tr>
<td><b>Size:</b></td>
<td>30 XL</td>
</tr>
<tr>
<td><b>Color:</b></td>
<td>white</td>
</tr>
<tr>
<td><b>Stock</b></td>
<td>20</td>
</tr>
</tbody>
</table>

You should have posted some Scrapy code to help us out.
Anyways, here is the code you can use to parse your HTML.
for row in response.css('table > tr'):
data = {}
data['name'] = row.css("td:nth-child(1) b::text").extract()[0]
data['value'] = row.css("td:nth-child(2)::text").extract()[0]
yield MyItem(name = data['name'], value = data['value'])
PS:
Do not use tbody in selectors on xpaths, tbody is added by modern browsers, its not included in original response.
See here: https://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use

Related

weasyprint single page html

I’m trying to convert a html page containing a single table (whose tds contain either images or text) to a pdf. I am using weasyprint as a python module.
My problem is that weasyprint inserts page breaks in the middle of columns, which I do not want.
What I would like to do is either:
Rescale (downscale) the page to fit into the height of a page
Change the page height to fit the height of the content
The size of the table (amount of rows / columns) is variable (I don’t know it in advance).
Things I tried:
1.
I tried using the css page.size property to change the page size, but it doesn’t work because I don’t know the size of the table in advance.
2.
I tried adding the following
body html{
height: 100%;
}
and adding a lot of display:block, but those simply make scrollbars appear, it doesn’t resize the content to fit.
3.
I tried restricting page breaks using
table {
break-inside: avoid;
}
but it didn’t change anything
There doesn't seem to be a particular solution for the problem you've encountered.
A suitable workarround is to manually insert page breaks before the tables that are being split.
That can be done using the following style:
break-before: page;
you can read more about it here
Example
You can experiment with this example be removing the .break-before class
.break-before {
break-before: page;
}
.m-top {
margin-top: 90vh;
}
<body>
<table class="m-top break-before">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
</table>
</body>

Huge HTML table - filter rows containing a string

I have sample HTML document as shown below. Now I need to filter all the rows with Profession as Engineer(column2) and generate resultant HTML document. But the problem here is that my document contains 2 million rows and size of the document is 1GB. Could anyone please suggest a faster way to process this?
I tried parsing using Python and BeautifulSoup module and tried to filter but it is taking more than 15 hours to process the data.. Is there a faster way to do this?
Code:
from BeautifulSoup import BeautifulSoup
fd = open("input.html")
soup = BeautifulSoup(fd.read())
for tr in soup('tr'):
if str(tr('td')[1].text) != "Engineer":
tr.extract()
with open("output.html", "w") as file:
file.write(str(soup))
fd.close()
INPUT:
<html>
<body>
<table>
<tr>
<td>Name</td>
<td>Profession</td>
<td>Address</td>
</tr>
<tr>
<td>John</td>
<td>Assassin</td>
<td>JohnWick</td>
</tr>
<tr>
<td>Tony</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Stark</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Bruce</td>
<td>Professor</td>
<td>Hulk</td>
</tr>
</table>
</body>
</html>
OUTPUT:
<html>
<body>
<table>
<tr>
<td>Name</td>
<td>Profession</td>
<td>Address</td>
</tr>
<tr>
<td>Tony</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Stark</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
</table>
</body>
</html>
Do you need to retain the whitespace / formatting? Is this something you need to do many times, or just as a one off?
If it's a one-time job, you might be able to do it a little more simply. Try opening it up in Notepad++, Sublime etc. Use find and replace to reformat so you have one code row per table row:
<tr><td>Bruce</td><td>Professor</td><td>Hulk</td></tr>
<tr><td>Stark</td><td>Engineer</td><td>IronMan</td></tr>
(You can do it without this step, but it'll make it easier to see what's going on).
Then you could find and replace for:
<tr>.*?<td>Professor</td>.*?</tr>
with a blank row (repeat for each non-Engineer role). If there are a lot of professions, you can use back-references to change the Engineer rows from
<tr> content </tr>
to
<tr-keep> content </tr>
and then find and replace all of the vanilla tr rows.
You could also open it up in Excel and filter that way. I'm sure there are some good Python solutions here too, just telling you how I'd do it - I've had similar issues handling large files in Python, and you can do a lot of data munging in a basic text or spreadsheet editor. Excel eats a million rows for breakfast.

How to obtain links using Selenium

I'm trying to obtain links using selenium from an e-commerce website. I'm a literally noob at web-scraping. So I'm open to any type of suggestions.
So this is the basic structure. Some of <tr> tags contain <href> which I want.
<tbody>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
<tr>...</tr>
</tbody>
What I have tried :
x1 = driver.find_elements_by_tag_name('tbody')
for x in x1:
print(x.text)
For some reason, this is fetching everything on the page, not only the things I want. Maybe that's because, there's another <tbody> tag at the start of the code and it covers everything in it.
My Question is:
How can I grab links from the <tbody> tag that I want?
x1 = driver.find_elements_by_tag_name('tbody')
for x in x1:
print(x.get_attribute('href'))

How do I get Beautifulsoup to Parse a Serial HTML list in a table into a CSV pattern of data?

I have an internal company webpage that lists a variety of data in a long list that I want to convert into a CSV file for reviewing. The data is in the format of:
*CUSTOMER_1*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
*Customer_2*
Email Link Category_Text Phone_Numbers
Email Link Category_Text Phone_Numbers
Encoded in HTML it looks like
<table id="responsibility">
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 1</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_1</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr class="customer">
<td colspan="6">
<strong>CUSTOMER 2</strong>
</td>
</tr>
<tr id="tr_1" title="Role_Name1">
<td>Name_3</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
<tr id="tr_2" title="Role_Name2">
<td>Name_2</td>
<td>Category_Text</td>
<td>Phone_Numbers</td>
<td></td>
</tr>
</table>
I'd like to end up with a file.csv that contains the info in this fashion
CUSTOMER1,Role_Name1,Name_1,Email_1,Category_Text,Phone_Numbers
CUSTOMER1,Role_Name2,Name_2,Email_2,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_3,Email_3,Category_Text,Phone_Numbers
CUSTOMER2,Role_Name1,Name_2,Email_2,Category_Text,Phone_Numbers
Right now i can get a list of all of the Customer names or a list of all of the text but I haven't been able to figure out how to iterate over every customer and then iterate over every line for each customer
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
line=responsibility.tr
for i in responsibility:
print(line)
line=responsibility.tr.next_sibling
I was expecting this to print every tag in the document but instead it only prints the first and never cycles to the next tags.
Focus on this line of code :
line=responsibility.tr
Here, you are using .tr tag, which locates the first instance of <tr> tag block and returns it's contents.
What does it mean over here?
Let's just say you have n instances of <tr> tag, then using .tr will give you only the first instance among those n <tr> instances as a result. So, if you wish to extract all n of them, then use find_all(). It will return a list of all possible matches.
line=responsibility.find_all("tr", class_="customer")
Also, add the class_="customer" filter. It will help you to locate all the <tr> blocks with the "customer" class. Then simply using the .next_sibling will allow you to find the 2 subsequent rows with title="Role_Name*" attribute.
So, to put the above theory in practice, watch this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("source.html"), "html.parser")
with open("output.csv",'w') as file:
responsibility=soup.find('table',{'id':'responsibility'})
lines=responsibility.find_all("tr", class_ = "customer")
for i in responsibility:
for line in lines:
line1=line.next_sibling #locates tr with title="Role_Name1"
line2=line.next_sibling.next_sibling #locates tr with title="Role_Name2"
print(line1)
print(line2)

CSS Selectors, Choose by CHILD values

Let say I have an html structure like this:
<html><head></head>
<body>
<table>
<tr>
<td>
<table>
<tr>
<td>Left</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Center</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Right</td>
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I would like to construct CSS selectors to access the three sub tables, which are only distinguished by the contents of a table data item in their first row.
How can I do this?
I think there no such method available in css selector to verify the inner text.
You can achieve that by using xpath or jQuery path.
xpath :
"//td[contains(text(),'Left')]"
or
"//td[text()='Right']"
jQuery path
jQuery("td:contains('Centre')")
Using below logic you can execute jQuery paths in WebDriver automation.
JavascriptExecutor js = (JavascriptExecutor) driver;
WebElement element=(WebElement)js.executeScript(locator);
the .text method on an element returns the text of an element.
tables = page.find_elements_by_xpath('.//table')
contents = "Left Center Right".split()
results = []
for table in tables:
if table.find_element_by_xpath('.//td').text in contents: # returns only the first element
results.append(table)
You can narrow the search field by setting 'page' to the first 'table' element, and then running your search over that. There are all kinds of ways to improve performance like this. Note, this method will be fairly slow if there are a lot of extraneous tables present. Each webpage will have it's quirks on how it chooses to represent information, make sure you work around those to gain efficiency.
You can also use list comprehension to return your results.
results = [t for t in tables if t.find_element_by_xpath('.//td').text in contents]

Categories

Resources