weasyprint single page html

weasyprint single page html - python

I’m trying to convert a html page containing a single table (whose tds contain either images or text) to a pdf. I am using weasyprint as a python module.
My problem is that weasyprint inserts page breaks in the middle of columns, which I do not want.
What I would like to do is either:
Rescale (downscale) the page to fit into the height of a page
Change the page height to fit the height of the content
The size of the table (amount of rows / columns) is variable (I don’t know it in advance).
Things I tried:
1.
I tried using the css page.size property to change the page size, but it doesn’t work because I don’t know the size of the table in advance.
2.
I tried adding the following
body html{
height: 100%;
}
and adding a lot of display:block, but those simply make scrollbars appear, it doesn’t resize the content to fit.
3.
I tried restricting page breaks using
table {
break-inside: avoid;
}
but it didn’t change anything

There doesn't seem to be a particular solution for the problem you've encountered.
A suitable workarround is to manually insert page breaks before the tables that are being split.
That can be done using the following style:
break-before: page;
you can read more about it here
Example
You can experiment with this example be removing the .break-before class
.break-before {
break-before: page;
}
.m-top {
margin-top: 90vh;
}
<body>
<table class="m-top break-before">
<tr>
<th>Company</th>
<th>Contact</th>
<th>Country</th>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
<tr>
<td>Centro comercial Moctezuma</td>
<td>Francisco Chang</td>
<td>Mexico</td>
</tr>
<tr>
<td>Alfreds Futterkiste</td>
<td>Maria Anders</td>
<td>Germany</td>
</tr>
</table>
</body>

Related

Huge HTML table - filter rows containing a string

I have sample HTML document as shown below. Now I need to filter all the rows with Profession as Engineer(column2) and generate resultant HTML document. But the problem here is that my document contains 2 million rows and size of the document is 1GB. Could anyone please suggest a faster way to process this?
I tried parsing using Python and BeautifulSoup module and tried to filter but it is taking more than 15 hours to process the data.. Is there a faster way to do this?
Code:
from BeautifulSoup import BeautifulSoup
fd = open("input.html")
soup = BeautifulSoup(fd.read())
for tr in soup('tr'):
if str(tr('td')[1].text) != "Engineer":
tr.extract()
with open("output.html", "w") as file:
file.write(str(soup))
fd.close()
INPUT:
<html>
<body>
<table>
<tr>
<td>Name</td>
<td>Profession</td>
<td>Address</td>
</tr>
<tr>
<td>John</td>
<td>Assassin</td>
<td>JohnWick</td>
</tr>
<tr>
<td>Tony</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Stark</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Bruce</td>
<td>Professor</td>
<td>Hulk</td>
</tr>
</table>
</body>
</html>
OUTPUT:
<html>
<body>
<table>
<tr>
<td>Name</td>
<td>Profession</td>
<td>Address</td>
</tr>
<tr>
<td>Tony</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
<tr>
<td>Stark</td>
<td>Engineer</td>
<td>IronMan</td>
</tr>
</table>
</body>
</html>

Do you need to retain the whitespace / formatting? Is this something you need to do many times, or just as a one off?
If it's a one-time job, you might be able to do it a little more simply. Try opening it up in Notepad++, Sublime etc. Use find and replace to reformat so you have one code row per table row:
<tr><td>Bruce</td><td>Professor</td><td>Hulk</td></tr>
<tr><td>Stark</td><td>Engineer</td><td>IronMan</td></tr>
(You can do it without this step, but it'll make it easier to see what's going on).
Then you could find and replace for:
<tr>.*?<td>Professor</td>.*?</tr>
with a blank row (repeat for each non-Engineer role). If there are a lot of professions, you can use back-references to change the Engineer rows from
<tr> content </tr>
to
<tr-keep> content </tr>
and then find and replace all of the vanilla tr rows.
You could also open it up in Excel and filter that way. I'm sure there are some good Python solutions here too, just telling you how I'd do it - I've had similar issues handling large files in Python, and you can do a lot of data munging in a basic text or spreadsheet editor. Excel eats a million rows for breakfast.

Scrapy: Can't select Content with Xpath, response.css in HTML Document

It's an odd one and I sit on this for nearly a week now.
Maybe it's obvious and im just not seeing things right anymore...
Any leads for alternative solutions are welcome, too.
I have no influence on the website.
I'm new to HTML.
I try to get specific Links from a website using scrapy. (how many is changing)
in this case RELATIVELINK1 and RELATIVELINK4; both are labeled "Details".
How many tables depends on how what you are allowd to see.
Before I start with the problem:
I'm using scrpy shell to test responses.
I get Values from all other parts of the HTML code.
I tried xpath, response.css und scrapy's LinkExtractor.
I tried ignoring the /p part in the path.
Now, If I try to get a response with xpath :
response.xpath('/html/body').extract() - I get a everything, including inside <p>
but when i get to
response.xpath('/html/body/.../p').extract() - I only get: ['<p>\n<br>\n</p>']
and then
response.xpath('/html/body/.../p/table').extract() - I get [ ]
same for
response.xpath('/html/body/.../p/br').extract()
Here is the HTML segment I'm having trouble with:
<p>
<BR>
<TABLE BORDER>
<TR>
<TD><b>NAME1</b></TD>
<TD><b>NAME2</b></TD>
<TD><b>NAME3</b></TD>
<TD><b>NAME4</b></TD>
<TD COLSPAN=3><b>Links</b></TD>
</TR>
<TR>
<TD>NUMBER1</font></TD>
<TD>LINK1 </font></TD>
<TD>&nbsp</font></TD>
<TD>NAME5 </font></TD>
<TD><a href=RELATIVELINK1>Details</a></TD>
<TD><a href=RELATIVELINK2>LABEL1</TD>
<TD><a href=RELATIVELINK3>LABEL2</TD>
</TR>
<TR>
<TD>NUMBER2</font></TD>
<TD>LINK2 </font></TD>
<TD>&nbsp</font></TD>
<TD>NAME5;</font></TD>
<TD><a href=RELATIVELINK4>Details</a></TD>
<TD><a href=RELATIVELINK5>LABEL1</TD>
<TD><a href=RELATIVELINK6>LABEL2</TD>
</TR>
</TABLE>
<BR>
There is no </P>.

for link_href in response.xpath('//a[.="Details"]/#href').extract():
print(link_href)

Crawling webpage with scrapy

Been reading up on Scrapy. My python skills are weak but i usually am able to build something on trial and error and determination...
I'm able to run trough my project site and scrape 'structured' product data.
The problem occurs with a table that has different rows and values per page.
Beneath an example, I can get the name and price of the product.
The problem is with the table underneath, products have different specifications and different amount of rows but always 2 columns. I'm trying to loop trough by counting the <tr> and for each get the first <td> as a label and the second <td> as the corresponding value. Then append it with the other page data to create 1 entry.
In the end i'd like to yield Name: name, Price:price, Label X : Value X, label y : value y
<div>name</div>
<div>price</div>
<table>
<tr><td>LABEL X</td><td>VALUE X</td></tr>
<tr><td>LABEL Y</td><td>VALUE Y</td></tr>
<tr><td>LABEL Z</td><td>VALUE Z</td></tr>
Could be anywhere from 2 to 6 rows
</table>
Any help would be much appreciated, or if someone could point me to an example.
EDIT >>>>
The HTML code
<table class="table table-striped">
<tbody>
<tr>
<td><b>Name:</b></td>
<td>Car</td>
</tr>
<tr>
<td><b>Brand:</b></td>
<td itemprop="brand">Merc</td>
</tr>
<tr>
<td><b>Size:</b></td>
<td>30 XL</td>
</tr>
<tr>
<td><b>Color:</b></td>
<td>white</td>
</tr>
<tr>
<td><b>Stock</b></td>
<td>20</td>
</tr>
</tbody>
</table>

You should have posted some Scrapy code to help us out.
Anyways, here is the code you can use to parse your HTML.
for row in response.css('table > tr'):
data = {}
data['name'] = row.css("td:nth-child(1) b::text").extract()[0]
data['value'] = row.css("td:nth-child(2)::text").extract()[0]
yield MyItem(name = data['name'], value = data['value'])
PS:
Do not use tbody in selectors on xpaths, tbody is added by modern browsers, its not included in original response.
See here: https://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use

downloading zip files with python mechanize

I am using Python 2.7, mechanize, and beautifulsoup and if it helps I could use urllib
ok, I am trying to download a couple different zip files that are in an different html tables. I know what tables the particular files are in ( I know if they are in the first, second,third ... table)
here is the second table in the html format from the webpage:
<table class="fe-form" cellpadding="0" cellspacing="0" border="0" width="50%">
<tr>
<td colspan="2"><h2>Eligibility List</h2></td>
</tr>
<tr>
<td><b>Eligibility File for Met-Ed</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=ME&ftype=1&fname=cmb_me_elig_lst_06_2013.zip">cmb_me_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penelec</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PN&ftype=1&fname=cmb_pn_elig_lst_06_2013.zip">cmb_pn_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=PP&ftype=1&fname=cmb_pennelig_06_2013.zip">cmb_pennelig_06_2013.zip</td>
</tr>
<tr>
<td><b>Eligibility File for West Penn Power</b> -
<a href="/content/fecorp/supplierservices/eligibility_list.suppliereligibility.html?id=WP&ftype=1&fname=cmb_wp_elig_lst_06_2013.zip">cmb_wp_elig_lst_06_2013.zip</td>
</tr>
<tr>
<td> </td>
</tr>
</table>
I was going to use the following code just to get to the 2nd table:
from bs4 import BeautifulSoup
html= br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table", class=fe-form)
I guess that class="fe-form" is wrong because it will not work, but there are no other attributes of the table that differentiates it from the other tables. All tables have cellpadding="0" cellspacing="0" border="0" width="50%". I guess I can't use the find() function.
so I am trying to get to the second table and then to download the files on this page. Could someone give me some info to push me in the right direction. I have worked with forms before, but not tables. I wish there was some way to find the find the particular title of the zip files I am looking for then download them since I will always know their names
Thanks for any help,
Tom

To select the table you want, simply do
table = soup.find('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
This assumes that there is only one table with class=fe-form and cellpadding=0 in your document. If there are more, this code will select only the first table. To be sure you are not overlooking anything on the page, you could do
tables = soup.findAll('table', attrs={'class' : 'fe-form', 'cellpadding' : '0' })
table = tables[0]
And maybe assert that len(tables)==1 to be sure that there is only one table.
Now, to download the file, there is plenty you can do. Assuming from your code that you have loaded mechanize, you could something like
a_tags = table.findAll('a')
for a in a_tags:
if '.zip' in a.get('href'):
br.retrieve(a.get('href'), a.text)
That would download all files to your current working directory and would name them according to their link text.

CSS Selectors, Choose by CHILD values

Let say I have an html structure like this:
<html><head></head>
<body>
<table>
<tr>
<td>
<table>
<tr>
<td>Left</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Center</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>Right</td>
</tr>
</table>
</td>
</tr>
</table>
</body>
</html>
I would like to construct CSS selectors to access the three sub tables, which are only distinguished by the contents of a table data item in their first row.
How can I do this?

I think there no such method available in css selector to verify the inner text.
You can achieve that by using xpath or jQuery path.
xpath :
"//td[contains(text(),'Left')]"
or
"//td[text()='Right']"
jQuery path
jQuery("td:contains('Centre')")
Using below logic you can execute jQuery paths in WebDriver automation.
JavascriptExecutor js = (JavascriptExecutor) driver;
WebElement element=(WebElement)js.executeScript(locator);

the .text method on an element returns the text of an element.
tables = page.find_elements_by_xpath('.//table')
contents = "Left Center Right".split()
results = []
for table in tables:
if table.find_element_by_xpath('.//td').text in contents: # returns only the first element
results.append(table)
You can narrow the search field by setting 'page' to the first 'table' element, and then running your search over that. There are all kinds of ways to improve performance like this. Note, this method will be fairly slow if there are a lot of extraneous tables present. Each webpage will have it's quirks on how it chooses to represent information, make sure you work around those to gain efficiency.
You can also use list comprehension to return your results.
results = [t for t in tables if t.find_element_by_xpath('.//td').text in contents]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

weasyprint single page html - python

Related

Huge HTML table - filter rows containing a string

Scrapy: Can't select Content with Xpath, response.css in HTML Document

Crawling webpage with scrapy

downloading zip files with python mechanize

CSS Selectors, Choose by CHILD values

Categories

Resources