Weasyprint HTML to PDF huge gap in right margin - python

I'm using Weasyprint to print an HTML template to PDF, and I keep getting a gap of 10cm on the right side.
I'm using #page:(size:letter;) as only page attribute.
I've tried setting the page size manually, but I still keep getting a huge space to the right of all the pages.
Any thoughts on what could be the problem?

Found the solution. It was a CSS problem. The class used to style the body was not at the beginning of the css file and that caused erratic behavior with other styles declared before it.

Related

The HTML code I scrape seems to be incomplete in comparison to the full website. Could the HTML be changing dynamically?

I am currently scraping a website for work to be able to sort the data locally, however when I do this the code seems to be incomplete, and I feel may be changing whilst I scroll on the website to add more content. Can this happen ? And if so, how can I ensure I am able to scrape the whole website for processing?
I only currently know some python and html for web scraping, looking into what other elements may be affecting this issue (javascript or ReactJS etc).
I am expecting to get a list of 50 names when scraping the website, but it only returns 13. I have downloaded the whole HTML file to go through it and none of the other names seem to exist in the file, i.e. why I think the file may be changing dynamically
Yes, the content of the HTML can be dynamic, and Javascript loading should be the most essential . For Python, scrapy+splash maybe a good choice to get started.
Depending on how the data is handled, you can have different methods to handle dyamic content HTML

Weasprint and Twitter Bootstrap

Does anybody have any experience rendering web pages in weasyprint that are styled using twitter Bootstrap? Whenever I try, the html renders completely unstyled as if there was no css applied to it.
I figured out what the problem was. When I declared the style sheet i set media="screen", I removed that tag element and it seemed to fix it. Further research indicated I could also declare a separate stylesheet and set media="print".

Rendering float divs from html to pdf

Is there any way to generate PDF from html with floating divs (I can event use fixed width and height values for divs), margins and padding in Python? Does anybody know python libs which work correctly with this css property or may be system tools? Any info will be helpfull.
I have tried wkhtmltopdf. Pisa excluded immediately...
not python, but you could try http://phantomjs.org/ simple js to generate a page, then just call .render to generate a pdf

Python : Rendering part of webpage with proper styling from server

I am building a screen clipping app.
So far:
I can get the html mark up of the part of the web page the user has selected including images and videos.
I then send them to a server to process the html with BeautifulSoup to sanitize the html and convert all relative paths if any to absolute paths
Now I need to render the part of the page. But I have no way to render the styling. Is there any library to help me in this matter or any other way in python ?
One way would be to fetch the whole webpage with urllib2 and remove the parts of the body I don't need and then render it.
But there must be a more pythonic way :)
Note: I don't want a screenshot. I am trying to render proper html with styling.
Thanks :)
Download the complete webpage, extract the style elements and the stylesheet link elements and download the files referenced the latter. That should give you the CSS used on the page.

How to check if a page is displaying a specific <img> tag

What is the best way to determine if a page on a website is REALLY displaying a specific img tag like this <img src=http://domain.com/img.jpg>? A simple string comparison is easy to fool using http comments <!-- -->. Even if the html tag exists it could be deleted with JavaScript. It could also be obscured by placing an image over it using CSS. Do you know of a solid method of detecting the img tag dispute these obscuring attacks listed? Do you know of another method of obscuring the image? Python code to detect the image would be ideal, but if you know of a good tactic or method that will earn you a +1 from me.
I don't think you can ever be sure. First, you're not even sure the program will stop.
Aside from that, consider the following scenarios. Your <img> can be added, removed or get obscured using JavaScript, CSS and/or server-side:
randomly.
at specific times.
to a certain part of the world.
according to differences and bugs between browsers.
Google is facing a similar problem - people are hiding search keywords in hidden text and links to get a better rank. Their solution is to penalize sites with hidden text. They get away with it because they're Google; people depend on them for traffic.
As for you, you can't do much better than to ask nicely...
The only surefire way I can think of is to render the page and check. It is simple to strip comments etc. But if scripts are involved, it is not possible to have a general solution that will not amount to executing them (I believe this is the first time I ever invoked Church's theorem...).
You could place a script anywhere that processes the request, counts the view and delivers the image like this:
http://yourhost.com/imageprocess?image=media/foo/bar.jpg
Then you can be sure that the image was loaded. If if was viewed, you of course can't be sure, however.

Categories

Resources