Here is what I'm trying to do: through a python script, I would like to get the first 5 pages of results of a Google search and save them as PDF files in a folder.
What do you suggest ?
(1) I start by parsing the HTML pages one by one and then find a tool to convert them into PDF ?
(2) I find a way to direclty do all the step in one through a mod which I don't know yet ?
Thank you very much in advance for your insights !
Use the standard Python library to download the file(s). Then you can use http://www.xhtml2pdf.com/ to convert the pages to PDF.
Note: Most web pages uses a lot of JavaScript to do all kinds of magic. So for many pages, only a full-blown web browser will get you nice/useful results. If you run into this problem, then there is no pure Python solution. Try phantomjs as explained here:
phantomjs rasterize.js 'http://en.wikipedia.org/w/index.php?title=Jakarta&printable=yes' jakarta.pdf
PS: I found these solutions by googling for python convert html to pdf You should try it once in a while.
Related
I am trying to convert a large section of a website into a pdf. I can convert one page with pdfkit, but I do not want the images on the website, just the text. Is there a way to do this with pdfkit? I have been searching google for the last half hour looking for a solution, but can only find information about getting pictures, not excluding them.
Thank you for your help!
This information can be found in the documentation for "wkhtmltopdf".
This tool has a --no-images option.
The PyPI page for pdfkit explains how to set options when using the Python package.
So this is what you are looking for:
import pdfkit
options = {
'no-images':True
}
pdfkit.from_url('https://www.google.com/','out.pdf',options=options)
First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.
You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/
As you see this website above use the CSS "table.wikitable" to identify the tables.
You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!
I would like to download the data in this table:
http://portal.ujn.gov.rs/Izvestaji/IzvestajiVelike.aspx
I know how to use selenium to go through the pages and the CSS selectors are helpful enough that it shouldn't be too difficult to get all the data...
However, I am curious if anyone knows some way of getting to a json or whatever intermediary object is used to make the html? As in, whatever the raw data format file that gets exported by the server is? Is this possible with aspnet frameworks?
I have found such solutions in the past, but with much simpler web pages and web pages with get requests...
Thank you!
Taking a look at the website (I have no experience with Russian at all but not like it maters much.) It looks to me like it is pulling the information from a database via PHP (In my book the "old" way of doing it) not a JSON file. Which means that your basically stuck doing it the normal web scraping route like you said OR to find a SQL injection (which I am in NO WAY SUGGESTING as it is illegal?) to be able to bypass the limitations of there crappy search page.
I am working on a project that needs to do a search on the internet (i.e. stack overflow). Retrieve all relevant results (URL, text, images paths) from the crawler from the search to an XML file. I am building it with python. Does anyone have any suggestion as to how i should approach this problem? I don't want to scan through the entire web, just top relevant results (stackoverflow, 10/08/2013, python as an example)
for stackoverflow you can use the api directly
for example:
https://api.stackexchange.com/2.1/questions?fromdate=1381190400&todate=1381276800&order=desc&sort=activity&tagged=python&site=stackoverflow
see https://api.stackexchange.com/docs/questions#fromdate=2013-10-08&todate=2013-10-09&order=desc&sort=activity&tagged=python&filter=default&site=stackoverflow
you can't making more 30 requests a second see http://api.stackexchange.com/docs/throttle
It sounds like you could use BeautifulSoup. And check out this thread, it sounds like it's what you need. Creating an XML document with BeautifulSoup: StackOverFlow
As for downloading and using BeautifulSoup, the site is here
It's pretty simple to use.
Hope this helps.
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.