Is it possible to automatically download the latest PDFs from a website? - python

For example, I want to download the latest WHO PDF on COVID-19. I'm really not sure how to do this.
If you type in 'who covid19 pdf' on Google, the pdf and link will come up.
I noticed that the links branch off from the main WHO domain name - maybe this can help?
Does anyone know how I can go about this?

From Python's standard library, use the urllib package. Specifically the retrieve function. A succient example is recreated below from this reference.
import urllib
testfile = urllib.URLopener()
testfile.retrieve("http://randomsite.com/file.pdf", "file.pdf")

Related

How to use a URL to get .csv data in Python

First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.
You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/
As you see this website above use the CSS "table.wikitable" to identify the tables.
You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!

Newspaper3k returns 0 articles from archive.org waybackmachine pages whereas the live page works as expected

When attempting to use the python library newspaper3 on an archived page url from archive.org it fails to fetch any articles. However when using it on the same live page url it works fine. Please see below:
import newspaper
len(newspaper.build('https://bbc.co.uk/news').articles)
>> 111
len(newspaper.build('https://web.archive.org/web/http://www.bbc.co.uk/news').articles)
>>> 0
Even using the special id hack that returns the original modified page doesn't work:
len(newspaper.build('https://web.archive.org/web/20171219030622id_/http://www.bbc.co.uk/news').articles)
>>> 0
Any help would be much appreciated, thanks!
I find no indication that this library is meant to work with archive.org, or that it works with archive.org.
Both [1][2] lists of sources contain no mention of either archive.org or web.archive.org.
I downloaded the entire repository to search the source code, and it also contains no mention of either Internet Archive domain.
From what I can tell based on this file, the articles attribute is based on RSS/ATOM feeds. I don't think Internet Archive archives those, and even if it does, since they would link back to the live version of the site, some changes in the library itself would be needed to get them to work with Internet Archive.
You've already opened an issue, where you specify it doesn't work at all (even on single articles -- this is probably an issue elsewhere, such as in the node scoring algorithm it uses to decide which nodes contain the article) so if you don't want to dive into the library source code and fix it yourself, all you can do is wait.

Python downloading PDF with urllib2 creates corrupt document

Last week I defined a function to download pdfs from a journal website. I successfully downloaded several pdfs using:
import urllib2
def pdfDownload(url):
response=urllib2.urlopen(url)
expdf=response.read()
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
I tried this function out with:
pdfDownload('http://pss.sagepub.com/content/26/1/3.full.pdf')
At the time, this was how the URLs on the journal Psychological Science were formatted. The pdf downloaded just fine.
I then went to write some more code to actually generate the URL lists and name the files appropriately so I could download large numbers of appropriately named pdf documents at once.
When I came back to join my two scripts together (sorry for non-technical language; I'm no expert, have just taught myself the basics) the formatting of URLs on the relevant journal had changed. Following the previous URL takes you to a page with URL 'http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009'. And now the pdfDownload function doesn't work anymore (either with the original URL or new URL). It creates a pdf which cannot be opened "because the file is not a supported file type or has been damaged".
I'm confused as to me it seems like all has changed is the formatting of the URLs, but actually something else must have changed to result in this? Any help would be hugely appreciated.
The problem is that the new URL points to a webpage--not the original PDF. If you print the value of "expdf", you'll get a bunch of HTML--not the binary data you're expecting.
I was able to get your original function working with a small tweak--I used the requests library to download the file instead of urllib2. requests appears to pull the file with the loader referenced in the html you're getting from your current implementation. Try this:
import requests
def pdfDownload(url):
response=requests.get(url)
expdf=response.content
egpdf=open('ex.pdf','wb')
egpdf.write(expdf)
egpdf.close()
If you're using Python 3, you already have requests; if you're using Python 2.7, you'll need to pip install requests.

Python Google Web Crawler

I am working on a project that needs to do a search on the internet (i.e. stack overflow). Retrieve all relevant results (URL, text, images paths) from the crawler from the search to an XML file. I am building it with python. Does anyone have any suggestion as to how i should approach this problem? I don't want to scan through the entire web, just top relevant results (stackoverflow, 10/08/2013, python as an example)
for stackoverflow you can use the api directly
for example:
https://api.stackexchange.com/2.1/questions?fromdate=1381190400&todate=1381276800&order=desc&sort=activity&tagged=python&site=stackoverflow
see https://api.stackexchange.com/docs/questions#fromdate=2013-10-08&todate=2013-10-09&order=desc&sort=activity&tagged=python&filter=default&site=stackoverflow
you can't making more 30 requests a second see http://api.stackexchange.com/docs/throttle
It sounds like you could use BeautifulSoup. And check out this thread, it sounds like it's what you need. Creating an XML document with BeautifulSoup: StackOverFlow
As for downloading and using BeautifulSoup, the site is here
It's pretty simple to use.
Hope this helps.

Different results for the same RSS feed fetching from different user agents

If I add a feed URL to Google Reader or to a desktop feed aggregator, I receive nice results. The URL is:
http://estaticos03.marca.com/rss/futbol_1adivision.xml
But when I fetch the same URL from a script (python script, using feedparser library) I am getting slightly different content for the same results (the title for each entry, for example, is different and all in uppercase).
I believe something is done on the server-side to try to discourage people like me to parse the content for my own projects (the feed is from a popular football newspaper), but I am not sure about it. I tried to pass some user agents (like the google reader one) but still no luck, so maybe they check the IP as well? I am really confused.
Any idea why is this happening to me?
Thanks!
AFAIK Google Reader does some "magic" in the content to beautify it. They strip some tags and styles to avoid breaking their interface.
Can you provide more details on the differences?
Did you changed the user agent of your script? Try to mimic Firefox and see what happen.
All right folks, I found it. I analyzed the source XML received (as #TryPyPy). I had been trusting too much the feedparser library. Latest official version (4.1) has a bug related to mistakeing the title tag from media namespace instead of the original one:
http://code.google.com/p/feedparser/issues/detail?id=76
So I reinstalled from trunk and now everything is OK. Thanks for helping anyway!

Categories

Resources