Webscraping to download unique PDF files

Webscraping to download unique PDF files - python

I have to routinely download over 300 pdfs from over 150 websites once every quarter, and I've been starting to think there has to be automate this using python. These PDFs are released on a quarterly basis, and detail the performance of mutual funds over the previous quarter. 90% of the time, these PDFs are called 'quarterly commentary' or 'commentary', and so what I want to do is write a script in python to search the fund-specific url ex (https://www.pimco.com/investments/mutual-funds/total-return-fund/inst) for the keyword 'commentary', find the link, and then download the resulting PDF file.
I would also like to name the download file to coincide with the proper mutual fund name. Now what I have been working off of is an excel spreadsheet. In column A I have the proper mutual fund name. In column B I have the mutual fund URL.
Would this be possible?

Personally I find it easier to use CaspjerJS and PhantomJS to download files from external websites because you can inject javascript code into the page to grab the elements you need.
Here is the casperjs documentation
And here is some code that I've written to download lectures from my professors webpage and download them to my desktop:
var casper = require('casper').create({verbose: true , logLevel: "debug" });
var url = "https://www.cs.rit.edu/~ib/Classes/CSCI264_Fall16-17/assignments.html";
var fs = require('fs');
casper.start(url);
var elements;
casper.then(function(){
elements = this.evaluate(function(){
var pdfs = document.querySelectorAll('body ul li a');
return Array.prototype.map.call(pdfs, function(e) {
return e.getAttribute('href');
});
});
for(var i = 0; i < elements.length; ++i){
var url = "" + elements[i] + "";
if(url.indexOf('pdf') !== -1){
var file = fs.absolute(url.substring(url.lastIndexOf("/")+1, url.length));
this.download(url, file);
}
}
});
casper.run(function() {
this.echo('Done.').exit();
});
Of course if you're deadset on using python then disregard this completely. Otherwise, good-luck with your CasperJS script.

Related

Python + webapp2 + Modify the URL without reloading the page [duplicate]

Is there a way I can modify the URL of the current page without reloading the page?
I would like to access the portion before the # hash if possible.
I only need to change the portion after the domain, so it's not like I'm violating cross-domain policies.
window.location.href = "www.mysite.com/page2.php"; // this reloads

This can now be done in Chrome, Safari, Firefox 4+, and Internet Explorer 10pp4+!
See this question's answer for more information:
Updating address bar with new URL without hash or reloading the page
Example:
function processAjaxData(response, urlPath){
document.getElementById("content").innerHTML = response.html;
document.title = response.pageTitle;
window.history.pushState({"html":response.html,"pageTitle":response.pageTitle},"", urlPath);
}
You can then use window.onpopstate to detect the back/forward button navigation:
window.onpopstate = function(e){
if(e.state){
document.getElementById("content").innerHTML = e.state.html;
document.title = e.state.pageTitle;
}
};
For a more in-depth look at manipulating browser history, see this MDN article.

HTML5 introduced the history.pushState() and history.replaceState() methods, which allow you to add and modify history entries, respectively.
window.history.pushState('page2', 'Title', '/page2.php');
Read more about this from here

You can also use HTML5 replaceState if you want to change the url but don't want to add the entry to the browser history:
if (window.history.replaceState) {
//prevents browser from storing history with each change:
window.history.replaceState(statedata, title, url);
}
This would 'break' the back button functionality. This may be required in some instances such as an image gallery (where you want the back button to return back to the gallery index page instead of moving back through each and every image you viewed) whilst giving each image its own unique url.

Here is my solution (newUrl is your new URL which you want to replace with the current one):
history.pushState({}, null, newUrl);

NOTE: If you are working with an HTML5 browser then you should ignore this answer. This is now possible as can be seen in the other answers.
There is no way to modify the URL in the browser without reloading the page. The URL represents what the last loaded page was. If you change it (document.location) then it will reload the page.
One obvious reason being, you write a site on www.mysite.com that looks like a bank login page. Then you change the browser URL bar to say www.mybank.com. The user will be totally unaware that they are really looking at www.mysite.com.

parent.location.hash = "hello";

In modern browsers and HTML5, there is a method called pushState on window history. That will change the URL and push it to the history without loading the page.
You can use it like this, it will take 3 parameters, 1) state object 2) title and a URL):
window.history.pushState({page: "another"}, "another page", "example.html");
This will change the URL, but not reload the page. Also, it doesn't check if the page exists, so if you do some JavaScript code that is reacting to the URL, you can work with them like this.
Also, there is history.replaceState() which does exactly the same thing, except it will modify the current history instead of creating a new one!
Also you can create a function to check if history.pushState exist, then carry on with the rest like this:
function goTo(page, title, url) {
if ("undefined" !== typeof history.pushState) {
history.pushState({page: page}, title, url);
} else {
window.location.assign(url);
}
}
goTo("another page", "example", 'example.html');
Also, you can change the # for <HTML5 browsers, which won't reload the page. That's the way Angular uses to do SPA according to hashtag...
Changing # is quite easy, doing like:
window.location.hash = "example";
And you can detect it like this:
window.onhashchange = function () {
console.log("#changed", window.location.hash);
}

The HTML5 replaceState is the answer, as already mentioned by Vivart and geo1701. However it is not supported in all browsers/versions.
History.js wraps HTML5 state features and provides additional support for HTML4 browsers.

Before HTML5 we can use:
parent.location.hash = "hello";
and:
window.location.replace("http:www.example.com");
This method will reload your page, but HTML5 introduced the history.pushState(page, caption, replace_url) that should not reload your page.

If what you're trying to do is allow users to bookmark/share pages, and you don't need it to be exactly the right URL, and you're not using hash anchors for anything else, then you can do this in two parts; you use the location. hash discussed above, and then implement a check on the home page, to look for a URL with a hash anchor in it, and redirect you to the subsequent result.
For instance:
User is on www.site.com/section/page/4
User does some action which changes the URL to www.site.com/#/section/page/6 (with the hash). Say you've loaded the correct content for page 6 into the page, so apart from the hash the user is not too disturbed.
User passes this URL on to someone else, or bookmarks it
Someone else, or the same user at a later date, goes to www.site.com/#/section/page/6
Code on www.site.com/ redirects the user to www.site.com/section/page/6, using something like this:
if (window.location.hash.length > 0){
window.location = window.location.hash.substring(1);
}
Hope that makes sense! It's a useful approach for some situations.

Below is the function to change the URL without reloading the page. It is only supported for HTML5.
function ChangeUrl(page, url) {
if (typeof (history.pushState) != "undefined") {
var obj = {Page: page, Url: url};
history.pushState(obj, obj.Page, obj.Url);
} else {
window.location.href = "homePage";
// alert("Browser does not support HTML5.");
}
}
ChangeUrl('Page1', 'homePage');

You can use this beautiful and simple function to do so anywhere on your application.
function changeurl(url, title) {
var new_url = '/' + url;
window.history.pushState('data', title, new_url);
}
You can not only edit the URL but you can update the title along with it.

Any changes of the loction (either window.location or document.location) will cause a request on that new URL, if you’re not just changing the URL fragment. If you change the URL, you change the URL.
Use server-side URL rewrite techniques like Apache’s mod_rewrite if you don’t like the URLs you are currently using.

You can add anchor tags. I use this on my site so that I can track with Google Analytics what people are visiting on the page.
I just add an anchor tag and then the part of the page I want to track:
var trackCode = "/#" + urlencode($("myDiv").text());
window.location.href = "http://www.piano-chords.net" + trackCode;
pageTracker._trackPageview(trackCode);

As pointed out by Thomas Stjernegaard Jeppesen, you could use History.js to modify URL parameters whilst the user navigates through your Ajax links and apps.
Almost an year has passed since that answer, and History.js grew and became more stable and cross-browser. Now it can be used to manage history states in HTML5-compliant as well as in many HTML4-only browsers. In this demo You can see an example of how it works (as well as being able to try its functionalities and limits.
Should you need any help in how to use and implement this library, i suggest you to take a look at the source code of the demo page: you will see it's very easy to do.
Finally, for a comprehensive explanation of what can be the issues about using hashes (and hashbangs), check out this link by Benjamin Lupton.

Use history.pushState() from the HTML 5 History API.
Refer to the HTML5 History API for more details.

Your new url.
let newUrlIS = window.location.origin + '/user/profile/management';
In a sense, calling pushState() is similar to setting window.location = "#foo", in that both will also create and activate another history entry associated with the current document. But pushState() has a few advantages:
history.pushState({}, null, newUrlIS);
You can check out the root: https://developer.mozilla.org/en-US/docs/Web/API/History_API

This code works for me. I used it into my application in ajax.
history.pushState({ foo: 'bar' }, '', '/bank');
Once a page load into an ID using ajax, It does change the browser url automatically without reloading the page.
This is ajax function bellow.
function showData(){
$.ajax({
type: "POST",
url: "Bank.php",
data: {},
success: function(html){
$("#viewpage").html(html).show();
$("#viewpage").css("margin-left","0px");
}
});
}
Example: From any page or controller like "Dashboard", When I click on the bank, it loads bank list using the ajax code without reloading the page. At this time, browser URL will not be changed.
history.pushState({ foo: 'bar' }, '', '/bank');
But when I use this code into the ajax, it change the browser url without reloading the page.
This is the full ajax code here in the bellow.
function showData(){
$.ajax({
type: "POST",
url: "Bank.php",
data: {},
success: function(html){
$("#viewpage").html(html).show();
$("#viewpage").css("margin-left","0px");
history.pushState({ foo: 'bar' }, '', '/bank');
}
});
}

This is all you will need to navigate without reload
// add setting without reload
location.hash = "setting";
// if url change with hash do somthing
window.addEventListener('hashchange', () => {
console.log('url hash changed!');
});
// if url change do somthing (dont detect changes with hash)
//window.addEventListener('locationchange', function(){
// console.log('url changed!');
//})
// remove #setting without reload
history.back();

Simply use, it will not reload the page, but just the URL :
$('#form_name').attr('action', '/shop/index.htm').submit();

Extract string from <script> - BeautifulSoup python

I'm trying to create a python script to extract some informations from a webmail. I wanna follow a redirection.
My code :
br1 = mechanize.Browser()
br1.set_handle_robots(False)
br1.set_cookiejar(cj)
br1.open("LOGIN URL")
br1.select_form(nr=0)
br1.form['username'] = mail_site
br1.form['password'] = pw_site
res1 = br1.submit()
html = res1.read()
print html
Result is not what i expect.
It contains only a redirection script.
I've seen that i have to extract the information from this script to follow this redirection.
So, in my case,i've to extract jsessionid into a script.
The script is :
<script>
function redir(){
window.self.location.replace('/webmail/en_EN/continue.html;jsessionid=1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA?MESSAGE=NO_COOKIE&DT=1&URL_VALID=welcome.html');
return true;
}
</script>
If i'm not wrong, i've to build one regex.
I've tried many things but no results.
Anyone have an idea ?

import re
get_jsession = re.search(r'jsessionid=([A-Za-z0-9.]+)',script_)
print(get_jsession.group(1))
>>> '1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA'

Getting resources with python phantomJS

I am trying to find a way to get all the resources loaded by a page using Python with PhantomJS. Note that some resources may be loaded by scripts, etc. Is there a way to do this? Thanks for all help in advance!
My current code is:
self.driver.get("about:blank")
js = '''
console.log('hello world')
var page = this;
var urls = Array();
page.onResourceRequested = function (req) {
urls.push(request.url)
};
page.onResourceReceived = function (res) {
urls.push(request.url)
};
return urls
'''
result = self.driver.execute_script(js)
self.driver.get(url_to_open)
time.sleep(2)
print(result)

Edit and replace partial URLs in tumblr posts

I want to change my URL in tumblr but I have hardcoded links to my blog all over the place. Rather than go into over 1000 posts and manually update the links, I was told it can be automated. I'd need it to:
Visit a page of the blog
Check for the old URL within the hyperlinked text of the post
If present, click Edit to edit the post content
Click the URL in the text area
Click Edit in the popup that'll appear
Replace only part of the URL in the popup with the new URL (ex: if we start with http://old.tumblr.com/tagged we'd then want http://new.tumblr.com/tagged)
Click Done on the popup to close it and save URL changes
Save changes to the post
Proceed to check the next post on the page
If no more instances occur, continue to the next page
Repeat until last page is reached
So I believe I understand the logic/steps required, but my flaw is in being able to execute them. What would be the best language or method to go about implementing this? Something straight-forward preferred, as I'm a complete coding newbie. Python was mentioned to me. Autohotkey maybe, as well?
My apologies if this isn't the correct place to ask.
Currently I've got a redirect in place on the old URL's page.
<title>Redirect</title>
<script>location.replace('http://new.tumblr.com' + location.pathname);</script>
<noscript>
<h1>This blog has moved to New Blog.</h1>
<p>If you’re reading this, you have JavaScript turned off and therefore can’t be redirected automatically. Replace “{BlogURL}” with “http://{text:New Tumblr URL}.tumblr.com/” in your browser’s address bar to get to your destination.</p>
</noscript>

Well in AutoHotkey I'd use IE COM Automation to get the job done, it'd be the most reliable.
Com Object Reference
Edit:
Frankly editing HTML using browser automation methods is just a terribly inefficient way to go about this. If you have a access to the site, you likely have access to Upload HTML files directly. If this is the case, the code below should provide you enough details about how to edit the links contained within your pages.
The code below is simplification of what you'll be doing. Just to familiarize yourself with the process.
html =
(
<html>
<body>
<a href="http://old.tumblr.com/tagged1"/>this old link</a>
<a href="http://old.tumblr.com/tagged2"/>this old link two</a>
<a href="http://old.tumblr.com/tagged3"/>this old link three</a>
<a href="http://new.tumblr.com/tagged3"/>this new link</a>
</body>
</html>
)
pwb := ComObjCreate("HTMLfile"), pwb.Write( html )
Links := pwb.Links
Loop % Links.Length ; check each link
If ((RelatedLink := Links[A_Index-1].href) != "" && (Links[A_Index-1].href ~= "http://old.")) { ; if the link is not blank
Links[A_Index-1].href := StrReplace(Links[A_Index-1].href, "http://old.", "http://new.")
}
html := pwb.documentElement.innerHTML
MsgBox % html
And this is how I would go about applying it to a bunch of websites:
SetBatchLines -1
fileName := A_ScriptDir . "\myfile.txt"
MyListOfWebPages = ; add all your blog page urls here
(
http://myblogpageone.html
http://myblogpagetwo.html
http://myblogpagethree.html
http://myblogpagefour.html
)
For Each, Line in StrSplit(MyListOfWebPages, "`n", "`r") {
FileAppend, % GrabWebPage(Line), % A_scriptDir "\htmlfile" A_index ".html"
}
GrabWebPage(Webpage) {
whr := ComObjCreate("WinHttp.WinHttpRequest.5.1")
;Change below to your URL
whr.Open("GET", Webpage, true)
whr.Send()
whr.WaitForResponse()
pwb := ComObjCreate("HTMLfile"), pwb.Write( whr.ResponseText )
Links := pwb.Links ; collection of hyperlinks on the page
Loop % Links.Length ; check each link
If ((RelatedLink := Links[A_Index-1].href) != "" && (Links[A_Index-1].href ~= "http://old.")) { ; if the link is not blank
Links[A_Index-1].href := StrReplace(Links[A_Index-1].href, "http://old.", "http://new.")
}
Return pwb.documentElement.innerHTML
}

create static html pages dynamically cos of SEO

I am trying how to create static html pages dynamically.
It is because I am reading that dynamic content is not google friendly, google cannot crawle the content which is coming from database once the page is opened.
in exact example:
{{ content_from_db }}
this variable is replaced with long text. This long text contains many important keywords of the page. I read that this content is unfortunately not seen by google since it is dynamic.
Then I said, well lets create static html pages dynamically, but i am stuck here not knowing how..
is it possible?

Your premise is completely and utterly false. It is absolutely not the case that Google can't index dynamically created websites. Of course it can: StackOverflow, which has awesome SEO, is just one of the millions of dynamic websites indexed by Google.

yes its possible i will give u short example you can proceed with it. we had a similar requirement so what we did was
def GenerateDynamicSelectPopulate(model, modelFields):
models=Product
modelsField=productname
str="""$.ajax({
type: 'GET',
async: false,
url: 'http://127.0.0.1:8000/api/v1/%s/?format=json',
cache: false,
accepts: 'application/json',
success: function(data){
var options = ''
for(i = 0; i < data.objects.length; ++i) {
var str = '<option value="' + data.objects[i].id + '">'+ %s + '</option>'
options=options+str
}
$('#%s').html(options)
},
dataType: "json"
});"""
here replace all '%s' in above code from the value you want similarly for html page u make a string having html code and things that can change make them as %s and provide value at runtime thus you can make html page at runtime
good luck

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Webscraping to download unique PDF files - python

Related

Python + webapp2 + Modify the URL without reloading the page [duplicate]

Extract string from <script> - BeautifulSoup python

Getting resources with python phantomJS

Edit and replace partial URLs in tumblr posts

create static html pages dynamically cos of SEO

Categories

Resources