Parse variable from javascript like BeautifulSoup parses HTML - python

I do not need a screen scraper.
I just want to extract a variable from the static js. e.g piece of javascript:
$(document).ready(function(){
has_map = true;
new hip.Events({
has_date_picker: false,
has_big_button: true,
has_map: has_map,
map_selector: 'event-map',
coordinates: 'POINT (-79.4004249999999985 43.6524020000000021)',
marker_icon: 'http://s3.amazonaws.com/imageupload/static/new_media/images/map-marker.png'
});
});
Let's say I want the value for the key marker_icon (the url in this example). Is there any library in python or should I do it the ugly way: regex?

Related

How can I enable PDF page breaks from HTML, maybe using a marker in the source HTML file?

I am using pdfkit to create a PDF from a HTML file... like so:
import pdfkit
pdfkit.from_file([source], target + '.pdf')
I create the HTML file myself before doing this conversion.
What I'm now trying to do is find a way to impleet a page break.
The HTML file doesn't use page breaks because ... well, it's basic html.
But PDF's are page type structures.
So how can I pickup something in the HTML as a marker, and then use that to implement a page break in the PDF?
Of course pdfkit.from_file([source], target + '.pdf') is a simple single line... there's no parsing of the content..... so I don't see how I could tell it what to look for.
Any ideas?
EDIT
With some advice from #Nathanial below, I've added to my CSS
#media print {
h2 {
page-break-before: always;
}
But I don't see pdfkit.from_file([source], target + '.pdf') picking it up?
Opening the html file in the browser and printing to PDF works perfectly. so this is more of a pdfkit issue.
Found a similar question here:
How to insert a page break in HTML so wkhtmltopdf parses it?
I think the pdfkit wrapper for wkhtmltopdf is limited.
On the commnd line, this works perfectly.
wkhtmltopdf --print-media-type 10100005.html 10100005.pdf
But how do I replicate that in python? It's not my first choice to doa os.execute....:/
After some fiddling, this worked for me. I'm putting this here to help the next person.
Thanks #Nathaniel Flick for pointing me to use media print and print only styles.
Example 11 on this page also helped
https://www.programcreek.com/python/example/100586/pdfkit.from_file
In the style sheet
#media print {
h2 {
page-break-before: always;
}
}
Then in the python code
pdfkit_options = {
'print-media-type': '',
}
>>> print (source)
c:/users/maxcot/desktop/Reports/10100001.html
>>> print (target)
c:/users/maxcot/desktop/Reports/10100001.pdf
>>> print (pdfkit_options)
{'print-media-type': ''}
pdfkit.from_file(source, target, options=pdfkit_options)

Python + webapp2 + Modify the URL without reloading the page [duplicate]

Is there a way I can modify the URL of the current page without reloading the page?
I would like to access the portion before the # hash if possible.
I only need to change the portion after the domain, so it's not like I'm violating cross-domain policies.
window.location.href = "www.mysite.com/page2.php"; // this reloads
This can now be done in Chrome, Safari, Firefox 4+, and Internet Explorer 10pp4+!
See this question's answer for more information:
Updating address bar with new URL without hash or reloading the page
Example:
function processAjaxData(response, urlPath){
document.getElementById("content").innerHTML = response.html;
document.title = response.pageTitle;
window.history.pushState({"html":response.html,"pageTitle":response.pageTitle},"", urlPath);
}
You can then use window.onpopstate to detect the back/forward button navigation:
window.onpopstate = function(e){
if(e.state){
document.getElementById("content").innerHTML = e.state.html;
document.title = e.state.pageTitle;
}
};
For a more in-depth look at manipulating browser history, see this MDN article.
HTML5 introduced the history.pushState() and history.replaceState() methods, which allow you to add and modify history entries, respectively.
window.history.pushState('page2', 'Title', '/page2.php');
Read more about this from here
You can also use HTML5 replaceState if you want to change the url but don't want to add the entry to the browser history:
if (window.history.replaceState) {
//prevents browser from storing history with each change:
window.history.replaceState(statedata, title, url);
}
This would 'break' the back button functionality. This may be required in some instances such as an image gallery (where you want the back button to return back to the gallery index page instead of moving back through each and every image you viewed) whilst giving each image its own unique url.
Here is my solution (newUrl is your new URL which you want to replace with the current one):
history.pushState({}, null, newUrl);
NOTE: If you are working with an HTML5 browser then you should ignore this answer. This is now possible as can be seen in the other answers.
There is no way to modify the URL in the browser without reloading the page. The URL represents what the last loaded page was. If you change it (document.location) then it will reload the page.
One obvious reason being, you write a site on www.mysite.com that looks like a bank login page. Then you change the browser URL bar to say www.mybank.com. The user will be totally unaware that they are really looking at www.mysite.com.
parent.location.hash = "hello";
In modern browsers and HTML5, there is a method called pushState on window history. That will change the URL and push it to the history without loading the page.
You can use it like this, it will take 3 parameters, 1) state object 2) title and a URL):
window.history.pushState({page: "another"}, "another page", "example.html");
This will change the URL, but not reload the page. Also, it doesn't check if the page exists, so if you do some JavaScript code that is reacting to the URL, you can work with them like this.
Also, there is history.replaceState() which does exactly the same thing, except it will modify the current history instead of creating a new one!
Also you can create a function to check if history.pushState exist, then carry on with the rest like this:
function goTo(page, title, url) {
if ("undefined" !== typeof history.pushState) {
history.pushState({page: page}, title, url);
} else {
window.location.assign(url);
}
}
goTo("another page", "example", 'example.html');
Also, you can change the # for <HTML5 browsers, which won't reload the page. That's the way Angular uses to do SPA according to hashtag...
Changing # is quite easy, doing like:
window.location.hash = "example";
And you can detect it like this:
window.onhashchange = function () {
console.log("#changed", window.location.hash);
}
The HTML5 replaceState is the answer, as already mentioned by Vivart and geo1701. However it is not supported in all browsers/versions.
History.js wraps HTML5 state features and provides additional support for HTML4 browsers.
Before HTML5 we can use:
parent.location.hash = "hello";
and:
window.location.replace("http:www.example.com");
This method will reload your page, but HTML5 introduced the history.pushState(page, caption, replace_url) that should not reload your page.
If what you're trying to do is allow users to bookmark/share pages, and you don't need it to be exactly the right URL, and you're not using hash anchors for anything else, then you can do this in two parts; you use the location. hash discussed above, and then implement a check on the home page, to look for a URL with a hash anchor in it, and redirect you to the subsequent result.
For instance:
User is on www.site.com/section/page/4
User does some action which changes the URL to www.site.com/#/section/page/6 (with the hash). Say you've loaded the correct content for page 6 into the page, so apart from the hash the user is not too disturbed.
User passes this URL on to someone else, or bookmarks it
Someone else, or the same user at a later date, goes to www.site.com/#/section/page/6
Code on www.site.com/ redirects the user to www.site.com/section/page/6, using something like this:
if (window.location.hash.length > 0){
window.location = window.location.hash.substring(1);
}
Hope that makes sense! It's a useful approach for some situations.
Below is the function to change the URL without reloading the page. It is only supported for HTML5.
function ChangeUrl(page, url) {
if (typeof (history.pushState) != "undefined") {
var obj = {Page: page, Url: url};
history.pushState(obj, obj.Page, obj.Url);
} else {
window.location.href = "homePage";
// alert("Browser does not support HTML5.");
}
}
ChangeUrl('Page1', 'homePage');
You can use this beautiful and simple function to do so anywhere on your application.
function changeurl(url, title) {
var new_url = '/' + url;
window.history.pushState('data', title, new_url);
}
You can not only edit the URL but you can update the title along with it.
Any changes of the loction (either window.location or document.location) will cause a request on that new URL, if you’re not just changing the URL fragment. If you change the URL, you change the URL.
Use server-side URL rewrite techniques like Apache’s mod_rewrite if you don’t like the URLs you are currently using.
You can add anchor tags. I use this on my site so that I can track with Google Analytics what people are visiting on the page.
I just add an anchor tag and then the part of the page I want to track:
var trackCode = "/#" + urlencode($("myDiv").text());
window.location.href = "http://www.piano-chords.net" + trackCode;
pageTracker._trackPageview(trackCode);
As pointed out by Thomas Stjernegaard Jeppesen, you could use History.js to modify URL parameters whilst the user navigates through your Ajax links and apps.
Almost an year has passed since that answer, and History.js grew and became more stable and cross-browser. Now it can be used to manage history states in HTML5-compliant as well as in many HTML4-only browsers. In this demo You can see an example of how it works (as well as being able to try its functionalities and limits.
Should you need any help in how to use and implement this library, i suggest you to take a look at the source code of the demo page: you will see it's very easy to do.
Finally, for a comprehensive explanation of what can be the issues about using hashes (and hashbangs), check out this link by Benjamin Lupton.
Use history.pushState() from the HTML 5 History API.
Refer to the HTML5 History API for more details.
Your new url.
let newUrlIS = window.location.origin + '/user/profile/management';
In a sense, calling pushState() is similar to setting window.location = "#foo", in that both will also create and activate another history entry associated with the current document. But pushState() has a few advantages:
history.pushState({}, null, newUrlIS);
You can check out the root: https://developer.mozilla.org/en-US/docs/Web/API/History_API
This code works for me. I used it into my application in ajax.
history.pushState({ foo: 'bar' }, '', '/bank');
Once a page load into an ID using ajax, It does change the browser url automatically without reloading the page.
This is ajax function bellow.
function showData(){
$.ajax({
type: "POST",
url: "Bank.php",
data: {},
success: function(html){
$("#viewpage").html(html).show();
$("#viewpage").css("margin-left","0px");
}
});
}
Example: From any page or controller like "Dashboard", When I click on the bank, it loads bank list using the ajax code without reloading the page. At this time, browser URL will not be changed.
history.pushState({ foo: 'bar' }, '', '/bank');
But when I use this code into the ajax, it change the browser url without reloading the page.
This is the full ajax code here in the bellow.
function showData(){
$.ajax({
type: "POST",
url: "Bank.php",
data: {},
success: function(html){
$("#viewpage").html(html).show();
$("#viewpage").css("margin-left","0px");
history.pushState({ foo: 'bar' }, '', '/bank');
}
});
}
This is all you will need to navigate without reload
// add setting without reload
location.hash = "setting";
// if url change with hash do somthing
window.addEventListener('hashchange', () => {
console.log('url hash changed!');
});
// if url change do somthing (dont detect changes with hash)
//window.addEventListener('locationchange', function(){
// console.log('url changed!');
//})
// remove #setting without reload
history.back();
Simply use, it will not reload the page, but just the URL :
$('#form_name').attr('action', '/shop/index.htm').submit();

Extract string from <script> - BeautifulSoup python

I'm trying to create a python script to extract some informations from a webmail. I wanna follow a redirection.
My code :
br1 = mechanize.Browser()
br1.set_handle_robots(False)
br1.set_cookiejar(cj)
br1.open("LOGIN URL")
br1.select_form(nr=0)
br1.form['username'] = mail_site
br1.form['password'] = pw_site
res1 = br1.submit()
html = res1.read()
print html
Result is not what i expect.
It contains only a redirection script.
I've seen that i have to extract the information from this script to follow this redirection.
So, in my case,i've to extract jsessionid into a script.
The script is :
<script>
function redir(){
window.self.location.replace('/webmail/en_EN/continue.html;jsessionid=1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA?MESSAGE=NO_COOKIE&DT=1&URL_VALID=welcome.html');
return true;
}
</script>
If i'm not wrong, i've to build one regex.
I've tried many things but no results.
Anyone have an idea ?
import re
get_jsession = re.search(r'jsessionid=([A-Za-z0-9.]+)',script_)
print(get_jsession.group(1))
>>> '1D5QS4DA6C148DC4C14QS4CS5.1FDS5F4DSV1A64DA5DA'

Webscraping to download unique PDF files

I have to routinely download over 300 pdfs from over 150 websites once every quarter, and I've been starting to think there has to be automate this using python. These PDFs are released on a quarterly basis, and detail the performance of mutual funds over the previous quarter. 90% of the time, these PDFs are called 'quarterly commentary' or 'commentary', and so what I want to do is write a script in python to search the fund-specific url ex (https://www.pimco.com/investments/mutual-funds/total-return-fund/inst) for the keyword 'commentary', find the link, and then download the resulting PDF file.
I would also like to name the download file to coincide with the proper mutual fund name. Now what I have been working off of is an excel spreadsheet. In column A I have the proper mutual fund name. In column B I have the mutual fund URL.
Would this be possible?
Personally I find it easier to use CaspjerJS and PhantomJS to download files from external websites because you can inject javascript code into the page to grab the elements you need.
Here is the casperjs documentation
And here is some code that I've written to download lectures from my professors webpage and download them to my desktop:
var casper = require('casper').create({verbose: true , logLevel: "debug" });
var url = "https://www.cs.rit.edu/~ib/Classes/CSCI264_Fall16-17/assignments.html";
var fs = require('fs');
casper.start(url);
var elements;
casper.then(function(){
elements = this.evaluate(function(){
var pdfs = document.querySelectorAll('body ul li a');
return Array.prototype.map.call(pdfs, function(e) {
return e.getAttribute('href');
});
});
for(var i = 0; i < elements.length; ++i){
var url = "" + elements[i] + "";
if(url.indexOf('pdf') !== -1){
var file = fs.absolute(url.substring(url.lastIndexOf("/")+1, url.length));
this.download(url, file);
}
}
});
casper.run(function() {
this.echo('Done.').exit();
});
Of course if you're deadset on using python then disregard this completely. Otherwise, good-luck with your CasperJS script.

create static html pages dynamically cos of SEO

I am trying how to create static html pages dynamically.
It is because I am reading that dynamic content is not google friendly, google cannot crawle the content which is coming from database once the page is opened.
in exact example:
{{ content_from_db }}
this variable is replaced with long text. This long text contains many important keywords of the page. I read that this content is unfortunately not seen by google since it is dynamic.
Then I said, well lets create static html pages dynamically, but i am stuck here not knowing how..
is it possible?
Your premise is completely and utterly false. It is absolutely not the case that Google can't index dynamically created websites. Of course it can: StackOverflow, which has awesome SEO, is just one of the millions of dynamic websites indexed by Google.
yes its possible i will give u short example you can proceed with it. we had a similar requirement so what we did was
def GenerateDynamicSelectPopulate(model, modelFields):
models=Product
modelsField=productname
str="""$.ajax({
type: 'GET',
async: false,
url: 'http://127.0.0.1:8000/api/v1/%s/?format=json',
cache: false,
accepts: 'application/json',
success: function(data){
var options = ''
for(i = 0; i < data.objects.length; ++i) {
var str = '<option value="' + data.objects[i].id + '">'+ %s + '</option>'
options=options+str
}
$('#%s').html(options)
},
dataType: "json"
});"""
here replace all '%s' in above code from the value you want similarly for html page u make a string having html code and things that can change make them as %s and provide value at runtime thus you can make html page at runtime
good luck

Categories

Resources