I am trying to build static pages, to display data on my internal website.
I would like to grab data from various text files, so every time that new data is created; I simply need to run again the script and the page is created with the new data.
I can't use JS or other runtime languages since my server allows only static pages; so I opted for python, to build the static pages.
Now the question is: how do I write such script, that allow me to build a web page?
All the data that I need is 3-4 lines, so the page is not so complex. I tried to create an empty page, and then try to modify the content via python but it was a disaster; then I thought that it would be probably simpler to build the whole page from scratch every time.
To be clear, I am making a simple page with white background, and some text on it, adjusted so it is nice to read; no graphic, no animations, nothing; just pure old school HTML.
Is there a template to do what I am trying to achieve? Thanks
You mean something like this?
I'm using the yattag library to define the template.
from yattag import Doc
def homepage_content():
return {
'text': open('/home/username/texts/homepage_text.txt').read(),
'title': open('/home/username/texts/homepage_title.txt').read()
}
def page_template(content):
doc, tag, text = Doc().tagtext()
with tag('html'):
with tag('head'):
with tag('title'):
text(content['title'])
with tag('body'):
with tag('div', id = 'main'):
text(content['text'])
return doc.getvalue()
def create_homepage():
with open('/home/username/www/index.html', "w") as fp:
fp.write(page_template(homepage_content()))
Related
I'm trying to understand how to automate checkout process on a demandware website that uses adyen checkout.
payload_creditcard = {
...
"dwfrm_billing_paymentMethod": "CREDIT_CARD",
"dwfrm_billing_creditCardFields_cardType": "Master+card",
"dwfrm_billing_creditCardFields_adyenEncryptedData":"adyenjs_0_1_18$ibddsadc65...",
"dwfrm_billing_creditCardFields_cardNumber":"************3345"
"dwfrm_billing_creditCardFields_expirationMonth": "20",
"dwfrm_billing_creditCardFields_expirationYear": "2030"
}
This is the script for the payment:
checkout_page = s.get("https://www.slamjam.com/en_IT/checkout-begin?stage=payment#payment",headers=headers)
checkout_card = s.post("https://www.slamjam.com/on/demandware.store/Sites-slamjam-Site/en_IT/CheckoutServices-SubmitPayment",headers=headers, data=payload_creditcard)
place_order = s.get("https://www.slamjam.com/en_IT/checkout-begin?stage=placeOrder#placeOrder",headers=headers)
The problem is that every time the "dwfrm_billing_creditCardFields_adyenEncryptedData" changes every time and I don't know how to generate it.
I found javascript functions within the website, but to make them work you need an html with the form with the card inputs and obviously I can't insert an html every time I need this token inside the python code, because everything is based on speed. Is there any way you can recommend me or if someone has already done it before?
The adyen client js intentionally performs per session, client side encryption to keep a shopper's card information safe, and keep the company's server out of scope for PCI.
If you really need to test this, then you will need to use something like selenium webdriver for python to actually load the page and render the js.
I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.
I wrote a script that scrapes various things from around the web and stores them in a python list and have a few questions about the best way to get it into a HTML table to display on a web page.
First off should my data be in a list? It will at most be a 25 by 9 list.
I’m assuming I should write the list to a file for the web site to import? Is a text file preferred or something like a CSV, XML file?
Whats the standard way to import a file into a table? In my quick look around the web I didn’t see an obvious answer (Major web design beginner). Is Javascript this best thing to use? Or can python write out something that can easily be read by HTML?
Thanks
store everything in a database eg: sqlite,mysql,mongodb,redis ...
then query the db every time you want to display the data.
this is good for changing it later from multiple sources.
store everything in a "flat file": sqlite,xml,json,msgpack
again, open and read the file whenever you want to use the data.
or read it in completly on startup
simple and often fast enough.
generate a html file from your list with a template engine eg jinja, save it as html file.
good for simple hosters
There are some good python webframeworks out there some i used:
Flask, Bottle, Django, Twisted, Tornado
They all more or less output html.
Feel free to use HTML5/DHTML/Java Script.
You could use a webframework to create/use an "api" on the backend, which serves json or xml.
Then your java script callback will display it on your site.
The most direct way to create an HTML table is to loop through your list and print out the rows.
print '<table><tr><th>Column1</th><th>Column2</th>...</tr>'
for row in my_list:
print '<tr>'
for col in row:
print '<td>%s</td>' % col
print '</tr>'
print '</table>'
Adjust the code as needed for your particular table.
I am trying to grab content from Wikipedia and use the HTML of the article. Ideally I would also like to be able to alter the content (eg, hide certain infoboxes etc).
I am able to grab page content using mwclient:
>>> import mwclient
>>> site = mwclient.Site('en.wikipedia.org')
>>> page = site.Pages['Samuel_Pepys']
>>> print page.text()
{{Redirect|Pepys}}
{{EngvarB|date=January 2014}}
{{Infobox person
...
But I can't see a relatively simple, lightweight way to translate this wikicode into HTML using python.
Pandoc is too much for my needs.
I could just scrape the original page using Beautiful Soup but that doesn't seem like a particularly elegant solution.
mwparserfromhell might help in the process, but I can't quite tell from the documentation if it gives me anything I need and don't already have.
I can't see an obvious solution on the Alternative Parsers page.
What have I missed?
UPDATE: I wrote up what I ended up doing, following the discussion below.
page="""<html>
your pretty html here
<div id="for_api_content">%s</div>
</html>"""
Now you can grab your raw content with your API and just call
generated_page = page%api_content
This way you can design any HTML you want and just insert the API content in a designed spot.
Those APIs that you are using are designed to return raw content so it's up to you to style how you want the raw content to be displayed.
UPDATE
Since you showed me the actual output you are dealing with I realize your dilemma. However luckily for you there are modules that already parse and convert to HTML for you.
There is one called mwlib that will parse the wiki and output to HTML, PDF, etc. You can install it with pip using the install instructions. This is probably one of your better options since it was created in cooperation between Wikimedia Foundation and PediaPress.
Once you have it installed you can use the writer method to do the dirty work.
def writer(env, output, status_callback, **kwargs): pass
Here are the docs for this module: http://mwlib.readthedocs.org/en/latest/index.html
And you can set attributes on the writer object to set the filetype (HTML, PDF, etc).
writer.description = 'PDF documents (using ReportLab)'
writer.content_type = 'application/pdf'
writer.file_extension = 'pdf'
writer.options = {
'coverimage': {
'param': 'FILENAME',
'help': 'filename of an image for the cover page',
}
}
I don't know what the rendered html looks like but I would imagine that it's close to the actual wiki page. But since it's rendered in code I'm sure you have control over modifications as well.
I would go with HTML parsing, page content is reasonably semantic (class="infobox" and such), and there are classes explicitly meant to demarcate content which should not be displayed in alternative views (the first rule of the print stylesheet might be interesting).
That said, if you really want to manipulate wikitext, the best way is to fetch it, use mwparserfromhell to drop the templates you don't like, and use the parse API to get the modified HTML. Or use the Parsoid API which is a partial reimplementation of the parser returning XHTML/RDFa which is richer in semantic elements.
At any rate, trying to set up a local wikitext->HTML converter is by far the hardest way you can approach this task.
The mediawiki API contains a (perhaps confusingly named) parse action that in effect renders wikitext into HTML. I find that mwclient's faithful mirroring of the API structure sometimes actually gets in the way. There's a good example of just using requests to call the API to "parse" (aka render) a page given its title.
I want to save a visited page on disk as a file. I am using a urllib and URLOpener.
I choose a site http://emma-watson.net/. The file is saved correctly as .html, but when I open the file I noticed that the main picture on top which contains bookmarks to other subpages is not displayed and also some other elements (like POTD). How can I save the page correctly to have all of the page saved on disk?
def saveUrl(url):
testfile = urllib.URLopener()
testfile.retrieve(url,"file.html")
...
saveUrl("http://emma-watson.net")
The screen of real page:
The screen of the opened file on my disk:
What you're trying to do is create a very simple web scraper (that is, you want to find all the links in the file, and download them, but you don't want to do so recursively, or do any fancy filtering or postprocessing, etc.).
You could do this by using a full-on web scraper library like scrapy and just restricting it to a depth of 1 and not enabling anything else.
Or you could do it manually. Pick your favorite HTML parser (BeautifulSoup is easy to use; html.parser is built into the stdlib; there are dozens of other choices). Download the page, then parse the resulting file, scan it for img, a, script, etc. tags with URLs, then download those URLs as well, and you're done.
If you want this all to be stored in a single file, there are a number of "web archive file" formats that exist, and different browsers (and other tools) support different ones. The basic idea of most of them is that you create a zipfile with the files in some specific layout and some extension like .webarch instead of .zip. That part's easy. But you also need to change all the absolute links to be relative links, which is a little harder. Still, it's not that hard with a tool like BeautifulSoup or html.parser or lxml.
As a side note, if you're not actually using the UrlOpener for anything, you're making life harder for yourself for no good reason; just use urlopen. Also, as the docs mention, you should be using urllib2, not urllib; in fact urllib.urlopen is deprecated as of 2.6. And, even if you do need to use an explicit opener, as the docs say, "Unless you need to support opening objects using schemes other than http:, ftp:, or file:, you probably want to use FancyURLopener."
Here's a simple example (enough to get you started, once you decide exactly what you do and don't want) using BeautifulSoup:
import os
import urllib2
import urlparse
import bs4
def saveUrl(url):
page = urllib2.urlopen(url).read()
with open("file.html", "wb") as f:
f.write(page)
soup = bs4.BeautifulSoup(f)
for img in soup('img'):
imgurl = img['src']
imgpath = urlparse.urlparse(imgurl).path
imgpath = 'file.html_files/' + imgpath
os.makedirs(os.path.dirname(imgpath))
img = urllib2.urlopen(imgurl)
with open(imgpath, "wb") as f:
f.write(img)
saveUrl("http://emma-watson.net")
This code won't work if there are any images with relative links. To handle that, you need to call urlparse.urljoin to attach a base URL. And, since the base URL can be set in various different ways, if you want to handle every page anyone will ever write, you will need to read up on the documentation and write the appropriate code. It's at this point that you should start looking at something like scrapy. But, if you just want to handle a few sites, just writing something that works for those sites is fine.
Meanwhile, if any of the images are loaded by JavaScript after page-load time—which is pretty common on modern websites—nothing will work, short of actually running that JavaScript code. At that point, you probably want a browser automation tool like Selenium or a browser simulator tool like Mechanize+PhantomJS, not a scraper.