python open web pages in batch mode

python open web pages in batch mode - python

System: Dell, Windows 7
Setup: PyCharm 5.0.2 + anaconda 2.7.11-0
Limitations: My preference would be to use a Mac, but my work requires me to use a Windows machine. I am familiar with the unix/linux terminals but unfamiliar with the windows command prompt.
Problem: I would like to open multiple webpages that use a Bing (or Google) search on a list of keywords. (I have entered into a new field and need to look up a hundreds of unknown terms. I am too lazy to manually do the search for each term.)
My attempts:
I have not been successful in finding an application that does Bing or Google searches in batch mode. Thus, I have decided to write my own script.
I recalled seeing that webpages could be rendered directly in the jupyter with the HTML() from IPython.display:
# import modules
from IPython.display import HTML
# create the list of keywords
keys = ['grumpy+cat','garfield+cat','basement+cat']
# loop through each key and display the search webpage
for k in keys:
HTML('<iframe src=https://www.google.com/search?q='+k+' width=700 height=350></iframe>')
Alas, nothing happens. I did check this:
HTML('<iframe src=https://www.google.com/search?q='+k[0]+' width=700 height=350></iframe>')
And that seems to be fine, but I need automation.
I moved on to a combination of input file, python command generating script, and batch script (to run in command prompt, which would be equivalent to a shell script like bash or tcsh).
Contents of 'search_list.txt':
grumpy+cat
garfield+cat
basement+cat
Contents of 'batch_search.py':
# Do a batch bing search in firefox using a keyword search list
# Make import statements
import numpy as np
# Read in the data
keys = np.loadtxt("search_list.txt", dtype='str')
# Output to a batch file
f1=open('./batch_search.bat', 'w+')
for k in keys:
command = "start firefox http://www.bing.com/search?q="+s+"\r\n"
f1.write(command)
# Why doesn't this line work?
#%%!batch_search.bat
Contents of 'run_batch_search.bat':
ipython batch_search.py
batch_search.bat
Finally, run in command prompt using:
run_batch_search.bat
What now?:
The above does open multiple tabs in firefox pointed to the bing search results for the inputed keywords, but I would like this to be a bit more streamlined.
Ideally, I would be able to (most preferably) open the web browsers pointed to the google/bing search directly from the python script. Solutions?
Otherwise, how can I properly format the last line of 'batch_search.py' to get the call to the command prompt to work?
More Information for #Kris (see comments below): The main goal is to look up via web searches hundreds of unknown terms for self education. For example, if I look up https://www.google.com/search?q=garfield+cat in firefox, the wikipedia results for Garfield cat pops up in the right-hand column of the page. If for one of my own search terms, the Bing popup results appear to be accurate, then I retain that as my reference information and I move on to the results for the next keyword. If it does not appear to be a good match, then I must continue to read the search results and follow links until I find an appropriate match.

Related

Print PDFs automatically with python

I am building a website which accepts pdf from users and the options, like pages to print, copies, color or black&white and the shop from which they want to get it printed.
The pdf will be stored in server and will be passed on to the shop to print. How do i get it printed automatically with those options applied. One way i thought was to edit the pdf and sent to the store to print with the options applied.
How do i print the pdf automatically and report back to the server that the pdf was printed?
chose python as it may have easy implementation.
BTW i'll build website using NodeJS

You can do the following:
import os
os.startfile("C:/Users/TestFile.txt", "print")
This will start the file, in its default opener, with the verb 'print', which will print to your default printer.Only requires the os module which comes with the standard library
This only works on windows. So if you want it to work on other OS's you'll need a way to detect which OS the pdf is being sent to.

Search/Filter/Select/Manipulate data from a website using Python

I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button

This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)

You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.

How to store, find , then import a text file on iPad air 2 using Pythonista?

Using an iPad, I'm attempting to import a text file from the Internet in order to utilize in Python MOOC exercise "hangman" from edx:
For this problem, you will need the code files ps3_hangman.py and words.txt. Right-click on each and hit "Save Link As". Be sure to save them in same directory. Open and run the file ps3_hangman.py without making any modifications to it, in order to ensure that everything is set up correctly.
Thing is, these seem not easy options on an iPad. I managed to copy and paste the hangman.py file into a new Pythonista file, but....
how to handle the large text file?
Where do I store it as a text file, find it, then import it into this iPad program?
No problem on Windows. Apple does not allow a file.open() type operation.

One way you could do this if you don't have access to a Mac/PC, i.e. entirely on your iPad in Pythonista:
Copy the URL of the text file in Safari (tap and hold the link)
In Pythonista, switch to the interactive prompt (swipe from right to left)
Enter the following two lines:
import urllib
urllib.urlretrieve('<paste copied url here>', 'words.txt')
You could also write these two lines in a regular script instead of using the interactive prompt. But you'll probably just need this once.

Capturing PDF files using Python Selenium Webdriver

We test an application developed in house using a python test suite which accomplishes web navigations/interactions through Selenium WebDriver. A tricky part of our web testing is in dealing with a series of pdf reports in the app. We are testing a planned upgrade of Firefox from v3.6 to v16.0.1, and it turns out that the way we captured reports before no longer works, because of changes in the directory structure of firefox's temp folder. I didn't write the original pdf capturing code, but I will refactor it for whatever we end up using with v16.0.1, so I was wondering if there' s a better way to save a pdf using Python's selenium webdriver bindings than what we're currently doing.
Previously, for Firefox v3.6, after clicking a link that generates a report, we would scan the "C:\Documents and Settings\\Local Settings\Temp\plugtmp" directory for a pdf file (with a specific name convention) to be generated. To be clear, we're not saving the report from the webpage itself, we're just using the one generated in firefox's Temp folder.
In Firefox 16.0.1, after clicking a link that generates a report, the file is generated in "C:\Documents and Settings\ \Local Settings\Temp\tmp*\cache*", with a random file name, not ending in ".pdf". This makes capturing this file somewhat more difficult, if using a technique similar to our previous one - each browser has a different tmp*** folder, which has a cache full of folders, inside of which the report is generated with a random file name.
The easiest solution I can see would be to directly save the pdf, but I haven't found a way to do that yet.
To use the same approach as we used in FF3.6 (finding the pdf in the Temp folder directory), I'm thinking we'll need to do the following:
Figure out which tmp*** folder belongs to this particular browser instance (which we can do be inspecting the tmp*** folders that exist before and after the browser is instantiated)
Look inside that browser's cache for a file generated immedaitely after the pdf report was generated (which we can by comparing timestamps)
In cases where multiple files are generated in the cache, we could possibly sort based on size, and take the largest file, since the pdf will almost certainly be the largest temp file (although this seems flaky and will need to be tested in practice).
I'm not feeling great about this approach, and was wondering if there's a better way to capture pdf files. Can anyone suggest a better approach?
Note: the actual scraping of the PDF file is still working fine.

We ultimately accomplished this by clearing firefox's temporary internet files before the test, then looking for the most recently created file after the report was generated.

python/django, firefox: Is there a way that I can use python code to call firefox's functionality

I've been searching for this for some time, but I couldn't seem to find a way to achieve this.
What I want to do is that I need the functionality of web page to pdf conversion from firefox. Right now the web page is generated in my django application and I use an open source software called "pisa"(or "xhtml2pdf") to get pdf report. However, it only supports very limited css styles, and some of the images are not rendering properly. After trying several possibilities, I found that firefox gives exactly what I want though printing web page to pdf file option in the brower gui, so I'm wondering if I could use python or command line to make firefox does the same thing. I would be very appreciated if somebody can pointing me to some resources for firefox commands or python api. Thanks.

To print from the command line with Firefox, you need to install an extension. One such extension is
Command Line Print by torisugari.
This extension allows you to print URLs immediately, without user interaction. This can be useful to convert html pages to PDF for example.
You first have to install the extension from http://torisugari.googlepages.com/commandlineprint2
After you've properly installed the extension, you can start using Firefox as command line printer.
Usage:
$>firefox -print http://www.example.com/index.html
$>firefox -print http://www.example.com/index.html -printmode pdf -printfile foobar.pdf
$>firefox -print http://www.example.com/index.html -printmode PNG
from here Command Line Print - torisugari -> https://sites.google.com/site/torisugari/commandlineprint2
now you must add your page like 127.0.0.1/yourpage with django webserver
so with loop and address you can print all page

Take a look at wkhtmltopdf.
It is a simple command line Utility, using the WebKit rendering engine, which is also used by Google Chrome and Apple Safari.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.