I would like to ask how can I get list of urls which are opened in my webbrowser, from example in firefox. I need it in Python.
Thanks
Try either SeleniumRC - which is very good
or https://github.com/bard/mozrepl/wiki/
You can use it with python as described here
http://www.craigethomas.com/blog/2009/04/get-android-market-stats-with-python-mozrepl-and-beautifulsoup/
But I would go the selenium route for anything not trivial
First I'd check if the browser has some kind of command line argument which could print such informations. I only checked Opera and it doesn't have one. What you could do is parse session file. I'd bet that every browser stores list of opened tabs/windows on disk (so it could recover after crash). Opera has this information in ~/.opera/sessions/autosave.win. It's pretty straight-forward text file. Find other browsers' session files in .mozzila, .google, etc.. or if you are on windows in /user/ directories. There might be commands to ask running instance for its working directory (as you can specify it on startup and it doesn't have to be the default one).
That's the way I'd go. Might be the wrong one.
Related
I need to inspect the contents of this file (or download it and somehow search that way) named "(index)" (it doesn't have any extension if that matters) on my target website (reddit.com) to search through and find an access token to then use later in my script.
How would I go about doing this? Also, this seems like a roundabout way of doing it, so do you have any suggestions that would be better? For example, is there a command I can execute in the console with driver.execute_script("something") to then search the file for me?
I have tried driver.page_source, but, as expected it doesn't return anything involving the pages files.
I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.
I am writing a small class assignment in python. The raw_input suppose to be a link like 'http://python-data.dr-chuck.net/comments_243948.xml'. If this works, then I can parse some of the data. I am doing this assignment using pycharm as the IDE. When it prompts me to enter a location and I type or paste in the above link and hit enter, it just opens the linked page and does not go in to process the rest of the data. Is there a way to enter this link without having it pop up the linked page? Please help me. thanks.
While Stack Overflow isn't a homework-answering site, I can provide pointers on documentation to look at:
urllib2.open will allow you to create a file like object which reads from a web address.
The ElementTree XML API will allow you to parse XMLs without 3rd-party libraries.
These two should provide enough examples to get you on your way.
If your problem is with PyCharm automatically redirecting URLs entered in the console (which is a problem I can't seem to reproduce), the easiest solution is to simply always use the terminal.
The same case as I encounter.
My solution is to write the input() with a non-whitespace Enter to escape of prompting to the URL.
I wanted to open the get-pip.py file in a new tab so that I could view it, but in Firefox, unlike Chrome, I cannot find a way to view code in a tab with an odd extension. When clicking on the code it asks if I want to download it, but I don't want to.
When I selected Firefox as the default program to open it (hoping it would just treat it like a text file or something, the same way Chrome handles odd extensions like .new) new tabs kept opening like a runaway freight train! It was difficult to get it under control and salvage my session.
Does anyone know how I may modify Firefox so that it will treat extensions like .py as a text file and open it in a new tab?
My version of your problem was solved like so: (on Linux)
cp /etc/mime.types ~/.mime.types # Note the dot in front of .mime
Edit ~/.mime.types, and comment out or delete everything but this:
text/plain asc txt text pot brf srt lua py
Add whatever file types you want to force as text/plain. In the above example, I added lua and py, they were not in the original /etc/mime.types.
Restart Firefox.
Note that this is for all apps that look at that/those files, not just Firefox. For example, if you run a web server, depending on the server, it may also affect that. This is why I did it in my home dir rather than /etc.
How I figured it out:
about:config
mime
This shows three settings:
helpers.global_mime_types_file /etc/mime.types
helpers.private_mime_types_file ~/.mime.types
plugin.java.mime application/x-java-vm
In my case I did not have the private file, so I did the copy/edit/restart as above.
Also Note: I can't get this to work for C files.
I've been searching for this for some time, but I couldn't seem to find a way to achieve this.
What I want to do is that I need the functionality of web page to pdf conversion from firefox. Right now the web page is generated in my django application and I use an open source software called "pisa"(or "xhtml2pdf") to get pdf report. However, it only supports very limited css styles, and some of the images are not rendering properly. After trying several possibilities, I found that firefox gives exactly what I want though printing web page to pdf file option in the brower gui, so I'm wondering if I could use python or command line to make firefox does the same thing. I would be very appreciated if somebody can pointing me to some resources for firefox commands or python api. Thanks.
To print from the command line with Firefox, you need to install an extension. One such extension is
Command Line Print by torisugari.
This extension allows you to print URLs immediately, without user interaction. This can be useful to convert html pages to PDF for example.
You first have to install the extension from http://torisugari.googlepages.com/commandlineprint2
After you've properly installed the extension, you can start using Firefox as command line printer.
Usage:
$>firefox -print http://www.example.com/index.html
$>firefox -print http://www.example.com/index.html -printmode pdf -printfile foobar.pdf
$>firefox -print http://www.example.com/index.html -printmode PNG
from here Command Line Print - torisugari -> https://sites.google.com/site/torisugari/commandlineprint2
now you must add your page like 127.0.0.1/yourpage with django webserver
so with loop and address you can print all page
Take a look at wkhtmltopdf.
It is a simple command line Utility, using the WebKit rendering engine, which is also used by Google Chrome and Apple Safari.