I need to inspect the contents of this file (or download it and somehow search that way) named "(index)" (it doesn't have any extension if that matters) on my target website (reddit.com) to search through and find an access token to then use later in my script.
How would I go about doing this? Also, this seems like a roundabout way of doing it, so do you have any suggestions that would be better? For example, is there a command I can execute in the console with driver.execute_script("something") to then search the file for me?
I have tried driver.page_source, but, as expected it doesn't return anything involving the pages files.
Related
I tried with SASpy but it's not working. I am able to open the SAS .egp file but not able to run the multiple scripts within in sequence.
import os, sys, subprocess
def OpenProject(sas_exe, egp_path):
sasExe = sas_exe
sasEGpath = egp_path
subprocess.call([sasExe, sasEGpath])
sas_exe = path\path\
egp_path = path\path\path\
OpenProject(sas_exe, egp_path)
This depends a bit on exactly what the workflow is. A few side notes, then the full solution.
First: EGP is not really intended to store production processes, in my opinion. EGP should really be used for development, then production is done with .sas (text) files. EGP can directly store the nodes as .sas files; ask a new question about that if you want to know more, but it's pretty easy to figure out. Best practice is to have EGP save the code modules as .sas files, then run those - SASPy will easily do that for you.
Second: If you use SAS's built-in Git connectivity, then you can do this a bit more easily I suspect. Consider doing that if you already use Git for your other processes. Again, then you end up with a .sas file, and can directly run that via SASPy.
So: how can you do this in Python, with the assumption you do have to use the .egp itself, without too many different moving parts? The key here is the .egp format. EGP is a container file, which is actually a .zip format container that has in it, among other things, all of the SAS code you want to run, as text. Text in xml format, but still, text.
You can write a python program that opens the .egp as a .zip file, using the zipfile library, and then use xml.etree.ElementTree to parse the project.xml file inside that project. Exactly what you do from there depends on your particular details, and is well out of scope for a Stack Overflow answer, but if you do better visually you can simply rename the .egp to .zip and then open in unzip program of your choice, then browse project.xml in your text editor, and find the nodes and code related to those nodes.
You can then extract the .sas code as text, and submit it directly via SASPy, or extract it to a .sas file and then submit that however you prefer (SASPy or something else).
I do something similar to this for a project - I don't actually run code from it, I'm just parsing it to verify that the correct programs were synced from the EGP to production - but it would be trivial to actually submit the code from what I've written, which is about 50 lines of code total. I may write a SGF paper this year or next year on this topic, in which case I'll try and remember to submit it here - or you can head over to my github page and see if it's there (in the future!).
I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.
I have a web-page here that I need to crawl. It looks like this:
www.abc.com/a/b/,
and I know that under the /b directory, there are some files with .html extensions I need. I know that I have access to those .html files, but I have no access to www.abc.com/a/b/. So, without knowing the .html file name, how can I crawl those .html pages?
You can't crawl webpages if you don't know how to get to them.
If I understood what you meant, you want to access pages that are accessible in a directory whose index page is not (because you get a 403).
Before you give up, you can try the following:
check if the main search engines link to the pages inside the directory that you seem to know about (because if you know you have access to those .html you probably know at least one of them). The page that includes that link may include other links to files inside that directory as well. For instance, in google, use the link: operator:
link:www.abc.com/a/b/the_file_you_know_exists
check if the website is indexed in the main search engines. For instance, in google, use the site: operator:
site:www.abc.com/a/b/
check if the website is archived in archive.org:
http://web.archive.org/web/*/www.abc.com/a/b/
check if you can find it in other web archives using memento:
http://timetravel.mementoweb.org/reconstruct/*/www.abc.com/a/b/
try to find other possible filenames such as index1.html, index_old.html, index.html_old, contact.html and so on. You could create a long list of the possible filenames to try but this also depends on what you know about the website.
This may give you possible pages from that website that still exist or existed in the past.
We test an application developed in house using a python test suite which accomplishes web navigations/interactions through Selenium WebDriver. A tricky part of our web testing is in dealing with a series of pdf reports in the app. We are testing a planned upgrade of Firefox from v3.6 to v16.0.1, and it turns out that the way we captured reports before no longer works, because of changes in the directory structure of firefox's temp folder. I didn't write the original pdf capturing code, but I will refactor it for whatever we end up using with v16.0.1, so I was wondering if there' s a better way to save a pdf using Python's selenium webdriver bindings than what we're currently doing.
Previously, for Firefox v3.6, after clicking a link that generates a report, we would scan the "C:\Documents and Settings\\Local Settings\Temp\plugtmp" directory for a pdf file (with a specific name convention) to be generated. To be clear, we're not saving the report from the webpage itself, we're just using the one generated in firefox's Temp folder.
In Firefox 16.0.1, after clicking a link that generates a report, the file is generated in "C:\Documents and Settings\ \Local Settings\Temp\tmp*\cache*", with a random file name, not ending in ".pdf". This makes capturing this file somewhat more difficult, if using a technique similar to our previous one - each browser has a different tmp*** folder, which has a cache full of folders, inside of which the report is generated with a random file name.
The easiest solution I can see would be to directly save the pdf, but I haven't found a way to do that yet.
To use the same approach as we used in FF3.6 (finding the pdf in the Temp folder directory), I'm thinking we'll need to do the following:
Figure out which tmp*** folder belongs to this particular browser instance (which we can do be inspecting the tmp*** folders that exist before and after the browser is instantiated)
Look inside that browser's cache for a file generated immedaitely after the pdf report was generated (which we can by comparing timestamps)
In cases where multiple files are generated in the cache, we could possibly sort based on size, and take the largest file, since the pdf will almost certainly be the largest temp file (although this seems flaky and will need to be tested in practice).
I'm not feeling great about this approach, and was wondering if there's a better way to capture pdf files. Can anyone suggest a better approach?
Note: the actual scraping of the PDF file is still working fine.
We ultimately accomplished this by clearing firefox's temporary internet files before the test, then looking for the most recently created file after the report was generated.
I would like to ask how can I get list of urls which are opened in my webbrowser, from example in firefox. I need it in Python.
Thanks
Try either SeleniumRC - which is very good
or https://github.com/bard/mozrepl/wiki/
You can use it with python as described here
http://www.craigethomas.com/blog/2009/04/get-android-market-stats-with-python-mozrepl-and-beautifulsoup/
But I would go the selenium route for anything not trivial
First I'd check if the browser has some kind of command line argument which could print such informations. I only checked Opera and it doesn't have one. What you could do is parse session file. I'd bet that every browser stores list of opened tabs/windows on disk (so it could recover after crash). Opera has this information in ~/.opera/sessions/autosave.win. It's pretty straight-forward text file. Find other browsers' session files in .mozzila, .google, etc.. or if you are on windows in /user/ directories. There might be commands to ask running instance for its working directory (as you can specify it on startup and it doesn't have to be the default one).
That's the way I'd go. Might be the wrong one.