Python 3.4 - Downloading newly uploaded text files from pastebin.com - python

I want to download text files from pastebin.com.
Once I start the program it should look for text files that are being uploaded and "download" them once they're uploaded.
I know how to "download" them but not how to tell Python to click on one of the public files on http://pastebin.com/archive and then click on the "raw"-button to open a new tab that contains the "raw" content.
I googled a lot but literally nothing came up that would help me.
Thanks

Well, a program doesn't know how to "click" anything :). In order to retrieve information from a page, you simply need to send a GET request at the correct url. In your case, that would be http://pastebin.com/raw/4ffLHviP or any other code of the pastebin you want to download. You can retrieve codes manually, or e.g. by applying text parsers (regex, beautifulsoup...) on the archive page.
Note that, there is an API for scraping Pastebin (see http://pastebin.com/scraping). It is strongly recommended, if you want to extract consequent content from them, to use it. It is more "polite", may offer better service, and will avoid you to be blacklisted.

To choose a file you simply do the following:
Visit the link of the file, ex. http://pastebin.com/B8A6L7Zt
The raw content is already on that page, namely inside<textarea id='paste_code'>...</textarea>. So you just cut this content off, using regex for example.

Related

Identify the edited location in the PDF modified by online editor www.ilovepdf.com using Python

I have an SBI bank statement PDF which is tampered/forged. Here is the link for the PDF.
This PDF is edited using online editor www.ilovepdf.com. The edited part is the first entry under the 'Credit' column. Original entry was '2,412.00' and I have modified it to '12.00'.
Is there any programmatic way either using Python or any other opensource technology to identify the edited/modified location/area of the PDF (i.e. BBOX(Bounding Box) around 12.00 credit entry in this PDF)?
2 things I already know:
Metadata (Info or XMP metadata) is not useful. Modify date of the metadata doesn't confirm if the PDF is compressed or indeed edited, it will change the modify date in both these cases. Also it doesn't give the location of the edit been done.
PyMuPDF SPANS JSON object is also not useful as the edited entry doesn't come at the end of the SPANS JSON, instead it's in the proper order of the text inside the PDF. Here is the SPAN JSON file generated from PyMuPDF.
Kindly let me know if anyone has any opensource solution to resolve this problem.
iLovePDF completely changes the whole text in the document. You can even see this, just open the original and the manipulated PDFs in two Acrobat Reader tabs and switch back and forth between them, you'll see nearly all letters move a bit.
Internally iLovePDF also rewrote the PDF completely according to its own preferences, and the edit fits in perfectly.
Thus, no, you cannot recognize the manipulated text based on this document alone because it technically is a completely different, a completely new one.

Search/Filter/Select/Manipulate data from a website using Python

I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.

How do I use crawler if I know the target web-page and file extension but not knowing the file name?

I have a web-page here that I need to crawl. It looks like this:
www.abc.com/a/b/,
and I know that under the /b directory, there are some files with .html extensions I need. I know that I have access to those .html files, but I have no access to www.abc.com/a/b/. So, without knowing the .html file name, how can I crawl those .html pages?
You can't crawl webpages if you don't know how to get to them.
If I understood what you meant, you want to access pages that are accessible in a directory whose index page is not (because you get a 403).
Before you give up, you can try the following:
check if the main search engines link to the pages inside the directory that you seem to know about (because if you know you have access to those .html you probably know at least one of them). The page that includes that link may include other links to files inside that directory as well. For instance, in google, use the link: operator:
link:www.abc.com/a/b/the_file_you_know_exists
check if the website is indexed in the main search engines. For instance, in google, use the site: operator:
site:www.abc.com/a/b/
check if the website is archived in archive.org:
http://web.archive.org/web/*/www.abc.com/a/b/
check if you can find it in other web archives using memento:
http://timetravel.mementoweb.org/reconstruct/*/www.abc.com/a/b/
try to find other possible filenames such as index1.html, index_old.html, index.html_old, contact.html and so on. You could create a long list of the possible filenames to try but this also depends on what you know about the website.
This may give you possible pages from that website that still exist or existed in the past.

Streaming audio (YouTube)

I'm writing a CLI for a music-media-platform. One of the features is going to be that you can directly play YouTube videos from the CLI. I don't really have an idea of how to do it, but this one sounded the most reasonable:
I'm going to use of those sites where you can download music from YouTube, for example, http://keepvid.com/ and then I directly stream and play this, but I have one problem. Is there any Python library capable of doing this and if so, do you have any concrete examples?
I've been looking, but I found nothing, even not with GStreamer.
You need two things to be able to download a YouTube video, the video id, which is represented by the v= section of the URL, and a hidden field t= which is present in the page source. I have no idea what this t value is, but it's what you need :)
You can then download the video using a URL in the format;
http://www.youtube.com/get_video?video_id=*******&t=*******
Where the stars represent the values obtained.
I'm guessing you can ask for the video id from user input, as it's straightforward to obtain. Your program would then download the HTML source for that video, parse the source for the t value, then download the video using the newly constructed URL.
For example, if you open this link in your browser, it should download the video, or you can use a downloading program such as Wget;
http://www.youtube.com/get_video?video_id=3HrSN7176XI&t=vjVQa1PpcFNM4c8MbEhsnGaNvYKoYERIJ-hK7ErLpUI=
It appears that KeepVid is simply a JavaScript bookmarklet that links you to a KeepVid download page where you can then download the YouTube video in any one of a variety of formats. So, unless you want to figure out how to stream the file that it links you to, it's not easily doable. You'd have to scrape the page returned and figure out which URL you wanted to download, and then you'd have to stream from that URL (and some of the formats may or may not be streamable anyway).
And as an aside, even though they don't have a terms of service specified, I'd imagine that since they appear to be mostly advertisement-supported that abusing their functionality by going around their advertisement-supported webpage would be ethically questionable.

How to get image details from firefox webdriver?

I've got an image on a page rendered by Firefox via Webdriver, I can get its object (wd.find_element_by_xpath("id('main')/form/p[5]/img")), but how can I get its body either base64-encoded or just a location on my hard drive?
PS: please don't suggest getting the src and fetching it with an external tool. I want the image I already have in the browser.
Cached images can be extracted from Firefox's cache by navigating to an URL like this one:
about:cache-entry?client=HTTP&sb=1&key=http://your.server/image.png
The resulting page will contain a line with the "file on disk" label, like this one:
file on disk: /home/fviktor/.mozilla/firefox/7jx6k3hx.default/Cache/CF7379D8d01
This page will also contain the hex dump of the file's contents. You can load the file from that path or parse the hex dump. Please note, that the path can also be none in the case of small files cached only in memory. Your only option is parsing the hex dump in this case.
Maybe there's a way to suppress the hex dump if there's a cache file on the disk, but I'm not sure about it.
I've created a little script for extracting data from browser cache. You can extract cache entries using it. Check it out at this gist. Check this post for usage guide.
fvictor's answer helped, but the syntax has changed. In Firefox version 60.9esr, the entries are stored as about:cache-entry?storage=disk&context=&eid=&uri=https://example.com/images/img.png, and the page doesn't contain a file on disk label. But at the bottom you will still find the hex dump.

Categories

Resources