I'd like to download all the PDF's from a domain to later analyse them. The problem is the domain has no list with all the PDFs. Instead the program would have to find all the PDFs of that website and then download them. Based on a google search with site:www.Domain.com and filetype:pdf I know the website has roughly 8000 PDFs.
I'd like to know:
1.) whether this is even possible
2.) What solutions would be available? I'd prefer a solution that works with python but am open to all solutions. (I'm new to python)
3.) If there is a link that references the solution you suggest
4.) How resource intensive the resulting program would be and how long it would take to run on an average pc.
I googled for hours for solutions that involve python and solutions that don’t involve python. I might be using the wrong search terms. The solutions I found, focused on downloading all PDFs from a webPAGE, but I need all the PDFs from the DOMAIN (At least those that can be found with the site's own search function).
Related
I am aware that this can't be done with bash script only, or it isn't as far as I know (and I'm still learning). This is why I'm asking for help. What do I need more ? Are there specific tools ?
This is what I'd like to do:
Upload an image to https://www.google.com/searchbyimage/upload
Then find all the identical images
Download the one which has the greatest resolution
So far I've been able to upload an image to Searchbyimage through curl. This uploaded image then creates a very long token that is used to search similar images, with some supplementary keywords.
The uploaded image creates a link composed like so:
https://www.google.com/search?tbs=sbi:
After this is the awfully long token: AMhZZith3JfR2OzwmuyQjufBifvdFWNjMShRMypWIE2-g005QfYLeTATLhGHAWz8MLI-tbgHzZp-bREPlJbsNWhY7U4Z2_19bu0oHII6VJPIVVJSPANODqnrJXp6X5VKKoXHMLcBCmI9eIpxS_1EX9g9YJPFL2XFEfJqIApLX83erP5mlRM7rSiIF5Te_1RPNyVkp4IPZPBRtoOKGhpDw2xad-JZsqd2ai4F5sMvyO2A_18PMFKg21nTRH_1jVeOeUhz8U5zkL4lycIg3kafAYlNy8YwmjSFcmc2nZB_10t9MFyi2BnBmemDRp4DCACI0FVM6pLTIB8VCBpU9A
And it adds this at the end: &hl=fr.
Finally the image is searched, and I have the choice between clicking "similar images" or "all sizes" (it's "all sizes" I want, as similar images doesn't ensure it will be identical). This will add some keywords from google's analysis of the picture (here, a photography of Émile Zola) and create a second token:
The picture I searched here
https://www.google.com/search?safe=strict&hl=fr&
q=emile+zola&tbm=isch
&tbs=simg:
CAQSmQEJthA57uIOXdcajQELEKjU2AQaBggXCD0IQgwLELCMpwgaYgpgCAMSKLQZ9QH3BLMZ2A6xGdcO3w70Ad0OwjrEOqEuwzqiLsE67iSTLoM4oC4aMIk1iw7XQn7Wu55hLB2k-bnfW3_1yf24eA0N-w-baKvWkDj48J67yZZS-uQ-BgjCRQyAEDAsQjq7-CBoKCggIARIEnfZWUgw&sa=X&ved=0ahUKEwi965ashtrhAhWI3eAKHSmRCBwQ2A4IKygB
&biw=1920&bih=944
With at the end the resolution of the picture. The idea is to recreate this second link, to then download the highest resolution image amongst what google has found. I have to get the token, but everything else can be found on the picture file itself: the file is properly named after the picture, and thus could make for keywords, and its resolution is also easily known. I'd like to make it a script, to download higher resolution images of many paintings - over a thousand - that are in low quality. Ideally I'd use it quite often. So far I had found how to upload a picture with curl, and it had gave me back a token, but uncomplete. Beyond this, I was completely lost.
In theory this doesn't seem impossible. The problem is I'm too much of a newbie: I enjoy a lot so far Linux and bash, but I only know so few. I have of course done some hours of googling before, nothing showed up that I knew I could use. There is nothing alike neither on github: a lot of scripts that search for similar images, but none for identical. None of them that also compares the sizes of these images. There's also a python API for reverse image searching, but it didn't seem like it could search for identical images, and it seems related to the google API, which is problematic. All of this is probably dumbly hard for me because I'm only a beginner, and I don't know enough to build this script: but in another way - maybe due to my lack of knowledge - it doesn't seem impossible at all, and I'm very willing to try, fail, try again: learn. So here I am, to ask: how do I do that ? Can it be done in bash only ? If not, what must I include ? Or perhaps it cannot be done ?
Lastly, I know there is a google API for reverse image searching. That'd be very useful, if it wasn't limited to a hundred image searches a day: if you want more, you've got to pay. And by a 100 images a day, it'd take me around eleven days to reverse search all the images I wanted in a better quality: in the end, I'd be done as fast by searching all that myself, by hand. But neither these options seems to be a solution: and this script doesn't seem impossible. It is only beyond my current capacities.
Thank you in advance, if anyone has got an idea !
PS: I can use linux wether through WSL, or a virtual machine. Both work very fine so far, including whatever command or package. WSL is much faster. And sorry for my english, I'm french !
Second PS: I've been asked to show what I had as code, but this doesn't get beyond this:
curl -i -F sch=sch -F encoded_image=#path/to/my/imagefile.jpg https://www.google.com/searchbyimage/upload
Which was a partial answer to my question I had found here:
How to use google search by image in curl
There's two fundamental ways to use the web programmatically:
via API: this is purpose built for computers to access web resources and always preferred. You follow strict rules and get well defined results back.
by crawling: this is when the computer pretends to be a user, emulating the clicking on links done in a browser. Basically curl, but over and over again with state stored in between, parameters generated correctly, encoding applied, etc.
As you say, there's an API available so if it does what you want then it's the right way to go. The fact that it does what you want, but enforces limits, is a very useful sign that was you're trying to do has limits. Those limits will have been carefully set to incentivise you to work within them. Trying to crawl for the same results will likely either breach Google's service term limits, or your sanity limits.
So if you really want to work around the API, then use a crawler library such as Python Scrapy. But note that the API limits might be a useful indication of how far you can expect to get without paying.
I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.
My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.
I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.
Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!
EDIT: When I inspect the save button here is what I get:
Search Button
This would depend a lot on the website your targeting and how their search is implemented.
For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.
For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this
import json
json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)
def print_names(data):
for entry in data["customers"]:
print(entry["name"])
print_names(rdict)
You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.
If there's no API, then you have
Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.
An example of what's possible:
I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).
The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.
I wanted to do this before for some websites but didn't know where to start. This time however I am adamant. I am talking about the scripts where we crawl a website and extract the data we require. My target is this: Basically I have to appear for job interviews in December. There is this site (http://www.geeksforgeeks.org/) which contains large number of questions from previous interviews (like http://www.geeksforgeeks.org/amazon-interview-set-42-on-campus/ & http://www.geeksforgeeks.org/adobe-interview-set-6-campus-mts-1/). Every title has word "set" and a number in it. It is quite cumbersome to keep track of what I have done and what not. So I want to extract questions from each of these pages and put them in a pdf with the title. How can I do this using curl, regex and Scrapy? I am intermediate in C/C++/Java and but have only beginner proficiency in Python. Any help is much appreciated. Also point me to any such scripts you such know of. I want to do this on my own. Just requires a starting point and some guidance. Thanks.
If you want just a starting point, try scrapy a screen-scraping library for python. I would recommend that you use the requests library for making requests. It's by far the simplest option (with no loss of power).
Also, don't try to parse html or xml with a regex. Just don't. Use one of the fine libraries available (beautifulsoup or lxml, or lxml with a beautifulsoup backend are the most popular, but there are others).
I am a social scientist and a complete newbie/noob when it comes to coding. I have searched through the other questions/tutorials but am unable to get the gist of how to crawl a news website targeting the comments section specifically. Ideally, I'd like to tell python to crawl a number of pages and return all the comments as a .txt file. I've tried
from bs4 import BeautifulSoup
import urllib2
url="http://www.xxxxxx.com"
and that's as far as I can go before I get an error message saying bs4 is not a module. I'd appreciate any kind of help on this, and please, if you decide to respond, DUMB IT DOWN for me!
I can run wget on terminal and get all kinds of text from websites which is awesome IF I could actually figure out how to save the individual output html files into one big .txt file. I will take a response to either question.
Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
You will most likely encounter this as you go, but in some cases, if the site is employing 3rd party services for comments, like Disqus, you will find that you will not be able to pull the comments down in this manner. Just a heads up.
I've gone down this route before and have had to tailor the script to a particular site's layout/design/etc.
I've found libcurl to be extremely handy, if you don't mind doing the post-processing using Python's string handler functions.
If you don't need to implement it purely in Python, you can make use of wget's recursive mirroring option to handle the content pull, then write your python code to parse the downloaded files.
I'll add my two cents here as well.
The first things to check are that you installed beautiful soup, and that it lives somewhere that it can be found. There's all kinds of things that can go wrong here.
My experience is similar to yours: I work at a web startup, and we have a bunch of users who register, but give us no information about their job (which is actually important for us). So my idea was to scrape the homepage and the "About us" page from the domain in their email address, and try to put a learning algorithm around the data that I captured to predict their job. The results for each domain are stored as a text file.
Unfortunately (for you...sorry), the code I ended up with was a bit complicated. The problem is that you'll end up getting a lot of garbage when you do the scraping, and you'll have to filter it out. You'll also end up with encoding issues, and (assuming you want to do some learning here) you'll have to get rid of low-value words. The total code is about 1000 lines, and I'll post some important pieces that may help you out here, if you're interested.
I am new to programming and to Python itself. I have no programming experience. I have managed to read up on Python and done some fairly basic Python tutorial, now I am ready for my first project in Python.
I am basing my project around XBMC, I want to develop some addons for this awesome media center.
I have a few websites that I want to scrape and display in XBMC. One is a music website and one is a payed TV website which is only available to people with accounts with them. I have managed to scrape a website with feedparse but I have no idea how to output these titles and links to play in XBMC.
My question here is: where do I start, how do I construct the script for these websites, what tools/libraries/modules do I need. And what do I need to do to include it into XBMC.
On the general topic that has been asked a ton of times regarding webpage scraping, the common answer is always Mechanize/Beautiful Soup for python. That would allow you to actually get your data.
Once you have your data, its then just a matter of formatting it the way you want, for your xbmc app: http://wiki.xbmc.org/index.php?title=HOW-TO:Write_Python_Scripts_for_XBMC
Its a two step process.
Get your data from a source and format it into some common structure
Use the common structure to populate your elements in the xbmc script
What you actually want to do with your script will determine how you would use your data. If its just simply providing information, then that link above would pretty much explain it.