Django image scraping then using sort thumbnail - python

I'm creating a Reddit clone, and this is my last piece I ne d to finish the job. I want image scraping from the url user provides. I have no idea how to start this though. I checked out Reddit code, and it seems like there are different functions for different sites. Any guidance to get me started? Any tutorial I can take a look at?

Check out Beautiful Soup. It should get the job done.

Related

I am wanting to pull some data about mortgages from the web

For example, take a look at this page. https://erecord.co.lubbock.tx.us/recorder/eagleweb/viewDoc.jsp?node=DOCCL-OPR19830027670
I would like to get all the data from a page like that with the user entering a name to my program.
Should I do this with web scraping or is using an API (if there is one) best? How would you go about tackling this project?
The simple answer is definitely API first if one exists. The second is to figure out what type of pages the site is serving. It looks like JSP, so any of the JSP related scraping would be the way forward. Not sure it merits another answer since there's lots of stuff out there.
For example here are answers to scraping jsp with python:
How do I web-scrape a JSP with Python, Selenium and BeautifulSoup?

How can I Scrape Business Email Contact with python?

this morning I wanted to create a little Software/Script in Python, it was 6am when I started and now I'm about to become crazy because it's 22pm and I have nothing that works.
So basically, I want to do this: Given an Instagram Username, scrape the Name, Number of followers and the business contact email.
I found out that going to the page source will give me this info (let's consider only the email for now): https://imgur.com/a/jYQ2FtR
Any idea about how I can do that? I try many different things and nothing is working. I don't know what to do. I tried downloading the page and parsing the text looking for "business_email" but I have no idea about how to implement it and extracting the data I'm looking for, I know it's a simple task, but I'm a total noob and I haven't been coding for years.
Can someone tell me how to do it? Or at least point me in the right direction.
There are different ways to approach this problem. If the data you want is visible on the page, then you could scrap that info using Beatiful Soup. If not, then it's a little more trickier but you could extract the info for the page source using regular expressions with the re module.

Scraping information from a flash object on a website using python or any other method

I was just wondering if it is possible to scrape information form this website that contained in a flash file.(http://www.tomtom.com/lib/doc/licensing/coverage/)
I am trying to get the all the text from the different components of this website.
Can anyone suggest a good starting point in python or any simpler method.
I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.
http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Is there a python module which web scrapes the image, title and a description of any link?

What I'm looking for, should give me something like this ->
There are many APIs available that can accomplish your task (more precisely the task you describe on your question, not the image :) ). I personally use diffbot, which I discovered after reading this. Beware though, for this kind of "content" extraction does not always end with success, because of the nature of web pages. Instead, it relies on heuristics and training and thus may not suffice for your specific purposes...
If you're wanting an entire screenshot of the page then something like https://stackoverflow.com/questions/1041371/alexa-api may help you?
Otherwise if you're just wanting to get a few key images from the page..
you could use mechanize to assit you. When you connect to a webpage you can search through all the links on the page using:
for link in br.links():
where br is your browser object.
You can see an example here:
Download all the links(related documents) on a webpage using Python
if you print dir(link) it will show you various properties such as link.text and link.url. furthermore you can import urlparse.urlsplit and use it on the url. You can direct the browser towards the URL and scrape the images as shown in the above example.
You should really use a search engines interpretation of the page and the images in it.
You could use, the python wrapper on the bing API, or the xGoogle library.
Beware the xGoogle library fakes to google as if a browser and may not be endorsed way to consume Google's data.
This one should help: http://palewi.re/posts/2008/04/20/python-recipe-grab-a-page-scrape-a-table-download-a-file/
Learns you how to scrape content and images and store it.

Categories

Resources