I want to create or find an open source web crawler (spider/bot) written in Python. It must find and follow links, collect meta tags and meta descriptions, title's of web pages and the url of a webpage and put all of the data into a MySQL database.
Does anyone know of any open source scripts that could help me? Also, if anyone can give me some pointers as to what I should do then they are more than welcome to.
yes i know,
libraries
https://github.com/djay/transmogrify.webcrawler
http://code.google.com/p/harvestman-crawler/
http://code.activestate.com/pypm/orchid/
open source web crawler
http://scrapy.org/
tutorials
http://www.example-code.com/python/pythonspider.asp
PS I don't know if they use mysql because normally python either uses sqlit or postgre sql so if you want you could use the libraries i gave you and import the python-mysql module and do it :D
http://sourceforge.net/projects/mysql-python/
I would suggest you to use Scrapy, which is a powerful scraping framework based on Twisted and lxml. It is particularly well suited for the kind of tasks you want to perform, it features regex based rules to follow links and lets you use either regular expressions or XPath expressions to extract data from the html. It also provides what they call "pipelines" to dump data to whatever you want.
Scrapy doesn't provide a built-in MySQL pipeline, but someone has written one here, from which you could base your own.
Scrappy is a web crawling and scraping framework you can extend to insert the selected data to a database.
It's like an inverse of the Django framework.
Related
First post - be gentle!
I am starting to learn Python and would like to get information from a table in a web page (https://en.wikipedia.org/wiki/European_Union#Demographics) in to a panda.
I am using Google Colab and from researching a bit I understand the process has something to do with 'web scraping' turning HTML in to .CSV.
Any thoughts welcome please. Worth noting I am constrained by not being able to download additional software due to the secure nature of my work.
Thanks.
You need a library to help you parse the HTML - a well known library for that in Python would be BeautifulSoup.
There are also some available tools online that do this kind of thing for you, and you can take some inspiration from them, even if you can't use them directly: https://wikitable2csv.ggor.de/
As you see this website above use the CSS "table.wikitable" to identify the tables.
You can use Scrapy, a python based scraping framework to get and parse the data as required. In Scrapy, you can create spiders which crawl a set of urls which you have initialized. Furthermore, you can parse the HTML data using something like Beautiful Soup to get your table from the response. The Scrapy documentation in itself is pretty useful and should get you through to set it up quickly! Scrapy also let you export the parsed data as CSV which should help you with the export part.
All the best!
I would like to download the data in this table:
http://portal.ujn.gov.rs/Izvestaji/IzvestajiVelike.aspx
I know how to use selenium to go through the pages and the CSS selectors are helpful enough that it shouldn't be too difficult to get all the data...
However, I am curious if anyone knows some way of getting to a json or whatever intermediary object is used to make the html? As in, whatever the raw data format file that gets exported by the server is? Is this possible with aspnet frameworks?
I have found such solutions in the past, but with much simpler web pages and web pages with get requests...
Thank you!
Taking a look at the website (I have no experience with Russian at all but not like it maters much.) It looks to me like it is pulling the information from a database via PHP (In my book the "old" way of doing it) not a JSON file. Which means that your basically stuck doing it the normal web scraping route like you said OR to find a SQL injection (which I am in NO WAY SUGGESTING as it is illegal?) to be able to bypass the limitations of there crappy search page.
I am looking to replace Yahoo Query Language with something more manageable and dependable. Right now we use it to scrape public CSV files and use the information in our web app.
Currently I am having trouble trying to find an alternative and it seems that scraping websites with Python is the best bet. However I don't even know where to start.
My question is what is needed to scrape a CSV, save the data and use it elsewhere in a web application using Python? Do I need a dedicated database or can I save the data a different way?
A simple explanation is appreciated
This is a bit broad, but let's divide it in separate tasks
My question is what is needed to scrape a CSV
If you mean downloading CSVs files from already known URLs, you can simply use urllib. If you don't have the CSVs URLs you'll have to obtain them somehow. If you want to get the URLs from webpages, beautifulsoup is commonly used to parse HTML. scrapy is used for larger-scale scraping.
save the data.
Do I need a dedicated database or can I save the data a different way?
Not at all. You can save the CSV files directly to your disk., store them with pickle, serialize them to JSON or use a relational or NoSQL database. What you should use depends heavily on what you want to do and what of access you need to the data (local/remote, centralized/distributed).
and use it elsewhere in a web application using Python
You'll probably want to learn how to use a web framework for that (django, flask and cherrypy are common choices). If you don't need concurrent write access, any of the storage approaches I mentioned would work with these
I have an HTML webpage. It has a search textbox. I want to allow the user to search within a dataset. The dataset is represented by a bunch of files on my server. I wrote a python script which can make that search.
Unfortunately, I'm not familiar with how can I unite the HTML page and a Python script.
The task is to put a python script into the html file so, that:
Python code will be run on the server side
Python code can somehow take the values from the HTML page as input
Python code can somehow put the search results to the HTML webpage as output
Question 1 : How can I do this?
Question 2 : How the python code should be stored on the website?
Question 3 : How it should take HTML values as input?
Question 4 : How can it output the results to the webpage? Do I need to install/use any additional frameworks?
Thanks!
There are too many things to get wrong if you try to implement that by yourself with only what the standard library provides.
I would recommend using a web framework, like flask or django. I linked to the quickstart sections of the comprehensive documentation of both. Basically, you write code and URL specifications that are mapped to the code, e.g. an HTTP GET on /search is mapped to a method returning the HTML page.
You can then use a form submit button to GET /search?query=<param> with the being the user's input. Based on that input you search the dataset and return a new HTML page with results.
Both frameworks have template languages that help you put the search results into HTML.
For testing purposes, web frameworks usually come with a simple webserver you can use. For production purposes, there are better solutions like uwsgi and gunicorn
Also, you should consider putting the data into a database, parsing files for each query can be quite inefficient.
I'm sure you will have more questions on the way, but that's what stackoverflow is for, and if you can ask more specific questions, it is easier to provide more focused answers.
I would look at the cgi library in python.
You should check out Django, its a very flexible and easy Python web-framework.
I'm trying to do three things.
One: crawl and archive, at least daily, a predefined set of sites.
Two: run overnight batch python scripts on this data (text classification).
Three: expose a Django based front end to users to let them search the crawled data.
I've been playing with Apache Nutch/Lucene but getting it to play nice with Django just seems too difficult when I could just use another crawler engine.
Question 950790 suggests I could just write the crawler in Django itself, but I'm not sure how to go about this.
Basically - any pointers to writing a crawler in Django or an existing python crawler that I could adapt? Or should I incorporate 'turning into Django-friendly stuff' in step two and write some glue code? Or, finally, should I abandon Django altogether? I really need something that can search quickly from the front end, though.
If you insert your django project's app directories into sys.path, you can write standard Python scripts that utilize the Django ORM functionality. We have an /admin/ directory that contains scripts to perform various tasks-- at the top of each script is a block that looks like:
sys.path.insert(0,os.path.abspath('../my_django_project'))
sys.path.insert(0,os.path.abspath('../'))
sys.path.insert(0,os.path.abspath('../../'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'
Then it's just a matter of using your tool of choice to crawl the web and using the Django database API to store the data.
You write your own crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.
Here's an example of reading a page:
http://docs.python.org/library/urllib2.html#examples
Here's an example of parsing the page:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML
If you don't want to write crawler using Django ORM (or already have working crawler) you could share database between crawler and Django-powred front-end.
To be able to search (and edit) existing database using Django admin you should create Django models.
The easy way for that is described here:
http://docs.djangoproject.com/en/dev/howto/legacy-databases/