Best approach to save data from webscraper

Best approach to save data from webscraper - python

I have recently hit some obstacles while trying to scrape a website. I'm using python requests to scrape a website and then bs4 to get the elements and the information I want. I'm using ThreadPoolExecutor to make this work faster. I'm using sqlite3 to save the information and check if some of the information already is in the database. The problem, however, is that I've until now used a multiprocessing Lock to only allow one thread to use the database at the same time since sqlite doesn't allow threading. This, however, is that this is really slow and really brings the speed of the scraper down. I've read about MongoDB but I would prefer a database that can be saved as a local file, i.e. mydatabase.db, like sqlite. I want the database to be able to support threading so that the threads don't have to wait for each other.
What's the best approach to solve this problem?
Thank you

Related

Python multiprocess/multithreading in database driven scraping

I have around 7 lacs website url in my database. I just scraping some short info form that website.
But when I run the scripting, it is normal to take huge to time to check such a huge URLs.
Currently, is am doing with for loop:
def scrape_short_webinfo():
# a function scraping some minor data
for instance in Link.objects.all():
scrape_short_webinfo(instance.url)
I want to put these thing in multiprocess/multipthread, so that it the script should finish more faster.
Can anyone help me in this case?

You can use ExecutorService to parallelize your code

Doesn't get results after a while scraping (python)

I'm trying to scrap a large database for a project of mine, however I find that after I scrap a relatively big amount of data, I stop receiving some of the xml information I'm interested in. I'm not sure if it's because the server is limiting my access or because it starts scraping too fast.
I put a "sleep" line between the scraping loops to overcome this, but as I try to reach more data it doesn't work anymore.
I guess this is a known problem in web scraping but I'm very new to this field so any suggestion will be very helpful.
Note: I tried 'request' with some free proxies but that didn't work either (still some data missing). I also checked the original website and it does have the data I seek.
Edit: It looks like most of that data I'm missing comes from specific attributes that don't load as fast as all other data. So I think I'm looking for a way to tell if this xml I'm looking for has loaded already.
I'm using lxml and requests.
Thanks.

Best way to scrape CSVs on the web with Python

I am looking to replace Yahoo Query Language with something more manageable and dependable. Right now we use it to scrape public CSV files and use the information in our web app.
Currently I am having trouble trying to find an alternative and it seems that scraping websites with Python is the best bet. However I don't even know where to start.
My question is what is needed to scrape a CSV, save the data and use it elsewhere in a web application using Python? Do I need a dedicated database or can I save the data a different way?
A simple explanation is appreciated

This is a bit broad, but let's divide it in separate tasks
My question is what is needed to scrape a CSV
If you mean downloading CSVs files from already known URLs, you can simply use urllib. If you don't have the CSVs URLs you'll have to obtain them somehow. If you want to get the URLs from webpages, beautifulsoup is commonly used to parse HTML. scrapy is used for larger-scale scraping.
save the data.
Do I need a dedicated database or can I save the data a different way?
Not at all. You can save the CSV files directly to your disk., store them with pickle, serialize them to JSON or use a relational or NoSQL database. What you should use depends heavily on what you want to do and what of access you need to the data (local/remote, centralized/distributed).
and use it elsewhere in a web application using Python
You'll probably want to learn how to use a web framework for that (django, flask and cherrypy are common choices). If you don't need concurrent write access, any of the storage approaches I mentioned would work with these

Database in Excel using win32com or xlrd Or Database in mysql

I have developed a website where the pages are simply html tables. I have also developed a server by expanding on python's SimpleHTTPServer. Now I am developing my database.
Most of the table contents on each page are static and doesn't need to be touched. However, there is one column per table (i.e. page) that needs to be editable and stored. The values are simply text that the user can enter. The user enters the text via html textareas that are appended to the tables via javascript.
The database is to store key/value pairs where the value is the user entered text (for now at least).
Current situation
Because the original format of my webpages was xlsx files I opted to use an excel workbook as my database that basically just mirrors the displayed web html tables (pages).
I hook up to the excel workbook through win32com. Every time the table (page) loads, javascript iterates through the html textareas and sends an individual request to the server to load in its respective text from the database.
Currently this approach works but is terribly slow. I have tried to optimize everything as much as I can and I believe the speed limitation is a direct consequence of win32com.
Thus, I see four possible ways to go:
Replace my current win32com functionality with xlrd
Try to load all the html textareas for a table (page) at once through one server call to the database using win32com
Switch to something like sql (probably use mysql since it's simple and robust enough for my needs)
Use xlrd but make a single call to the server for each table (page) as in (2)
My schedule to build this functionality is around two days.
Does anyone have any thoughts on the tradeoffs in time-spent-coding versus speed of these approaches? If anyone has any better/more streamlined methods in mind please share!

Probably not the answer you were looking for, but your post is very broad, and I've used win32coma and Excel a fair but and don't see those as good tools towards your goal. An easier strategy is this:
for the server, use Flask: it is a Python HTTP server that makes it crazy easy to respond to HTTP requests via Python code and HTML templates. You'll have a fully capable server running in 5 minutes, then you will need a bit of time create code to get data from your DB and render from templates (which are really easy to use).
for the database, use SQLite (there is far more overhead intergrating with MysQL); because you only have 2 days, so
you could also use a simple CSV file, since the API (Python has a CSV file read/write module) is much simpler, less ramp up time. One CSV per user, easy to manage. You don't worry about insertion of rows for a user, you just append; and you don't implement remove of rows for a user, you just mark as inactive (a column for active/inactive in your CSV). In processing GET request from client, as you read from the CSV, you can count how many certain rows are inactive, and do a re-write of the CSV, so once in a while the request will be a little slower to respond to client.
even simpler yet you could use in-memory data structure of your choice if you don't need persistence across restarts of the server. If this is for a demo this should be acceptable limitation.
for the client side, use jQuery on top of javascript -- maybe you are doing that already. Makes it super easy to manipulate the DOM and use effects like slide-in/out etc. Get yourself the book "Learning jQuery", you'll be able to make good use of jQuery in just a couple hours.
If you only have two days it might be a little tight, but you will probably need more than 2 days to get around the issues you are facing with your current strategy, and issues you will face imminently.

How to display database query results of 100,000 rows or more with HTML?

We're rewriting a website used by one of our clients. The user traffic on it is very low, less than 100 unique visitors a week. It's basically just a nice interface to their data in our databases. It allows them to query and filter on different sets of data of theirs.
We're rewriting the site in Python, re-using the same Oracle database that the data is currently on. The current version is written in an old, old version of Coldfusion. One of the things that Coldfusion does well though is displays tons of database records on a single page. It's capable of displaying hundreds of thousands of rows at once without crashing the browser. It uses a Java applet, and it looks like the contents of the rows are perhaps compressed and passed in through the HTML or something. There is a large block of data in the HTML but it's not displayed - it's just rendered by the Java applet.
I've tried several JavaScript solutions but they all hinge on the fact that the data will be present in an HTML table or something along those lines. This causes browsers to freeze and run out of memory.
Does anyone know of any solutions to this situation? Our client loves the ability to scroll through all of this data without clicking a "next page" link.

I have done just what you are describing using the following (which works very well):
jQuery Datatables
It enables you to do 'fetch as you scroll' pagination, so you can disable the pagination arrows in favor of a 'forever' scroll.

Give a try with Jquery scroll.
Instead of image scroll , you need to have data scroll.
You should poulate data in the divs , instead of images.
http://www.smoothdivscroll.com/#quickdemo
It should work. I wish.
You gotta great client anyway :-)
Something related to your Q
http://www.9lessons.info/2009/07/load-data-while-scroll-with-jquery-php.html
http://api.jquery.com/scroll/

I'm using Open Rico's LiveGrid in a project to display a table with thousands of rows in a page as an endless scrolling table. It has been working really fine so far. The table requests data on demand when you scroll through the rows. The parameters are send as simple GET parameters and the response you have to create on the serverside is simple XML. It should be possible to implement a data backend for a Rico LiveGrid in Python.

Most people, in this case, would use a framework. The best documented and most popular framework in Python is Django. It has good database support (including Oracle), and you'll have the easiest time getting help using it since there's such an active Django community.
You can try some other frameworks, but if you're tied to Python I'd recommend Django.
Of course, Jython (if it's an option), would make your job very easy. You could take the existing Java framework you have and just use Jython to build a frontend (and continue to use your Java applet and Java classes and Java server).
The memory problem is an interesting one; I'd be curious to see what you come up with.

Have you tried jqGrid? It can be buggy at times, but overall it's one of the better JavaScript grids. It's fairly efficient in dealing with large datasets. It also has a feature whereby the grid retrieves data asynchronously in chunks, but still allows continuous scrolling. It just asks for more data as the user scrolls down to it.

I did something like this a while ago and successfully implemented YUI's data table combined with Django
http://developer.yahoo.com/yui/datatable/
This gives you column sorting, pagination, scrolling and so on. It also allows you to use a variety of data sources such as JSON or XML.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.